Pluggable analysis framework¶
The pluggable analysis framework is built to allow the different bits of analysis done by buckaroo to be mixed and matched as configuration, without requiring editting the core of buckaroo. Pieces of analysis can build on each other, and errors are flagged intelligently.
Analysis In Action¶
The most obvious piece of analysis is adding a new measure/fact to summary stats.
A monolithic summary_stats would look like this
def summarize_string(ser):
l = len(ser)
val_counts = ser.value_counts()
distinct_count= len(val_counts)
return dict(
dtype=ser.dtype,
length=l,
min='',
distinct_count= distinct_count,
distinct_percent = distinct_count/l)
def summarize_df(df):
summary_df = pd.DataFrame({col:summarize_string(df[col]) for col in df})
return summary_df
If you wanted to enhance that simple summary_stat by adding nan_count and nan_percent, you would have to rewrite the entire function and get buckaroo to use your new function. Instead, imagine that the existing sumarize_string
was already built into Buckaroo and you wanted to add to it. Here’s what the code looks like.
bw = BuckarooWidget(df)
@bw.add_analysis
class NanStats(ColAnalysis):
provided_summary = [
'nan_count', 'nan_percent']
requires_summary = ['length']
@staticmethod
def summary(sampled_ser, summary_ser, ser):
l = summary_ser.loc['min']
nan_count = l - len(ser.dropna())
return dict(
nan_count = nan_count,
nan_percent = nan_count/l)
bw #render the buckaroo widget
Now buckaroo will be displayed with the new summary stats of ‘nan_count’ and ‘nan_percent’. Buckaroo knows that NanStats
must be called after the ColAnalysis
object that provides length
and that ordering is handled for you automatically (via a DAG).
You can then iteratively and interactively update your analysis in the jupyter notebook focussing on the core of your business logic without worrying about how to structure your code. When add_analysis
is called, a set of simple unit tests are run that catch common errors with weird datashapes. This means you can focus on coding and not worry about all the edge cases.
Features of the Pluggable Analysis Framework¶
Every aspect of the interactive experience of buckaroo can be manipulated through Analysis objects
- summary_stats
different facts can be added to the summary stats view… This is the most basic type of analysis to add
- table_hints
# used for styling hints like coloring or font. The frontend currently doesn’t do much with table_hints
- column_ordering
Allows a custom ordering of columns
class NumbersFirst(ColAnalysis): requires_summary = ['is_numeric'] @staticmethod def column_ordering(summary_df): # I know that this could be done in pure pandas on the summary_df col_dicts = [] for i, col in enumerate(df.columns): col_facts = { 'name':col, 'existing_order_score': i/len(df.columns)} if summary_df[col]['is_numeric']: col_facts['numeric_boost'] = 2 else: col_facts['numeric_boost'] = 0 col_facts['total_score'] = col_facts['existing_order_score'] + col_facts['numeric_boost'] return [cd['name'] for cd in sorted(col_dicts, key=lambda x: x['total_score'])]
- summary_stats_display
a list of which rows from summary stats to display. Currently only the last added summary_stats_set is used
multiple column_orderings and summary_stats_facts can be added. Then the UI allows the user to toggle through the different column orderings to see the view of the table they want.
Order of Operations¶
The pluggable analysis framework runs different functions on analysis functions in a specific order. First the DAG for all of the analysis objects and verifies that all of the needed facts can be computed. once all of the analysis objects are ordered the computations start.
Compute the order of analysis objects. This builds a DAG and makes sure all of the facts can be computed.
Run all of the
summary
methods and build thesummary_df
extract table_hints from the
summary_df