Pluggable analysis framework¶
The pluggable analysis framework is built to make it easy to add custom analysis to table applications built with Buckaroo. It powers summary stats and styling for buckaroo.
Why¶
when writing analysis code, I frequently wrote code that iterated over columns and built a resulting summary dataframe. This is initially simple. First you write transformations inline, then you probably iterate over functions that operate on each column. Eventually this type of code becomes difficult to maintain. A single error is hard to track down because it will be in the middle of nested for loops. For pandas in particular you face the problem of either repeating expenesive analyses over and over (value counts) or depending on state in an adhoc way. Your simple functions become complex and dependent on order of execution.
How¶
The pluggable analysis framework improves these problems by
Writing analysis into classes that extend ColAnalysis
Requiring each analysis class to recieve previously computed values, specify which keys it depends on, and specify keys it provides along with defaults
Ordering analysis classes into a DAG so users don’t have to manually order dependent classes. If the DAG contains cycles or the required keys aren’t provided, an error is thrown before execution with a more understandable message.
If an error occurs during excution, sensible error messages are displayed along with explicit steps to reproduce. No more navigating through nested for loop stack traces and wondering what the state passed into functions was.
There are 3 main areas that the pluggable analysis framework is responsible for powering
Summary stats. A dictionary of measures about each column. These can be independently computed on a per column basis.
Column styling. This is a function that takes the “required” measures about an individual column and returns a column_config. Once again this can be computed indepently per column. Styling also can generally be agnostic to pandas vs polars, as long as the other analysis classes provide similar measures
Transform functions. Transform functions operate on the entire dataframe, and return extra summary_stats. This is the only place you can operate on related columns.
Methods to override¶
Pandas / Polars specific methods to produce raw facts (covered separately)
style_column
return a column_config given column_metadatapost_process_df
modify the entire dataframe
Properties to override¶
post_processing_method
name of the post_processing function for display in the UIpinned_rows
Ordered list of pinned_row configs that will be show before any main datadf_display_name
Name of the display view that is visible in the UIdata_key
Which key to read the non_pinned rows from, use “main” or “empty”
Pandas specific methods¶
1. series_summary
Passed the series and sampled series, returns a dictionary of measures
The extending-pandas notebook shows all of these methods being used
Polars specific methods¶
select_clauses
A list of polars expressions to be called on the dataframe. Try to use this as much as possible, select queries are optimized heavily by polars.column_ops
a dictionary from measure_key to tuple of polars selector, and a function to apply to the polars series object of each matching series. There are some polars operations that only can be called on series and not executed as a select query.
The extending-polars notebook shows all of these methods being used
Future Improvements¶
Future releases of Buckaroo should include pydantic for better typing of summary stats methods
Better error messages. The error messages in pluggable analysis framework seek to give you a one line reproduction fo the error found. through some refactorings, the method names have changed.