Data Flow through Buckaroo

Buckaroo is extensible. The architecture of Buckaroo is crafted to allow specific points of composable extensibility in an opinionated manner. It was designed this way based on experience from writing many adhoc analysis pipelines. Previous “simpler” attempts at extensibility ran into bugs that couldn’t be cleanly accomodated. The following will be addressed below:

Buckaroo aims to allow highly opionated configurations to be toggled by users. With buckaroo, you can add a cleaning_method of “interpret int as milliseconds from unix epoch”, and it will look at a column of ints, and decide that values map to datetimes in the past year (as opposed to centered around 1970) and we should treat this column as a datetime. That is a highly opinionated view of your data, the cost for that highly opinonated view is less when multiple opinions can be quickly cycled through.

This approach is different than most tools which aim to be a generic tool that is customizable with bespoke configuration. It would be a bad thing if a generic table tool displayed integers as dates because it assumes that those integers are milliseconds from the unix epoch. Normally this would require custom code to be written and called based on manual inspection of the data.

This document describes the multiple ways of extending bucakroo to add your own toggable opinons.

  1. understand the dataflow through Buckaroo

  2. quick start to extending buckaroo

  3. description of extension points

Customization points of Buckaroo

  1. Sample_method Used to specify conditions for downsampling dataframe and method of sampling. Example alternatives include sampling in chunks, only showing first and last row, random sampling, and limiting number of columns. Returns sampled df

  2. Cleaning_method recieves sampled_dataframe Used to control how dataframes are cleaned before summary stats are run. Examples include special parsing rules for unique date formats, removing strings from primarily numeric columns. Returns cleaned_df and cleaned_summary_dict

  3. Post_processing_method recieves entire cleaned dataframe. Used to perform multi-column operations, like adding a running_diff column, or combining a latitude and longitude column into a single lat/long column. returns processed_df and processed_summary_dict

  4. Analysis_klasses recieves individual columns from processed_df. Individual column level analysis klasses used to fill out summary_stats. examples include mean, median, min, max, and complex results like histograms. Each class returns a summary_dict about a single column

  5. style-method recieves col_name, col_summary_dict, default_config. Takes a column_summary_dict and returns the column_config for that column. Examples include formatting a datetime as time only if the min/max are within a single day, conditionally turning on tooltips and color_maps based on other info in summary_dict

Full Flow

Starting with raw_df data flows through buckaroo as follows. If one of the values on the right side of equals changes, all steps below that are executed

The final result of widget is what is displayed to the user.

Destination

args

sampled_df

raw_df, sample_method

cleaned

sampled_df, sample_method, cleaning_method, lowcode_ops

cleaned_df, cleaned_sd, generated_code

processed

cleaned_df, post_processing_method

processed_df, processed_sd

summary_sd

processed_df, analysis_klasses

merged_sd

cleaned_sd, summary_sd, processed_sd

widget

processed_df, merged_sd, style_method, generated_code

digraph g { compound=true; newrank=true; node [style=filled; ]; //define nodes or cluster then links to the next level //level raw_df raw_df [color="green"; shape="house" height=1.2] sample_method [color="orange"]; { rank = same; // Here you enforce the desired order with "invisible" edges and arrowheads edge[ style=invis]; raw_df -> sample_method; rankdir = LR; } raw_df -> sampled_df [weight=20]; sample_method -> sampled_df; //sampled_df level sampled_df [color="lightblue"; shape="house" height=1.2]; cleaning_method [color="orange"]; lowcode_ops [color="orange" ]; { rank = same; edge[ style=invis]; sampled_df -> cleaning_method -> lowcode_ops; rankdir = LR; } cleaning_method -> cleaned_df [lhead=cluster_cleaned] lowcode_ops -> cleaned_sd [lhead=cluster_cleaned] //cluster cleaned level subgraph cluster_cleaned { label="cleaned"; fillcolor="red"; style=filled; cleaned_df [shape="house" height=1.2; ]; cleaned_sd [shape="invtrapezium" height=1 width=.5 group="sd"]; { rank = same; edge[ style=invis]; cleaned_df -> cleaned_sd -> generated_code ; rankdir = LR; } } //forces post_processing_method onto a separate line cleaned_df -> post_processing_method[style=invis]; post_processing_method [color="orange"]; { rank = same; edge[ style=invis]; //foo -> post_processing_method; //foo[style=invis] rankdir = LR; } sampled_df -> cleaned_df [lhead=cluster_cleaned; color="blueviolet"; penwidth = 5; weight=60] post_processing_method -> processed_sd[headport="e" tailport="w" lhead=cluster_processed]; //cluster_processed level subgraph cluster_processed { label="processed"; fillcolor="red"; style=filled; processed_df [shape="house" height=1.2 group="summary"]; processed_sd [shape="invtrapezium" height=1 width=.5 group="sd"]; { rank = same; edge[ style=invis]; processed_df -> processed_sd; rankdir = LR; } } cleaned_df -> processed_df[ lhead=cluster_processed; color="blueviolet"; weight=100; penwidth = 5; ]; processed_df -> summary_sd [ color="blueviolet"; penwidth = 5; weight=6; tailport="sw" weight=50] ; { rank=same; analysis_klasses [color="orange" group="summary"]; summary_sd; summary_sd -> analysis_klasses [style=invis;] } analysis_klasses -> summary_sd ; summary_sd [color="lightblue" shape="invtrapezium" height=1 width=.5 group="summary"]; //analysis_klasses [color="green"]; summary_sd -> merged_sd [color="blueviolet"; penwidth = 5; headport="nw" weight=60]; processed_sd -> merged_sd [style="dashed" headport="ne" weight=40 penwidth=2.5]; cleaned_sd -> merged_sd [style="dashed" tailport="s" headport="e" weight=10 penwidth=2.5]; merged_sd [color="lightblue"; shape="invtrapezium" height=1 width=.5 group="sd"] style_method [color="orange"]; {rank=same; node [style="invis"] edge [style="invis" headport="s" tailport="s"] merged_sd -> style_method; } style_method -> widget ; processed_df -> widget [style="dashed" weight=2.8 tailport="w" penwidth=2.5]; merged_sd -> widget [color="blueviolet"; penwidth = 5; ]; //merged_sd -> style_method[style=invis]; //forces style_method onto a separate rank generated_code -> widget [style="dashed" penwidth=2.5; tailport="s" headport="e" samehead="right"]; generated_code [shape="component"; ]; widget [color="lightblue"]; // sampled_df -> widget [ style="dotted" penwidth=3 color="red" constraint=false ] // cleaned_df -> widget [ style="dotted" penwidth=3 color="red" constraint=false ltail=cluster_cleaned ] // processed_df -> widget [ style="dotted" penwidth=3 color="red" constraint=false ltail=cluster_processed ] // summary_sd -> widget [ style="dotted" penwidth=3 color="red" constraint=false ] // merged_sd -> widget [ style="dotted" penwidth=3 color="red" constraint=false ] }

Glossary

  1. dataflow-result are the result of a step. updates to this variable trigger steps that watch the variable as a dataflow arg

  2. dataflow-arg a dataflow-result used as a function argument. updates to this cause the current step to execute

  3. UI-Variable are specified in the UI, and can be changed interactively. updates to this cause the current step to execute

  4. class-state are defined at class instantiation time, these can be customized, but not interactively

  5. named-tuple-result Some results return as a tuple, the tuple is what is watched, the sub parts of the tuple can be referenced later

  6. tuple-param read this from the a named-tuple-result. do not watch this vriable (setting this named-tuple-result will not trigger this step)

digraph g { compound=true; ratio = fill; k=.8 node [style=filled]; subgraph cluster_variables { label="Variable types"; instance_var [color="green";] dataflow_variable [color="lightblue"] ui_variable [color="orange"]; } instance_var -> dataframe [fconstraint=false style="invis" weight=10] subgraph cluster_types { label="Types"; dataframe [shape="house" height=1 width=.5]; summary_dict [shape="invtrapezium" height=1 width=.5]; } summary_dict -> foo [fconstraint=false style="invis"] subgraph cluster_blah { label="Named Tuple"; fillcolor="red"; style=filled; foo bar baz } foo -> c [ style="invis"] subgraph cluster_edges { landscape=true packmode=10 label="Edges"; c -> d [penwidth = 5; color="blueviolet"; label=" data-flow changes trigger recomputation" ]; e -> f [ constraint=false label=" changes to ui-variables trigger recomputation too" ] a -> b [ style="dashed" penwidth=2.5 weight=10 label=" read-only,\n does not trigger recompute"]; function -> widget [ style="dotted" penwidth=3 color="red" label="error-flow \n data-flow steps skipped" ] e [color="orange"]; c [color="lightblue"] } }

Quick Start to extending Buckaroo

In this exercise we are going add a custom coloring method to Buckaroo. We will take an OHLCV dataframecolor and Volume based on the change from the previous day.

First we need to craft the column config that will enable this conditonal coloring.

We want to use ColorFromColumn, we want the config for the volume column to look like

volume_config_override = {
    'color_map_config' : {
        'color_rule': 'color_from_column',
        'col_name': 'Volume_colors'}}

Using this in Buckaroo will look like this

df = get_ohlcv("IBM")
df['Volume_colors'] = 'red'
BuckarooWidget(df, override_column_config={'Volume': volume_config_override})

This is a nice start. But now our analysis depends on remembering and typing specific config lines each time we want this display.

Buckaroo provides built in ways of handling this.

First we want to use a post_processing_function to add the volume_colors column all of the time. And to make it condtional on change. we need to use post_processing_function because we specifically need to operate on the whole dataframe, not just a single column.

def volume_post(df):
    if 'Volume' not in df.columns:
        return [df, {}]
    df['Volume_colors'] = 'red'  # replace with actual red/green based on diff
    extra_summary_dict = {
        'Volume' : {
            'column_config_override': {
                'color_map_config' :
                    {'color_rule': 'color_from_column',
                     'col_name': 'Volume_colors'}}},
        'Volume_colors' : {
            'column_config_override': {
                'displayer': 'hidden'}}}
    return [df, extra_summary_dict]

 class OHLVCBuckarooWidget(BuckarooWidget):
     post_processing_function=volume_post
OHLVCBuckarooWidget(get_ohlcv("IBM"))

Now when you instantiate OHLVCBuckarooWidget there will be a UI toggable function of volume_post so you can turn on and turn off this feature interactively. OHLVCBuckarooWidget has your own opinions baked in, that the user can turn on or off.

What if we want to switch between red/green colors map and a color map based on size of diff to previous day? In this case we want to add two “style_methods” which are togglable in the UI. style_method takes a summary_dict and returns the column config.

def volume_post(df):
    if 'Volume' not in df.columns:
        return [df, {}]
    df['Volume_colors'] = 'red'  # replace with actual red/green based on diff
    df['Volume_diff'] = df['Volume'].diff()
    extra_summary_dict = {
        'Volume_colors' : { 'column_config_override': { 'displayer': 'hidden'}},
        'Volume_diff' : { 'column_config_override': { 'displayer': 'hidden'}}}
    return [df, extra_summary_dict]

 def volume_style_red_green(col_name, col_summary_dict, default_config):
     if col_name == 'Volume':
         return {'override': {
                'color_map_config' : {'color_rule': 'color_from_column', 'col_name': 'Volume_colors'}}}
     return {}

 def volume_style_color_map(col_name, col_summary_dict, default_config):
     if col_name == 'Volume':
         return {'override': {
                'color_map_config' : {'color_rule': 'color_map', 'map_name': 'BLUE_TO_YELLOW',
                                      'val_column': 'Volume_diff'}}}
     return {}

 class OHLVCBuckarooWidget(BuckarooWidget):
     post_processing_function=volume_post
     style_methods=[volume_style_red_green, volume_style_color_map]
OHLVCBuckarooWidget(get_ohlcv("IBM"))

With this implementation, the frontend can cycle through three style_methods volume_style_red_green, volume_style_color_map and default