Roadmap

Buckaroo is maturing and I decided to write a roadmap

Priorities

Buckaroo is still for the most part, pre-users. It is maturing though, and feedback gathered can map some reasonable principles

  • Function as a reliable replacement for the default display of dataframes

    • Exceptions in the basic display of a dataframe are a P1 error.

    • Dataframes that don’t display are a P1 error.

    • Taking more than a second to display a dataframe with less than 1M values is a P2 error

  • Buckaroo should do the least surprising thing.

    • autocleaning should be turned off by default.

  • Bug/feature request priorities

    • This is the roadmap and I’ll stick with it.

    • If a user has a feature/bug request, and that is preventing them from using Buckaroo, that gets priority.

Release Plans

0.4 Series

  1. Documentation

    • Readme refresh

    • How to create a formatter

    • Pluggable analysis framework refresh

    • Customizing autocleaning

    • Customizing enable/instantiation

    • Order of operations Dataflow doc

  2. Promotion

  3. Devops improvements (CI, testing, end to end testing, packaging)

    • CI passing - Done

    • CI testing - Done

    • End to End testing - Done

    • CI version Bump - needed

    • Ruff python linter - needed

  4. Jupyter notebook compatability

    • Google colab - Done

    • VSCode - Done

    • Warning message on notebook < 7 - Done

    • Notebook 6.0 compatability ???

  5. Code cleanup

    • Typescript passes linter - Done

    • snake_case camelCase normalization

    • better naming

    • sub module organization

  6. Python Repr bugs

    • List

    • Tuple

    • Nested list and tuple across python types (int, float, boolean)

    • Dictionary?

  7. Formatters

    • DateTime formatter

    • Float formatter with specificity

  8. Frontend

    • Autoclean toggle

0.5 series

I’m a bit fuzzy on this one, it’s either going to be a backend port to polars or filtering. I’ll write it as filtering for now

  1. Filtering

    • any field text search

    • Should work with codegen

    • Per column exact filtering

  2. Additional sampling techniques

    • Chunks (50 contiguous rows)

    • Outliers - extent percentile for each colum all in a single view

    • Straight random sample

  3. UI cycling

    • Everything that is now binary (summary stats on or off), is actually a single choice of multiple possible choices. Allow multiple clicks to cycle through different options.

    • Enable cycling for summary_stats and sample method

  4. Low code UI

    • Add Commands for filtering

0.6 series

Polars backend

All of the same tests should pass.

  1. Lowcode UI Commands in polars

    • Gives auto cleaning and filtering at much higher performance. Nice way to dip my feet into polars.

    • Testing that verifies eval(_to_py) == transform(df) and pl.transform(df) == pd.transform(df)

    • pandas and polars equivalence is key to code gen continuing to be useful

  2. Serialization in polars

    • 2x speed bump

    • straight forward

  3. Pluggable analysis framework - for polars

    • Same pluggable analysis framework, now lazy

    • Summary stats run on whole dataframe - up to 1Gig

0.7 series

  1. serialization speedup

    • integrate parquest_wasm in the frontend

    • parquet serialization on the backend

    • maintain json serialization