Auto Cleaning¶

By default buckaroo aggresively tries to type data and clean it up.

Better typing¶

What do I mean by cleaning types? By default if an integer column contains a single missing value, pandas will use the float64 dtype to represent that value as a NaN. The autotyping functionality instead casts that as Int64 a new type in pandas that allows NA values in Int columns. Work is also done to constrain types to their narrowest, so if an int value is between 0 and 255, autotyping will cast that to UInt8 using a single byte instead of 8 for a float64 or int64.

Heuristic cleaning¶

The autocleaning tool also heursitically removes errant mistyped values from column. If a column is primarily Ints with a single string, that string is stripped so the column can be treated as numeric.

Using Autocleaning¶

Changing individual coercions¶

Autotyping operations are added to the lowcode UI by default, open up the lowcode UI with the λ menu, then click on operations and delete them with the X.

Turning off auto cleaning¶

Buckaroo’s auto cleaning is aggressive and sometimes not wanted to use Buckaroo without autotyping, invoke it this way .. code-block:: python

from buckaroo import BuckarooWidget BuckarooWidget(df, autoType=False)

How Autotyping Works¶

There are three steps to auto_cleaning

First frequency metadata is collected with get_typing_metadata, this is a dictionary with 0 - 1 ranges for the proportion of values that could be int, float, bool, datetime.

Next recommend_type takes the typing metadata and returns bool, datetime, int, float, or string

Finally emit_command returns a JLisp operation that will perform the conversion.

Why three functions?¶

Splitting this into three distinct phases makes it much easier to customize behavior. It also allows improvements to accrue without requiring complete rewrites of the auto-typing functionality. My guess is that recommend_type is the easiest to override, and will be the most frequently.

How do I add special replacement functionality?¶

What if you commonly deal with a dataset that treats y as True and n as False, how would you recognize those types of values and convert them to boolean?

Code coming soon.