.. _using: ============= Auto Cleaning ============= By default buckaroo aggresively tries to type data and clean it up. Better typing ------------- What do I mean by cleaning types? By default if an integer column contains a single missing value, pandas will use the ``float64`` dtype to represent that value as a NaN. The autotyping functionality instead casts that as ``Int64`` a new type in pandas that allows ``NA`` values in Int columns. Work is also done to constrain types to their narrowest, so if an int value is between 0 and 255, autotyping will cast that to `UInt8` using a single byte instead of 8 for a float64 or int64. Heuristic cleaning ------------------ The autocleaning tool also heursitically removes errant mistyped values from column. If a column is primarily Ints with a single string, that string is stripped so the column can be treated as numeric. Using Autocleaning ================== Changing individual coercions ---------------------------- Autotyping operations are added to the lowcode UI by default, open up the lowcode UI with the λ menu, then click on operations and delete them with the X. Turning off auto cleaning ------------------------- Buckaroo's auto cleaning is aggressive and sometimes not wanted to use Buckaroo without autotyping, invoke it this way .. code-block:: python from buckaroo import BuckarooWidget BuckarooWidget(df, autoType=False) How Autotyping Works ==================== There are three steps to auto_cleaning First frequency metadata is collected with ``get_typing_metadata``, this is a dictionary with ``0`` - ``1`` ranges for the proportion of values that could be ``int``, ``float``, ``bool``, ``datetime``. Next ``recommend_type`` takes the typing metadata and returns ``bool``, ``datetime``, ``int``, ``float``, or ``string`` Finally ``emit_command`` returns a JLisp operation that will perform the conversion. Why three functions? -------------------- Splitting this into three distinct phases makes it much easier to customize behavior. It also allows improvements to accrue without requiring complete rewrites of the auto-typing functionality. My guess is that ``recommend_type`` is the easiest to override, and will be the most frequently. How do I add special replacement functionality? ----------------------------------------------- What if you commonly deal with a dataset that treats ``y`` as ``True`` and ``n`` as ``False``, how would you recognize those types of values and convert them to boolean? Code coming soon.