BPt.Dataset.filter_categorical_by_percent#

Dataset.filter_categorical_by_percent(scope='category', drop_percent=1, drop=True, inplace=False)[source]#

This method is designed to allow performing outlier filtering on categorical type variables. Note that this method assume all columns passed are of type ‘category’, and they if not already will be cast first to pandas data type ‘category’.

Note: NaN values will be skipped. If desired to treat them as a class, use the method nan_to_class to first. It is worth noting further that this method will not work on data files.

This method operates on each of the columns specified by scope independently. In the case that multiple columns are passed, then the overlap of all outliers from each column will dropped after all have been calculated (therefore the order won’t matter).

Parameters

scopeScope

A BPt style Scope used to select a subset of column(s) in which to apply the current function to. See Scope for more information on how this can be applied.

default = 'category'

drop_percentfloat, optional

This parameter acts as a percentage threshold for dropping categories when loading categorical data. This parameter should be passed as a percent, such that a category will be dropped if it makes up less than that % of the data points. For example:

drop_percent = 1

In this case any data points within the relevant categories as specified by scope with a category constituting less than 1% of total valid data points will be dropped (or set to NaN if drop=False).

dropbool, optional

By default this function will drop any subjects / index that are determined to be outliers. On the otherhand, you may instead set specific outlier values as NaN values instead. To do this, set drop=False. Now those specific values identified as outliers will be replaced with NaN.

default = True

inplacebool, optional

If True, perform the current function inplace and return None.

default = False