BPt.Dataset.filter_outliers_by_percent#

Dataset.filter_outliers_by_percent(scope='float', fop=1, drop=True, reduce_func=<function mean>, n_jobs=-1, inplace=False)[source]#

This method is designed to allow dropping a fixed percent of outliers from the requested columns. This method is designed to work on float type / cont. variables.

Note: This method operates on each of the columns specified by scope independently. In the case that multiple columns are passed, then the overlap of all outliers from each column will dropped after all have been calculated (therefore the order won’t matter).

This method can be used with data file’s as well, the reduce_func and n_jobs parameters are specific to this case.

Parameters

scopeScope

A BPt style Scope used to select a subset of column(s) in which to apply the current function to. See Scope for more information on how this can be applied.

default = 'float'

fopfloat, tuple, optional

This parameter represents the percent of outliers to drop. It should be passed as a percent, e.g., therefore 1 for one percent, or 5 for five percent.

This can also be passed as a tuple with two elements, where the first entry represents the percent to filter from the lower part of the distribution and the second element the percent from the upper half of the distribution. For example,

filter_outlier_percent = (5, 1)

This set of parameters with drop 5 percent from the lower part of the distribution and only 1 percent from the top portion. Likewise, you can use None on one side to skip dropping from one half, for example:

filter_outlier_percent = (5, None)

Would drop only five percent from the bottom half, and not drop any from the top half.

default = 1

dropbool, optional

By default this function will drop any subjects / index that are determined to be outliers. On the otherhand, you may instead set specific outlier values as NaN values instead. To do this, set drop=False. Now those specific values identified as outliers will be replaced with NaN.

default = True

reduce_funcpython function, optional

The passed python function will be applied only if the requested col/column is a ‘data file’. In the case that it is, the function should accept as input the data from one data file, and should return a single scalar value. For example, the default value is numpy’s mean function, which returns one value.

default = np.mean

n_jobsint, optional

As with reduce_func, this parameter is only valid when the passed col/column is a ‘data file’. In that case, this specifies the number of cores to use in loading and applying the reduce_func to each data file. This can provide a significant speed up when passed the number of available cores, but can sometimes be memory intensive depending on the underlying size of the file.

If set to -1, will try to automatically use all available cores.

default = -1

inplacebool, optional

If True, perform the current function inplace and return None.

default = False