BPt.Dataset.filter_outliers_by_std#
- Dataset.filter_outliers_by_std(scope='float', n_std=10, drop=True, reduce_func=<function mean>, n_jobs=-1, inplace=False)[source]#
This method is designed to allow dropping outliers from the requested columns based on comparisons with that columns standard deviation.
Note: This method operates on each of the columns specified by scope independently. In the case that multiple columns are passed, then the overlap of all outliers from each column will dropped after all have been calculated (therefore the order won’t matter).
This method can be used with data file’s as well, the reduce_func and n_jobs parameters are specific to this case.
- Parameters
- scopeScope
default = 'float'
- n_stdfloat, tuple, optional
This value is used to set an outlier threshold by standrad deviation. For example if passed n_std = 10, then it will be converted internally to (10, 10). This parameter determines outliers as data points within each relevant column (as determined by the scope argument) where their value is less than the mean of the column - n_std[0] * the standard deviation of the column, and greater than the mean of the column + n_std[1] * the standard deviation of the column.
If a single number is passed, that number is applied to both the lower and upper range. If a tuple with None on one side is passed, e.g. (None, 3), then nothing will be taken off that lower or upper bound.
default = 10
- dropbool, optional
By default this function will drop any subjects / index that are determined to be outliers. On the otherhand, you may instead set specific outlier values as NaN values instead. To do this, set drop=False. Now those specific values identified as outliers will be replaced with NaN.
default = True
- reduce_funcpython function, optional
The passed python function will be applied only if the requested col/column is a ‘data file’. In the case that it is, the function should accept as input the data from one data file, and should return a single scalar value. For example, the default value is numpy’s mean function, which returns one value.
default = np.mean
- n_jobsint, optional
As with reduce_func, this parameter is only valid when the passed col/column is a ‘data file’. In that case, this specifies the number of cores to use in loading and applying the reduce_func to each data file. This can provide a significant speed up when passed the number of available cores, but can sometimes be memory intensive depending on the underlying size of the file.
If set to -1, will try to automatically use all available cores.
default = -1
- inplacebool, optional
If True, perform the current function inplace and return None.
default = False
Examples
If we define a dataset, we can check the std.
In [1]: import BPt as bp In [2]: import numpy as np In [3]: data = bp.Dataset() In [4]: data.verbose = 1 In [5]: data['1'] = [1, 1, 2, 2, 3, 3, 4, 4, 5, 5] In [6]: np.std(data['1']), np.mean(data['1']) Out[6]: (1.4142135623730951, 3.0)
We can now see how different thresholds work.
# This won't drop anything In [7]: data.filter_outliers_by_std(n_std=2) Out[7]: 1 0 1 1 1 2 2 3 2 4 3 5 3 6 4 7 4 8 5 9 5 # This will In [8]: data.filter_outliers_by_std(n_std=1) Dropped 4 Rows Out[8]: 1 2 2 3 2 4 3 5 3 6 4 7 4
What if there was more than one column?
In [9]: data['2'] = [1, 1, 1, 1, 10, 1, 1, 1, 1, 1] # Now a subject will be dropped In [10]: data.filter_outliers_by_std(n_std=2) Dropped 1 Rows Out[10]: 1 2 0 1 1 1 1 1 2 2 1 3 2 1 5 3 1 6 4 1 7 4 1 8 5 1 9 5 1 In [11]: data.filter_outliers_by_std(n_std=1) Dropped 5 Rows Out[11]: 1 2 2 2 1 3 2 1 5 3 1 6 4 1 7 4 1
We can also apply it only to one column, and instead of dropping subjects, replace outliers with NaN’s
In [12]: data.filter_outliers_by_std(n_std=1, scope='1', drop=False) Out[12]: 1 2 0 NaN 1 1 NaN 1 2 2.0 1 3 2.0 1 4 3.0 10 5 3.0 1 6 4.0 1 7 4.0 1 8 NaN 1 9 NaN 1