BPt.Dataset.filter_outliers_by_std#

Dataset.filter_outliers_by_std(scope='float', n_std=10, drop=True, reduce_func=<function mean>, n_jobs=-1, inplace=False)[source]#

This method is designed to allow dropping outliers from the requested columns based on comparisons with that columns standard deviation.

Note: This method operates on each of the columns specified by scope independently. In the case that multiple columns are passed, then the overlap of all outliers from each column will dropped after all have been calculated (therefore the order won’t matter).

This method can be used with data file’s as well, the reduce_func and n_jobs parameters are specific to this case.

Parameters
scopeScope

A BPt style Scope used to select a subset of column(s) in which to apply the current function to. See Scope for more information on how this can be applied.

default = 'float'
n_stdfloat, tuple, optional

This value is used to set an outlier threshold by standrad deviation. For example if passed n_std = 10, then it will be converted internally to (10, 10). This parameter determines outliers as data points within each relevant column (as determined by the scope argument) where their value is less than the mean of the column - n_std[0] * the standard deviation of the column, and greater than the mean of the column + n_std[1] * the standard deviation of the column.

If a single number is passed, that number is applied to both the lower and upper range. If a tuple with None on one side is passed, e.g. (None, 3), then nothing will be taken off that lower or upper bound.

default = 10
dropbool, optional

By default this function will drop any subjects / index that are determined to be outliers. On the otherhand, you may instead set specific outlier values as NaN values instead. To do this, set drop=False. Now those specific values identified as outliers will be replaced with NaN.

default = True
reduce_funcpython function, optional

The passed python function will be applied only if the requested col/column is a ‘data file’. In the case that it is, the function should accept as input the data from one data file, and should return a single scalar value. For example, the default value is numpy’s mean function, which returns one value.

default = np.mean
n_jobsint, optional

As with reduce_func, this parameter is only valid when the passed col/column is a ‘data file’. In that case, this specifies the number of cores to use in loading and applying the reduce_func to each data file. This can provide a significant speed up when passed the number of available cores, but can sometimes be memory intensive depending on the underlying size of the file.

If set to -1, will try to automatically use all available cores.

default = -1
inplacebool, optional

If True, perform the current function inplace and return None.

default = False

Examples

If we define a dataset, we can check the std.

In [1]: import BPt as bp

In [2]: import numpy as np

In [3]: data = bp.Dataset()

In [4]: data.verbose = 1

In [5]: data['1'] = [1, 1, 2, 2, 3, 3, 4, 4, 5, 5]

In [6]: np.std(data['1']), np.mean(data['1'])
Out[6]: (1.4142135623730951, 3.0)

We can now see how different thresholds work.

# This won't drop anything
In [7]: data.filter_outliers_by_std(n_std=2)
Out[7]: 
   1
0  1
1  1
2  2
3  2
4  3
5  3
6  4
7  4
8  5
9  5

# This will
In [8]: data.filter_outliers_by_std(n_std=1)
Dropped 4 Rows
Out[8]: 
   1
2  2
3  2
4  3
5  3
6  4
7  4

What if there was more than one column?

In [9]: data['2'] = [1, 1, 1, 1, 10, 1, 1, 1, 1, 1]

# Now a subject will be dropped
In [10]: data.filter_outliers_by_std(n_std=2)
Dropped 1 Rows
Out[10]: 
   1  2
0  1  1
1  1  1
2  2  1
3  2  1
5  3  1
6  4  1
7  4  1
8  5  1
9  5  1

In [11]: data.filter_outliers_by_std(n_std=1)
Dropped 5 Rows
Out[11]: 
   1  2
2  2  1
3  2  1
5  3  1
6  4  1
7  4  1

We can also apply it only to one column, and instead of dropping subjects, replace outliers with NaN’s

In [12]: data.filter_outliers_by_std(n_std=1, scope='1', drop=False)
Out[12]: 
     1   2
0  NaN   1
1  NaN   1
2  2.0   1
3  2.0   1
4  3.0  10
5  3.0   1
6  4.0   1
7  4.0   1
8  NaN   1
9  NaN   1