BPt.Dataset.drop_cols_by_nan#

Dataset.drop_cols_by_nan(scope='all', threshold=0.5, inplace=False)[source]#

This method is used for dropping columns based on the amount of missing values per column, dropping any which exceed a user defined threshold.

Parameters

scopeScope

A BPt style Scope used to select a subset of column(s) in which to apply the current function to. See Scope for more information on how this can be applied.

default = 'all'

thresholdfloat or int, optional

Passed as a float between 0 and 1, or as an int. If greater than 0 or less than 1, this parameter represents the threshold in which a column should be dropped if it has greater than or equal to this percent of its columns as NaN values.

If passed a value greater than 1, then this threshold represents the absolute value in which if a column has that number of subjects or greater with NaN, it will be dropped.

For example, if a column within a dataset containing 10 total rows has 3 non-missing values and 7 missing values, a threshold of .7 or lower will drop the column. On the other hand, anything above .7, will not.

default = .5

inplacebool, optional

If True, perform the current function inplace and return None.

default = False

Examples

Consider a brief example below where we first load in a simple Dataset and then apply the drop_cols_by_nan method.

In [1]: data = bp.read_csv('data/example1.csv')

In [2]: data
Out[2]: 
  animals  numbers
0   'cat'      1.0
1   'cat'      2.0
2   'dog'      1.0
3   'dog'      2.0
4   'elk'      NaN

In [3]: data.drop_cols_by_nan(threshold=.1)
Setting NaN threshold to: 0.5
Dropped 1 Columns
Out[3]: 
  animals
0   'cat'
1   'cat'
2   'dog'
3   'dog'
4   'elk'

Alternatively, note that if we pass a threshold above .2, then no columns will be dropped.

In [4]: data.drop_cols_by_nan(threshold=.5)
Setting NaN threshold to: 2.5
Out[4]: 
  animals  numbers
0   'cat'      1.0
1   'cat'      2.0
2   'dog'      1.0
3   'dog'      2.0
4   'elk'      NaN