BPt.Dataset.summary#

Dataset.summary(scope, subjects='all', measures=['count', 'nan count', 'mean', 'max', 'min', 'std', 'var', 'skew', 'kurtosis'], cat_measures=['count', 'freq', 'nan count'], decode_values=True, save_file=None, decimals=3, reduce_func=<function mean>, n_jobs=-1)[source]#

This method is used to generate a summary across some data.

Parameters
scopeScope

A BPt style Scope used to select a subset of column(s) in which to apply the current function to. See Scope for more information on how this can be applied.

subjectsSubjects

This argument can be any of the BPt accepted subject style inputs. E.g., None, ‘nan’ for subjects with any nan data, or ‘not not’ for subjects without any, the str location of a file formatted with one subject per line, or directly as an array-like of subjects, to list a few options.

See Subjects for all options, and a more detailed description of the already mentioned options.

measureslist of str, optional

The summary measures which should be computed for any float / continuous type columns within the passed scope.

Valid options are:

  • ‘count’

    Calculates the number of non-missing data points for each column.

  • ‘nan count’

    Calculates the number of missing data points in this column, which are excluded from other statistics.

  • ‘mean’

    Calculates the mean value for each column.

  • ‘max’

    Calculates the maximum value for each column.

  • ‘min’

    Calculates the minimum value for each column.

  • ‘std’

    Calculates the standard deviation for each column.

  • ‘var’

    Calculates the variance for each column.

  • ‘skew’

    Calculates the skew for each column.

  • ‘kurtosis’

    Calculates the kurtosis for each column.

  • ‘mean +- std’

    Return the mean and std as str rounded to decimals as mean ± std.

These values should be passed as a list.

default =  ['count', 'nan count',
            'mean', 'max', 'min',
            'std', 'var',
            'skew', 'kurtosis']
cat_measureslist of str, optional

These measures will be used to compute statistics for every categorical column within the passed scope. Likewise, these measures will be used to compute statistics by each unique class value for each categorical measure.

Valid options are:

  • ‘count’

    Calculates the number of non-missing data points for each column and unique value.

  • ‘freq’

    Calculate the percentage of values that each unique value makes up. Note: for column measures this will always be 1.

  • ‘nan count’

    Calculates the number of missing data points in this column, which are excluded from other statistics. Note: For class values this will always be 0.

These values should be passed as a list.

default =  ['count', 'freq', 'nan count']
decode_valuesbool, optional

When handling categorical variables that have been encoded through a BPt dataset method, e.g., Dataset.ordinalize(), then you may optionally either use either the original categorical values before encoding with decode_values = True, or use the current internal values with decode_values = False.

default = True
save_fileNone or str, optional

You may optionally save this info description to a docx file in a table. If set to a str, this string should be the path to a docx file, where if it exists, the table will be added, and if it doesn’t, the table with summary stats will be created as a new file.

Keep as None, to skip this option.

default = None
decimalsint, optional

If save_file is not None, then this parameter sets the number of decimal points to which values in the saved table will be rounded to.

This parameter will also be used in the case that a special str measure is requested, e.g., mean +- std.

default = 3
reduce_funcpython function, optional

The passed python function will be applied only if the requested col/column is a ‘data file’. In the case that it is, the function should accept as input the data from one data file, and should return a single scalar value. For example, the default value is numpy’s mean function, which returns one value.

default = np.mean
n_jobsint, optional

As with reduce_func, this parameter is only valid when the passed col/column is a ‘data file’. In that case, this specifies the number of cores to use in loading and applying the reduce_func to each data file. This can provide a significant speed up when passed the number of available cores, but can sometimes be memory intensive depending on the underlying size of the file.

If set to -1, will try to automatically use all available cores.

default = -1
Returns
cont_info_dfpandas DataFrame

A dataframe containing the summary statistics as computed for any float / continuous type data. If None, then this DataFrame will be empty.

This corresponds to the measures argument.

cat_info_dfpandas DataFrame

A dataframe containing the summary statistics as computed for any categorical type data. If None, then this DataFrame will be empty.

This corresponds to the cat_measures argument.