Dataset#

Constructor#

Dataset([data, index, columns, dtype, copy, ...])

The BPt Dataset class is the main class used for preparing data

Base#

Dataset.get_cols(scope[, limit_to])

This method is the main internal and external facing way of getting the names of columns which match a passed scope from the Dataset.

Dataset.get_subjects(subjects[, return_as, ...])

Method to get a set of subjects, from a set of already loaded ones, or from a saved location.

Dataset.get_values(col[, dropna, ...])

This method is used to obtain the either normally loaded and stored values from a passed column, or in the case of a data file column, the data file proxy values will be loaded.

Dataset.add_scope(scope, scope_val[, inplace])

This method is designed as helper for adding a new scope val to a number of columns at once, using the existing scope system.

Dataset.remove_scope(scope, scope_val[, inplace])

This method is used for removing scopes from an existing column or subset of columns, as selected by the scope parameter.

Dataset.set_role(scope, role[, inplace])

This method is used to set a role for either a single column or multiple, as set through the scope parameter.

Dataset.set_roles(scopes_to_roles[, inplace])

This method is used to set multiple roles across multiple scopes as specified by a passed dictionary with keys as scopes and values as the role to set for all columns corresponding to that scope.

Dataset.get_roles()

This function can be used to get a dictionary with the currently loaded roles, See Role for more information on how roles are defined and used within BPt.

Dataset.rename([mapper, index, columns, ...])

Calls method according to: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html

Dataset.copy([deep])

Creates and returns a dopy of this dataset, either a deep copy or shallow.

Dataset.auto_detect_categorical([scope, ...])

This function will attempt to automatically add scope "category" to any loaded categorical variables.

Dataset.get_Xy([problem_spec])

This function is used to get a sklearn-style grouping of input data (X) and target data (y) from the Dataset as according to a passed problem_spec.

Dataset.get_permuted_Xy([problem_spec, ...])

This method is otherwise identical to Dataset.get_Xy(), except a version of X, y where the values in y are permuted is returned.

Dataset.split_by(scope[, decode_values])

This method allows splitting the dataset into sub datasets by the different unique values of a passed scope.

Encoding#

Dataset.to_binary(scope[, drop, inplace])

This method works by setting all columns within scope to just two binary categories.

Dataset.binarize(scope, threshold[, ...])

This method contains a utilities for binarizing a variable.

Dataset.k_bin(scope[, n_bins, strategy, inplace])

This method is used to apply k binning to a column, or columns.

Dataset.ordinalize(scope[, nan_to_class, ...])

This method is used to ordinalize a group of columns.

Dataset.nan_to_class([scope, inplace])

This method will cast any columns that were not categorical that are passed here to categorical.

Dataset.copy_as_non_input(col, new_col[, ...])

This method is a used for making a copy of an existing column, ordinalizing it and then setting it to have role = non input.

Dataset.add_unique_overlap(cols, new_col[, ...])

This function is designed to add a new column

Data Files#

Dataset.add_data_files(files[, ...])

This method allows adding columns of type 'data file' to the Dataset class.

Dataset.to_data_file(scope[, load_func, inplace])

This method can be used to cast any existing columns where the values are file paths, to a data file.

Dataset.consolidate_data_files(save_dr[, ...])

This function is designed as helper to consolidate all or a subset of the loaded data files into one column.

Dataset.update_data_file_paths(old, new)

Go through and update saved file paths within the Datasets file mapping attribute.

Dataset.get_file_mapping([cols])

This function is used to access the up to date file mapping.

Filtering & Drop#

Dataset.filter_outliers_by_std([scope, ...])

This method is designed to allow dropping outliers from the requested columns based on comparisons with that columns standard deviation.

Dataset.filter_outliers_by_percent([scope, ...])

This method is designed to allow dropping a fixed percent of outliers from the requested columns.

Dataset.filter_categorical_by_percent([...])

This method is designed to allow performing outlier filtering on categorical type variables.

Dataset.drop_cols([scope, exclusions, ...])

This method is designed to allow dropping columns based on some flexible arguments.

Dataset.drop_nan_subjects(scope[, inplace])

This method is used for dropping all of the subjects which have NaN values for a given scope / column.

Dataset.drop_subjects_by_nan([scope, ...])

This method is used for dropping subjects based on the amount of missing values found across a subset of columns as selected by scope.

Dataset.drop_cols_by_unique_val([scope, ...])

This method will drop any columns with less than or equal to the number of unique values.

Dataset.drop_cols_by_nan([scope, threshold, ...])

This method is used for dropping columns based on the amount of missing values per column, dropping any which exceed a user defined threshold.

Dataset.drop_id_cols([scope, inplace])

This method will drop any str-type / object type columns where the number of unique columns is equal to the length of the dataframe.

Dataset.drop_duplicate_cols([scope, inplace])

This method is used for checking to see if there are any columns loaded with duplicate values.

Dataset.apply_inclusions(subjects[, inplace])

This method will drop all subjects that do not overlap with the passed subjects to this function.

Dataset.apply_exclusions(subjects[, inplace])

This method will drop all subjects that overlap with the passed subjects to this function.

Plotting / Viewing#

Dataset.plot(scope[, subjects, cut, ...])

This function creates plots for each of the passed columns (as specified by scope) seperately.

Dataset.plots(scope[, subjects, ncols, ...])

This function creates a multi-figure plot containing all of the passed columns (as specified by scope) in their own axes.

Dataset.plot_bivar(scope1, scope2[, ...])

This method can be used to plot the relationship between two variables.

Dataset.nan_info([scope])

Dataset.summary(scope[, subjects, measures, ...])

This method is used to generate a summary across some data.

Dataset.display_scopes()

Display an HTML representation of the Dataset, as split by scope, instead of the default repr html as split by role.

Train / Test Split#

Dataset.set_test_split([size, subjects, ...])

Defines a set of subjects to be reserved as test subjects. This

Dataset.set_train_split([size, subjects, ...])

Defines a set of subjects to be reserved as train subjects. This

Dataset.test_split([size, subjects, ...])

This method defines and returns a Train and Test Dataset

Dataset.train_split([size, subjects, ...])

This method defines and returns a Train and Test Dataset

Dataset.save_test_split(loc)

Saves the currently defined test subjects in a text file with one subject / index per line.

Dataset.save_train_split(loc)

Saves the currently defined train subjects in a text file with one subject / index per line.