Dataset#

Constructor#

Dataset([data, index, columns, dtype, copy, ...])

The BPt Dataset class is the main class used for preparing data

Base#

`Dataset.get_cols`(scope[, limit_to])	This method is the main internal and external facing way of getting the names of columns which match a passed scope from the Dataset.
`Dataset.get_subjects`(subjects[, return_as, ...])	Method to get a set of subjects, from a set of already loaded ones, or from a saved location.
`Dataset.get_values`(col[, dropna, ...])	This method is used to obtain the either normally loaded and stored values from a passed column, or in the case of a data file column, the data file proxy values will be loaded.
`Dataset.add_scope`(scope, scope_val[, inplace])	This method is designed as helper for adding a new scope val to a number of columns at once, using the existing scope system.
`Dataset.remove_scope`(scope, scope_val[, inplace])	This method is used for removing scopes from an existing column or subset of columns, as selected by the scope parameter.
`Dataset.set_role`(scope, role[, inplace])	This method is used to set a role for either a single column or multiple, as set through the scope parameter.
`Dataset.set_roles`(scopes_to_roles[, inplace])	This method is used to set multiple roles across multiple scopes as specified by a passed dictionary with keys as scopes and values as the role to set for all columns corresponding to that scope.
`Dataset.get_roles`()	This function can be used to get a dictionary with the currently loaded roles, See Role for more information on how roles are defined and used within BPt.
`Dataset.rename`([mapper, index, columns, ...])	Calls method according to: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
`Dataset.copy`([deep])	Creates and returns a dopy of this dataset, either a deep copy or shallow.
`Dataset.auto_detect_categorical`([scope, ...])	This function will attempt to automatically add scope "category" to any loaded categorical variables.
`Dataset.get_Xy`([problem_spec])	This function is used to get a sklearn-style grouping of input data (X) and target data (y) from the Dataset as according to a passed problem_spec.
`Dataset.get_permuted_Xy`([problem_spec, ...])	This method is otherwise identical to `Dataset.get_Xy()`, except a version of X, y where the values in y are permuted is returned.
`Dataset.split_by`(scope[, decode_values])	This method allows splitting the dataset into sub datasets by the different unique values of a passed scope.

Encoding#

`Dataset.to_binary`(scope[, drop, inplace])	This method works by setting all columns within scope to just two binary categories.
`Dataset.binarize`(scope, threshold[, ...])	This method contains a utilities for binarizing a variable.
`Dataset.k_bin`(scope[, n_bins, strategy, inplace])	This method is used to apply k binning to a column, or columns.
`Dataset.ordinalize`(scope[, nan_to_class, ...])	This method is used to ordinalize a group of columns.
`Dataset.nan_to_class`([scope, inplace])	This method will cast any columns that were not categorical that are passed here to categorical.
`Dataset.copy_as_non_input`(col, new_col[, ...])	This method is a used for making a copy of an existing column, ordinalizing it and then setting it to have role = non input.
`Dataset.add_unique_overlap`(cols, new_col[, ...])	This function is designed to add a new column

Data Files#

`Dataset.add_data_files`(files[, ...])	This method allows adding columns of type 'data file' to the Dataset class.
`Dataset.to_data_file`(scope[, load_func, inplace])	This method can be used to cast any existing columns where the values are file paths, to a data file.
`Dataset.consolidate_data_files`(save_dr[, ...])	This function is designed as helper to consolidate all or a subset of the loaded data files into one column.
`Dataset.update_data_file_paths`(old, new)	Go through and update saved file paths within the Datasets file mapping attribute.
`Dataset.get_file_mapping`([cols])	This function is used to access the up to date file mapping.

Filtering & Drop#

`Dataset.filter_outliers_by_std`([scope, ...])	This method is designed to allow dropping outliers from the requested columns based on comparisons with that columns standard deviation.
`Dataset.filter_outliers_by_percent`([scope, ...])	This method is designed to allow dropping a fixed percent of outliers from the requested columns.
`Dataset.filter_categorical_by_percent`([...])	This method is designed to allow performing outlier filtering on categorical type variables.
`Dataset.drop_cols`([scope, exclusions, ...])	This method is designed to allow dropping columns based on some flexible arguments.
`Dataset.drop_nan_subjects`(scope[, inplace])	This method is used for dropping all of the subjects which have NaN values for a given scope / column.
`Dataset.drop_subjects_by_nan`([scope, ...])	This method is used for dropping subjects based on the amount of missing values found across a subset of columns as selected by scope.
`Dataset.drop_cols_by_unique_val`([scope, ...])	This method will drop any columns with less than or equal to the number of unique values.
`Dataset.drop_cols_by_nan`([scope, threshold, ...])	This method is used for dropping columns based on the amount of missing values per column, dropping any which exceed a user defined threshold.
`Dataset.drop_id_cols`([scope, inplace])	This method will drop any str-type / object type columns where the number of unique columns is equal to the length of the dataframe.
`Dataset.drop_duplicate_cols`([scope, inplace])	This method is used for checking to see if there are any columns loaded with duplicate values.
`Dataset.apply_inclusions`(subjects[, inplace])	This method will drop all subjects that do not overlap with the passed subjects to this function.
`Dataset.apply_exclusions`(subjects[, inplace])	This method will drop all subjects that overlap with the passed subjects to this function.

Plotting / Viewing#

`Dataset.plot`(scope[, subjects, cut, ...])	This function creates plots for each of the passed columns (as specified by scope) seperately.
`Dataset.plots`(scope[, subjects, ncols, ...])	This function creates a multi-figure plot containing all of the passed columns (as specified by scope) in their own axes.
`Dataset.plot_bivar`(scope1, scope2[, ...])	This method can be used to plot the relationship between two variables.
`Dataset.nan_info`([scope])
`Dataset.summary`(scope[, subjects, measures, ...])	This method is used to generate a summary across some data.
`Dataset.display_scopes`()	Display an HTML representation of the Dataset, as split by scope, instead of the default repr html as split by role.

Train / Test Split#

`Dataset.set_test_split`([size, subjects, ...])	Defines a set of subjects to be reserved as test subjects. This
`Dataset.set_train_split`([size, subjects, ...])	Defines a set of subjects to be reserved as train subjects. This
`Dataset.test_split`([size, subjects, ...])	This method defines and returns a Train and Test Dataset
`Dataset.train_split`([size, subjects, ...])	This method defines and returns a Train and Test Dataset
`Dataset.save_test_split`(loc)	Saves the currently defined test subjects in a text file with one subject / index per line.
`Dataset.save_train_split`(loc)	Saves the currently defined train subjects in a text file with one subject / index per line.