Loading Data#

Intro#

In order to get data ready for machine learning, BPt has a specially designed Dataset class. This class is built on top of the DataFrame class from the pandas library. As we will see, the recommended way of preparing data actually first involves using the DataFrame class from pandas directly. The general idea is to use pandas and the DataFrame class to load all of the data you might end up wanting to use. Luckily, pandas contains a huge wealth of useful functions for accomplishing this already. Next, once all of the data is loaded, we cast the DataFrame to the BPt Dataset class, and then use the built in Dataset methods to get the data ready for use with the rest of BPt. This includes steps like specifying which variables are in what role (e.g., target variables vs. data variables), outlier detection, transformations like binning and converting to binary, tools for plotting / viewing distributions and specifying a global train / test split. We will introduce all of this functionality below!

Data of interest is inevitably going to come from a wide range of different sources, luckily the python library pandas has an incredible amount of support for loading data from different sources into DataFrames. Likewise, pandas offers a huge amount of support material, e.g., [here](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html), for getting starting with loading in raw data (or a google search with a specific question will almost always help). Pandas should be used to accomplish the initial loading and merging of all tabular data of interest into a DataFrame.

For example, let’s say our data of interest is stored in a file called data.csv, we could load it with:

data = pd.read_csv('data.csv')

Next let’s say we wanted to specify that the subject column is called ‘subject’, we can do this with another call to the native pandas API.

data = data.set_index('subject')

Then when we are finished with loading and merging the data of interest into a DataFrame, we can cast it to a BPt Dataset!

from BPt import Dataset
data = Dataset(data)

We can now still use a number of the native pandas api methods in addition now to the added functionality of the BPt Dataset!

There are a few key concepts when using Dataset which are important to know. These are Role, Scope, Subjects, Data Types and Data Files.

Warnings with using Dataset:

Column names within the Dataset class must be strings in order for the concept of scopes to work more easily. Therefore if any columns are loaded as a non-string, they will be renamed to the string version of that non-string.

Their are some caveats to using some DataFrame function once the DataFrame has been cast as a Dataset. While a great deal will continue to work, their are certain types of operations which can end up either re-casting the result back to a DataFrame (therefore losing all of the associated metadata), or renaming columns, which may cause internal errors and metadata loss.

Basic Example#

In [1]: import BPt as bp

In [2]: data = bp.Dataset()

In [3]: data['col 1'] = [1, 2, 3]

In [4]: data
Out[4]: 
   col 1
0      1
1      2
2      3

We can then perform operations on it, for example change its role.

In [5]: data.set_role('col 1', 'target')
Out[5]: 
   col 1
0      1
1      2
2      3

In [6]: data.roles
Out[6]: {'col 1': 'input data'}

What happened here? It looks like the role of the target is still ‘data’ and not ‘target’. That is because the Dataset class, like the underlying pandas DataFrame, has an inplace argument. This gives use two options, where both of the below operations will correctly set the role.

In [7]: data = data.set_role('col 1', 'target')

In [8]: data.set_role('col 1', 'target', inplace=True)

In [9]: data.roles
Out[9]: {'col 1': 'target'}

Preparing Data#

There are some pre-modelling steps that, depending on the dataset and the question, might also be explored at this stage, and can be performed using the Dataset object directly. These are steps that occur after the actual loading and setting of explicit roles as described above. These include decisions like:

Generating exploratory plots of the different features in the dataset.
Should any data be removed or set to missing based on status as an outlier?
Should missing data be kept and imputed, or dropped?
Are there any pre-requisite transformations that should be applied to the data? E.g., conversion from strings ‘Male’, ‘Female’ to 0 and 1’s.