Loading Data#
Intro#
In order to get data ready for machine learning, BPt has a specially designed Dataset
class.
This class is built on top of the DataFrame class from the pandas library. As we will see,
the recommended way of preparing data actually first involves using the DataFrame class from pandas
directly. The general idea is to use pandas and the DataFrame class to load all of the data you might
end up wanting to use. Luckily, pandas contains a huge wealth of useful functions
for accomplishing this already. Next, once all of the data is loaded, we cast the DataFrame
to the BPt Dataset
class, and then use the built in Dataset
methods to get the data
ready for use with the rest of BPt. This includes steps like specifying which variables are in what
role (e.g., target variables vs. data variables), outlier detection, transformations like binning and converting to binary,
tools for plotting / viewing distributions and specifying a global train / test split. We will introduce all of
this functionality below!
Data of interest is inevitably going to come from a wide range of different sources, luckily the python library pandas has an incredible amount of support for loading data from different sources into DataFrames. Likewise, pandas offers a huge amount of support material, e.g., [here](https://pandas.pydata.org/docs/getting_started/intro_tutorials/02_read_write.html), for getting starting with loading in raw data (or a google search with a specific question will almost always help). Pandas should be used to accomplish the initial loading and merging of all tabular data of interest into a DataFrame.
For example, let’s say our data of interest is stored in a file called data.csv, we could load it with:
data = pd.read_csv('data.csv')
Next let’s say we wanted to specify that the subject column is called ‘subject’, we can do this with another call to the native pandas API.
data = data.set_index('subject')
Then when we are finished with loading and merging the data of interest into a DataFrame, we can cast it to a BPt Dataset!
from BPt import Dataset
data = Dataset(data)
We can now still use a number of the native pandas api methods in addition now to the added functionality of the BPt Dataset
!
There are a few key concepts when using Dataset
which are important to know.
These are Role, Scope, Subjects, Data Types and Data Files.
Warnings with using Dataset
:
Column names within the Dataset
class must be strings in order for the concept
of scopes to work more easily. Therefore if any columns are loaded as a non-string, they
will be renamed to the string version of that non-string.
Their are some caveats to using some DataFrame function once the DataFrame has
been cast as a Dataset
. While a great deal will continue to work, their are
certain types of operations which can end up either re-casting the result back to
a DataFrame (therefore losing all of the associated metadata), or renaming columns,
which may cause internal errors and metadata loss.
Basic Example#
In [1]: import BPt as bp
In [2]: data = bp.Dataset()
In [3]: data['col 1'] = [1, 2, 3]
In [4]: data
Out[4]:
col 1
0 1
1 2
2 3
We can then perform operations on it, for example change its role.
In [5]: data.set_role('col 1', 'target')
Out[5]:
col 1
0 1
1 2
2 3
In [6]: data.roles
Out[6]: {'col 1': 'input data'}
What happened here? It looks like the role of the target is still ‘data’ and not ‘target’. That is because the Dataset class, like the underlying pandas DataFrame, has an inplace argument. This gives use two options, where both of the below operations will correctly set the role.
In [7]: data = data.set_role('col 1', 'target')
In [8]: data.set_role('col 1', 'target', inplace=True)
In [9]: data.roles
Out[9]: {'col 1': 'target'}
Preparing Data#
There are some pre-modelling steps that, depending on the dataset and the question, might also be explored at this stage, and can be performed using the Dataset object directly. These are steps that occur after the actual loading and setting of explicit roles as described above. These include decisions like:
Generating exploratory plots of the different features in the dataset.
Should any data be removed or set to missing based on status as an outlier?
Should missing data be kept and imputed, or dropped?
Are there any pre-requisite transformations that should be applied to the data? E.g., conversion from strings ‘Male’, ‘Female’ to 0 and 1’s.