Data Types#

We consider loaded variables to be essentially of three types, ‘float’ which are continuous variables, categorical or a data file. By default if not specified, variables are considered to be of type ‘float’.

Not taking into account Data Files, which we will discuss below, all one generally has to worry about with respect to data types are telling the Dataset class which columns are categorical. By default, if any columns are set to pandas type ‘category’, e.g., via:

data['col'] = data['col'].astype('category')

Then this example column, ‘col’, is already set within BPt as categorical too. For example:

Warning

In the case where the categorical column is composed of say strings or objects, it is not enough to just cast it as type category for use as input to a machine learning pipeline! In order for it to be valid input, the strings or objects or whatever must be converted to float / int representations, e.g., the best way to do this is ussually ordinalize.

You may also specify if a column is categorical or not by adding ‘category’ to that columns scope via add_scope. Again though, this should only be done for columns which are already ordinally or one-hot encoded.

data.add_scope('col', 'category')

In addition to explicitly setting columns as categorical, it is important to note that a number of Dataset methods will automatically cast relevant columns to type ‘category’. These methods include auto_detect_categorical which will try to automatically detect categorical columns, but also functions like: binarize, filter_categorical_by_percent, ordinalize, copy_as_non_input and more.

Basic Example#

import BPt as bp data = bp.Dataset([[‘cow’, 1, 3],

[‘horse’, 2, 2], [‘cat’, 3, 2],],

columns=[‘f1’, ‘f2’, ‘f3’])

data

Define a basic dataset with 3 columns, we will assume they are all categorical, and use ordinalize on all of them.

In [1]: data = data.ordinalize(scope='data')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[1], line 1
----> 1 data = data.ordinalize(scope='data')

NameError: name 'data' is not defined

In [2]: data
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 1
----> 1 data

NameError: name 'data' is not defined

We can confirm they were cast to categorical:

In [3]: data.get_cols(scope='category')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 data.get_cols(scope='category')

NameError: name 'data' is not defined

Using functions like ordinalize, should be the prefered way of letting Dataset’s know which variables are categorical.