Data Types#
We consider loaded variables to be essentially of three types, ‘float’ which are continuous variables, categorical or a data file. By default if not specified, variables are considered to be of type ‘float’.
Not taking into account Data Files, which we will discuss below, all one generally has to worry about with respect to data types are telling the Dataset class which columns are categorical. By default, if any columns are set to pandas type ‘category’, e.g., via:
data['col'] = data['col'].astype('category')
Then this example column, ‘col’, is already set within BPt as categorical too. For example:
Warning
In the case where the categorical column is composed of say strings or objects,
it is not enough to just cast it as type category for use as input to a machine learning pipeline!
In order for it to be valid input, the strings or objects or whatever must be converted to float / int
representations, e.g., the best way to do this is ussually ordinalize
.
You may also specify if a column is categorical or not by adding ‘category’ to that columns
scope via add_scope
. Again though, this should only be done for columns which
are already ordinally or one-hot encoded.
data.add_scope('col', 'category')
In addition to explicitly setting columns as categorical, it is important to note
that a number of Dataset methods will automatically cast relevant columns to type ‘category’.
These methods include auto_detect_categorical
which
will try to automatically detect categorical columns, but also functions like:
binarize
,
filter_categorical_by_percent
,
ordinalize
,
copy_as_non_input
and more.
Basic Example#
import BPt as bp data = bp.Dataset([[‘cow’, 1, 3],
[‘horse’, 2, 2], [‘cat’, 3, 2],],
columns=[‘f1’, ‘f2’, ‘f3’])
data
Define a basic dataset with 3 columns, we will assume they are all categorical, and use ordinalize
on all of them.
In [1]: data = data.ordinalize(scope='data')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[1], line 1
----> 1 data = data.ordinalize(scope='data')
NameError: name 'data' is not defined
In [2]: data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 1
----> 1 data
NameError: name 'data' is not defined
We can confirm they were cast to categorical:
In [3]: data.get_cols(scope='category')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[3], line 1
----> 1 data.get_cols(scope='category')
NameError: name 'data' is not defined
Using functions like ordinalize
, should be the prefered way of letting Dataset’s know which variables are categorical.