BPt.Dataset.set_test_split#

Dataset.set_test_split(size=None, subjects=None, cv_strategy=None, random_state=None, inplace=False)[source]#
Defines a set of subjects to be reserved as test subjects. This method includes utilities for either defining a new test split, or loading an existing one.
This method applies the passed parameters in order define a test set which is stored in the current Dataset.
Parameters
sizefloat, int or None, optional
This parameter represents the size of the train/test split to apply. If passed a floating type, this parameter should be between 0.0 and 1.0, and it will specify the proportion / percent of the data to bet set in the train/test split. If passed as an integer type, this value will be intrpetted as the absolute number of subjects to be set in the train/test split.
Note that either this parameter or subjects should be used, not both as they define different behaviors. Keep as default = None if using subjects.
default = None
subjectsSubjects, optional
This parameter can be optionally used instead of size in the case that a specific set of subjects should be used to define the split. This argument can accept any valid Subjects style input. Explicitly, either this parameter or size should be used, not both as they define different behaviors.
In the case that additional subjects are specified here, i.e., ones not loaded in the current Dataset, they will be simply be ignored, and the functional splits set to the overlap of passed subjects with loaded subjects. valid subjects.
default = None
cv_strategyNone or CVStrategy, optional

This parameter is only relevant when size is used, i.e., a new split is defined (and subjects is not used). In this case, an instance of CVStrategy defining any validation behavior the train/test split should be performed according to should be passed - or left as None (the default), which will use random splits. This parameter is typically used to define behavior like making sure the same distribution of target variable is present in both folds, or that members of the same family are preserved across folds.

default = None
random_stateint or None, optional

This parameter is only relevant when size is used, i.e., a new split is defined (and subjects is not used). In this case, this parameter represents the random state in which the split should be performed according to. Random states allow for reproducing the same train/test splits across different runs if given the same input Dataset. If left as None, the train/test split will be performed with just a random random seed (that is to say a different random state each time the function is called.)

default = None
inplacebool, optional

If True, perform the current function inplace and return None.

default = False

See also

set_train_split

Apply a train/test split but via specifying which subjects are training subjects.

test_split

Apply a test split returning two separate Train and Test Datasets.

save_test_split

Save the test subjects from a split to a text file.

Examples

In [1]: import BPt as bp

# Load example data
In [2]: data = bp.read_pickle('data/example1.dataset')

In [3]: data
Out[3]: 
  animals  numbers
0   'cat'      1.0
1   'cat'      2.0
2   'dog'      1.0
3   'dog'      2.0
4   'elk'      NaN

In [4]: data.set_test_split(size=.6, inplace=True)
Performing test split on: 5 subjects.
random_state: None
Test split size: 0.6

Performed train/test split
Train size: 2
Test size:  3

In [5]: data.train_subjects
Out[5]: Int64Index([1, 2], dtype='int64')

In [6]: data.test_subjects
Out[6]: Int64Index([0, 3, 4], dtype='int64')

Note that the split is stored in the dataset itself. We can also pass specific subjects.

In [7]: data = data.set_test_split(subjects=[0, 1])
Overriding existing train/test split.
Performed train/test split
Train size: 3
Test size:  2

In [8]: data.train_subjects
Out[8]: Int64Index([2, 3, 4], dtype='int64')

In [9]: data.test_subjects
Out[9]: Int64Index([0, 1], dtype='int64')