BPt.Dataset.test_split#

Dataset.test_split(size=None, subjects=None, cv_strategy=None, random_state=None)[source]#
This method defines and returns a Train and Test Dataset based on the passed parameters, allowing for defining the test set as a new split or from an existing set of subjects.
This method’s parameters describe how the test set should be generated, where the training set is then defined by just which subjects are not in the test set.
Parameters
sizefloat, int or None, optional
This parameter represents the size of the train/test split to apply. If passed a floating type, this parameter should be between 0.0 and 1.0, and it will specify the proportion / percent of the data to bet set in the train/test split. If passed as an integer type, this value will be intrpetted as the absolute number of subjects to be set in the train/test split.
Note that either this parameter or subjects should be used, not both as they define different behaviors. Keep as default = None if using subjects.
default = None
subjectsSubjects, optional
This parameter can be optionally used instead of size in the case that a specific set of subjects should be used to define the split. This argument can accept any valid Subjects style input. Explicitly, either this parameter or size should be used, not both as they define different behaviors.
In the case that additional subjects are specified here, i.e., ones not loaded in the current Dataset, they will be simply be ignored, and the functional splits set to the overlap of passed subjects with loaded subjects. valid subjects.
default = None
cv_strategyNone or CVStrategy, optional

This parameter is only relevant when size is used, i.e., a new split is defined (and subjects is not used). In this case, an instance of CVStrategy defining any validation behavior the train/test split should be performed according to should be passed - or left as None (the default), which will use random splits. This parameter is typically used to define behavior like making sure the same distribution of target variable is present in both folds, or that members of the same family are preserved across folds.

default = None
random_stateint or None, optional

This parameter is only relevant when size is used, i.e., a new split is defined (and subjects is not used). In this case, this parameter represents the random state in which the split should be performed according to. Random states allow for reproducing the same train/test splits across different runs if given the same input Dataset. If left as None, the train/test split will be performed with just a random random seed (that is to say a different random state each time the function is called.)

default = None
Returns
train_dataDataset

The current Dataset as indexed (i.e., only the subjects from) the requested training set, as defined by the passed parameters. This Dataset will have all the same metadata as the original Dataset, though if changes are made to it, they will not influence the original Dataset.

test_dataDataset

The current Dataset as indexed (i.e., only the subjects from) the requested test set, as defined by the passed parameters. This Dataset will have all the same metadata as the original Dataset, though if changes are made to it, they will not influence the original Dataset.

See also

train_split

Return a train/test split but via specifying which subjects are training subjects.

set_test_split

Apply a test split, but storing the split information in the Dataset.

save_test_split

Save the test subjects from a split to a text file.

Examples

In [1]: import BPt as bp

# Load example data
In [2]: data = bp.read_pickle('data/example1.dataset')

In [3]: data
Out[3]: 
  animals  numbers
0   'cat'      1.0
1   'cat'      2.0
2   'dog'      1.0
3   'dog'      2.0
4   'elk'      NaN

In [4]: tr_data, test_data = data.test_split(size=.2)
Performing test split on: 5 subjects.
random_state: None
Test split size: 0.2

Performed train/test split
Train size: 4
Test size:  1

In [5]: tr_data
Out[5]: 
  animals  numbers
1   'cat'      2.0
2   'dog'      1.0
3   'dog'      2.0
4   'elk'      NaN

In [6]: test_data
Out[6]: 
  animals  numbers
0   'cat'      1.0

We can also define a split by passing specific subjects.

In [7]: tr_data, test_data = data.test_split(subjects=[3, 4])
Performed train/test split
Train size: 3
Test size:  2

In [8]: tr_data
Out[8]: 
  animals  numbers
0   'cat'      1.0
1   'cat'      2.0
2   'dog'      1.0

In [9]: test_data
Out[9]: 
  animals  numbers
3   'dog'      2.0
4   'elk'      NaN

We see that the parameters are used to define which subjects are set as test subjects.