BPt.Dataset.train_split#

Dataset.train_split(size=None, subjects=None, cv_strategy=None, random_state=None)[source]#

This method defines and returns a Train and Test Dataset based on the passed parameters, allowing for defining the training set as a new split or from an existing set of subjects.

This method’s parameters describe how the training set should be generated, where the testing set is then defined by just which subjects are not in the training set.

Parameters

sizefloat, int or None, optional

This parameter represents the size of the train/test split to apply. If passed a floating type, this parameter should be between 0.0 and 1.0, and it will specify the proportion / percent of the data to bet set in the train/test split. If passed as an integer type, this value will be intrpetted as the absolute number of subjects to be set in the train/test split.

Note that either this parameter or subjects should be used, not both as they define different behaviors. Keep as default = None if using subjects.

default = None

subjectsSubjects, optional

This parameter can be optionally used instead of size in the case that a specific set of subjects should be used to define the split. This argument can accept any valid Subjects style input. Explicitly, either this parameter or size should be used, not both as they define different behaviors.

In the case that additional subjects are specified here, i.e., ones not loaded in the current Dataset, they will be simply be ignored, and the functional splits set to the overlap of passed subjects with loaded subjects. valid subjects.

default = None

cv_strategyNone or CVStrategy, optional

This parameter is only relevant when size is used, i.e., a new split is defined (and subjects is not used). In this case, an instance of CVStrategy defining any validation behavior the train/test split should be performed according to should be passed - or left as None (the default), which will use random splits. This parameter is typically used to define behavior like making sure the same distribution of target variable is present in both folds, or that members of the same family are preserved across folds.

default = None

random_stateint or None, optional

This parameter is only relevant when size is used, i.e., a new split is defined (and subjects is not used). In this case, this parameter represents the random state in which the split should be performed according to. Random states allow for reproducing the same train/test splits across different runs if given the same input Dataset. If left as None, the train/test split will be performed with just a random random seed (that is to say a different random state each time the function is called.)

default = None

Returns

train_dataDataset: The current Dataset as indexed (i.e., only the subjects from) the requested training set, as defined by the passed parameters. This Dataset will have all the same metadata as the original Dataset, though if changes are made to it, they will not influence the original Dataset.
test_dataDataset: The current Dataset as indexed (i.e., only the subjects from) the requested test set, as defined by the passed parameters. This Dataset will have all the same metadata as the original Dataset, though if changes are made to it, they will not influence the original Dataset.

See also

test_split: Return a train/test split but via specifying which subjects are test subjects.
set_train_split: Apply a train split, but storing the split information in the Dataset.
save_train_split: Save the train subjects from a split to a text file.

Examples

In [1]: import BPt as bp

# Load example data
In [2]: data = bp.read_pickle('data/example1.dataset')

In [3]: data
Out[3]: 
  animals  numbers
0   'cat'      1.0
1   'cat'      2.0
2   'dog'      1.0
3   'dog'      2.0
4   'elk'      NaN

In [4]: tr_data, test_data = data.train_split(size=.6)
Performing train split on: 5 subjects.
random_state: None
Train split size: 0.6

Performed train/test split
Train size: 3
Test size:  2

In [5]: tr_data
Out[5]: 
  animals  numbers
1   'cat'      2.0
2   'dog'      1.0
4   'elk'      NaN

In [6]: test_data
Out[6]: 
  animals  numbers
0   'cat'      1.0
3   'dog'      2.0

In [7]: tr_data, test_data = data.train_split(subjects=[0, 1])
Performed train/test split
Train size: 2
Test size:  3

In [8]: tr_data
Out[8]: 
  animals  numbers
0   'cat'      1.0
1   'cat'      2.0

In [9]: test_data
Out[9]: 
  animals  numbers
2   'dog'      1.0
3   'dog'      2.0
4   'elk'      NaN