BPt.Dataset.train_split#
- Dataset.train_split(size=None, subjects=None, cv_strategy=None, random_state=None)[source]#
- This method defines and returns a Train and Test Dataset based on the passed parameters, allowing for defining the training set as a new split or from an existing set of subjects.This method’s parameters describe how the training set should be generated, where the testing set is then defined by just which subjects are not in the training set.
- Parameters
- sizefloat, int or None, optional
- This parameter represents the size of the train/test split to apply. If passed a floating type, this parameter should be between 0.0 and 1.0, and it will specify the proportion / percent of the data to bet set in the train/test split. If passed as an integer type, this value will be intrpetted as the absolute number of subjects to be set in the train/test split.Note that either this parameter or subjects should be used, not both as they define different behaviors. Keep as default = None if using subjects.
default = None
- subjectsSubjects, optional
- This parameter can be optionally used instead of size in the case that a specific set of subjects should be used to define the split. This argument can accept any valid Subjects style input. Explicitly, either this parameter or size should be used, not both as they define different behaviors.In the case that additional subjects are specified here, i.e., ones not loaded in the current Dataset, they will be simply be ignored, and the functional splits set to the overlap of passed subjects with loaded subjects. valid subjects.
default = None
- cv_strategyNone or
CVStrategy
, optional This parameter is only relevant when size is used, i.e., a new split is defined (and subjects is not used). In this case, an instance of
CVStrategy
defining any validation behavior the train/test split should be performed according to should be passed - or left as None (the default), which will use random splits. This parameter is typically used to define behavior like making sure the same distribution of target variable is present in both folds, or that members of the same family are preserved across folds.default = None
- random_stateint or None, optional
This parameter is only relevant when size is used, i.e., a new split is defined (and subjects is not used). In this case, this parameter represents the random state in which the split should be performed according to. Random states allow for reproducing the same train/test splits across different runs if given the same input Dataset. If left as None, the train/test split will be performed with just a random random seed (that is to say a different random state each time the function is called.)
default = None
- Returns
- train_data
Dataset
The current
Dataset
as indexed (i.e., only the subjects from) the requested training set, as defined by the passed parameters. ThisDataset
will have all the same metadata as the originalDataset
, though if changes are made to it, they will not influence the originalDataset
.- test_data
Dataset
The current
Dataset
as indexed (i.e., only the subjects from) the requested test set, as defined by the passed parameters. ThisDataset
will have all the same metadata as the originalDataset
, though if changes are made to it, they will not influence the originalDataset
.
- train_data
See also
test_split
Return a train/test split but via specifying which subjects are test subjects.
set_train_split
Apply a train split, but storing the split information in the Dataset.
save_train_split
Save the train subjects from a split to a text file.
Examples
In [1]: import BPt as bp # Load example data In [2]: data = bp.read_pickle('data/example1.dataset') In [3]: data Out[3]: animals numbers 0 'cat' 1.0 1 'cat' 2.0 2 'dog' 1.0 3 'dog' 2.0 4 'elk' NaN In [4]: tr_data, test_data = data.train_split(size=.6) Performing train split on: 5 subjects. random_state: None Train split size: 0.6 Performed train/test split Train size: 3 Test size: 2 In [5]: tr_data Out[5]: animals numbers 1 'cat' 2.0 2 'dog' 1.0 4 'elk' NaN In [6]: test_data Out[6]: animals numbers 0 'cat' 1.0 3 'dog' 2.0 In [7]: tr_data, test_data = data.train_split(subjects=[0, 1]) Performed train/test split Train size: 2 Test size: 3 In [8]: tr_data Out[8]: animals numbers 0 'cat' 1.0 1 'cat' 2.0 In [9]: test_data Out[9]: animals numbers 2 'dog' 1.0 3 'dog' 2.0 4 'elk' NaN