BPt.ProblemSpec#

class BPt.ProblemSpec(target=0, scorer='default', scope='all', subjects='default', problem_type='default', n_jobs=1, random_state=1, base_dtype='float32')[source]#

Problem Spec is defined as an object encapsulating the set of parameters used by different Evaluation Functions

Parameters

targetint or str, optional

The target variable to predict, where the target variable is a loaded column within the Dataset eventually used that has been set to Role target.

This parameter can be passed either as the name of the column, or by passing an integer index. If passed an interger index (default = 0), then all loaded target variables will be sorted in alphabetical order and that index used to select the target to model.

default = 0

scorerstr or list, optional

Indicator str for which scorer(s) to use when calculating validation scores in the context of different Evaluation Functions.

A list of str’s can be passed as well, in this case, scores for all of the requested scorers will be calculated and returned. In some cases though, for example cross_val_score() only one scorer can be used, and if passed a list here, the first element of the list will be used.

Note: If using a nested ParamSearch, this object has its own separate scorer param.

For a full list of the base sklearn supported scorers please view the scikit-learn docs at: https://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules

You can also view the BPt reference to these options at Scorers.

If left as ‘default’, reasonable scorers will be assigned based on the underlying problem type.

‘regression’ : [‘r2’, ‘neg_mean_squared_error’]
‘binary’ : [‘matthews’, ‘roc_auc’, ‘balanced_accuracy’]
‘categorical’ : [‘matthews’, ‘roc_auc_ovr’, ‘balanced_accuracy’]

default = 'default'

scopeScope, optional

This parameter allows for specifying that only subset of columns be used in what modelling this ProblemSpec is passed to.

See Scope for a more detailed explained / guide on how scopes are defined and used within BPt.

default = 'all'

subjectsSubjects, optional

This parameter allows for specifying that the current experiment be run with only a subset of the current subjects.

A common use of this parameter is to specify the reserved keyword ‘train’ to specify that only the training subjects should be used.

If set to ‘default’, special behavior will be used where if a train/test split is defined then subjects will be set to ‘train’ by default (unless cv=’test’, then subjects will be set to ‘all’). If a train/test split is not defined, then subjects will be set to ‘all’.

See Subjects for more information of the different accepted BPt subject style inputs.

default = 'default'

problem_typestr or ‘default’, optional

This parameter controls what type of machine learning should be conducted. As either a regression, or classification where ‘categorical’ represents a special case of binary classification, where typically a binary classifier is trained on each class.

‘default’
Determine the problem type based on how the requested target variable is loaded.
‘regression’, ‘f’ or ‘float’
For ML on float/continuous target data.
‘binary’ or ‘b’
For ML on binary target data.
‘categorical’ or ‘c’
For ML on categorical target data, as multiclass.

This can almost always be left as default.

default = 'default'

n_jobsint

n_jobs are employed within the context of a call to Evaluate or Test.

In general, the way n_jobs are propagated to the different pipeline pieces on the backend is that, if there is a parameter search, the base ML pipeline will all be set to use 1 job, and the n_jobs budget will be used to train pipelines in parallel to explore different params. Otherwise, if no param search, n_jobs will be used for each piece individually, though some might not support it.

default = 1

random_stateint, RandomState instance or None, optional

Random state, either as int for a specific seed, or if None then the random seed is set by np.random.

This parameter is used to ensure replicability of experiments (wherever possible!). In some cases even with a random seed, depending on the pipeline pieces being used, if any have a component that occassionally yields different results, even with the same random seed, e.g., some model optimizations, then you might still not get exact replicability.

Note

There are some good arguments to be made for not using a fixed seed in some cases. See: https://scikit-learn.org/stable/common_pitfalls.html#controlling-randomness

default = 1

base_dtypenumpy dtype

The dataset is cast to a numpy array of float. This parameter can be used to change the default behavior, e.g., if more resolution or less is needed.

default = 'float32'

Methods

`copy`()	This method returns a deepcopy of the base object.
`get_params`([deep])	Get parameters for this estimator.
`print_all`([show_header, show_scorer, _print])	This method can be used to print a formatted representation of this object.
`set_params`(**params)	Set the parameters of this estimator.