Predict BMI#

This script shows a real world example using BPt to study the relationship between BMI and the brain. The data used in this notebook cannot be made public as it is from the ABCD Study, which requires a data use agreement in order to use the data.

This notebook covers a number of different topics:

  • Preparing Data

  • Evaluating a single pipeline

  • Considering different options for how to use a test set

  • Introduce and use the Evaluate input option

[1]:
import pandas as pd
import BPt as bp
import numpy as np

# Don't show sklearn convergence warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
simplefilter("ignore", category=ConvergenceWarning)

# Display tables up to five decimals
pd.options.display.float_format = "{:,.5f}".format

Preparing Data#

We will first load in the underlying dataset for this project which has been saved as a csv. It contains multi-modal change in ROI data from two timepoints of the ABCD Study (difference from follow up and baseline).

This saved dataset doesn’t include the real family ids, but an interesting piece of the ABCD study derived data is that there are a number of subjects from the same family. We will handle that in this example (granted with a fake family structure which we will generate below) by ensuring that for any cross-validation split, members of the same family stay in the same training or testing fold.

[2]:
data = pd.read_excel('data/structure_base.xlsx')
list(data)[:10]
[2]:
['Unnamed: 0',
 'src_subject_id',
 'b_averaged_puberty',
 'b_agemos',
 'b_sex',
 'b_race_ethnicity_categories',
 'b_demo_highest_education_categories',
 'b_site_id_l',
 'b_subjects_no_missing_data',
 'bmi_keep']

This dataset contains a number of columns we don’t need. We will use the next cell to both group variables of interest together, and then select only the relvant columns to keep.

[3]:
# Our target variable
targets = ['b_bmi']

# Columns with different traditional 'co-variates'.
covars = ['b_sex',  'b_demo_highest_education_categories','b_race_ethnicity_categories',
          'b_agemos', 'b_mri_info_deviceserialnumber']

# Let's also note which of these are categorical
cat_covars = ['b_mri_info_deviceserialnumber',
              'b_demo_highest_education_categories',
              'b_race_ethnicity_categories',
              'b_sex']

# These variables are any which we might want to use
# but not directly as input features! E.g., we
# might want to use them to inform choice of cross-validation.
non_input = ['b_rel_family_id']

# The different imaging features
thick = [d for d in list(data) if 'thick' in d]
area = [d for d in list(data) if 'smri_area_cort' in d]
subcort = [d for d in list(data) if 'smri_vol' in d]
dti_fa = [d for d in list(data) if 'dmri_dti_full_fa_' in d]
dti_md = [d for d in list(data) if 'dmri_dti_full_md_' in d]
brain = thick + area + subcort + dti_fa + dti_md

# All to keep
to_keep = brain + targets + covars + non_input

data = data[to_keep]
data.shape
[3]:
(9724, 668)

Now let’s convert from a pandas DataFrame to a BPt Dataset.

[4]:
data = bp.Dataset(data)

# This is optional, to print some extra statements.
data.verbose = 1
data.shape
[4]:
(9724, 668)

Next, we perform some actions specific to the Dataset class. These include specifying which columns are ‘target’ and ‘non input’, with any we don’t set to one these roles treated as the default role, ‘data’.

[5]:
# Set's targets to target role
data = data.set_role(targets, 'target')

# Set non input to non input role
data = data.set_role(non_input, 'non input')

# Drop any missing values in the target variable
data = data.drop_subjects_by_nan(scope='target')

# We can optional add the categories we made as scopes!
data.add_scope(covars, 'covars', inplace=True)
data.add_scope(cat_covars, 'cat covars', inplace=True)
data.add_scope(thick, 'thick', inplace=True)
data.add_scope(area, 'area', inplace=True)
data.add_scope(subcort, 'subcort', inplace=True)
data.add_scope(dti_fa, 'dti_fa', inplace=True)
data.add_scope(dti_md, 'dti_md', inplace=True)
data.add_scope(brain, 'brain', inplace=True )

# Drop all NaN from any column
# Though BPt can generally handle NaN data fine,
# it makes certain pieces easier for this example as we don't have to worry
# about imputation.
data = data.dropna()

# Just show the first few rows
data.head()
Setting NaN threshold to: 0.5
Dropped 8 Rows
[5]:

Data

b_agemos b_demo_highest_education_categories b_mri_info_deviceserialnumber b_race_ethnicity_categories b_sex dmri_dti_full_fa_subcort_aseg_accumbens_area_lh dmri_dti_full_fa_subcort_aseg_accumbens_area_rh dmri_dti_full_fa_subcort_aseg_amygdala_lh dmri_dti_full_fa_subcort_aseg_amygdala_rh dmri_dti_full_fa_subcort_aseg_caudate_lh ... smri_vol_scs_subcorticalgv smri_vol_scs_suprateialv smri_vol_scs_tplh smri_vol_scs_tprh smri_vol_scs_vedclh smri_vol_scs_vedcrh smri_vol_scs_wholeb smri_vol_scs_wmhint smri_vol_scs_wmhintlh smri_vol_scs_wmhintrh
0 124 2 HASHe4f6957a 2 1 0.17195 0.17389 0.19923 0.16654 0.15518 ... 52,440.00000 935,475.83514 6,220.90000 5,787.40000 3,554.80000 3,427.90000 1,045,923.63514 478.40000 0.00000 0.00000
1 122 5 HASH1314a204 1 2 0.16897 0.16477 0.20000 0.17713 0.15848 ... 62,550.00000 1,078,644.86257 8,222.60000 7,571.40000 3,872.50000 3,837.40000 1,197,394.06258 769.70000 0.00000 0.00000
2 114 3 HASH69f406fa 1 1 0.22881 0.25250 0.23221 0.22298 0.22545 ... 60,695.00000 1,133,471.50460 7,040.50000 6,752.90000 4,588.80000 4,631.50000 1,291,126.70460 1,114.70000 0.00000 0.00000
3 130 5 HASH1314a204 1 2 0.16737 0.18840 0.15135 0.16215 0.18848 ... 65,614.00000 1,055,580.45187 8,365.70000 7,656.40000 4,600.10000 4,920.60000 1,189,841.05187 1,788.50000 0.00000 0.00000
4 115 5 HASHc3bf3d9c 1 2 0.18902 0.21375 0.23945 0.20581 0.20100 ... 59,174.00000 1,010,567.33270 6,577.90000 6,612.70000 3,434.30000 3,942.60000 1,144,069.73270 1,036.30000 0.00000 0.00000

5 rows × 666 columns

Target

b_bmi
0 15.17507
1 16.45090
2 24.43703
3 17.38701
4 17.59670

Non Input

b_rel_family_id
0 2257
1 11328
2 7607
3 11324
4 7608

The scopes we defined are nice, as it lets use check columns, or compose different scopes together. For example we can check the scope we set ‘cat covars’ as composed with another variable as:

[6]:
data.get_cols([cat_covars, 'b_rel_family_id'])
[6]:
['b_demo_highest_education_categories',
 'b_mri_info_deviceserialnumber',
 'b_race_ethnicity_categories',
 'b_rel_family_id',
 'b_sex']

These are notably the columns we want to make sure are categorical, so lets ordinalize them, then plot.

[7]:
data = data.ordinalize([cat_covars, 'b_rel_family_id'])

# Then plot just the categorical variables
data.plot(scope='category', subjects='all', decode_values=True)
b_demo_highest_education_categories: 9478 rows
../_images/user_guide_bmi_full_example_14_1.png
b_mri_info_deviceserialnumber: 9478 rows
../_images/user_guide_bmi_full_example_14_3.png
b_race_ethnicity_categories: 9478 rows
../_images/user_guide_bmi_full_example_14_5.png
b_rel_family_id: 9478 rows
b_sex: 9478 rows
/home/sage/anaconda3/envs/bpt/lib/python3.9/site-packages/BPt/dataset/helpers.py:167: UserWarning: Skipping plot: b_rel_family_id as >= categories!
  warnings.warn(as_str)
../_images/user_guide_bmi_full_example_14_8.png

Let’s plot the target variables as well.

[8]:
data.plot('target')
b_bmi: 9478 rows
../_images/user_guide_bmi_full_example_16_1.png

Okay, we note that there are some extreme outliers.

[9]:
data = data.filter_outliers_by_std(n_std=10, scope='target')
data['target'].max()
Dropped 1 Rows
[9]:
b_bmi   53.90845
dtype: float64
[10]:
data.plot('target')
b_bmi: 9477 rows
../_images/user_guide_bmi_full_example_19_1.png

Okay this maximum seem much more reasonable. Let’s also assume that there may be some extreme values present in the input data as well, and that these represet corrupted data that we therefore want to drop.

[11]:
# Repeat it twice, to deal with outliers at multiple scales
data = data.filter_outliers_by_std(n_std=10, scope='float')
data = data.filter_outliers_by_std(n_std=10, scope='float')
Dropped 62 Rows
Dropped 14 Rows

Next, we consider splitting up out data with a global train and test split. This can be useful in some instances. Note that we also define a cv strategy which says to perform the train test split keeping members of the same family in the same fold.

[12]:
# Let's say we want to keep family members in the same train or test split
cv_strategy = bp.CVStrategy(groups='b_rel_family_id')

# Test split
data = data.set_test_split(.2, random_state=5, cv_strategy=cv_strategy)
data
Performing test split on: 9401 subjects.
random_state: 5
Test split size: 0.2

Performed train/test split
Train size: 7481
Test size:  1920
[12]:

Data

b_agemos b_demo_highest_education_categories b_mri_info_deviceserialnumber b_race_ethnicity_categories b_sex dmri_dti_full_fa_subcort_aseg_accumbens_area_lh dmri_dti_full_fa_subcort_aseg_accumbens_area_rh dmri_dti_full_fa_subcort_aseg_amygdala_lh dmri_dti_full_fa_subcort_aseg_amygdala_rh dmri_dti_full_fa_subcort_aseg_caudate_lh ... smri_vol_scs_subcorticalgv smri_vol_scs_suprateialv smri_vol_scs_tplh smri_vol_scs_tprh smri_vol_scs_vedclh smri_vol_scs_vedcrh smri_vol_scs_wholeb smri_vol_scs_wmhint smri_vol_scs_wmhintlh smri_vol_scs_wmhintrh
0 124 1 26 1 0 0.17195 0.17389 0.19923 0.16654 0.15518 ... 52,440.00000 935,475.83514 6,220.90000 5,787.40000 3,554.80000 3,427.90000 1,045,923.63514 478.40000 0.00000 0.00000
1 122 4 2 0 1 0.16897 0.16477 0.20000 0.17713 0.15848 ... 62,550.00000 1,078,644.86257 8,222.60000 7,571.40000 3,872.50000 3,837.40000 1,197,394.06258 769.70000 0.00000 0.00000
2 114 2 13 0 0 0.22881 0.25250 0.23221 0.22298 0.22545 ... 60,695.00000 1,133,471.50460 7,040.50000 6,752.90000 4,588.80000 4,631.50000 1,291,126.70460 1,114.70000 0.00000 0.00000
3 130 4 2 0 1 0.16737 0.18840 0.15135 0.16215 0.18848 ... 65,614.00000 1,055,580.45187 8,365.70000 7,656.40000 4,600.10000 4,920.60000 1,189,841.05187 1,788.50000 0.00000 0.00000
4 115 4 20 0 1 0.18902 0.21375 0.23945 0.20581 0.20100 ... 59,174.00000 1,010,567.33270 6,577.90000 6,612.70000 3,434.30000 3,942.60000 1,144,069.73270 1,036.30000 0.00000 0.00000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9717 119 1 2 2 1 0.16392 0.17222 0.19906 0.17829 0.15375 ... 47,422.00000 893,537.35606 6,066.30000 5,342.00000 2,808.40000 3,469.60000 999,770.35606 1,152.40000 0.00000 0.00000
9720 129 4 7 0 1 0.23469 0.25338 0.25971 0.26321 0.22641 ... 59,348.00000 1,070,071.00812 7,036.60000 6,669.10000 3,817.40000 4,103.40000 1,198,435.10812 794.90000 0.00000 0.00000
9721 108 2 0 1 0 0.15469 0.20223 0.17145 0.15300 0.15900 ... 63,328.00000 1,093,051.50897 8,123.20000 7,346.20000 3,992.80000 4,219.80000 1,213,030.80897 805.50000 0.00000 0.00000
9722 110 2 17 0 1 0.16530 0.22484 0.19524 0.16084 0.16671 ... 57,037.00000 997,648.58273 6,692.50000 6,437.90000 3,833.40000 3,887.70000 1,127,133.48273 1,477.90000 0.00000 0.00000
9723 113 4 26 0 1 0.15050 0.16066 0.16982 0.16484 0.15355 ... 61,090.00000 989,701.56266 7,113.50000 6,835.30000 4,029.60000 3,826.00000 1,134,202.76266 2,304.80000 0.00000 0.00000

9401 rows × 666 columns

7481 rows × 666 columns - Train Set

1920 rows × 666 columns - Test Set

Target

b_bmi
0 15.17507
1 16.45090
2 24.43703
3 17.38701
4 17.59670
... ...
9717 16.26895
9720 31.95732
9721 14.19771
9722 24.42694
9723 18.25486

9401 rows × 1 columns

7481 rows × 1 columns - Train Set

1920 rows × 1 columns - Test Set

Non Input

b_rel_family_id
0 1589
1 7867
2 5174
3 7864
4 5175
... ...
9717 8070
9720 1814
9721 5719
9722 6412
9723 1740

9401 rows × 1 columns

7481 rows × 1 columns - Train Set

1920 rows × 1 columns - Test Set

Single Training Set Evaluation#

Our Dataset is now fully prepared. We can now define and evaluate a machine learning pipeline. We will start by considering a pipeline which a few steps, these are:

  1. Winsorize just the brain data.

  2. Perform Robust Scaling on any ‘float’ type columns, ‘brain’ or ‘covars’

  3. One Hot Encode any categorical features

  4. Fit an Elastic-Net Regression with nested random hyper-parameter search

Let’s use one other feature of the toolbox, that is a custom cross-validation strategy. This is the same idea that we used when defining the train-test split, but now we want it to apply both during the evaluation and during the splits made when evaluating hyper-parameters.

[13]:
#1
w_scaler = bp.Scaler('winsorize', quantile_range=(2, 98), scope='brain')

#2
s_scaler = bp.Scaler('standard', scope='float')

#3
ohe = bp.Transformer('one hot encoder', scope='category')

#4
param_search=bp.ParamSearch('RandomSearch', n_iter=60,
                            cv=bp.CV(splits=3, cv_strategy=cv_strategy))
elastic = bp.Model('elastic', params=1,
                   param_search=param_search)

# Now we can actually defined the pipeline
pipe = bp.Pipeline(steps=[w_scaler, s_scaler, ohe, elastic])
pipe
[13]:
Pipeline(steps=[Scaler(extra_params={'quantile_range': (2, 98)},
                       obj='winsorize', scope='brain'),
                Scaler(obj='standard'),
                Transformer(obj='one hot encoder', scope='category'),
                Model(obj='elastic',
                      param_search=ParamSearch(cv=CV(cv_strategy=CVStrategy(groups='b_rel_family_id')),
                                               n_iter=60),
                      params=1)])

Let’s say we want to use a 5-fold CV to evaluate this model on just the training set. We can first define the cv, the same as for the param_search above, but this time with splits=5.

[14]:
cv=bp.CV(splits=5, n_repeats=1, cv_strategy=cv_strategy)
cv
[14]:
CV(cv_strategy=CVStrategy(groups='b_rel_family_id'), splits=5)

And we will make a problem_spec to store some common params. In this case the random state for reproducibility of results and the number of jobs to use.

[15]:
# Use problem spec just to store n jobs and random state
ps = bp.ProblemSpec(n_jobs=8, random_state=10)
ps
[15]:
ProblemSpec(n_jobs=8, random_state=10)

Now we are ready to evaluate this pipeline, let’s check an example using the function evaluate.

We will set just one specific parameter to start, which is a scope of ‘brain’ to say we just want to run a model with just the brain features (i.e., not the co-variates)

[16]:
evaluator = bp.evaluate(pipeline=pipe,
                        dataset=data,
                        problem_spec=ps,
                        scope='brain',
                        cv=cv)
evaluator
[16]:
BPtEvaluator
------------
mean_scores = {'explained_variance': 0.25655836420425737, 'neg_mean_squared_error': -12.15230817006252}
std_scores = {'explained_variance': 0.02393695017649266, 'neg_mean_squared_error': 1.1099605948532965}

Saved Attributes: ['estimators', 'preds', 'timing', 'train_subjects', 'val_subjects', 'feat_names', 'ps', 'mean_scores', 'std_scores', 'weighted_mean_scores', 'scores', 'fis_', 'coef_']

Available Methods: ['get_preds_dfs', 'get_fis', 'get_coef_', 'permutation_importance']

Evaluated with:
ProblemSpec(n_jobs=8, problem_type='regression', random_state=10, scope='brain',
            scorer={'explained_variance': make_scorer(explained_variance_score),
                    'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False)},
            subjects='train', target='b_bmi')

Wait, what subset of subjects did we use? We want to make sure we are using just the training set of subjects at this stage. We just used default for the subjects, which in this case, i.e., when their is a test set defined, uses the training set as the subjects when passed to evaluate. We can confirm this by checking the intersection of the first train and validation fold with the test subjects

[17]:
evaluator.train_subjects[0].intersection(data.test_subjects), evaluator.val_subjects[0].intersection(data.test_subjects)
[17]:
(Int64Index([], dtype='int64'), Int64Index([], dtype='int64'))

As we can see, they are empty. We could also specify subjects=‘train’ above to make sure this behavior occured.

How to employ Test Set?#

Next, we will discuss briefly two methods of evaluating on the test set.

  1. There is a notion in some of the literature that the test set should be used to evaluate an existing estimator. In this case, that would consist of selecting from the 5 trained models we evaluated on the train set, and testing the one which performed best.

  2. On the otherhand, we could instead re-train an estimator on the full training set and then test that on the test set. This is actually the reccomended strategy, but its still worth comparing both below.

Let’s first do strategy 1.

[18]:
# Get best model
best_model_ind = np.argmax(evaluator.scores['explained_variance'])
best_model = evaluator.estimators[best_model_ind]

# Get the correct test data - setting the problem spec
# as the problem spec saved when running this evaluator
X_test, y_test = data.get_Xy(problem_spec=evaluator.ps,
                             subjects='test')

# Score
score = best_model.score(X_test, y_test)
print(score)
0.2724861166614897

Next, lets compare that with re-training our model on the full training set, then testing.

[19]:
evaluator = bp.evaluate(pipeline=pipe,
                        dataset=data,
                        problem_spec=ps,
                        scope='brain', # Same as before
                        cv='test')

evaluator
[19]:
BPtEvaluator
------------
mean_scores = {'explained_variance': 0.27514029708500876, 'neg_mean_squared_error': -12.269150360411253}
std_scores = {'explained_variance': 0.0, 'neg_mean_squared_error': 0.0}

Saved Attributes: ['estimators', 'preds', 'timing', 'train_subjects', 'val_subjects', 'feat_names', 'ps', 'mean_scores', 'std_scores', 'weighted_mean_scores', 'scores', 'fis_', 'coef_']

Available Methods: ['get_preds_dfs', 'get_fis', 'get_coef_', 'permutation_importance']

Evaluated with:
ProblemSpec(n_jobs=8, problem_type='regression', random_state=10, scope='brain',
            scorer={'explained_variance': make_scorer(explained_variance_score),
                    'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False)},
            subjects='all', target='b_bmi')

Note that in this case, when cv=‘test’ then all of the subjects will be used. If we wanted to make this behavior explicit we couldpass subjects=‘all’.

We can see that the results difer, though not hugely. One explanation is that at these sample sizes we still get boosts in performance from sample size, which means we expect our final model trained on the full and larger training set to do better. Another contributing factor is that by selecting from the 5-folds the model which does ‘best’ we may actually be choosing an overfit model which happened to do better on its corresponding validation set. We therefore reccomend the latter strategy.

In fact, we can formalize this intution, and test it on just the training set. We can do this because really the first method is just a type of meta-estimator. What it does is just in an internal nesting step, trains five models and selects the best one. We can formalize this as a custom model and compare it below.

We will also change the pipeline being compared a little bit for simplicity, replacing the elastic net we made via BPt with the ElasticNetCV object from scikit-learn.

[20]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import ElasticNetCV
from sklearn.base import BaseEstimator

class BestOfModel(BaseEstimator):

    def __init__(self, estimator, cv=5):

        self.estimator = estimator
        self.cv = cv

    def fit(self, X, y):

        scores = cross_validate(self.estimator, X, y, cv=self.cv,
                                return_estimator=True)

        best_ind = np.argmax(scores['test_score'])
        self.estimator_ = scores['estimator'][best_ind]

    def predict(self, X):

        return self.estimator_.predict(X)


# The new elastic net cv to use
new_elastic = ElasticNetCV()

# The best of model
best_of_model = bp.Model(
    BestOfModel(estimator=new_elastic, cv=5))

# The static model
static_model = bp.Model(new_elastic)

# Use the same initial steps as before, but now with the best of model and the static models
best_of_pipe = bp.Pipeline(steps=[w_scaler, s_scaler, ohe, best_of_model])
static_pipe = bp.Pipeline(steps=[w_scaler, s_scaler, ohe, static_model])

Let’s try the best of model first, evaluating it on the training set with 5-fold CV.

Note: This is going to be a bit slow as each time a model is trained in one of the 5-folds it needs to internally train 5 models. So it should take roughly 5x longer to evaluate then it’s static counterpart.

[21]:
evaluator = bp.evaluate(pipeline=best_of_pipe,
                        dataset=data,
                        problem_spec=ps,
                        scope='brain',
                        cv=cv)

evaluator
[21]:
BPtEvaluator
------------
mean_scores = {'explained_variance': 0.2485223109713471, 'neg_mean_squared_error': -12.281307958444154}
std_scores = {'explained_variance': 0.024857512022276998, 'neg_mean_squared_error': 1.112832141546892}

Saved Attributes: ['estimators', 'preds', 'timing', 'train_subjects', 'val_subjects', 'feat_names', 'ps', 'mean_scores', 'std_scores', 'weighted_mean_scores', 'scores']

Available Methods: ['get_preds_dfs', 'permutation_importance']

Evaluated with:
ProblemSpec(n_jobs=8, problem_type='regression', random_state=10, scope='brain',
            scorer={'explained_variance': make_scorer(explained_variance_score),
                    'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False)},
            subjects='train', target='b_bmi')

Next, let’s run its corresponding ‘static’ counterpart.

[22]:
evaluator = bp.evaluate(pipeline=static_pipe,
                        dataset=data,
                        problem_spec=ps,
                        scope='brain',
                        cv=cv)

evaluator
[22]:
BPtEvaluator
------------
mean_scores = {'explained_variance': 0.25832965982663025, 'neg_mean_squared_error': -12.120236239339546}
std_scores = {'explained_variance': 0.02451883500218578, 'neg_mean_squared_error': 1.0828442666660394}

Saved Attributes: ['estimators', 'preds', 'timing', 'train_subjects', 'val_subjects', 'feat_names', 'ps', 'mean_scores', 'std_scores', 'weighted_mean_scores', 'scores', 'fis_', 'coef_']

Available Methods: ['get_preds_dfs', 'get_fis', 'get_coef_', 'permutation_importance']

Evaluated with:
ProblemSpec(n_jobs=8, problem_type='regression', random_state=10, scope='brain',
            scorer={'explained_variance': make_scorer(explained_variance_score),
                    'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False)},
            subjects='train', target='b_bmi')

These results confirm the intution, namely that it is better to re-train using as many subjects as possible rather than chose a ‘best model’ from an internal CV.

Compare & Evaluate#

Let’s now try out a newer feature the input class Compare. Let’s define as our options to compare using different subsets of the input data.

[23]:
compare_scopes = bp.Compare(['covars', 'brain', 'all',
                             'thick', 'area', 'subcort',
                             'dti_fa', 'dti_md'])

evaluators = bp.evaluate(pipeline=pipe,
                         scope=compare_scopes,
                         dataset=data,
                         problem_spec=ps,
                         cv=cv)

# Look at a summary of the results - note this option is only available after a call to evaluate with
# Compare's has been made!
evaluators.summary()
[23]:
mean_scores_explained_variance mean_scores_neg_mean_squared_error std_scores_explained_variance std_scores_neg_mean_squared_error mean_timing_fit mean_timing_score
scope
covars 0.11317 -14.51761 0.01192 1.45243 4.73712 0.03311
brain 0.25656 -12.15231 0.02394 1.10996 15.58119 0.05231
all 0.27908 -11.78751 0.02294 1.10537 15.45347 0.07103
thick 0.10118 -14.69564 0.02711 1.38971 8.06942 0.01529
area 0.02774 -15.89653 0.00991 1.46288 7.71240 0.01417
subcort 0.05189 -15.50615 0.00831 1.39405 4.38906 0.02439
dti_fa 0.10302 -14.66791 0.00659 1.39366 8.24234 0.01827
dti_md 0.14541 -13.96297 0.01127 1.21864 8.31585 0.01754

We can also try another method available which will run a pairwise model ttest comparison between all options

[24]:
evaluators.pairwise_t_stats(metric='explained_variance')
[24]:
scope (1) scope (2) t_stat p_val
0 covars brain -11.36825 0.00478
1 covars all -14.32368 0.00193
2 covars thick 0.86450 1.00000
3 covars area 7.66506 0.02180
4 covars subcort 9.07301 0.01145
5 covars dti_fa 1.97369 1.00000
6 covars dti_md -4.56856 0.14381
7 brain all -15.67812 0.00135
8 brain thick 30.46195 0.00010
9 brain area 17.32117 0.00091
10 brain subcort 17.40024 0.00090
11 brain dti_fa 12.94696 0.00287
12 brain dti_md 11.98864 0.00388
13 all thick 31.58027 0.00008
14 all area 19.16281 0.00061
15 all subcort 20.49440 0.00047
16 all dti_fa 15.58172 0.00139
17 all dti_md 15.37322 0.00146
18 thick area 5.10927 0.09713
19 thick subcort 3.32372 0.40987
20 thick dti_fa -0.14599 1.00000
21 thick dti_md -3.56252 0.32946
22 area subcort -2.76953 0.70498
23 area dti_fa -11.38268 0.00476
24 area dti_md -12.59938 0.00320
25 subcort dti_fa -7.99938 0.01854
26 subcort dti_md -31.20696 0.00009
27 dti_fa dti_md -6.47580 0.04102

We can also explicitly compare two evaluators

[25]:
e1 = evaluators['scope=covars']
e2 = evaluators['scope=brain']

# Look at function docstring
print('       ', e1.compare.__doc__)
print()

e1.compare(e2)
        This method is designed to perform a statistical comparison
        between the results from the evaluation stored in this object
        and another instance of :class:`BPtEvaluator`. The statistics
        produced here are explained in:
        https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_stats.html

        .. note::
            In the case that the sizes of the training and validation sets
            at each fold vary dramatically, it is unclear if this
            statistics are still valid.
            In that case, the mean train size and mean validation sizes
            are employed when computing statistics.

        Parameters
        ------------
        other : :class:`BPtEvaluator`
            Another instance of :class:`BPtEvaluator` in which
            to compare which. The cross-validation used
            should be the same in both instances, otherwise
            statistics will not be generated.

        rope_interval : list or dict of
            | This parameter allows for passing in a custom
                region of practical equivalence interval (or rope interval)
                a concept from bayesian statistics. If passed as
                a list, this should be a list with two elements, describing
                the difference in score which should be treated as two
                models or runs being practically equivalent.

            | Alternatively, in the case of multiple underlying
                scorers / metrics. A dictionary, where keys correspond
                to scorer / metric names can be passed with a separate
                rope_interval for each. For example:

            ::

                rope_interval = {'explained_variance': [-0.01, 0.01],
                                 'neg_mean_squared_error': [-1, 1]}

            This example would define separate rope regions depending
            on the metric.

            ::

                default = [-0.01, 0.01]

        Returns
        -------
        compare_df : pandas DataFrame
            | The returned DataFrame will generate separate rows
                for all overlapping metrics / scorers between the
                evaluators being compared. Further, columns with
                statistics of interest will be generated:

                - 'mean_diff'
                    The mean score minus other's mean score

                - 'std_diff'
                    The std minus other's std

            | Further, only in the case that the cross-validation
                folds are identical between the comparisons,
                the following additional columns will be generated:

                - 't_stat'
                    Corrected paired ttest statistic.

                - 'p_val'
                    The p value for the corrected paired ttest statistic.

                - 'better_prob'
                    The probability that this evaluated option is better than
                    the other evaluated option under a bayesian framework and
                    the passed value of rope_interval. See sklearn example
                    for more details.

                - 'worse_prob'
                    The probability that this evaluated option is worse than
                    the other evaluated option under a bayesian framework and
                    the passed value of rope_interval. See sklearn example
                    for more details.

                - 'rope_prob'
                    The probability that this evaluated option is equivalent to
                    the other evaluated option under a bayesian framework and
                    the passed value of rope_interval. See sklearn example
                    for more details.



[25]:
mean_diff std_diff t_stat p_val better_prob worse_prob rope_prob
explained_variance -0.14339 -0.01202 -11.36825 0.00017 0.00013 0.99977 0.00009
neg_mean_squared_error -2.36530 0.34247 -8.09697 0.00063 0.00062 0.99936 0.00002

Okay so it looks like it ran all the different combinations, but how do we look at each indivudal evaluator / the full results?? There are few different ways to index the object storing the different evaluators, but its essentially a dictionary. We can index one run as follows:

[26]:
evaluators['scope=covars']
[26]:
BPtEvaluator
------------
mean_scores = {'explained_variance': 0.11316930054010477, 'neg_mean_squared_error': -14.517611278919068}
std_scores = {'explained_variance': 0.011916342728361404, 'neg_mean_squared_error': 1.4524277905958483}

Saved Attributes: ['estimators', 'preds', 'timing', 'train_subjects', 'val_subjects', 'feat_names', 'ps', 'mean_scores', 'std_scores', 'weighted_mean_scores', 'scores', 'fis_', 'coef_']

Available Methods: ['get_preds_dfs', 'get_fis', 'get_coef_', 'permutation_importance']

Evaluated with:
ProblemSpec(n_jobs=8, problem_type='regression', random_state=10,
            scope='covars',
            scorer={'explained_variance': make_scorer(explained_variance_score),
                    'neg_mean_squared_error': make_scorer(mean_squared_error, greater_is_better=False)},
            subjects='train', target='b_bmi')
[27]:
# Note, with get_fis, if mean=True it will return
# the mean and only non-null, and non-zero features directly.
evaluators['scope=covars'].get_fis(mean=True)
[27]:
b_demo_highest_education_categories=1           0.59780
b_demo_highest_education_categories=2           0.62281
b_demo_highest_education_categories=4          -0.79331
b_demo_highest_education_categories=5          -1.12268
b_mri_info_deviceserialnumber='HASH03db707f'    0.17417
b_mri_info_deviceserialnumber='HASH11ad4ed5'   -0.60833
b_mri_info_deviceserialnumber='HASH1314a204'   -0.70919
b_mri_info_deviceserialnumber='HASH311170b9'   -0.14871
b_mri_info_deviceserialnumber='HASH3935c89e'   -0.42419
b_mri_info_deviceserialnumber='HASH4b0b8b05'    0.02187
b_mri_info_deviceserialnumber='HASH5ac2b20b'   -0.00810
b_mri_info_deviceserialnumber='HASH5b0cf1bb'    0.02770
b_mri_info_deviceserialnumber='HASH5b2fcf80'   -0.32682
b_mri_info_deviceserialnumber='HASH65b39280'    0.44976
b_mri_info_deviceserialnumber='HASH69f406fa'    0.00030
b_mri_info_deviceserialnumber='HASH6b4422a7'    0.09136
b_mri_info_deviceserialnumber='HASH7911780b'    0.73788
b_mri_info_deviceserialnumber='HASH7f91147d'   -0.05698
b_mri_info_deviceserialnumber='HASH96a0c182'   -0.01863
b_mri_info_deviceserialnumber='HASHa3e45734'   -0.09858
b_mri_info_deviceserialnumber='HASHb640a1b8'    0.56248
b_mri_info_deviceserialnumber='HASHc3bf3d9c'    0.10516
b_mri_info_deviceserialnumber='HASHd422be27'    0.00579
b_mri_info_deviceserialnumber='HASHd7cb4c6d'    0.01132
b_mri_info_deviceserialnumber='HASHdb2589d4'    0.00784
b_mri_info_deviceserialnumber='HASHe4f6957a'   -0.19978
b_mri_info_deviceserialnumber='HASHfeb7e81a'    0.03217
b_race_ethnicity_categories=1                  -0.73047
b_race_ethnicity_categories=2                   0.87523
b_race_ethnicity_categories=3                   0.74217
b_race_ethnicity_categories=4                  -0.32284
b_sex=1                                        -0.08126
b_sex=2                                         0.07723
b_agemos                                        0.39502
dtype: float64

Warning: This doesn’t mean that this mean features were selected! Remember this is the average over training 5 models, so it means that if a feature shows up it was selected at least once in one of the models!

[28]:
# These are the different options
evaluators.keys()
[28]:
dict_keys([Options(scope=covars), Options(scope=brain), Options(scope=all), Options(scope=thick), Options(scope=area), Options(scope=subcort), Options(scope=dti_fa), Options(scope=dti_md)])
[ ]:

[ ]: