BPt.EvalResults.compare#

EvalResults.compare(other, rope_interval=[- 0.01, 0.01])[source]#

This method is designed to perform a statistical comparison between the results from the evaluation stored in this object and another instance of EvalResults. The statistics produced here are explained in: https://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_stats.html

Note

In the case that the sizes of the training and validation sets at each fold vary dramatically, it is unclear if this statistics are still valid. In that case, the mean train size and mean validation sizes are employed when computing statistics.

Parameters

otherEvalResults

Another instance of EvalResults in which to compare which. The cross-validation used should be the same in both instances, otherwise statistics will not be generated.

rope_intervallist or dict of

This parameter allows for passing in a custom region of practical equivalence interval (or rope interval) a concept from bayesian statistics. If passed as a list, this should be a list with two elements, describing the difference in score which should be treated as two models or runs being practically equivalent.

Alternatively, in the case of multiple underlying scorers / metrics. A dictionary, where keys correspond to scorer / metric names can be passed with a separate rope_interval for each. For example:

rope_interval = {'explained_variance': [-0.01, 0.01],
                 'neg_mean_squared_error': [-1, 1]}

This example would define separate rope regions depending on the metric.

default = [-0.01, 0.01]

Returns

compare_dfpandas DataFrame

The returned DataFrame will generate separate rows for all overlapping metrics / scorers between the evaluators being compared. Further, columns with statistics of interest will be generated:

‘mean_diff’
The mean score minus other’s mean score

‘std_diff’
The std minus other’s std

Further, only in the case that the cross-validation folds are identical between the comparisons, the following additional columns will be generated:

‘t_stat’
Corrected paired ttest statistic.

‘p_val’
The p value for the corrected paired ttest statistic.

‘better_prob’
The probability that this evaluated option is better than the other evaluated option under a bayesian framework and the passed value of rope_interval. See sklearn example for more details.

‘worse_prob’
The probability that this evaluated option is worse than the other evaluated option under a bayesian framework and the passed value of rope_interval. See sklearn example for more details.

‘rope_prob’
The probability that this evaluated option is equivalent to the other evaluated option under a bayesian framework and the passed value of rope_interval. See sklearn example for more details.