Scope#

Scope’s represent a key concept within BPt, that are present when preparing data with the Dataset class (See functions for adding and removing scopes to the Dataset: add_scope and remove_scope), and during ML. The easiest way to think about the scope argument is simply as an expanded way of index’ing the loaded columns.

The scope argument can also be found across different ModelPipeline pieces and within ProblemSpec. The fundamental idea is that during loading, plotting, ML, etc… it is often desirable to specify a subset of the total loaded columns/features. This is accomplished within BPt via the concept of ‘scope’ and the ‘scope’ parameter.

The concept of scopes extends beyond the Dataset class to the rest of BPt. The fundamental idea is that it provides a utility for more easily selecting different subsets of columns from the full dataset. This is accomplished by providing different functions and methods with a scope argument, which accepts any BPt style Scope input, and then operates just on that subset of columns. For example consider the example below with the function get_cols.

# Empty Dataset with 3 columns
data = Dataset(columns=['1', '2', '3'])

# scope of 'all' will return all columns
cols = data.get_cols(scope='all')

# cols == ['1', '2', '3']

In this example, we pass a fixed input str scope: ‘all’. This is a special reserved scope which will always return all columns. In addition to ‘all’ there are a number of other reserved special scopes which cannot be set, and have their own fixed behavior. These are:

‘all’
All loaded columns
‘float’
All loaded columns of type ‘float’, i.e., a continuous variable and not a categorical variable or a data file, see: Data Types
‘category’
All loaded columns of type / scope ‘category’, see Data Types.
‘data file’
All loaded columns of type / scope ‘data file’, see Data Types.
‘data’
All loaded columns with role ‘data’, see Role.
‘target’
All loaded columns with role ‘target’, see Role.
‘non input’
All loaded columns with role ‘non input’, see Role.
‘data float’
All loaded columns of type ‘float’ with role ‘data’.
‘data category’
All loaded columns of type ‘float’ with role ‘data’.
‘target float’
All loaded columns of type ‘float’ with role ‘target’.
‘target category’
All loaded columns of type ‘float’ with role ‘target’.

Those enumerated, the scope system also passing other strings, which are not one of the above, reserved scopes. In the case that a string is passed, the following options are possible and are checked in this order:

1. Passing the name of a column directly. In this case that column will be returned by name. E.g., with the variable data from before:

cols = data.get_cols(scope='1')

This will specify just the column ‘1’.

2. Passing the name of a scope. What this refers to is the ability to add custom scopes to columns with add_scope. This acts as a tagging system, where you can create custom subsets. For example if we wanted the subset of ‘1’ and ‘3’, we can pass scope=[‘1’, ‘3’], but if we were using this same set many times, we can also set the scopes of each of these columns to a custom scope, e.g.,

data.set_scopes({'1': 'custom', '3': 'custom'})

cols = data.get_cols(scope='custom')

In this case, cols would return us the scope ‘custom’. Likewise, you may remove scopes with remove_scope.

3. Passing a stub. This functionality allows us to pass a common substring present across a number of columns, and lets us select all columns with that substring. For example, let’s say we have columns ‘my_col1’, ‘my_col2’ and ‘target’ loaded. By passing scope=’my_col’ we can select both ‘my_col1’ and ‘my_col2, but not ‘target’.

In addition to the 4 different ways scopes can be used enumerated above, we can also compose any combination by passing a list of scopes. For example:

cols = data.get_cols(scope=['1', '2'])

Returns columns ‘1’ and ‘2’. We can also combine across methods. E.g.,

cols = data.get_cols(scope=['1', 'category', 'custom', 'non input'])

In this example, we are requesting the union (NOT the overlap) of column ‘1’, any category columns, any columns with the scope ‘custom’ and any ‘non input’ columns.

Scopes can also be associated 1:1 with their corresponding base ModelPipeline objects (except for the ProblemSpec scope). One useful function designed specifically for objects with Scope is the Duplicate Input Wrapper, which allows us to conveniently replicate pipeline objects across a number of scopes. This functionality is especially useful with Transformer objects, (though still usable with other pipeline pieces, though other pieces tend to work on each feature independency, ruining some of the benefit). For example consider a case where you would like to run a PCA transformer on different groups of variables separately, or say you wanted to use a categorical encoder on 15 different categorical variables. Rather then having to manually type out every combination or write a for loop, you can use Duplicate.

See Duplicate for more information on how to use this functionality.