Subjects#
Various functions within BPt, and Dataset
can accept subjects or some variation on this
name as an argument. The idea of subjects is similar to Scope, where essentially subjects allows
for an expanded upon index’ing system but at the row-level, whereas Scope operates as the column level.
The parameter can accept a few different values. These are explained below:
1. array-like#
You may pass any array-like (e.g., list, set, pandas Index, etc…) of subjects directly. Warning: Passing a python tuple is reserved for a special MultiIndex case!
For example:
subjects = ['subj1', 'subj2', 'subj3']
Would select those three subjects, where the list could also be a numpy array or pandas index for example.
2. location#
You may pass the location of a text file were subject’s names are stored as one subject’s name per line. Names should be saved with python style types, e.g., quotes around str’s, but if they are not, it should in most cases still be able to figure out the correct type. For example if subjects.txt contained:
'subj1'
'subj2'
'subj3'
We could pass:
subjects = 'subjects.txt'
To select those three subjects.
3. reserved keyword#
A reserved key word may be passed. These include:
‘all’ Operate on all subjects
‘nan’ Select any subjects with any missing values in any of their loaded columns, regardless of scope or role.
‘not nan’ Like ‘nan’ but not. Or in english, any subjects without any missing values in any of their loaded columns, regardless of scope or role.
‘train’ Select the set of train subjects as defined by a split in the Dataset, e.g.,
set_train_split
.‘test’ Select the set of test subjects as defined by a split in the Dataset, e.g.,
set_test_split
.‘default’ This is the default subjects value for
ProblemSpec
, it refers to special behavior where when evaluating if the passed dataset has a train/test split defined, and a cv value is passed that isn’t ‘test’, then subjects = ‘train’ will be used. Otherwise, subjects=’all’ will be used.
1. value subset case#
You can pass the special input wrapper ValueSubset
. This can be
used to select subsets of subject by a column’s
value or values. See ValueSubset
for more
information on how this input class is used.
5. multi-index case There also exists the case where you may wish for the underlying index of subjects to be a MultiIndex. In this case, there is some extra functionality to discuss. Say for example we have a Dataset multi-indexed by subject and eventname, e.g.,
data.set_index(['subject', 'eventname'], inplace=True)
We now have more options for how we might want to index this dataset, and therefore more options for valid arguments to pass to a subjects argument. Consider first all of the examples from above, where we are just specifying a subject-like index. In this case, all of those arguments will still work, and will just return all subjects with all of their eventnames. E.g., assuming there were two eventname values for each subjects ‘e1’ and ‘e2’:
subjects = ['subj1', 'subj2']
Would select subject eventname pairs: (‘subj1’, ‘e1’), (‘subj1’, ‘e2’), (‘subj2’, ‘e1’), (‘subj2’, ‘e2’) and likewise with loading from a text file which just specified ‘subj1’ and ‘subj2’.
Note that if we pass arguments in this manner, BPt will assume they refer to whatever index is first, in this case ‘subject’, and not ‘eventname’. If we wish to also select explicitly by eventname, we have two options.
6. multi-index array-like#
You may pass fully indexed tuples in an array-like manner, the same as 1. from before, e.g.:
subjects = ('subj1', 'e1'), ('subj2', 'e2')
To just keep ‘subj1’ at event ‘e1’ and ‘subj2’ at ‘e2’. Likewise, we could select this same subset if subjects.txt was formatted as:
('subj1', 'e1')
('subj2', 'e2')
Our second option is to use the special tuple reserved input. In this case, we must pass a python tuple with the same length at the number of levels in the underlying MultiIndex, e.g., in the example before, of length two. Each index in the tuple will then be used to specify the BPt subjects compatible argument for just that level of the index. For example:
subjects = ('all', ['e1'])
Would select all subjects, and then note the array-like list in the second index of the tuple, would filter that to include only subject eventname pairs with an eventname of ‘e1’. Consider another example:
subjects = ('subjects.txt', 'events.txt')
In this case, the subjects to select would be loaded from ‘subjects.txt’ and the corresponding eventnames from ‘events.txt’.
7. name of column#
This option only works in the context of the function evaluate
, where you may
pass the name of a loaded column, and have the argument converted intenerally to a
special Compare
style object, CompareSubset
, where subsets of
of subjects would be defined for each unique value in the name of the column passed.
For example:
subjects='sex'
Where ‘sex’ is a loaded column in a dataset, and would then define a Compare
object
with seperate options for each unique value of ‘sex’.
Examples#
First let’s define an example dataset to show some examples with.
In [1]: import BPt as bp
In [2]: import numpy as np
In [3]: data = bp.Dataset()
In [4]: data['index'] = ['subj1', 'subj2', 'subj3']
In [5]: data['col1'] = [1, 2, np.nan]
In [6]: data.set_index('index', inplace=True)
In [7]: data
Out[7]:
col1
index
subj1 1.0
subj2 2.0
subj3 NaN
Next, we will use Dataset.get_subjects()
to explore what passing
different values to subjects will return.
In [8]: data.get_subjects(subjects=['subj1'])
Out[8]: Index(['subj1'], dtype='object', name='index')
In [9]: data.get_subjects(subjects=['subj1', 'subj2'])
Out[9]: Index(['subj1', 'subj2'], dtype='object', name='index')
One gotcha is that if we pass a single str value, it will be assumed to
be a file path, so if we pass subjects=’subj1’, we will get an error.
Let’s try using ValueSubset
next.
In [10]: data.get_subjects(bp.ValueSubset('col1', 1))
Out[10]: Index(['subj1'], dtype='object', name='index')
In [11]: data.get_subjects(bp.ValueSubset('col1', [1, 2]))
Out[11]: Index(['subj1', 'subj2'], dtype='object', name='index')
We can also use special reversed keywords.
In [12]: data.get_subjects('nan')
Out[12]: Index(['subj3'], dtype='object', name='index')
In [13]: data.get_subjects('not nan')
Out[13]: Index(['subj1', 'subj2'], dtype='object', name='index')