BPt.Dataset.add_data_files#

Dataset.add_data_files(files, file_to_subject='auto', load_func=<function load>, inplace=False)[source]#

This method allows adding columns of type ‘data file’ to the Dataset class.

Parameters

filesdict

This argument specifies the files to be loaded as Data Files. Files must be passed as a python dict where each key refers to the name of that feature / column of data files to load, and the value is either a list-like of str file paths, or a single globbing str which will be used to determine the files.

In addition to this parameter, you must also pass a python function to the file_to_subject param, which specifies how to convert from passed file path, to a subject name.

file_to_subjectpython function, dict of or ‘auto’, optional

This parameter represents how the subject name should be determined from the passed file paths. This parameter can be passed any python function, where the first argument on the function takes a full file path and returns a subject name.

This parameter should be passed as either a single function or argument to be used for all columns, or as a dictionary corresponding to the passed files dictionary in the case that each column requires a different function mapping path to subject. If just one function is passed, it will be used for to load all dictionary entries. For example:

You may also pass the custom str ‘auto’, which is the default, to specify that the subject should try to be automatically inferred. Note that the way in which this occurs is somewhat complex, and potentially a bit brittle, so make sure to check to see that all index names were inferred correctly. Also note that already having a column of reference index names loaded can help the program figure out what kind of index to put, e.g., if your desired index names are ‘subj_1’, ‘subj_2’, unless you already have an example column loaded, this method will likely just set the index as ‘1’ and ‘2’.

In the case that the underlying index is a MultiIndex, this function should be designed to return the subject in correct tuple form. See Examples below.

default == 'auto'

load_funcpython function, optional

Fundamentally columns of type ‘data file’ represent a path to a saved file, which means you must also provide some information on how to load the saved file. This parameter is where that loading function should be passed. The passed load_func will be called on each file individually and whatever the output of the function is will be passed to the different loading functions.

You might need to pass a user defined custom function in some cases, e.g., you want to use numpy.load(), but then also numpy.stack(). Just wrap those two functions in one, and pass the new function.

def my_wrapper(x):
    return np.stack(np.load(x))

Note that in this case where a custom function is defined it is reccomended that you define this function in a separate file from where the main script will be run, and then import the function.

By default this function assumes data files are passed as numpy arrays, and uses the default function numpy.load(), when nothing else is specified.

default = np.load

inplacebool, optional

If True, perform the current function inplace and return None.

default = False

See also

to_data_file: Cast existing columns to type Data File.
get_file_mapping: Returns the raw file mapping.

Examples

Consider the brief example below for loading two fake subjects, with the files parameter.

files = dict()
files['feat1'] = ['f1/subj_0.npy', 'f1/subj_1.npy']
files['feat2'] = ['f2/subj_0.npy', 'f2/subj_1.npy']

This could be matched with file_to_subject as:

def file_to_subject_func(file):
    subject = file.split('/')[1].replace('.npy', '')
    return subject

file_to_subject = file_to_subject_func
# or
file_to_subject = dict()
file_to_subject['feat1'] = file_to_subject_func
file_to_subject['feat2'] = file_to_subject_func

In this example, subjects are loaded as ‘subj_0’ and ‘subj_1’, and they have associated loaded data files ‘feat1’ and ‘feat2’.

Next, we consider an example with fake data. In this example we will first generate and save some fake data files. These fake files will correspond to left hemisphere vertex files.

In [1]: import numpy as np

In [2]: import os

In [3]: dr = 'data/fake_surface/'

In [4]: os.makedirs(dr, exist_ok=True)

# 20 subjects each with 10,242 vertex values
In [5]: X = np.random.random(size=(20, 10242))

# Save the data as numpy arrays
In [6]: for x in range(len(X)):
   ...:     np.save(dr + str(x), X[x])
   ...: 

In [7]: os.listdir(dr)[:5]
Out[7]: ['17.npy', '15.npy', '14.npy', '0.npy', '19.npy']

Next, we will use add data files to add these to a Dataset.

In [8]: data = bp.Dataset()

In [9]: files = dict()

In [10]: files['fake_surface'] = dr + '*' # Add * for file globbing

In [11]: data = data.add_data_files(files=files, file_to_subject='auto')

In [12]: data.head(5)
Out[12]: 
    fake_surface
17           0.0
15           1.0
14           2.0
0            3.0
19           4.0

Let’s also consider lastly a MultiIndex example:

# The underlying dataset is indexed by subject and event
data.set_index(['subject', 'event'], inplace=True)

# Only one feature
files = dict()
files['feat1'] = ['f1/s0_e0.npy',
                  'f1/s0_e1.npy',
                  'f1/s1_e0.npy',
                  'f1/s1_e1.npy']

def file_to_subject_func(file):

    # This selects the substring
    # at the last part seperated by the '/'
    # so e.g. the stub, 's0_e0.npy', 's0_e1.npy', etc...
    subj_split = file.split('/')[-1]

    # This removes the .npy from the end, so
    # stubs == 's0_e0', 's0_e1', etc...
    subj_split = subj_split.replace('.npy', '')

    # Set the subject name as the first part
    # and the eventname as the second part
    subj_name = subj_split.split('_')[0]
    event_name = subj_split.split('_')[1]

    # Lastly put it into the correct return style
    # This is tuple style e.g., ('s0', 'e0'), ('s0', 'e1')
    ind = (subj_name, eventname)

    return ind