loading#

load#

neurotools.loading.load(f, index_slice=None, dtype=None)#

Smart loading function for neuro-imaging type data.

Parameters
  • f (str, arrays or other) – The location / file path of the data to load passed as a str, or can accept directly numpy arrays or objects from nibabel, ect…

  • index_slice (slices, tuple of slices, or None, optional) –

    You may optional pass index slicing here. The typical benefit of index_slicing over masking in a later step is where data to load will be read with nibabel (e.g., saved as .nii or .mgz) and can therefore use an index_slice to load in only the data of interest.

    For example let’s say the shape of data is 5 x 20 on the disk, but you only need the data in the first 20, e.g., so with traditional slicing that is [0], in this case by using the mask you have to load the full data from the disk, then slice it. But if you pass slice(0) here, then this accomplishes the same thing, but only loads the requested slice from the disk.

    You must use python keyword slice for complex slicing. For example in the example above you may pass either (0) or slice(0), but if you wanted something more complex, e.g. passing something like my_array[1:5:2, ::3] you should pass (slice(1,5,2), slice(None,None,3)) here.

    default = None
    

  • dtype (str or None, optional) –

    The data type in which to cast loaded data to. If left as None, keep original data type, otherwise can pass as a string name of datatype to cast to. For example, ‘float32’ to cast to 32 bit floating point precision.

    default = None
    

Returns

data – Returned is the requested data as a numpy array, or if passed None, then None will be returned. The shape of the returned data will be the shape of the data array portion of whatever data is passed with extra singular dimensions removed (i.e., np.squeeze is applied). If passed an index slice, only the portion of the data specified by the index slice will be loaded and returned.

Note: The data type of the returned data will depend on how the data to be loaded is saved.

Return type

array or None

Examples

Data can be loaded by simply passing a file location to load. For example, we can load just an array of ones:

In [1]: from neurotools.loading import load

In [2]: data = load('data/ones.nii.gz')

In [3]: data.shape
Out[3]: (10,)

This function is likewise robust to being passed non-file paths, for example passing an already defined array to load:

In [4]: import numpy as np

In [5]: data = np.ones((5, 3, 1))

In [6]: load(data).shape
Out[6]: (5, 3)

Note here that load will also apply a squeeze function to any loaded data, getting rid of any used dimensions.

Next, let’s look at an example where we pass a specific index slice.

In [7]: data = np.random.random((5, 3))

# Let's say we want to select columns 0 and 1
In [8]: data[:, 0:2]
Out[8]: 
array([[0.72985059, 0.00102751],
       [0.98263662, 0.34106399],
       [0.57454001, 0.23615256],
       [0.84292275, 0.74310495],
       [0.06495864, 0.60732639]])

# It's a bit more awkward but we pass:
In [9]: load(data, index_slice=(slice(None), slice(0, 2)))
Out[9]: 
array([[0.72985059, 0.00102751],
       [0.98263662, 0.34106399],
       [0.57454001, 0.23615256],
       [0.84292275, 0.74310495],
       [0.06495864, 0.60732639]])

load_data#

neurotools.loading.load_data(subjects, template_path, contrast=None, mask=None, index_slice=None, zero_as_nan=False, nan_as_zero=False, dtype=None, n_jobs=1, verbose=1, _print=None, **cache_args)#

This method is designed to load data saved in a particular way, specifically where each subject / participants data is saved seperately.

Parameters
  • subjects (array-like) – A list or array-like with the names of the subjects to load, where the names correspond in some way to the way the way the subject’s data is saved. This correspondence is specified in template_path.

  • template_path (str or func) –

    A str indicating the template form for how a single subjects data should be loaded, where SUBJECT will be replaced with that subjects name, and optionally CONTRAST will be replaced with the contrast name.

    For example, to load subject X’s contrast Y saved under ‘some_loc/X_Y.nii.gz’ the template_path would be: ‘some_loc/SUBJECT_CONTRAST.nii.gz’.

    You may also alternatively pass template_path as a python function, with first argument accepting subject, and optional second argument accepting contrast. In this case, the function should return the correct path for that subject / contrast pair when given one or both of these arguments.

    Note: The use of CONTRAST within the template path is optional, but this parameter is not.

  • contrast (str or None, optional) –

    The name of the contrast, used along with the template path to define where to load data. If passed None, then it is assumed that CONTRAST is not present in the template_path and will be ignored.

    default = None
    

  • mask (str, array or None, optional) –

    After data is loaded, it can optionally be masked according to a specific array.

    If None, and the data to be loaded is multi-dimensional, then the data will be flattened by default, e.g., if each data point originally has shape 10 x 2, the flattened shape will be 20.

    If passed a str, it will be assumed to be the location of a mask in which to load, where the shape of the mask should match the data.

    When passing a mask, either by location or by array directly, values set to either 1 or True indicate the value should be kept, whereas values of 0 or False indicate the values at that location should be discarded.

    If a mask is used, see function funcs.reverse_mask_data() for reversing masked data (e.g., an eventual statistical output) back into it’s pre-masked state.

    default = None
    

  • index_slice (slices, tuple of slices, or None, optional) –

    You may optional pass index slicing here. The typical benefit of index_slicing over masking is where data to load will be read with nibabel (e.g., saved as .nii or .mgz) and can therefore use an index_slice to load in only the data of interest.

    For example let’s say the shape of data is 5 x 20 on the disk, but you only need the data in the first 20, e.g., so with traditional slicing that is [0], in this case by using the mask you have to load the full data from the disk, then slice it. But if you pass slice(0) here, then this accomplishes the same thing, but only loads the requested slice from the disk.

    You must use python keyword slice for complex slicing. For example in the example above you may pass either (0) or slice(0), but if you wanted something more complex, e.g. passing something like my_array[1:5:2, ::3] you should pass (slice(1,5,2), slice(None,None,3)) here.

    If passed None (the default), this option is ignored and the full data avaliable is loaded.

    default = None
    

  • zero_as_nan (bool, optional) –

    Often in neuroimaging data, NaN values are encoded as 0’s (I’m not sure why either). If this flag is set to True, then any 0’s found in the loaded data will be replaced with NaN.

    default = False
    

  • nan_as_zero (bool, optional) –

    As an alternative to zero_as_nan, can instead set any NaNs found to zeros. Note, if this is True then zero_as_nan can not also be True.

    default = False
    

  • dtype (str or None, optional) –

    The data type in which to cast loaded data to. If left as None, keep original data type, otherwise can pass as a string name of datatype to cast to. For example, ‘float32’ to cast to 32 bit floating point precision.

    default = None
    

  • n_jobs (int, optional) –

    The number of threads to use when loading data.

    Note: Often n_jobs will refer to number of separate processors, but in this case it refers to threads, which are not limited by the number of cores avaliable.

    default = 1
    

  • verbose (int, optional) –

    This parameter controls the verbosity of this function.

    • If -1, then no message at all will be printed.

    • If 0, only warnings will be printed.

    • If 1, general status updates will be printed.

    • If >= 2, full verbosity will be enabled.

    default = 1
    

  • cache_args (keyword arguments) –

    There are a number of optional cache arguments that can be set via kwargs, as listed below.

    • cache_dr : str or None

      The location of where to cache the results of this function, for faster loading in the future.

      If None, do not cache. The default if not set is ‘default’, which will use caching in a location defined by the function name in a folder called neurotools_cache in the users homes directory.

    • cache_max_sz : str or int

      This parameter defines the maximize size of the cache directory. The idea is that if saving a new cached function call and it exceeds this cache max size, previous saved caches (by oldest in terms of used) will be deleted, ensuring the cache directory remains under this size.

      Can either pass in terms of bytes directly as a number, or in terms of a str w/ byte marker, e.g., ‘14G’ for 14 gigabytes, or ‘10 KB’ for 10 kilobytes.

      The default if not set is ‘30G’.

    • use_base_name : bool

      Optionally when any arguments used in the caching can be cached based on either their full file path, if use_base_name is False, or just the file name itself, so for example /some/path/location vs. just location. The default if not set is True, as it assumes that maybe another file in another location with the same name is the same.

Returns

data – Loaded data across all specified subjects is returned as a 2D numpy array (shape= subj x data) where the first dimension is subject and the second is a single dimension representation of that subjects data.

Return type

array

Examples

Consider the following simplified example where we are interested in loading and concatenating the data from three subjects: ‘subj1’, ‘subj2’ and subj3’.

In [1]: import os

In [2]: from neurotools.loading import load, load_data

# Data are saved in separate folders
In [3]: subjs = os.listdir('data/ex1')

In [4]: subjs
Out[4]: ['subj3', 'subj1', 'subj2']

# Within each subject's directory is single data file
# E.g., loading the first subjects data
In [5]: load('data/ex1/subj1/data.nii.gz')
Out[5]: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

# Now we can load all subjects with the load_data method
In [6]: all_data = load_data(subjects=subjs,
   ...:                      template_path='data/ex1/SUBJECT/data.nii.gz')
   ...: 
No existing cache found, loading directly
Saving loaded data to cache_loc: /home/runner/neurotools_cache/_base_load_data/819d62c6a327bffd76a4294b40916b17
Current saved size is: 368 / 32212254720 (cache_max_sz)
Loaded data with shape: (3, 10) in 0.0044171810150146484 seconds.

In [7]: all_data.shape
Out[7]: (3, 10)

Additional arguments can of course be added allowing for more flexibility. Consider an additional case below:

# We can add a mask, keeping only
# the first and last data point
In [8]: mask = np.zeros(10)

In [9]: mask[0], mask[9] = 1, 1

In [10]: all_data = load_data(subjects=subjs,
   ....:                      template_path='data/ex1/SUBJECT/data.nii.gz',
   ....:                      mask=mask)
   ....: 
No existing cache found, loading directly
Saving loaded data to cache_loc: /home/runner/neurotools_cache/_base_load_data/7cec15fac7eec08a38d4a389b22ca11c
Current saved size is: 544 / 32212254720 (cache_max_sz)
Loaded data with shape: (3, 2) in 0.00432586669921875 seconds.

reverse_mask_data#

neurotools.loading.reverse_mask_data(data, mask)#

This function is used to reverse the application of a mask to a data array.

Parameters
  • data (numpy.ndarray or list of) – Some data in which to transform from either a single dimensional array of data, or a 2D array of data with shape = subjs x data, back to the original shape of the data pre-masking. Instead of a 2D array, a list of single arrays can be passed as well.

  • mask (loc or numpy.ndarray) – The location of, or a array mask, where (1/True=keep, 0/False=discard) in which to reverse the application of. If the passed mask is either a Nifti1Image with an affine, or saved as a nifti, then as a part of the reverse mask process, the data will be converted into a Nifti1Image image.

Returns

reversed_data – In the case that the mask is passed in a format that provides an affine, data will be returned in Nifti1Image format. If instead the mask is passed as a array or location of a saved data array without affine information, then the data will be returned as a numpy array.

If either a list of data arrays is passed, or data is passed as a 2D array with shape = subjs x data, then the returned reversed data will be a list with each element corresponding to each passed subject’s reversed data.

Return type

numpy.ndarray, Nifti1Image or list of

Examples

Consider the following brief example:

In [1]: import numpy as np

In [2]: from neurotools.loading import reverse_mask_data

In [3]: data = np.random.random((4, 2))

In [4]: mask = np.array([1, 0, 0, 1])

In [5]: reversed_data = reverse_mask_data(data, mask)

In [6]: reversed_data
Out[6]: 
[array([0.36550292, 0.        , 0.        , 0.48926144]),
 array([0.06822381, 0.        , 0.        , 0.56173037]),
 array([0.86999783, 0.        , 0.        , 0.01415011]),
 array([0.78179444, 0.        , 0.        , 0.90679894])]

In [7]: reversed_data[0].shape
Out[7]: (4,)

Or in the case where the mask is a Nifti1Image:

In [8]: import nibabel as nib

# Convert mask to nifti
In [9]: mask = nib.Nifti1Image(mask, affine=np.eye(4))

In [10]: reversed_data = reverse_mask_data(data, mask)

In [11]: reversed_data
Out[11]: 
[<nibabel.nifti1.Nifti1Image at 0x7fde79325220>,
 <nibabel.nifti1.Nifti1Image at 0x7fde79325580>,
 <nibabel.nifti1.Nifti1Image at 0x7fde79325fa0>,
 <nibabel.nifti1.Nifti1Image at 0x7fde79337070>]

# Closer look
In [12]: reversed_data[0].get_fdata()
Out[12]: array([0.36550292, 0.        , 0.        , 0.48926144])

In [13]: reversed_data[0].affine
Out[13]: 
array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])

get_overlap_subjects#

neurotools.loading.get_overlap_subjects(subjs, template_path=None, contrast=None, data_df=None, verbose=1, _print=None)#
Helper function to be used when working with template_path style saved

data in order to compute an overlapping set of subjects between a dataframe and a template path, or alternatively between two dataframes.

subjslist-like or pandas.DataFrame

Either list-like or a pandas.DataFrame that is indexed by the names of the subjects in which to overlap. Where the names correspond in some way to the way the way the subject’s data is saved.

s
template_pathstr, list of str, optional

A str indicating the template form for how a single subjects data should be loaded (or in this case located), where SUBJECT will be replaced with that subjects name, and optionally CONTRAST will be replaced with the contrast name.

For example, to load subject X’s contrast Y saved under ‘some_loc/X_Y.nii.gz’ the template_path would be: ‘some_loc/SUBJECT_CONTRAST.nii.gz’.

Note that the use of a CONTRAST argument is optional.

You can also pass a list of template_path’s in which to overlap.

Note that if this parameter is passed, then data_df will be ignored!

default = None
contraststr, optional

The name of the contrast, used along with the template path to define where to load data.

If passed None then it is assumed that CONTRAST is not present in the template_path and will be ignored.

Note that if this parameter is passed, then data_df will be ignored! This parameter is used only with the template_path option.

default = None
data_dfpandas.DataFrame or None, optional

Optionally specify a pandas.DataFrame, indexed by subject, in which to overlap with df, INSTEAD of a template_path and / or contrast set of arguments. Explicitly, if specifying a data_df, a template_path should not be passed!

default = None
verboseint, optional

By default this value is 1. This parameter controls the verbosity of this function.

If -1, then no message at all will be printed. If 0, only warnings will be printed. If 1, general status updates will be printed. If >= 2, full verbosity will be enabled.

default = 1

from_data.get_surf_loc#

neurotools.loading.from_data.get_surf_loc(space, hemi, key)#

Find the saved surface file based on space, hemi and key, with tolerance to different naming schemes.

Parameters
  • space (str) – The space of the data, where space refers to a valid surface space, e.g., ‘fsaverage’.

  • hemi (str) – The hemisphere to find, can pass as ‘lh’ / ‘rh’ or some alternate formats, e.g., ‘left’ / ‘right’.

  • key (str or list of str) – The identifying key of the surface to load. If passing a list of keys, will try to find the correct surface in the order of the passed list, e.g., you should pass the most specific option first, then potentially more general options.

Returns

path – The path of the saved requested surface file, or None if not found.

Return type

str or None

abcd.load_from_csv#

neurotools.loading.abcd.load_from_csv(cols, csv_loc, eventname='baseline_year_1_arm_1', drop_nan=False, encode_cat_as='ordinal', verbose=0, **cache_args)#

Special ABCD Study specific helper utility to load specific columns from a csv saved version of the DEAP release RDS file or similar ABCD specific csv dataset.

Parameters
  • cols (str or list-like) –

    Either a single str with the column name to load, or a list / list-like with the names of multiple columns to load.

    If any variable passed is wrapped in ‘C(variable_name)’ that variable will be ordinalized (or whatever option is specified with the encode_cat_as option) and saved under the base variable name (i.e., with C() wrapper removed).

  • csv_loc (str) – The str location of the csv saved version of the DEAP release RDS file for the ABCD Study - or any other comma seperated dataset with an eventname column.

  • eventname (str, list-like or None, optional) –

    The single eventname as a str, or multiple eventnames in which to include results by. If passed as None then all avaliable data will be kept.

    If a single eventname is specified then the eventname column will be dropped, otherwise it will be kept.

    default = 'baseline_year_1_arm_1'
    

  • drop_nan (bool, optional) –

    If True, then drop any rows / subjects data with missing values in any of the requested columns.

    Note: Any values encoded as [‘777’, 999, ‘999’, 777] will be treated as NaN. This is a special ABCD specific consideration.

    default = False
    

  • encode_cat_as ({'ordinal', 'one hot', 'dummy'}, optional) –

    The way in which categorical vars, any wrapped in C(), should be categorically encoded.

    • ’ordinal’ :

      The variables in encoded sequentially in one column with the original name, with values 0 to k-1 where k is the number of unique categorical values. This method uses OrdinalEncoder.

    • ’one hot’ :

      The variables is one hot encoded, adding columns for each unique value. This method uses function pandas.get_dummies().

    • ’dummy’ :

      Same as ‘one hot’, except one of the columns is then dropped.

  • cache_args (keyword arguments) –

    There are a number of optional cache arguments that can be set via kwargs, as listed below.

    • cache_dr : str or None

      The location of where to cache the results of this function, for faster loading in the future.

      If None, do not cache. The default if not set is ‘default’, which will use caching in a location defined by the function name in a folder called neurotools_cache in the users homes directory.

    • cache_max_sz : str or int

      This parameter defines the maximize size of the cache directory. The idea is that if saving a new cached function call and it exceeds this cache max size, previous saved caches (by oldest in terms of used) will be deleted, ensuring the cache directory remains under this size.

      Can either pass in terms of bytes directly as a number, or in terms of a str w/ byte marker, e.g., ‘14G’ for 14 gigabytes, or ‘10 KB’ for 10 kilobytes.

      The default if not set is ‘30G’.

    • use_base_name : bool

      Optionally when any arguments used in the caching can be cached based on either their full file path, if use_base_name is False, or just the file name itself, so for example /some/path/location vs. just location. The default if not set is True, as it assumes that maybe another file in another location with the same name is the same.

Returns

df – Will return a pandas.DataFrame as indexed by column src_subject_id within the original csv.

Return type

pandas.DataFrame

abcd.load_family_block_structure#

neurotools.loading.abcd.load_family_block_structure(csv_loc, subjects=None, eventname='baseline_year_1_arm_1', add_neg_ones=False, cache_dr='default', cache_max_sz='30G', verbose=0)#

This helper utility loads PALM-style exchanability blocks for ABCD study specific data according to right now a fixed set of rules:

  • Families of the same type can be shuffled (i.e., same number of members + of same status)

  • Siblings of the same type can be shuffled

  • Treat DZ as ordinary sibs (i.e., just treat MZ seperately)

Parameters
  • csv_loc (str / file path) – The location of the csv saved version of the DEAP release RDS file for the ABCD Study. This can also just be any other csv as long as it has columns: ‘rel_family_id’, ‘rel_relationship’, ‘genetic_zygosity_status_1’

  • subjects (None or array-like, optional) –

    Can optionally specify that the block structure be created on a subset of subjects, though if any missing values are present in rel_relationship or rel_family_id within this subset, then those will be further dropped.

    If passed as non-null, this should be a valid array-like or pandas.Index style set of subjects.

    default = None
    

  • eventname (str, array-like or None, optional) –

    A single eventname as a str in which to specify data by.

    For now, this method only supports loading data at a single time point across subjects.

    default = 'baseline_year_1_arm_1'
    

  • add_neg_ones (bool, optional) –

    If True, add a left-most column with all negative ones representing that swaps should occur within group at the outermost level. Note that if using a permutation function through neurotools that accepts this style of blocks, this outer layer is assumed by default, so this parameter can be left as False.

    default = False
    

Returns

block_structure – The loaded block structure as indexed by src_subject_id with columns: ‘neg_ones’, ‘family_type’, ‘rel_family_id’ and ‘rel_relationship’.

Any subjects with missing values in a key column have been dropped from this returned structure.

Return type

pandas.DataFrame

outliers.drop_top_x_outliers#

neurotools.loading.outliers.drop_top_x_outliers(data, subjects=None, top=50)#

Drop a fixed number of outliers from a set of loaded subjects data, based on the absolute mean of each subject’s data.

Parameters
  • data (numpy array) – Loaded 2D numpy array with shape # of subjects x data points, representing the loaded data in which to drop outliers from.

  • subjects (numpy array or None, optional) –

    A corresponding list or array like of subjects with the same length as the first dimension of the passed data. If passed, then will return as a modified subject array representing the new subject list with dropped outliers excluded. If not passed, i.e., left as default None, will only return the modified data.

    default = None
    

  • top (int, optional) –

    The number of subjects to drop.

    default = 50
    

Returns

  • data (numpy array) – The loaded 2D data, but with top subjects data removed.

  • subjects (numpy array) – If passed originally as None, only data will be returned, otherwise this will represent the new list of kept subjects, with top removed.

parcels.load_32k_fs_LR_concat#

neurotools.loading.parcels.load_32k_fs_LR_concat(parcel_name)#

Dedicated loader function for saved parcels as generated in parc_scaling project. These parcellations are all in space 32k_fs_LR, and are left, right hemisphere concatenated. The parcel will be downloaded to parcels directory in the default data dr.

Parameters

parcel_name (str) – The name of the parcel to load, see https://github.com/sahahn/parc_scaling/tree/main/parcels for valid options.

Returns

parcel – The loaded concat LR parcellation as a numpy array is returned.

Return type

numpy array

space.get_space_options#

neurotools.loading.space.get_space_options()#

Simple utility design to return the avaliable spaces based off the downloaded data.

Returns

spaces – list of currently downloaded / avaliable spaces.

Return type

list