loading#
load#
- neurotools.loading.load(f, index_slice=None, dtype=None)#
Smart loading function for neuro-imaging type data.
- Parameters
f (str,
arrays
or other) – The location / file path of the data to load passed as a str, or can accept directly numpyarrays
or objects from nibabel, ect…index_slice (slices, tuple of slices, or
None
, optional) –You may optional pass index slicing here. The typical benefit of index_slicing over masking in a later step is where data to load will be read with nibabel (e.g., saved as .nii or .mgz) and can therefore use an index_slice to load in only the data of interest.
For example let’s say the shape of data is 5 x 20 on the disk, but you only need the data in the first 20, e.g., so with traditional slicing that is [0], in this case by using the mask you have to load the full data from the disk, then slice it. But if you pass
slice(0)
here, then this accomplishes the same thing, but only loads the requested slice from the disk.You must use python keyword slice for complex slicing. For example in the example above you may pass either
(0)
orslice(0)
, but if you wanted something more complex, e.g. passing something likemy_array[1:5:2, ::3]
you should pass(slice(1,5,2), slice(None,None,3))
here.default = None
dtype (str or None, optional) –
The data type in which to cast loaded data to. If left as None, keep original data type, otherwise can pass as a string name of datatype to cast to. For example, ‘float32’ to cast to 32 bit floating point precision.
default = None
- Returns
data – Returned is the requested data as a
numpy array
, or if passedNone
, thenNone
will be returned. The shape of the returned data will be the shape of the data array portion of whatever data is passed with extra singular dimensions removed (i.e., np.squeeze is applied). If passed an index slice, only the portion of the data specified by the index slice will be loaded and returned.Note: The data type of the returned data will depend on how the data to be loaded is saved.
- Return type
Examples
Data can be loaded by simply passing a file location to load. For example, we can load just an
array
of ones:In [1]: from neurotools.loading import load In [2]: data = load('data/ones.nii.gz') In [3]: data.shape Out[3]: (10,)
This function is likewise robust to being passed non-file paths, for example passing an already defined
array
to load:In [4]: import numpy as np In [5]: data = np.ones((5, 3, 1)) In [6]: load(data).shape Out[6]: (5, 3)
Note here that load will also apply a squeeze function to any loaded data, getting rid of any used dimensions.
Next, let’s look at an example where we pass a specific index slice.
In [7]: data = np.random.random((5, 3)) # Let's say we want to select columns 0 and 1 In [8]: data[:, 0:2] Out[8]: array([[0.72985059, 0.00102751], [0.98263662, 0.34106399], [0.57454001, 0.23615256], [0.84292275, 0.74310495], [0.06495864, 0.60732639]]) # It's a bit more awkward but we pass: In [9]: load(data, index_slice=(slice(None), slice(0, 2))) Out[9]: array([[0.72985059, 0.00102751], [0.98263662, 0.34106399], [0.57454001, 0.23615256], [0.84292275, 0.74310495], [0.06495864, 0.60732639]])
load_data#
- neurotools.loading.load_data(subjects, template_path, contrast=None, mask=None, index_slice=None, zero_as_nan=False, nan_as_zero=False, dtype=None, n_jobs=1, verbose=1, _print=None, **cache_args)#
This method is designed to load data saved in a particular way, specifically where each subject / participants data is saved seperately.
- Parameters
subjects (array-like) – A list or array-like with the names of the subjects to load, where the names correspond in some way to the way the way the subject’s data is saved. This correspondence is specified in template_path.
template_path (str or func) –
A str indicating the template form for how a single subjects data should be loaded, where SUBJECT will be replaced with that subjects name, and optionally CONTRAST will be replaced with the contrast name.
For example, to load subject X’s contrast Y saved under ‘some_loc/X_Y.nii.gz’ the template_path would be: ‘some_loc/SUBJECT_CONTRAST.nii.gz’.
You may also alternatively pass template_path as a python function, with first argument accepting subject, and optional second argument accepting contrast. In this case, the function should return the correct path for that subject / contrast pair when given one or both of these arguments.
Note: The use of CONTRAST within the template path is optional, but this parameter is not.
contrast (str or
None
, optional) –The name of the contrast, used along with the template path to define where to load data. If passed
None
, then it is assumed that CONTRAST is not present in the template_path and will be ignored.default = None
mask (str,
array
orNone
, optional) –After data is loaded, it can optionally be masked according to a specific
array
.If
None
, and the data to be loaded is multi-dimensional, then the data will be flattened by default, e.g., if each data point originally has shape 10 x 2, the flattened shape will be 20.If passed a str, it will be assumed to be the location of a mask in which to load, where the shape of the mask should match the data.
When passing a mask, either by location or by
array
directly, values set to either 1 or True indicate the value should be kept, whereas values of 0 or False indicate the values at that location should be discarded.If a mask is used, see function
funcs.reverse_mask_data()
for reversing masked data (e.g., an eventual statistical output) back into it’s pre-masked state.default = None
index_slice (slices, tuple of slices, or
None
, optional) –You may optional pass index slicing here. The typical benefit of index_slicing over masking is where data to load will be read with nibabel (e.g., saved as .nii or .mgz) and can therefore use an index_slice to load in only the data of interest.
For example let’s say the shape of data is 5 x 20 on the disk, but you only need the data in the first 20, e.g., so with traditional slicing that is [0], in this case by using the mask you have to load the full data from the disk, then slice it. But if you pass
slice(0)
here, then this accomplishes the same thing, but only loads the requested slice from the disk.You must use python keyword slice for complex slicing. For example in the example above you may pass either
(0)
orslice(0)
, but if you wanted something more complex, e.g. passing something likemy_array[1:5:2, ::3]
you should pass(slice(1,5,2), slice(None,None,3))
here.If passed
None
(the default), this option is ignored and the full data avaliable is loaded.default = None
zero_as_nan (bool, optional) –
Often in neuroimaging data, NaN values are encoded as 0’s (I’m not sure why either). If this flag is set to True, then any 0’s found in the loaded data will be replaced with NaN.
default = False
nan_as_zero (bool, optional) –
As an alternative to zero_as_nan, can instead set any NaNs found to zeros. Note, if this is True then zero_as_nan can not also be True.
default = False
dtype (str or None, optional) –
The data type in which to cast loaded data to. If left as None, keep original data type, otherwise can pass as a string name of datatype to cast to. For example, ‘float32’ to cast to 32 bit floating point precision.
default = None
n_jobs (int, optional) –
The number of threads to use when loading data.
Note: Often n_jobs will refer to number of separate processors, but in this case it refers to threads, which are not limited by the number of cores avaliable.
default = 1
verbose (int, optional) –
This parameter controls the verbosity of this function.
If -1, then no message at all will be printed.
If 0, only warnings will be printed.
If 1, general status updates will be printed.
If >= 2, full verbosity will be enabled.
default = 1
cache_args (keyword arguments) –
There are a number of optional cache arguments that can be set via kwargs, as listed below.
cache_dr : str or None
The location of where to cache the results of this function, for faster loading in the future.
If None, do not cache. The default if not set is ‘default’, which will use caching in a location defined by the function name in a folder called neurotools_cache in the users homes directory.
cache_max_sz : str or int
This parameter defines the maximize size of the cache directory. The idea is that if saving a new cached function call and it exceeds this cache max size, previous saved caches (by oldest in terms of used) will be deleted, ensuring the cache directory remains under this size.
Can either pass in terms of bytes directly as a number, or in terms of a str w/ byte marker, e.g., ‘14G’ for 14 gigabytes, or ‘10 KB’ for 10 kilobytes.
The default if not set is ‘30G’.
use_base_name : bool
Optionally when any arguments used in the caching can be cached based on either their full file path, if use_base_name is False, or just the file name itself, so for example /some/path/location vs. just location. The default if not set is True, as it assumes that maybe another file in another location with the same name is the same.
- Returns
data – Loaded data across all specified subjects is returned as a 2D numpy
array
(shape= subj x data) where the first dimension is subject and the second is a single dimension representation of that subjects data.- Return type
Examples
Consider the following simplified example where we are interested in loading and concatenating the data from three subjects: ‘subj1’, ‘subj2’ and subj3’.
In [1]: import os In [2]: from neurotools.loading import load, load_data # Data are saved in separate folders In [3]: subjs = os.listdir('data/ex1') In [4]: subjs Out[4]: ['subj3', 'subj1', 'subj2'] # Within each subject's directory is single data file # E.g., loading the first subjects data In [5]: load('data/ex1/subj1/data.nii.gz') Out[5]: array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) # Now we can load all subjects with the load_data method In [6]: all_data = load_data(subjects=subjs, ...: template_path='data/ex1/SUBJECT/data.nii.gz') ...: No existing cache found, loading directly Saving loaded data to cache_loc: /home/runner/neurotools_cache/_base_load_data/819d62c6a327bffd76a4294b40916b17 Current saved size is: 368 / 32212254720 (cache_max_sz) Loaded data with shape: (3, 10) in 0.0044171810150146484 seconds. In [7]: all_data.shape Out[7]: (3, 10)
Additional arguments can of course be added allowing for more flexibility. Consider an additional case below:
# We can add a mask, keeping only # the first and last data point In [8]: mask = np.zeros(10) In [9]: mask[0], mask[9] = 1, 1 In [10]: all_data = load_data(subjects=subjs, ....: template_path='data/ex1/SUBJECT/data.nii.gz', ....: mask=mask) ....: No existing cache found, loading directly Saving loaded data to cache_loc: /home/runner/neurotools_cache/_base_load_data/7cec15fac7eec08a38d4a389b22ca11c Current saved size is: 544 / 32212254720 (cache_max_sz) Loaded data with shape: (3, 2) in 0.00432586669921875 seconds.
reverse_mask_data#
- neurotools.loading.reverse_mask_data(data, mask)#
This function is used to reverse the application of a mask to a data array.
- Parameters
data (
numpy.ndarray
or list of) – Some data in which to transform from either a single dimensionalarray
of data, or a 2Darray
of data with shape = subjs x data, back to the original shape of the data pre-masking. Instead of a 2Darray
, a list of singlearrays
can be passed as well.mask (loc or
numpy.ndarray
) – The location of, or aarray
mask, where (1/True=keep, 0/False=discard) in which to reverse the application of. If the passed mask is either aNifti1Image
with an affine, or saved as a nifti, then as a part of the reverse mask process, the data will be converted into aNifti1Image
image.
- Returns
reversed_data – In the case that the mask is passed in a format that provides an affine, data will be returned in
Nifti1Image
format. If instead the mask is passed as aarray
or location of a saved data array without affine information, then the data will be returned as anumpy array
.If either a list of data
arrays
is passed, or data is passed as a 2Darray
with shape = subjs x data, then the returned reversed data will be a list with each element corresponding to each passed subject’s reversed data.- Return type
numpy.ndarray
,Nifti1Image
or list of
Examples
Consider the following brief example:
In [1]: import numpy as np In [2]: from neurotools.loading import reverse_mask_data In [3]: data = np.random.random((4, 2)) In [4]: mask = np.array([1, 0, 0, 1]) In [5]: reversed_data = reverse_mask_data(data, mask) In [6]: reversed_data Out[6]: [array([0.36550292, 0. , 0. , 0.48926144]), array([0.06822381, 0. , 0. , 0.56173037]), array([0.86999783, 0. , 0. , 0.01415011]), array([0.78179444, 0. , 0. , 0.90679894])] In [7]: reversed_data[0].shape Out[7]: (4,)
Or in the case where the mask is a
Nifti1Image
:In [8]: import nibabel as nib # Convert mask to nifti In [9]: mask = nib.Nifti1Image(mask, affine=np.eye(4)) In [10]: reversed_data = reverse_mask_data(data, mask) In [11]: reversed_data Out[11]: [<nibabel.nifti1.Nifti1Image at 0x7fde79325220>, <nibabel.nifti1.Nifti1Image at 0x7fde79325580>, <nibabel.nifti1.Nifti1Image at 0x7fde79325fa0>, <nibabel.nifti1.Nifti1Image at 0x7fde79337070>] # Closer look In [12]: reversed_data[0].get_fdata() Out[12]: array([0.36550292, 0. , 0. , 0.48926144]) In [13]: reversed_data[0].affine Out[13]: array([[1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]])
get_overlap_subjects#
- neurotools.loading.get_overlap_subjects(subjs, template_path=None, contrast=None, data_df=None, verbose=1, _print=None)#
- Helper function to be used when working with template_path style saved
data in order to compute an overlapping set of subjects between a dataframe and a template path, or alternatively between two dataframes.
- subjslist-like or
pandas.DataFrame
Either list-like or a
pandas.DataFrame
that is indexed by the names of the subjects in which to overlap. Where the names correspond in some way to the way the way the subject’s data is saved.
- subjslist-like or
- s
- template_pathstr, list of str, optional
A str indicating the template form for how a single subjects data should be loaded (or in this case located), where SUBJECT will be replaced with that subjects name, and optionally CONTRAST will be replaced with the contrast name.
For example, to load subject X’s contrast Y saved under ‘some_loc/X_Y.nii.gz’ the template_path would be: ‘some_loc/SUBJECT_CONTRAST.nii.gz’.
Note that the use of a CONTRAST argument is optional.
You can also pass a list of template_path’s in which to overlap.
Note that if this parameter is passed, then data_df will be ignored!
default = None
- contraststr, optional
The name of the contrast, used along with the template path to define where to load data.
If passed
None
then it is assumed that CONTRAST is not present in the template_path and will be ignored.Note that if this parameter is passed, then data_df will be ignored! This parameter is used only with the template_path option.
default = None
- data_df
pandas.DataFrame
orNone
, optional Optionally specify a
pandas.DataFrame
, indexed by subject, in which to overlap with df, INSTEAD of a template_path and / or contrast set of arguments. Explicitly, if specifying a data_df, a template_path should not be passed!default = None
- verboseint, optional
By default this value is 1. This parameter controls the verbosity of this function.
If -1, then no message at all will be printed. If 0, only warnings will be printed. If 1, general status updates will be printed. If >= 2, full verbosity will be enabled.
default = 1
from_data.get_surf_loc#
- neurotools.loading.from_data.get_surf_loc(space, hemi, key)#
Find the saved surface file based on space, hemi and key, with tolerance to different naming schemes.
- Parameters
space (str) – The space of the data, where space refers to a valid surface space, e.g., ‘fsaverage’.
hemi (str) – The hemisphere to find, can pass as ‘lh’ / ‘rh’ or some alternate formats, e.g., ‘left’ / ‘right’.
key (str or list of str) – The identifying key of the surface to load. If passing a list of keys, will try to find the correct surface in the order of the passed list, e.g., you should pass the most specific option first, then potentially more general options.
- Returns
path – The path of the saved requested surface file, or None if not found.
- Return type
abcd.load_from_csv#
- neurotools.loading.abcd.load_from_csv(cols, csv_loc, eventname='baseline_year_1_arm_1', drop_nan=False, encode_cat_as='ordinal', verbose=0, **cache_args)#
Special ABCD Study specific helper utility to load specific columns from a csv saved version of the DEAP release RDS file or similar ABCD specific csv dataset.
- Parameters
cols (str or list-like) –
Either a single str with the column name to load, or a list / list-like with the names of multiple columns to load.
If any variable passed is wrapped in ‘C(variable_name)’ that variable will be ordinalized (or whatever option is specified with the encode_cat_as option) and saved under the base variable name (i.e., with C() wrapper removed).
csv_loc (str) – The str location of the csv saved version of the DEAP release RDS file for the ABCD Study - or any other comma seperated dataset with an eventname column.
eventname (str, list-like or None, optional) –
The single eventname as a str, or multiple eventnames in which to include results by. If passed as None then all avaliable data will be kept.
If a single eventname is specified then the eventname column will be dropped, otherwise it will be kept.
default = 'baseline_year_1_arm_1'
drop_nan (bool, optional) –
If True, then drop any rows / subjects data with missing values in any of the requested columns.
Note: Any values encoded as [‘777’, 999, ‘999’, 777] will be treated as NaN. This is a special ABCD specific consideration.
default = False
encode_cat_as ({'ordinal', 'one hot', 'dummy'}, optional) –
The way in which categorical vars, any wrapped in C(), should be categorically encoded.
’ordinal’ :
The variables in encoded sequentially in one column with the original name, with values 0 to k-1 where k is the number of unique categorical values. This method uses
OrdinalEncoder
.’one hot’ :
The variables is one hot encoded, adding columns for each unique value. This method uses function
pandas.get_dummies()
.’dummy’ :
Same as ‘one hot’, except one of the columns is then dropped.
cache_args (keyword arguments) –
There are a number of optional cache arguments that can be set via kwargs, as listed below.
cache_dr : str or None
The location of where to cache the results of this function, for faster loading in the future.
If None, do not cache. The default if not set is ‘default’, which will use caching in a location defined by the function name in a folder called neurotools_cache in the users homes directory.
cache_max_sz : str or int
This parameter defines the maximize size of the cache directory. The idea is that if saving a new cached function call and it exceeds this cache max size, previous saved caches (by oldest in terms of used) will be deleted, ensuring the cache directory remains under this size.
Can either pass in terms of bytes directly as a number, or in terms of a str w/ byte marker, e.g., ‘14G’ for 14 gigabytes, or ‘10 KB’ for 10 kilobytes.
The default if not set is ‘30G’.
use_base_name : bool
Optionally when any arguments used in the caching can be cached based on either their full file path, if use_base_name is False, or just the file name itself, so for example /some/path/location vs. just location. The default if not set is True, as it assumes that maybe another file in another location with the same name is the same.
- Returns
df – Will return a
pandas.DataFrame
as indexed by column src_subject_id within the original csv.- Return type
abcd.load_family_block_structure#
- neurotools.loading.abcd.load_family_block_structure(csv_loc, subjects=None, eventname='baseline_year_1_arm_1', add_neg_ones=False, cache_dr='default', cache_max_sz='30G', verbose=0)#
This helper utility loads PALM-style exchanability blocks for ABCD study specific data according to right now a fixed set of rules:
Families of the same type can be shuffled (i.e., same number of members + of same status)
Siblings of the same type can be shuffled
Treat DZ as ordinary sibs (i.e., just treat MZ seperately)
- Parameters
csv_loc (str / file path) – The location of the csv saved version of the DEAP release RDS file for the ABCD Study. This can also just be any other csv as long as it has columns: ‘rel_family_id’, ‘rel_relationship’, ‘genetic_zygosity_status_1’
subjects (None or array-like, optional) –
Can optionally specify that the block structure be created on a subset of subjects, though if any missing values are present in rel_relationship or rel_family_id within this subset, then those will be further dropped.
If passed as non-null, this should be a valid array-like or
pandas.Index
style set of subjects.default = None
eventname (str, array-like or None, optional) –
A single eventname as a str in which to specify data by.
For now, this method only supports loading data at a single time point across subjects.
default = 'baseline_year_1_arm_1'
add_neg_ones (bool, optional) –
If True, add a left-most column with all negative ones representing that swaps should occur within group at the outermost level. Note that if using a permutation function through neurotools that accepts this style of blocks, this outer layer is assumed by default, so this parameter can be left as False.
default = False
- Returns
block_structure – The loaded block structure as indexed by src_subject_id with columns: ‘neg_ones’, ‘family_type’, ‘rel_family_id’ and ‘rel_relationship’.
Any subjects with missing values in a key column have been dropped from this returned structure.
- Return type
outliers.drop_top_x_outliers#
- neurotools.loading.outliers.drop_top_x_outliers(data, subjects=None, top=50)#
Drop a fixed number of outliers from a set of loaded subjects data, based on the absolute mean of each subject’s data.
- Parameters
data (numpy array) – Loaded 2D numpy array with shape # of subjects x data points, representing the loaded data in which to drop outliers from.
subjects (numpy array or None, optional) –
A corresponding list or array like of subjects with the same length as the first dimension of the passed data. If passed, then will return as a modified subject array representing the new subject list with dropped outliers excluded. If not passed, i.e., left as default None, will only return the modified data.
default = None
top (int, optional) –
The number of subjects to drop.
default = 50
- Returns
data (numpy array) – The loaded 2D data, but with top subjects data removed.
subjects (numpy array) – If passed originally as None, only data will be returned, otherwise this will represent the new list of kept subjects, with top removed.
parcels.load_32k_fs_LR_concat#
- neurotools.loading.parcels.load_32k_fs_LR_concat(parcel_name)#
Dedicated loader function for saved parcels as generated in parc_scaling project. These parcellations are all in space 32k_fs_LR, and are left, right hemisphere concatenated. The parcel will be downloaded to parcels directory in the default data dr.
- Parameters
parcel_name (str) – The name of the parcel to load, see https://github.com/sahahn/parc_scaling/tree/main/parcels for valid options.
- Returns
parcel – The loaded concat LR parcellation as a numpy array is returned.
- Return type
numpy array