Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review load_data #107

Merged
merged 1 commit into from
Nov 16, 2023
Merged

Review load_data #107

merged 1 commit into from
Nov 16, 2023

Conversation

JoeZiminski
Copy link
Member

@JoeZiminski JoeZiminski commented Sep 13, 2023

This PR is to review the load_data code from spikewrap.

The PR system is a little strange but I think should (hopefully) work well. The aim of this system is avoid having to review the entire codebase at once, or review since-reverted commits without changing the dev / main brain git history. reviewed_code was created by branching dev and removing all commits. rev_load_data was created by branching dev and removing all of the code unrelated to loading data. Now merging rev_load_data > reviewed code should isolate these commits of interest. For new features, I will just add them from dev to here. Once the main bulk of the code is reviewed we can revert to reviewing in the normal way. This is a very weird system :'D but it should work out ok! and allow tailored commits to be reviewed.

This PR includes all the general package-related files (e.g. pyproject.toml) as well as:

examples/load_data.py : an example of how to use the code to load data into spikewrap.
pipeline/load_data.py : wrapper code to load the raw data into a class data structure. Under the hood the data is loaded with spikeinterface. Note that all data objects in spikeinterface are 'lazy' and do not actually contain the data, so these objects can be easily stored in classes.
data_classes/base.py and data_classes/preprocessing.py: These data classes hold the raw data objects. At the moment, they don't do anything with them, but become more important when preprocessing is added.

Some notes:
1)
The dataclass inheritence reduces boilerplate but is quite confusing, I am considering reverting back to normal classes just for readability. The way it works is the class attributes are placed at the start of the class. For sub-classing, only the class attributes of the subclass are placed at the start of the class. Then, when classing the class you must input all required arguments (first the superclasses, then the subclasses).

Because this only works for attributes, any logic must be put in a __post_init__ function.

The idea is that the user will have data stored in SWCBlueprint style. This consists of subject, sessions, ephys level. Then, typically within each sessions there are different runs. The user must indicate the sessions and runs to load in the sessions_and_runs dictionary (e.g. examples/load_data.py). Then, the filesystem is traversed based on these names to load the specified sessions and runs into the preprocessingData class. This is a userdict (basically a dictionary with additional functions) in which the loaded data objects are stored in this dictionary. Later (future PR) these objects are operated on during preprocessing.

Currently the test data is at:
"/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/test_data/steve_multi_run/1119617/time-short-multises"

Please let me know if you have any trouble accessing it! Currently I am running off a mounted drive but let me know if you have trouble on macos with the path and I can update the usage instructions for the cluster (sorry these are currently unfinished).

@JoeZiminski JoeZiminski changed the title Add load_data. Review load_data Sep 13, 2023
Copy link
Member

@lauraporta lauraporta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting work, Joe!
I left some comments on parts of the code that are not clear to me. Open to discuss them when you have time!

spikewrap/data_classes/preprocessing.py Show resolved Hide resolved
)

sub_name = "sub-1119617"
sessions_and_runs = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond this example, will the user need to scrape these names from the filesystem?

spikewrap/pipeline/load_data.py Show resolved Hide resolved
spikewrap/pipeline/load_data.py Show resolved Hide resolved
spikewrap/examples/load_data.py Show resolved Hide resolved

return list(zip(ordered_ses_names, ordered_run_names))

def _validate_inputs(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this set of assert statements, seems like a clean way to hold all the validations together.
Do you think you will define separate methods to get the folders? For instance will the get_sub_path change depending on the implementation of load_data()?
In case yes it makes sense to inject the Callables

spikewrap/data_classes/base.py Show resolved Hide resolved
spikewrap/data_classes/base.py Show resolved Hide resolved
spikewrap/data_classes/preprocessing.py Show resolved Hide resolved
spikewrap/data_classes/preprocessing.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants