Review `load_data` #107

JoeZiminski · 2023-09-13T18:21:24Z

This PR is to review the load_data code from spikewrap.

The PR system is a little strange but I think should (hopefully) work well. The aim of this system is avoid having to review the entire codebase at once, or review since-reverted commits without changing the dev / main brain git history. reviewed_code was created by branching dev and removing all commits. rev_load_data was created by branching dev and removing all of the code unrelated to loading data. Now merging rev_load_data > reviewed code should isolate these commits of interest. For new features, I will just add them from dev to here. Once the main bulk of the code is reviewed we can revert to reviewing in the normal way. This is a very weird system :'D but it should work out ok! and allow tailored commits to be reviewed.

This PR includes all the general package-related files (e.g. pyproject.toml) as well as:

examples/load_data.py : an example of how to use the code to load data into spikewrap.
pipeline/load_data.py : wrapper code to load the raw data into a class data structure. Under the hood the data is loaded with spikeinterface. Note that all data objects in spikeinterface are 'lazy' and do not actually contain the data, so these objects can be easily stored in classes.
data_classes/base.py and data_classes/preprocessing.py: These data classes hold the raw data objects. At the moment, they don't do anything with them, but become more important when preprocessing is added.

Some notes:
1)
The dataclass inheritence reduces boilerplate but is quite confusing, I am considering reverting back to normal classes just for readability. The way it works is the class attributes are placed at the start of the class. For sub-classing, only the class attributes of the subclass are placed at the start of the class. Then, when classing the class you must input all required arguments (first the superclasses, then the subclasses).

Because this only works for attributes, any logic must be put in a __post_init__ function.

The idea is that the user will have data stored in SWCBlueprint style. This consists of subject, sessions, ephys level. Then, typically within each sessions there are different runs. The user must indicate the sessions and runs to load in the sessions_and_runs dictionary (e.g. examples/load_data.py). Then, the filesystem is traversed based on these names to load the specified sessions and runs into the preprocessingData class. This is a userdict (basically a dictionary with additional functions) in which the loaded data objects are stored in this dictionary. Later (future PR) these objects are operated on during preprocessing.

Currently the test data is at:
"/ceph/neuroinformatics/neuroinformatics/scratch/jziminski/ephys/test_data/steve_multi_run/1119617/time-short-multises"

Please let me know if you have any trouble accessing it! Currently I am running off a mounted drive but let me know if you have trouble on macos with the path and I can update the usage instructions for the cluster (sorry these are currently unfinished).

lauraporta

Interesting work, Joe!
I left some comments on parts of the code that are not clear to me. Open to discuss them when you have time!

spikewrap/data_classes/preprocessing.py

lauraporta · 2023-09-22T13:47:30Z

spikewrap/examples/load_data.py

+)
+
+sub_name = "sub-1119617"
+sessions_and_runs = {


Beyond this example, will the user need to scrape these names from the filesystem?

spikewrap/pipeline/load_data.py

spikewrap/examples/load_data.py

lauraporta · 2023-09-22T14:16:08Z

spikewrap/data_classes/base.py

+
+        return list(zip(ordered_ses_names, ordered_run_names))
+
+    def _validate_inputs(


I like this set of assert statements, seems like a clean way to hold all the validations together.
Do you think you will define separate methods to get the folders? For instance will the get_sub_path change depending on the implementation of load_data()?
In case yes it makes sense to inject the Callables

spikewrap/data_classes/base.py

spikewrap/data_classes/preprocessing.py

Add load_data.

3a2b94b

JoeZiminski force-pushed the rev_load_data branch from de3c17c to 3a2b94b Compare September 13, 2023 18:25

JoeZiminski changed the title ~~Add load_data.~~ Review load_data Sep 13, 2023

JoeZiminski requested a review from lauraporta September 20, 2023 10:53

lauraporta reviewed Sep 22, 2023

View reviewed changes

JoeZiminski mentioned this pull request Oct 3, 2023

[Feature] Consider pattern when adding new data formats (e.g. OpenEphys). #128

Open

JoeZiminski requested a review from lauraporta November 16, 2023 16:46

JoeZiminski merged commit 73f9d7b into reviewed_code Nov 16, 2023

JoeZiminski deleted the rev_load_data branch November 16, 2023 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review `load_data` #107

Review `load_data` #107

JoeZiminski commented Sep 13, 2023 •

edited

Loading

lauraporta left a comment

lauraporta Sep 22, 2023

lauraporta Sep 22, 2023


		return list(zip(ordered_ses_names, ordered_run_names))

		def _validate_inputs(

Review load_data #107

Review load_data #107

Conversation

JoeZiminski commented Sep 13, 2023 • edited Loading

lauraporta left a comment

Choose a reason for hiding this comment

lauraporta Sep 22, 2023

Choose a reason for hiding this comment

lauraporta Sep 22, 2023

Choose a reason for hiding this comment

Review `load_data` #107

Review `load_data` #107

JoeZiminski commented Sep 13, 2023 •

edited

Loading