ticket/ecephys/33a/metadata/writer #2407

danielsf · 2022-05-02T23:45:25Z

This PR adds a module to write the metadata files for the 2022 VBN release.

Some tangentially related changes are

Adding properties to BehaviorStimulus file allowing us to access date_of_acquisition and session_type directly from the pickle file when they are absent or wrong in MTRAIN or LIMS
Moving some generic LIMS querying functions out of behavior/behavior_project_cache and into allensdk/internal/api/queries, where they are more accessible to more packages within the SDK.

This PR does not adequately unit-test the code in allensdk/brain_observatory/vbn_2022/lims_queries.py. Writing a mock LIMS database at test time is pretty straightforward but will take a lot of personnel hours. I will open a separate ticket (#2408) to add those tests. For now, our principal means of testing the LIMS queries is to use this code to generate metadata tables and present them to Corbett for validation.

danielsf · 2022-05-02T23:47:49Z

I'm attaching an input_json that can be run through allensdk/brain_observatory/vbn_2022/metadata_writer to produce my current best approximation to the metadata files that we want for the release.
production_json.txt

aamster

Overall notes:

This design does not make use of the existing design for the BehaviorProjectMetadataWriter. These are essentially 2 different strategies for metadata writing. Is it possible to make this design share more in common with that design? If not, why not?
There may be potentially duplicated/slightly modified logic around ecephys LIMS queries. There already exists EcephysProjectLimsApi which I think does most of what we need. Are you able to reuse this class/share code in any way?
Some of the code in the vbn_2022 seems borderline generally applicable. But if it is in a package called vbn_2022 it will not be reusable in the future. Please only place code within this package that is truly specific to this release.
allensdk.brain_observatory.ecephys.ecephys_project_api seems like an already existing suitable home for project-level code relating to ecephys projects
Please indent SQL queries when the statement runs to the next line. I.e instead of

SELECT 
foo,
bar
JOIN bar
ON bar.bar = foo.bar
WHERE
foo = 3

it is much easier to read as

SELECT 
  foo,
  bar
JOIN bar ON bar.bar = foo.bar
WHERE
  foo = 3

On a related note, I find it easier to read queries if the line doesn't start with a comma.

...k/brain_observatory/behavior/behavior_project_cache/tables/util/prior_exposure_processing.py

allensdk/brain_observatory/behavior/data_files/stimulus_file.py

allensdk/brain_observatory/vbn_2022/metadata_writer/lims_queries.py

aamster · 2022-05-03T12:56:54Z

allensdk/brain_observatory/vbn_2022/metadata_writer/lims_queries.py

@@ -0,0 +1,758 @@
+from typing import List, Tuple, Dict, Any, Optional


It looks like some of these functions should be private. Unable to tell what functions are supposed to be exposed to outside consumers

aamster · 2022-05-03T15:54:10Z

allensdk/brain_observatory/vbn_2022/metadata_writer/dataframe_manipulations.py

+        FileIDGenerator)
+
+
+def add_file_paths_to_session_table(


This is different strategy than https://github.com/AllenInstitute/AllenSDK/blob/master/allensdk/brain_observatory/behavior/behavior_project_cache/project_apis/data_io/behavior_project_lims_api.py#L523

Yes. The VBO strategy relied on the files being released being well known files that were tracked in LIMS. Given that we are not making NWB generation a LIMS strategy, the files we are releasing are not going to be well known files tracked in LIMS, so we are going to need a different way to list the files for release. We can discuss this if you want to revisit that decision, but the only way to reuse the VBO strategy is to make sure the VBN NWB files are tracked in the well_known_files table on LIMS.

That makes sense. Ideally we would have uniformity for these data releases. I know that the old strategy of using LIMS to generate a well known file wasn't great. If we prefer the new strategy, I'd like to update the old strategy to use the new one.

move to allensdk/data_releases/metadata_writing

aamster · 2022-05-03T15:57:11Z

allensdk/brain_observatory/vbn_2022/metadata_writer/dataframe_manipulations.py

+# the VBN 2022 metadata dataframes as they are directly queried
+# from LIMS.
+
+import pandas as pd


Unsure why this module doesn't make use of prior_exposure_processing.py. Is it possible to share code?

The prior_exposure_processing.py code used in VBO is all very VBO-specified, for instance

https://github.com/AllenInstitute/AllenSDK/blob/master/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/prior_exposure_processing.py#L81-L115

which, based on the docstring, sounds like a VBO-specific bug or

https://github.com/AllenInstitute/AllenSDK/blob/master/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/prior_exposure_processing.py#L124-L127

which assumes that these are ophys sessions.

Talking to Corbett, his attitude towards omissions was "there were only omissions in the ecephys sessions, so really, prior exposures to omissions is just a question of 'has the mouse seen an ecephys session, yet" which is different than what we did for VBO, but also very VBN specific (which is why I placed it here, where it is unlikely to get reused for future data releases).

I do reuse prior_exposure_processing.py where appropriate in vbn_2022/metadata_writer/lims_queries.py. Prior exposures to omissions was a weird edge case for this release.

The VBO code does something like (ignoring the habituation session bug)

Select all ophys sessions
calculate prior exposure to omissions for these sessions using __get_prior_exposure_count
merge these values into the other table

Isn't this essentially the same algorithm? The only difference being select all ecephys sessions. The function I think should be used is __get_prior_exposure_count (making it public and moving it if needed). I'm unsure why you have such complicated logic to fill in the behavior session values. Wouldn't a merge from one table to the other work, similar to how we did it for VBO?

allensdk/test/brain_observatory/vbn_2020/metadata_writer/conftest.py

allensdk/brain_observatory/vbn_2022/metadata_writer/lims_queries.py

danielsf · 2022-05-04T22:22:57Z

I haven't really thought through what this means with respect to the VCN LIMS API, but I am summarizing a Teams conversation I just had with Corbett so that we know why we did what we ended up doing:

Scott:

Hi Corbett,

I apologize for the very elementary nature of this question, but:

Looking back over the legacy VCN code, it looks like most of the queries to lims are filtered with

WHERE
NOT ecephys_sessions.habituation
AND ecephys_channels.valid_data
AND ecephys_sessions.workflow_state != 'failed'
AND ecephys_probes.workflow_state != 'failed'

I just checked the sessions, probes, and channels we are slated for release. Most of these requirements are honored, except that there are 8066 of 354048 channels with ecephys_channels.valid_data==False. Should we be filtering these channels out? I ask because I don't know where or how this flag got set and wasn't sure if you wanted to honor it in this release.

I can, of course, provide more details if the answer is not a simple "yes" or "no"

After a brief back-and-forth, the decision was:

Release all the data (even valid_data==False) but add a valid_data column to the metadata tables (presumably the channels.csv table) so that users know what they are getting.

I'm not sure if this means we can/should reuse the existing EcephysProjectLimsApi to generate metadata tables, as it all assumes you only want valid_data==True. I will investigate adding functionality that API to turn that filter off.

danielsf · 2022-05-04T22:35:32Z

@aamster Given the above comment about the WHERE clauses that the EcephysProjectLimsApi assumes, how would you feel about leaving the LIMS queries in this ticket as is and then opening a new ticket to make the VBN metadata writer depend on the EcephysProjectLimsApi making the necessary changes to the EcephysProjectLimsApi along the way?

My reasoning for this is that, once this ticket is merged, we can start working on the issue Corbett opened regarding session_type, which unblocks us for release. We are then free to take as long as we need to updating the EcephysProjectLimsApi, knowing that we have something (the metadata tables generated by this ticket) to validate against.

To be explicit, the new ticket we are opening would involve

adding probe_ids_to_skip to the EcephysProjectLimsApi
adding a kwarg to turn off filtering on ec.valid_data in EcephysLimsProjectApi
adding dataframe manipulation methods to join the tables produced by EcephysLimsProjectApi (there are some columns, like channel_id, that appear in more than one VBO metadata table)
modifing VBN metadata_writer to depend on EcephysLimsProjectApi
only accept the new ticket if the metadata tables produced before and after merging are identical

aamster · 2022-05-04T22:45:13Z

@danielsf I'm fine with pushing the reuse of EcephysProjectLimsApi to the future, if we're confident we'll be able to take it on in the future.

danielsf · 2022-05-05T23:49:00Z

#2411 is the ticket to make this work rely on the legacy ecephys LIMS API

danielsf · 2022-05-05T23:56:48Z

#2412 is to address a misalignment in date_of_acquisition discovered in the course of this work

danielsf · 2022-05-06T00:04:54Z

#2414 has been opened to add functionality to the metadata writer to create the input.json that will ultimately be fed into the informatics_data_release_tool

danielsf · 2022-05-06T00:08:34Z

@aamster I think this is ready for another look. The method for processing prior_exposures_to_omissions is still ugly (it has to create a combined history from the ecephys_sessions table and the behavior_sessions_table), but it relies on the prior_exposure_processing module from VBO (and now actually does what Corbett wanted, as I understand it).

I haven't done much to address your "what should be private and what should be public" comment. The reason is, "I don't know." As I'm sure you've noticed, I've become much less sanguine than you of late that code from any particular release can be generalized to releases past or present. I think the best we can do is keep the data munging methods siloed and documented so that we can borrow things when we want them and don't accidentally borrow them when we don't want them.

I'd like to get this merged so we can move forward with the release at hand.

If you want to have a larger organizational conversation about the VBN release code, I'm open to that in the next few weeks, once we know that we are on our way to actually getting the data into S3.

danielsf · 2022-05-06T21:53:54Z

@aamster I think, modulo the cases where we have tentatively agreed to disagree (and whatever we are calling the situation with ecephys_project_lims_api.py) I have actually responded to your review.

Talk to you about this on Monday.

aamster

Looks good. I do think there needs to be more cohesion between this metadata writer and the previous VBO one, but as already discussed doing so would require some additional time and effort. Things I am referring to are reuse of ProjectTable, common base class between the existing BehaviorProjectMetadataWriter and VBN2022MetadataWriterClass, and reorganization of data release packages for different projects, as you've already pointed out. I don't know how much of the above we want to take on in the future. We've also pointed out that much of database queries can share with the VCN queries, which I think we plan on taking on in the future.

These methods are going to be useful for the VBN metadata.

…usFile

- move functions for getting behavior -> session_utils - move VBN utils into internal/api/queries where appropriate

split up methods that are principally LIMS queries from methods that are local dataframe manipulations

danielsf mentioned this pull request May 2, 2022

Add unit test for VBN metadata LIMS queries #2408

Open

1 task

aamster requested changes May 3, 2022

View reviewed changes

danielsf force-pushed the ticket/ecephys/33a/metadata/writer branch 3 times, most recently from 3c9a07a to 95c6446 Compare May 5, 2022 23:38

danielsf mentioned this pull request May 5, 2022

Try to merge VBN 2022 and VCN 2019 LIMS APIs #2411

Open

5 tasks

danielsf mentioned this pull request May 6, 2022

Create python tool to generate VBN NWB input JSONs #2400

Closed

3 tasks

danielsf force-pushed the ticket/ecephys/33a/metadata/writer branch 3 times, most recently from 4d5ebef to 6e53840 Compare May 6, 2022 21:30

danielsf mentioned this pull request May 6, 2022

Read NWB date_of_acquisition from the stimulus pickle file #2416

Closed

1 task

aamster approved these changes May 9, 2022

View reviewed changes

danielsf added 9 commits May 9, 2022 09:05

fix docstring error in timestamps_processing

5c849a2

factor lims_query methods out of behavior_project_cache space

353cf02

These methods are going to be useful for the VBN metadata.

add session_type and date_of_acquisition properties to BehaviorStimul…

68c1a32

…usFile

start implementing VBN metadata writer

4ef4148

add method to lookup pickle paths from a list of behavior session IDs

22bbfe1

reorganize metadata utility methods

dc80384

- move functions for getting behavior -> session_utils - move VBN utils into internal/api/queries where appropriate

add docstrings to lims_queries.py

2862b41

add method to get behavior_sessions table from ecephys_session_ids

0f6ab97

finish functional implementation of VBN_2022 metadata writer

9e76d7c

danielsf added 26 commits May 9, 2022 09:05

reorganize metadata writer utils

6c67e6f

split up methods that are principally LIMS queries from methods that are local dataframe manipulations

fix _add_prior_omissions_to_behavior

0f91f73

finish adding dataframe_manipulations unit test

44474fb

add class to generate unique file IDs when metadata tables are written

97b95be

add method to add file_id and file_path to session_table

62cf764

connect add_file_path... CLI metadata writer

20acde6

add on-prem smoke test for VBN metadata writer CLI

168434e

move _add_images_from_behavior to dataframe_manipulations

8f7b6a2

pep8 change

b227a30

fix test to pass in windows

751ad8b

simplify return statement in BehaviorStimulusFile.session_type

065769a

rename test package

120fcf0

reformat SQL queries

8f4c1a5

remove parentheses from return statement

a705f50

fix docstrings in lims_queries.py

5484608

remove unused array

673c81d

add valid_data to channels and units table

b54485a

use existing get_prior_exposures_to_omissions method

183cf8e

_add_prior_omissions uses code from VBO tables/utils

8230824

indent queries in allensdk/internal/api/queries/lims_queries.py

41f28d1

factor out different flavors of lims queries

53eb5d2

privatize methods in lims_queries.py

c2814c4

move fileID generator infrastructure to data_release_utils

b02d213

fix docstrings

433de7c

move get_image_set

f08e511

pep8 changes

e8be731

danielsf force-pushed the ticket/ecephys/33a/metadata/writer branch from 6e53840 to e8be731 Compare May 9, 2022 16:05

danielsf merged commit fe68a15 into vbn_2022_dev May 9, 2022

danielsf deleted the ticket/ecephys/33a/metadata/writer branch June 8, 2022 16:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ticket/ecephys/33a/metadata/writer #2407

ticket/ecephys/33a/metadata/writer #2407

danielsf commented May 2, 2022 •

edited

Loading

danielsf commented May 2, 2022

aamster left a comment

aamster May 3, 2022

danielsf May 6, 2022

aamster May 3, 2022

danielsf May 4, 2022

aamster May 4, 2022

danielsf May 6, 2022

aamster May 3, 2022

danielsf May 4, 2022

aamster May 4, 2022 •

edited

Loading

danielsf commented May 4, 2022

danielsf commented May 4, 2022

aamster commented May 4, 2022

danielsf commented May 5, 2022

danielsf commented May 5, 2022

danielsf commented May 6, 2022

danielsf commented May 6, 2022

danielsf commented May 6, 2022

aamster left a comment

		@@ -0,0 +1,758 @@
		from typing import List, Tuple, Dict, Any, Optional

ticket/ecephys/33a/metadata/writer #2407

ticket/ecephys/33a/metadata/writer #2407

Conversation

danielsf commented May 2, 2022 • edited Loading

danielsf commented May 2, 2022

aamster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aamster May 4, 2022 • edited Loading

Choose a reason for hiding this comment

danielsf commented May 4, 2022

danielsf commented May 4, 2022

aamster commented May 4, 2022

danielsf commented May 5, 2022

danielsf commented May 5, 2022

danielsf commented May 6, 2022

danielsf commented May 6, 2022

danielsf commented May 6, 2022

aamster left a comment

Choose a reason for hiding this comment

danielsf commented May 2, 2022 •

edited

Loading

aamster May 4, 2022 •

edited

Loading