Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ticket/ecephys/33a/metadata/writer #2407

Merged
merged 35 commits into from
May 9, 2022

Conversation

danielsf
Copy link
Contributor

@danielsf danielsf commented May 2, 2022

This PR adds a module to write the metadata files for the 2022 VBN release.

Some tangentially related changes are

  • Adding properties to BehaviorStimulus file allowing us to access date_of_acquisition and session_type directly from the pickle file when they are absent or wrong in MTRAIN or LIMS
  • Moving some generic LIMS querying functions out of behavior/behavior_project_cache and into allensdk/internal/api/queries, where they are more accessible to more packages within the SDK.

This PR does not adequately unit-test the code in allensdk/brain_observatory/vbn_2022/lims_queries.py. Writing a mock LIMS database at test time is pretty straightforward but will take a lot of personnel hours. I will open a separate ticket (#2408) to add those tests. For now, our principal means of testing the LIMS queries is to use this code to generate metadata tables and present them to Corbett for validation.

@danielsf
Copy link
Contributor Author

danielsf commented May 2, 2022

I'm attaching an input_json that can be run through allensdk/brain_observatory/vbn_2022/metadata_writer to produce my current best approximation to the metadata files that we want for the release.
production_json.txt

Copy link
Contributor

@aamster aamster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall notes:

  1. This design does not make use of the existing design for the BehaviorProjectMetadataWriter. These are essentially 2 different strategies for metadata writing. Is it possible to make this design share more in common with that design? If not, why not?
  2. There may be potentially duplicated/slightly modified logic around ecephys LIMS queries. There already exists EcephysProjectLimsApi which I think does most of what we need. Are you able to reuse this class/share code in any way?
  3. Some of the code in the vbn_2022 seems borderline generally applicable. But if it is in a package called vbn_2022 it will not be reusable in the future. Please only place code within this package that is truly specific to this release.
  4. allensdk.brain_observatory.ecephys.ecephys_project_api seems like an already existing suitable home for project-level code relating to ecephys projects
  5. Please indent SQL queries when the statement runs to the next line. I.e instead of
SELECT 
foo,
bar
JOIN bar
ON bar.bar = foo.bar
WHERE
foo = 3

it is much easier to read as

SELECT 
  foo,
  bar
JOIN bar ON bar.bar = foo.bar
WHERE
  foo = 3

On a related note, I find it easier to read queries if the line doesn't start with a comma.

@@ -0,0 +1,758 @@
from typing import List, Tuple, Dict, Any, Optional
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like some of these functions should be private. Unable to tell what functions are supposed to be exposed to outside consumers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

FileIDGenerator)


def add_file_paths_to_session_table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The VBO strategy relied on the files being released being well known files that were tracked in LIMS. Given that we are not making NWB generation a LIMS strategy, the files we are releasing are not going to be well known files tracked in LIMS, so we are going to need a different way to list the files for release. We can discuss this if you want to revisit that decision, but the only way to reuse the VBO strategy is to make sure the VBN NWB files are tracked in the well_known_files table on LIMS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Ideally we would have uniformity for these data releases. I know that the old strategy of using LIMS to generate a well known file wasn't great. If we prefer the new strategy, I'd like to update the old strategy to use the new one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to allensdk/data_releases/metadata_writing

# the VBN 2022 metadata dataframes as they are directly queried
# from LIMS.

import pandas as pd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unsure why this module doesn't make use of prior_exposure_processing.py. Is it possible to share code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prior_exposure_processing.py code used in VBO is all very VBO-specified, for instance

https://github.com/AllenInstitute/AllenSDK/blob/master/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/prior_exposure_processing.py#L81-L115

which, based on the docstring, sounds like a VBO-specific bug or

https://github.com/AllenInstitute/AllenSDK/blob/master/allensdk/brain_observatory/behavior/behavior_project_cache/tables/util/prior_exposure_processing.py#L124-L127

which assumes that these are ophys sessions.

Talking to Corbett, his attitude towards omissions was "there were only omissions in the ecephys sessions, so really, prior exposures to omissions is just a question of 'has the mouse seen an ecephys session, yet" which is different than what we did for VBO, but also very VBN specific (which is why I placed it here, where it is unlikely to get reused for future data releases).

I do reuse prior_exposure_processing.py where appropriate in vbn_2022/metadata_writer/lims_queries.py. Prior exposures to omissions was a weird edge case for this release.

Copy link
Contributor

@aamster aamster May 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VBO code does something like (ignoring the habituation session bug)

Select all ophys sessions
calculate prior exposure to omissions for these sessions using __get_prior_exposure_count
merge these values into the other table

Isn't this essentially the same algorithm? The only difference being select all ecephys sessions. The function I think should be used is __get_prior_exposure_count (making it public and moving it if needed). I'm unsure why you have such complicated logic to fill in the behavior session values. Wouldn't a merge from one table to the other work, similar to how we did it for VBO?

@danielsf
Copy link
Contributor Author

danielsf commented May 4, 2022

I haven't really thought through what this means with respect to the VCN LIMS API, but I am summarizing a Teams conversation I just had with Corbett so that we know why we did what we ended up doing:

Scott:

Hi Corbett,

I apologize for the very elementary nature of this question, but:

Looking back over the legacy VCN code, it looks like most of the queries to lims are filtered with

WHERE
NOT ecephys_sessions.habituation
AND ecephys_channels.valid_data
AND ecephys_sessions.workflow_state != 'failed'
AND ecephys_probes.workflow_state != 'failed'

I just checked the sessions, probes, and channels we are slated for release. Most of these requirements are honored, except that there are 8066 of 354048 channels with ecephys_channels.valid_data==False. Should we be filtering these channels out? I ask because I don't know where or how this flag got set and wasn't sure if you wanted to honor it in this release.

I can, of course, provide more details if the answer is not a simple "yes" or "no"

After a brief back-and-forth, the decision was:

Release all the data (even valid_data==False) but add a valid_data column to the metadata tables (presumably the channels.csv table) so that users know what they are getting.

I'm not sure if this means we can/should reuse the existing EcephysProjectLimsApi to generate metadata tables, as it all assumes you only want valid_data==True. I will investigate adding functionality that API to turn that filter off.

@danielsf
Copy link
Contributor Author

danielsf commented May 4, 2022

@aamster Given the above comment about the WHERE clauses that the EcephysProjectLimsApi assumes, how would you feel about leaving the LIMS queries in this ticket as is and then opening a new ticket to make the VBN metadata writer depend on the EcephysProjectLimsApi making the necessary changes to the EcephysProjectLimsApi along the way?

My reasoning for this is that, once this ticket is merged, we can start working on the issue Corbett opened regarding session_type, which unblocks us for release. We are then free to take as long as we need to updating the EcephysProjectLimsApi, knowing that we have something (the metadata tables generated by this ticket) to validate against.

To be explicit, the new ticket we are opening would involve

  • adding probe_ids_to_skip to the EcephysProjectLimsApi
  • adding a kwarg to turn off filtering on ec.valid_data in EcephysLimsProjectApi
  • adding dataframe manipulation methods to join the tables produced by EcephysLimsProjectApi (there are some columns, like channel_id, that appear in more than one VBO metadata table)
  • modifing VBN metadata_writer to depend on EcephysLimsProjectApi
  • only accept the new ticket if the metadata tables produced before and after merging are identical

@aamster
Copy link
Contributor

aamster commented May 4, 2022

@danielsf I'm fine with pushing the reuse of EcephysProjectLimsApi to the future, if we're confident we'll be able to take it on in the future.

@danielsf danielsf force-pushed the ticket/ecephys/33a/metadata/writer branch 3 times, most recently from 3c9a07a to 95c6446 Compare May 5, 2022 23:38
@danielsf
Copy link
Contributor Author

danielsf commented May 5, 2022

#2411 is the ticket to make this work rely on the legacy ecephys LIMS API

@danielsf
Copy link
Contributor Author

danielsf commented May 5, 2022

#2412 is to address a misalignment in date_of_acquisition discovered in the course of this work

@danielsf
Copy link
Contributor Author

danielsf commented May 6, 2022

#2414 has been opened to add functionality to the metadata writer to create the input.json that will ultimately be fed into the informatics_data_release_tool

@danielsf
Copy link
Contributor Author

danielsf commented May 6, 2022

@aamster I think this is ready for another look. The method for processing prior_exposures_to_omissions is still ugly (it has to create a combined history from the ecephys_sessions table and the behavior_sessions_table), but it relies on the prior_exposure_processing module from VBO (and now actually does what Corbett wanted, as I understand it).

I haven't done much to address your "what should be private and what should be public" comment. The reason is, "I don't know." As I'm sure you've noticed, I've become much less sanguine than you of late that code from any particular release can be generalized to releases past or present. I think the best we can do is keep the data munging methods siloed and documented so that we can borrow things when we want them and don't accidentally borrow them when we don't want them.

I'd like to get this merged so we can move forward with the release at hand.

If you want to have a larger organizational conversation about the VBN release code, I'm open to that in the next few weeks, once we know that we are on our way to actually getting the data into S3.

@danielsf danielsf force-pushed the ticket/ecephys/33a/metadata/writer branch 3 times, most recently from 4d5ebef to 6e53840 Compare May 6, 2022 21:30
@danielsf
Copy link
Contributor Author

danielsf commented May 6, 2022

@aamster I think, modulo the cases where we have tentatively agreed to disagree (and whatever we are calling the situation with ecephys_project_lims_api.py) I have actually responded to your review.

Talk to you about this on Monday.

Copy link
Contributor

@aamster aamster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I do think there needs to be more cohesion between this metadata writer and the previous VBO one, but as already discussed doing so would require some additional time and effort. Things I am referring to are reuse of ProjectTable, common base class between the existing BehaviorProjectMetadataWriter and VBN2022MetadataWriterClass, and reorganization of data release packages for different projects, as you've already pointed out. I don't know how much of the above we want to take on in the future. We've also pointed out that much of database queries can share with the VCN queries, which I think we plan on taking on in the future.

@danielsf danielsf force-pushed the ticket/ecephys/33a/metadata/writer branch from 6e53840 to e8be731 Compare May 9, 2022 16:05
@danielsf danielsf merged commit fe68a15 into vbn_2022_dev May 9, 2022
@danielsf danielsf deleted the ticket/ecephys/33a/metadata/writer branch June 8, 2022 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants