Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cellxgene loader #10

Merged
merged 16 commits into from
Nov 3, 2020
Merged

Cellxgene loader #10

merged 16 commits into from
Nov 3, 2020

Conversation

davidsebfischer
Copy link
Contributor

  • automatised data loader on streamlined data structure
  • refactoring of code to ease 3rd party integration

@davidsebfischer davidsebfischer self-assigned this Oct 27, 2020
@davidsebfischer davidsebfischer marked this pull request as draft October 27, 2020 12:21
adata = anndata.read(fn)
adata.X = adata.raw.X

self.adata.uns["lab"] = adata.uns["contributors"]["name"]
Copy link

@ambrosejcarr ambrosejcarr Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this loader evaded the ADATA_IDS refactor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ambrosejcarr, indeed this was the case, I fixed that and introduced a separate constants container for cellxgene objects!

Copy link

@ambrosejcarr ambrosejcarr Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this implementation! I made my comment because I assumed (incorrectly) that you would replace our schema names with those in ADATA_IDS, making our data conform to your internal schema.

Checking my understanding: What you've done instead is added cellxgene's schema as an alternative schema. Is that correct?

If going this route, I'd recommend you inherit both ADATA_IDS and ADATA_IDS_CELLXGENE from an abstract base class that defines the properties that sfaira requires. I imagine this includes but may not be limited to the "lazy loading metadata". If you did it this way, a future PR could modularize the schema; a user could choose to swap between different schema based on their selection of which set of constants to use with the data loader.

Someone who wanted to use sfaira to work in the HTAN or HuBMAP schema (examples) would simply need to write a new set of schema that meets the requirements of the sfaira ABC, and they could work in their chosen ecosystem's schema...

Copy link
Contributor Author

@davidsebfischer davidsebfischer Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking my understanding: What you've done instead is added cellxgene's schema as an alternative schema. Is that correct?

yes! I also refactored the classes to be derived of base classes that reflect what we require for other data bases now! this closesly ties in with the meta data as well, correct! Thanks for the suggestion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended this - all relevant properties for lazy loading are now exactly those that are also read from meta data files. So in this case, we just need the matched meta data files which are then queried during lazy loading. These match a subset of all generally required properties of the object, essentially the most important dataset-wise (as oppose to cell-wise) properties. If it s the case that a data set has cells from different organs for example, the organ entry of this meta data file would just be a list of contained organs, but not cell-wise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On 2nd thoughts, I dont think it s really necessary that a 3rd party also maintains the meta files - if the adata objects are streamlined as this is outlined here now, these relevant attributes can be easily read from a loaded adata object and an up-to-date library of meta files can be built locally with write_meta() once before lazy loading usage. This is not very costly (this is how we maintain our meta data file library for example) and reduces overhead of interfacing 3rd parties a lot, let me know if somebody disagrees.

Copy link

@ambrosejcarr ambrosejcarr Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. I don't understand what this means, however:

[A]ll relevant properties for lazy loading are now exactly those that are also read from meta data files. So in this case, we just need the matched meta data files which are then queried during lazy loading

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, meta data files were not directly linked to lazy loading as we largely set properties for lazy loading in datasest constructors. Initially we only used meta data files to generate data base statistics actually. I streamlined this now so that we dont have different versions of meta data files.

@davidsebfischer davidsebfischer changed the base branch from master to dev November 3, 2020 10:02
@davidsebfischer
Copy link
Contributor Author

addresses #15

@davidsebfischer davidsebfischer marked this pull request as ready for review November 3, 2020 10:40
@davidsebfischer davidsebfischer merged commit 365739c into dev Nov 3, 2020
@Zethson Zethson deleted the cellxgene_loader branch April 27, 2021 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants