Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expanding database selection methods: metadata #2180

Open
bluegenes opened this issue Aug 5, 2022 · 3 comments
Open

expanding database selection methods: metadata #2180

bluegenes opened this issue Aug 5, 2022 · 3 comments

Comments

@bluegenes
Copy link
Contributor

bluegenes commented Aug 5, 2022

Discussion in #2178 reminded me of some things @ctb and I talked around a while back, and that seem like far less of a leap now.

With new selection and subsetting functionalities being increasingly fleshed out and useful:sig grep, tax grep, sig extract, tax extract, etc - we could generally enable a manifest-style file with metadata (METADATA.csv/sql?) for signatures and support (generating picklists for) subsetting across it.

Current use case example:

When we run MAGsearch, we postprocess the results to link matches with their SRA metadata. We could instead (or in addition) build a lineages-style sqldb for SRA runinfo metadata as a complementary manifest.

This would allow us to do:

  • metadata selection, e.g. "seawater metagenome" to enable SRA search/MAGsearch on just samples with that metadata. This could be really handy for times where we don't want to search the entire database -- assuming picklists make it into SRA search, I guess. It would be extra neat if metadata categories were hierarchical so that we could use extract to scale up, but afaik that's not how the info is organized, so this is more of a dream than a concrete use case.
  • tax annotate-style annotation (or perhaps metadata annotate?)

As with the current functions, we use the metadata to select the identifiers we want, which we then use to select signatures for output/search/etc.

The most proximal use case is for MAGsearch, but I think could also be really useful for reference databases if there was additional metadata that would be useful to subset on -- e.g. quality, completeness, contamination, database source.

Ok this part is far less well-defined:
Thinking a bit about LIN groups and taxonomy that does not fit our current standard hierarchy. I wonder if we could allow these in the metadata file, with a corresponding json or similar that defines any (optional) hierarchical nature of the categories.

I guess the way I'm thinking about it is that taxonomy is a specific case, but metadata could be more flexible. @ctb there was a specific sort of tagging you suggested we could tie into when we talked about this (...last year??), but I can't remember the details.

@ctb
Copy link
Contributor

ctb commented Aug 7, 2022

see folksonomies in particular, mentioned in #1916 and #268 (comment)

@ctb
Copy link
Contributor

ctb commented Aug 7, 2022

continuing that thought - sig grep seems like the places to do this, or perhaps something specific to manifests where we can link signature identifiers/names to generic metadata.

@ctb
Copy link
Contributor

ctb commented Aug 15, 2022

I think expanding standalone manifests to support this kind of thing is the way to go - explicit shoutout to #1916.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants