Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide a mechanism for tracking the source of genomes in SBTs #230

Closed
ctb opened this issue May 17, 2017 · 4 comments
Closed

provide a mechanism for tracking the source of genomes in SBTs #230

ctb opened this issue May 17, 2017 · 4 comments

Comments

@ctb
Copy link
Contributor

ctb commented May 17, 2017

@meren asks:

I have one quick logistics question though: is there a mechanism to keep track of the "source" of a > given set of genomes in the db that are used for classification? The reason I am asking is follows. I > am sure there is a way to add more genomes (incrementally maybe? or by re-computing the entire > thing?) to the database. For instance, all the genomes you have is coming from RefSeq. Fine. Let's > say I have additional 1,000 that is nowhere to be found, but essential for some researchers to
utilize. If I could add these genomes with some sort of source identifier, I could find out if I have
hits there that I don't see in RefSeq, etc. This could truly improve the utility of the tool
tremendously in the long run, since more and more labs will start putting together novel genomic
collections (with or without taxonomic affiliations) and others will want to track them.

@ctb
Copy link
Contributor Author

ctb commented May 17, 2017

A few quick thoughts --

as in #229, allow multiple SBTs to be passed into gather, search, watch, and categorize. Then you can "just" group signatures into SBTs by their source.

provide an 'originator' or 'provider' flag.

Note, some of this was discussed in http://ivory.idyll.org/blog/2016-sourmash-signatures-metadata.html.

Probably the simplest thing is to provide two fields in the metadata: a 'provider' that is generic to a collection of signatures (e.g. 'NCBI') and a 'provider accession' that is supposed to be unique to the provider. Then we also need to provide a command line way of setting, updating, and retrieving that info.

@ctb
Copy link
Contributor Author

ctb commented May 17, 2017

hmm, we may also want to revisit 'type' (e.g. our current mrnaseq tag).

@ctb
Copy link
Contributor Author

ctb commented Feb 25, 2018

I think this will be punted past 2.0.

@ctb
Copy link
Contributor Author

ctb commented May 4, 2022

we have decided to do this using an (incrementally more standardized) "identifier" scheme that is not exactly principled but seems to work fine.

@ctb ctb closed this as completed May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant