-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] implement tax grep
to produce identifier picklists from taxonomies
#2178
Conversation
@bluegenes curious if there is good default functionality that belongs in |
Codecov Report
@@ Coverage Diff @@
## latest #2178 +/- ##
==========================================
+ Coverage 84.45% 84.53% +0.07%
==========================================
Files 130 131 +1
Lines 15392 15458 +66
Branches 2192 2207 +15
==========================================
+ Hits 13000 13067 +67
Misses 2092 2092
+ Partials 300 299 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more |
My ideal end point would be allowing tax grep to (optionally) work directly with signature files, assuming appropriate taxonomy file is provided (either internally or via
Simple use case: I know the More complicated use case that would be really neat to enable: run prefetch against, e.g. genus-level representative database. Then run gather and use the prefetch output csv as a picklist, but select all signatures in the same genera (or family, etc) as any match. Actually, even if you were to keep |
I have thoughts and questions!
There are two aspects of
this is on top of the general advertisement of the functionality, which is generally useful. I note in passing that I just realized that we can already use taxonomy files as picklists on ident, which I hadn't quite realized. 😂 So I think
Right... hmm. Brainstorming response - This doesn't belong in I wonder... does this maybe belong in picklists?
or, as you allude to above, maybe once we can add default taxonomies to zipfiles, then
So this comes back around to the "special hooks" idea above a little bit, I think - you don't want to grep on a single value, you want to use a list of lineages as a picklist and extract at some specific rank.
Right... is this kind of an inverse operation? You would want to take a list of lineages (perhaps from a prefetch or gather file - note that For example, you might:
This strikes me as a pretty useful taxonomic utility, and points at functionality that is lacking -
Taking a step back - My main concern is that I think some of the above is too fuzzy and experimental to implement now in the CLI - I would rather we write a few scripts to do it when we need it, and then once we develop out more specific use cases, we can more easily figure out where it belongs in the CLI. If we implemented it now we'll get it wrong and then have a bunch of not-that-useful functionality lying around. There are also some design principles forming in my head around the above. Thoughts -
Circling back around - maybe in addition to Just roughing it out,
would take the matches in gather.csv, use the taxonomy in gtdb.csv, pull them back to genus level, and output a picklist. |
yep, this is all i meant by that! grep the taxonomy file for desired taxon (all names should be unique, right?)--> desired tax picklist, just re-add header. But I think
Note -- If we're providing the taxonomy file to Additional use case: use these picklists with
agreed -- this would be useful across the board. Expanding the I agree that the genome identifiers are most important -- it's just helpful to use their associated information to subselect the database. But I do think we still have some problems with genome identifiers: first, we don't currently match across |
tax grep
to produce identifier picklists from taxonomiestax grep
to produce identifier picklists from taxonomies
Ready for review @bluegenes, but no hurry ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎉
Wider question: do we want to support gzipped csv all across the board? e.g. picklists, etc?
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Implements
sourmash tax grep
.See new documentation section from this PR.
Fixes #1592
Fixes #1868
Implements taxonomy csv.gz support in loading and saving #2012
TODO:
Example:
where
out.csv
contains: