Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create utility (tax extract?) to work with taxonomic annotation/output => picklists #2187

Open
ctb opened this issue Aug 7, 2022 · 0 comments

Comments

@ctb
Copy link
Contributor

ctb commented Aug 7, 2022

From #2178,

@bluegenes:

More complicated use case that would be really neat to enable: run prefetch against, e.g. genus-level representative database. Then run gather and use the prefetch output csv as a picklist, but select all signatures in the same genera (or family, etc) as any match.

Actually, even if you were to keep tax grep as just a picklist utility, being able to scale up from matches to all members of the taxonomic group could be pretty neat.

I responded:

Right... is this kind of an inverse operation?

You would want to take a list of lineages (perhaps from a prefetch or gather file - note that sourmash tax only deals with gather files for now) and then build a taxonomy? or a picklist? that expands those matches to another level.

For example, you might:

* run gather

* annotate gather results with taxonomy using `sourmash tax annotate` => strain level

* 🪄 somehow 🪄 go from the lineages in the annotated gather file to a more general set of lineages at (say) the genus level

This strikes me as a pretty useful taxonomic utility, and points at functionality that is lacking -

* we don't really have anything that parses the annotated gather file, other than the metacoder example in #2041
  • we don't have any way to manipulate a "bulk" taxonomy file in bulk ways, e.g. "give me all of the lineages from taxfile1 that match at the genus level to the genomes/lineages in taxfile2.

... elided ...

maybe in addition to tax grep which works on a single match, we want a bulk matching function that takes in some format that links identifiers and taxonomies (annotated gather file? and/or taxonomy file?) as well as a taxonomy database, and then outputs picklists. "Promote these matches from strain to genus level" is one specific example here.

Just roughing it out,

sourmash tax extract -g gather.csv -t gtdb.csv -r genus -o picklist.csv

would take the matches in gather.csv, use the taxonomy in gtdb.csv, pull them back to genus level, and output a picklist

@bluegenes

I think tax extract would be very useful and get us to the second use case!!!

1. to select all members of specific family: `tax grep family_name` --> picklist

2. to promote prefetch matches to genus level: `tax annotate` --> `tax extract` --> picklist

Note -- If we're providing the taxonomy file to tax extract, we could even just do the tax annotate step internally to avoid needing to run an extra step.

Additional use case: use these picklists with exclude allows us to easily exclude entire taxonomic groups from search, e.g. for testing taxonomic classification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant