Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] implement tax grep to produce identifier picklists from taxonomies #2178

Merged
merged 19 commits into from
Aug 8, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Aug 4, 2022

Implements sourmash tax grep.

See new documentation section from this PR.

Fixes #1592
Fixes #1868
Implements taxonomy csv.gz support in loading and saving #2012

TODO:

  • update/condense description; create new issue(s)
  • provide link to rtd
  • implement '-v/--invert-match' to invert search
  • implement '-i/--ignore-case' for case insensitive
  • implement '-c/--count' for count only

Example:

% sourmash tax grep Shew -t tests/test-data/tax/test.taxonomy.csv  -o out.csv

== This is sourmash version 4.4.4.dev4+gceb590ce. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

searching 1 taxonomy files for 'Shew'
found 2 matches; saved identifiers to picklist file 'out.csv'

where out.csv contains:

ident,superkingdom,phylum,class,order,family,genus,species
GCF_000017325.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Shewanellaceae,g__Shewanella,s__Shewanella baltica
GCF_000021665.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Shewanellaceae,g__Shewanella,s__Shewanella baltica

@ctb
Copy link
Contributor Author

ctb commented Aug 4, 2022

@bluegenes curious if there is good default functionality that belongs in tax grep! It's already useful 😆

@codecov
Copy link

codecov bot commented Aug 4, 2022

Codecov Report

Merging #2178 (8ce3be4) into latest (6ac4862) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2178      +/-   ##
==========================================
+ Coverage   84.45%   84.53%   +0.07%     
==========================================
  Files         130      131       +1     
  Lines       15392    15458      +66     
  Branches     2192     2207      +15     
==========================================
+ Hits        13000    13067      +67     
  Misses       2092     2092              
+ Partials      300      299       -1     
Flag Coverage Δ
python 91.88% <100.00%> (+0.05%) ⬆️
rust 65.29% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/cli/tax/__init__.py 100.00% <100.00%> (ø)
src/sourmash/cli/tax/grep.py 100.00% <100.00%> (ø)
src/sourmash/tax/__main__.py 90.15% <100.00%> (+1.67%) ⬆️
src/sourmash/tax/tax_utils.py 98.27% <100.00%> (+0.18%) ⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

@bluegenes
Copy link
Contributor

bluegenes commented Aug 4, 2022

My ideal end point would be allowing tax grep to (optionally) work directly with signature files, assuming appropriate taxonomy file is provided (either internally or via -t). This would allow subsetting a database directly (skip picklist) for gather, etc. Otherwise tax grep just seems like a slightly more specific version of regular grep? Still very useful, though.

it would be really neat to allow signature selection on the taxonomy, e.g. --include s__Phaeobacter or --tax-include ... Yes, we can currently do this with sig grep or picklists using the taxonomy file, but seamless integration would be great.
#2154 (comment)

Simple use case: I know the family of my organism and just want to get genbank/gtdb matches to anything in that family.

More complicated use case that would be really neat to enable: run prefetch against, e.g. genus-level representative database. Then run gather and use the prefetch output csv as a picklist, but select all signatures in the same genera (or family, etc) as any match.

Actually, even if you were to keep tax grep as just a picklist utility, being able to scale up from matches to all members of the taxonomic group could be pretty neat.

@ctb
Copy link
Contributor Author

ctb commented Aug 5, 2022

I have thoughts and questions!

My ideal end point would be allowing tax grep to (optionally) work directly with signature files, assuming appropriate taxonomy file is provided (either internally or via -t). This would allow subsetting a database directly (skip picklist) for gather, etc. Otherwise tax grep just seems like a slightly more specific version of regular grep? Still very useful, though.

There are two aspects of tax grep as a "slightly more specific version of regular grep" that I wanted to highlight -

  • the ability to restrict to a specific rank is useful!
  • it works on SQLite tax dbs

this is on top of the general advertisement of the functionality, which is generally useful.

I note in passing that I just realized that we can already use taxonomy files as picklists on ident, which I hadn't quite realized. 😂

So I think tax grep as currently envisioned is useful :).

it would be really neat to allow signature selection on the taxonomy, e.g. --include s__Phaeobacter or --tax-include ... Yes, we can currently do this with sig grep or picklists using the taxonomy file, but seamless integration would be great.
#2154 (comment)

Simple use case: I know the family of my organism and just want to get genbank/gtdb matches to anything in that family.

Right... hmm. Brainstorming response -

This doesn't belong in sourmash tax which is specific to taxonomy manipulations - where this really belongs is somewhere generic to all the other sourmash subcommands,as with picklists and --include/--exclude. We want the rest of sourmash to "understand" taxonomies when they are present.

I wonder... does this maybe belong in picklists?

  • taxonomy CSV files can be picklists already, and we could add something to recognize SQLite databases here;
  • you could imagine adding special hooks of some kind to the picklist API... this is janky, but something like --picklist tax.csv::taxonomy. (But this kind of perverts the semantics of picklists, where tax.csv is a list of all the things you want to select. 🤔 )

or, as you allude to above, maybe once we can add default taxonomies to zipfiles, then include/exclude can just recognize that they're there and search taxonomies too.

More complicated use case that would be really neat to enable: run prefetch against, e.g. genus-level representative database. Then run gather and use the prefetch output csv as a picklist, but select all signatures in the same genera (or family, etc) as any match.

So this comes back around to the "special hooks" idea above a little bit, I think - you don't want to grep on a single value, you want to use a list of lineages as a picklist and extract at some specific rank.

Actually, even if you were to keep tax grep as just a picklist utility, being able to scale up from matches to all members of the taxonomic group could be pretty neat.

Right... is this kind of an inverse operation?

You would want to take a list of lineages (perhaps from a prefetch or gather file - note that sourmash tax only deals with gather files for now) and then build a taxonomy? or a picklist? that expands those matches to another level.

For example, you might:

  • run gather
  • annotate gather results with taxonomy using sourmash tax annotate => strain level
  • 🪄 somehow 🪄 go from the lineages in the annotated gather file to a more general set of lineages at (say) the genus level

This strikes me as a pretty useful taxonomic utility, and points at functionality that is lacking -

  • we don't really have anything that parses the annotated gather file, other than the metacoder example in visualizing the sourmash gather output with metacoder #2041
  • we don't have any way to manipulate a "bulk" taxonomy file in bulk ways, e.g. "give me all of the lineages from taxfile1 that match at the genus level to the genomes/lineages in taxfile2.

Taking a step back -

My main concern is that I think some of the above is too fuzzy and experimental to implement now in the CLI - I would rather we write a few scripts to do it when we need it, and then once we develop out more specific use cases, we can more easily figure out where it belongs in the CLI. If we implemented it now we'll get it wrong and then have a bunch of not-that-useful functionality lying around.

There are also some design principles forming in my head around the above. Thoughts -

  • most of our use of taxonomy files and databases so far have been taxonomy-wide - like "here's the GTDB taxonomy! All 320,000 entries!" - and I, at least, haven't been thinking about subsetting or subsampling them. I think a lot of the ideas above come from breaking through from this view into one where subsetting the taxonomy files starts happening.
  • there's also distinction developing for me - IMO we want to retain our handle on the genome identifiers, and not pay too much attention to just the taxonomic lineages. The genome identifiers are actually pretty useful!

Circling back around -

maybe in addition to tax grep which works on a single match, we want a bulk matching function that takes in some format that links identifiers and taxonomies (annotated gather file? and/or taxonomy file?) as well as a taxonomy database, and then outputs picklists. "Promote these matches from strain to genus level" is one specific example here.

Just roughing it out,

sourmash tax extract -g gather.csv -t gtdb.csv -r genus -o picklist.csv

would take the matches in gather.csv, use the taxonomy in gtdb.csv, pull them back to genus level, and output a picklist.

@bluegenes
Copy link
Contributor

bluegenes commented Aug 5, 2022

I note in passing that I just realized that we can already use taxonomy files as picklists on ident, which I hadn't quite realized. 😂

yep, this is all i meant by that! grep the taxonomy file for desired taxon (all names should be unique, right?)--> desired tax picklist, just re-add header. But tax grep is easier and safer, and you make a great point about it this working on sqldb, which will be very handy going forward when we default include taxonomy in databases 🙂 .

I think tax extract would be very useful and get us to the second use case!!!

  1. to select all members of specific family: tax grep family_name --> picklist
  2. to promote prefetch matches to genus level: tax annotate --> tax extract --> picklist

Note -- If we're providing the taxonomy file to tax extract, we could even just do the tax annotate step internally to avoid needing to run an extra step.

Additional use case: use these picklists with exclude allows us to easily exclude entire taxonomic groups from search, e.g. for testing taxonomic classification.

This doesn't belong in sourmash tax which is specific to taxonomy manipulations - where this really belongs is somewhere generic to all the other sourmash subcommands, as with picklists and --include/--exclude. We want the rest of sourmash to "understand" taxonomies when they are present.

agreed -- this would be useful across the board. Expanding the --include to search taxonomy could be really convenient, but not needed for now. A combination of tax grep and tax extract with picklists can get us to both use cases.

I agree that the genome identifiers are most important -- it's just helpful to use their associated information to subselect the database. But I do think we still have some problems with genome identifiers: first, we don't currently match across GCA and GCF, and second, we don't have explicitly written rules for identifiers for custom databases (e.g. spaces and periods have specific meanings and will change how we parse your identifier). I should perhaps through these in another issue... (edit: see #2181).

@ctb ctb changed the title [WIP] implement tax grep to produce identifier picklists from taxonomies [MRG] implement tax grep to produce identifier picklists from taxonomies Aug 7, 2022
@ctb
Copy link
Contributor Author

ctb commented Aug 7, 2022

Ready for review @bluegenes, but no hurry ;)

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Wider question: do we want to support gzipped csv all across the board? e.g. picklists, etc?

doc/command-line.md Outdated Show resolved Hide resolved
Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add a tax grep command? create a taxonomy module function to output picklists
2 participants