[MRG] implement `tax grep` to produce identifier picklists from taxonomies #2178

ctb · 2022-08-04T13:06:24Z

Implements sourmash tax grep.

See new documentation section from this PR.

Fixes #1592
Fixes #1868
Implements taxonomy csv.gz support in loading and saving #2012

TODO:

update/condense description; create new issue(s)
provide link to rtd
implement '-v/--invert-match' to invert search
implement '-i/--ignore-case' for case insensitive
implement '-c/--count' for count only

Example:

% sourmash tax grep Shew -t tests/test-data/tax/test.taxonomy.csv  -o out.csv

== This is sourmash version 4.4.4.dev4+gceb590ce. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

searching 1 taxonomy files for 'Shew'
found 2 matches; saved identifiers to picklist file 'out.csv'

where out.csv contains:

ident,superkingdom,phylum,class,order,family,genus,species
GCF_000017325.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Shewanellaceae,g__Shewanella,s__Shewanella baltica
GCF_000021665.1,d__Bacteria,p__Proteobacteria,c__Gammaproteobacteria,o__Enterobacterales,f__Shewanellaceae,g__Shewanella,s__Shewanella baltica

ctb · 2022-08-04T13:08:26Z

@bluegenes curious if there is good default functionality that belongs in tax grep! It's already useful 😆

codecov · 2022-08-04T13:13:51Z

Codecov Report

Merging #2178 (8ce3be4) into latest (6ac4862) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2178      +/-   ##
==========================================
+ Coverage   84.45%   84.53%   +0.07%     
==========================================
  Files         130      131       +1     
  Lines       15392    15458      +66     
  Branches     2192     2207      +15     
==========================================
+ Hits        13000    13067      +67     
  Misses       2092     2092              
+ Partials      300      299       -1

Flag	Coverage Δ
python	`91.88% <100.00%> (+0.05%)`	⬆️
rust	`65.29% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/sourmash/cli/tax/__init__.py	`100.00% <100.00%> (ø)`
src/sourmash/cli/tax/grep.py	`100.00% <100.00%> (ø)`
src/sourmash/tax/__main__.py	`90.15% <100.00%> (+1.67%)`	⬆️
src/sourmash/tax/tax_utils.py	`98.27% <100.00%> (+0.18%)`	⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

bluegenes · 2022-08-04T15:50:11Z

My ideal end point would be allowing tax grep to (optionally) work directly with signature files, assuming appropriate taxonomy file is provided (either internally or via -t). This would allow subsetting a database directly (skip picklist) for gather, etc. Otherwise tax grep just seems like a slightly more specific version of regular grep? Still very useful, though.

it would be really neat to allow signature selection on the taxonomy, e.g. --include s__Phaeobacter or --tax-include ... Yes, we can currently do this with sig grep or picklists using the taxonomy file, but seamless integration would be great.
#2154 (comment)

Simple use case: I know the family of my organism and just want to get genbank/gtdb matches to anything in that family.

More complicated use case that would be really neat to enable: run prefetch against, e.g. genus-level representative database. Then run gather and use the prefetch output csv as a picklist, but select all signatures in the same genera (or family, etc) as any match.

Actually, even if you were to keep tax grep as just a picklist utility, being able to scale up from matches to all members of the taxonomic group could be pretty neat.

ctb · 2022-08-05T10:41:58Z

I have thoughts and questions!

My ideal end point would be allowing tax grep to (optionally) work directly with signature files, assuming appropriate taxonomy file is provided (either internally or via -t). This would allow subsetting a database directly (skip picklist) for gather, etc. Otherwise tax grep just seems like a slightly more specific version of regular grep? Still very useful, though.

There are two aspects of tax grep as a "slightly more specific version of regular grep" that I wanted to highlight -

the ability to restrict to a specific rank is useful!
it works on SQLite tax dbs

this is on top of the general advertisement of the functionality, which is generally useful.

I note in passing that I just realized that we can already use taxonomy files as picklists on ident, which I hadn't quite realized. 😂

So I think tax grep as currently envisioned is useful :).

it would be really neat to allow signature selection on the taxonomy, e.g. --include s__Phaeobacter or --tax-include ... Yes, we can currently do this with sig grep or picklists using the taxonomy file, but seamless integration would be great.
#2154 (comment)

Simple use case: I know the family of my organism and just want to get genbank/gtdb matches to anything in that family.

Right... hmm. Brainstorming response -

This doesn't belong in sourmash tax which is specific to taxonomy manipulations - where this really belongs is somewhere generic to all the other sourmash subcommands,as with picklists and --include/--exclude. We want the rest of sourmash to "understand" taxonomies when they are present.

I wonder... does this maybe belong in picklists?

taxonomy CSV files can be picklists already, and we could add something to recognize SQLite databases here;
you could imagine adding special hooks of some kind to the picklist API... this is janky, but something like --picklist tax.csv::taxonomy. (But this kind of perverts the semantics of picklists, where tax.csv is a list of all the things you want to select. 🤔 )

or, as you allude to above, maybe once we can add default taxonomies to zipfiles, then include/exclude can just recognize that they're there and search taxonomies too.

More complicated use case that would be really neat to enable: run prefetch against, e.g. genus-level representative database. Then run gather and use the prefetch output csv as a picklist, but select all signatures in the same genera (or family, etc) as any match.

So this comes back around to the "special hooks" idea above a little bit, I think - you don't want to grep on a single value, you want to use a list of lineages as a picklist and extract at some specific rank.

Actually, even if you were to keep tax grep as just a picklist utility, being able to scale up from matches to all members of the taxonomic group could be pretty neat.

Right... is this kind of an inverse operation?

You would want to take a list of lineages (perhaps from a prefetch or gather file - note that sourmash tax only deals with gather files for now) and then build a taxonomy? or a picklist? that expands those matches to another level.

For example, you might:

run gather
annotate gather results with taxonomy using sourmash tax annotate => strain level
🪄 somehow 🪄 go from the lineages in the annotated gather file to a more general set of lineages at (say) the genus level

This strikes me as a pretty useful taxonomic utility, and points at functionality that is lacking -

we don't really have anything that parses the annotated gather file, other than the metacoder example in visualizing the sourmash gather output with metacoder #2041
we don't have any way to manipulate a "bulk" taxonomy file in bulk ways, e.g. "give me all of the lineages from taxfile1 that match at the genus level to the genomes/lineages in taxfile2.

Taking a step back -

My main concern is that I think some of the above is too fuzzy and experimental to implement now in the CLI - I would rather we write a few scripts to do it when we need it, and then once we develop out more specific use cases, we can more easily figure out where it belongs in the CLI. If we implemented it now we'll get it wrong and then have a bunch of not-that-useful functionality lying around.

There are also some design principles forming in my head around the above. Thoughts -

most of our use of taxonomy files and databases so far have been taxonomy-wide - like "here's the GTDB taxonomy! All 320,000 entries!" - and I, at least, haven't been thinking about subsetting or subsampling them. I think a lot of the ideas above come from breaking through from this view into one where subsetting the taxonomy files starts happening.
there's also distinction developing for me - IMO we want to retain our handle on the genome identifiers, and not pay too much attention to just the taxonomic lineages. The genome identifiers are actually pretty useful!

Circling back around -

maybe in addition to tax grep which works on a single match, we want a bulk matching function that takes in some format that links identifiers and taxonomies (annotated gather file? and/or taxonomy file?) as well as a taxonomy database, and then outputs picklists. "Promote these matches from strain to genus level" is one specific example here.

Just roughing it out,

sourmash tax extract -g gather.csv -t gtdb.csv -r genus -o picklist.csv

would take the matches in gather.csv, use the taxonomy in gtdb.csv, pull them back to genus level, and output a picklist.

bluegenes · 2022-08-05T16:41:55Z

I note in passing that I just realized that we can already use taxonomy files as picklists on ident, which I hadn't quite realized. 😂

yep, this is all i meant by that! grep the taxonomy file for desired taxon (all names should be unique, right?)--> desired tax picklist, just re-add header. But tax grep is easier and safer, and you make a great point about it this working on sqldb, which will be very handy going forward when we default include taxonomy in databases 🙂 .

I think tax extract would be very useful and get us to the second use case!!!

to select all members of specific family: tax grep family_name --> picklist
to promote prefetch matches to genus level: tax annotate --> tax extract --> picklist

Note -- If we're providing the taxonomy file to tax extract, we could even just do the tax annotate step internally to avoid needing to run an extra step.

Additional use case: use these picklists with exclude allows us to easily exclude entire taxonomic groups from search, e.g. for testing taxonomic classification.

This doesn't belong in sourmash tax which is specific to taxonomy manipulations - where this really belongs is somewhere generic to all the other sourmash subcommands, as with picklists and --include/--exclude. We want the rest of sourmash to "understand" taxonomies when they are present.

agreed -- this would be useful across the board. Expanding the --include to search taxonomy could be really convenient, but not needed for now. A combination of tax grep and tax extract with picklists can get us to both use cases.

I agree that the genome identifiers are most important -- it's just helpful to use their associated information to subselect the database. But I do think we still have some problems with genome identifiers: first, we don't currently match across GCA and GCF, and second, we don't have explicitly written rules for identifiers for custom databases (e.g. spaces and periods have specific meanings and will change how we parse your identifier). I should perhaps through these in another issue... (edit: see #2181).

…o tax_grep

ctb · 2022-08-07T13:33:59Z

Ready for review @bluegenes, but no hurry ;)

bluegenes

🎉

Wider question: do we want to support gzipped csv all across the board? e.g. picklists, etc?

doc/command-line.md

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

ctb added 2 commits August 4, 2022 08:51

add basics of sourmash tax grep command

acce43b

initial implementation

6d23702

This was referenced Aug 4, 2022

create a taxonomy module function to output picklists #1592

Closed

add a tax grep command? #1868

Closed

fixed bug; added various args

15ff2c1

bluegenes mentioned this pull request Aug 5, 2022

expanding database selection methods: metadata #2180

Open

ctb added 10 commits August 6, 2022 12:39

more tests

66b0103

support gzipped taxonomy files

9a4a0b7

let 'prepare' save gzip csv files, too

d6391c7

switch to outputting lineage; refactor

26b00cf

fix and test invert match

34bfb60

more tests

de6cfa9

add --count

b56305a

comment

96533b0

upd docs

2262822

add test for tax prepare combining

b593f53

ctb mentioned this pull request Aug 7, 2022

document 'ident' properties for databases and taxonomy #2181

Open

ctb added 2 commits August 7, 2022 05:42

test multiple, duplicate tax

f13f3ec

finish? tests

1d78637

ctb added 3 commits August 7, 2022 06:25

Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…

b721680

…o tax_grep

implement --force

0f31c40

update usage string

a3914f5

ctb changed the title ~~[WIP] implement tax grep to produce identifier picklists from taxonomies~~ [MRG] implement tax grep to produce identifier picklists from taxonomies Aug 7, 2022

ctb mentioned this pull request Aug 7, 2022

support gzip and/or zipped taxonomy CSVs? #2012

Closed

bluegenes approved these changes Aug 8, 2022

View reviewed changes

doc/command-line.md Outdated Show resolved Hide resolved

Update doc/command-line.md

8ce3be4

Co-authored-by: Tessa Pierce Ward <bluegenes@users.noreply.github.com>

ctb mentioned this pull request Aug 8, 2022

should we support gzipped CSVs for all of sourmash? #2188

Closed

ctb merged commit 010718c into latest Aug 8, 2022

ctb deleted the tax_grep branch August 8, 2022 14:34

ctb mentioned this pull request Aug 12, 2022

[MRG] add generic support for gzipped and zipfile CSVs #2195

Merged

4 tasks

ctb mentioned this pull request Aug 29, 2022

draft release notes for v4.5.0 #2241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] implement `tax grep` to produce identifier picklists from taxonomies #2178

[MRG] implement `tax grep` to produce identifier picklists from taxonomies #2178

ctb commented Aug 4, 2022 •

edited

Loading

ctb commented Aug 4, 2022

codecov bot commented Aug 4, 2022 •

edited

Loading

bluegenes commented Aug 4, 2022 •

edited

Loading

ctb commented Aug 5, 2022

bluegenes commented Aug 5, 2022 •

edited

Loading

ctb commented Aug 7, 2022

bluegenes left a comment

[MRG] implement tax grep to produce identifier picklists from taxonomies #2178

[MRG] implement tax grep to produce identifier picklists from taxonomies #2178

Conversation

ctb commented Aug 4, 2022 • edited Loading

Example:

ctb commented Aug 4, 2022

codecov bot commented Aug 4, 2022 • edited Loading

Codecov Report

bluegenes commented Aug 4, 2022 • edited Loading

ctb commented Aug 5, 2022

bluegenes commented Aug 5, 2022 • edited Loading

ctb commented Aug 7, 2022

bluegenes left a comment

Choose a reason for hiding this comment

[MRG] implement `tax grep` to produce identifier picklists from taxonomies #2178

[MRG] implement `tax grep` to produce identifier picklists from taxonomies #2178

ctb commented Aug 4, 2022 •

edited

Loading

codecov bot commented Aug 4, 2022 •

edited

Loading

bluegenes commented Aug 4, 2022 •

edited

Loading

bluegenes commented Aug 5, 2022 •

edited

Loading