Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tax summarize #2333

Merged
merged 17 commits into from
Nov 14, 2022
Merged

add tax summarize #2333

merged 17 commits into from
Nov 14, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Oct 15, 2022

This PR adds a tax summarize command per #2212.

It also:

Tackles #2212
Tackles #2185
Tackles parts of #2326

TODO

Example output

Running on a traditional taxonomy file:

% sourmash tax summarize gtdb-rs202.taxonomy.v2.db                      

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 258406 entries.
num idents: 258406
rank superkingdom:        2 distinct identifiers
rank phylum:              169 distinct identifiers
rank class:               419 distinct identifiers
rank order:               1312 distinct identifiers
rank family:              3264 distinct identifiers
rank genus:               12888 distinct identifiers
rank species:             47894 distinct identifiers

On a gather-with-lineages file:

% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers

On the bad CSV file from #2326 -

% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'

CSV output of per-rank information

With CSV output,

% sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 317542 entries.
num idents: 317542
rank superkingdom:        2 distinct identifiers
rank phylum:              189 distinct identifiers
rank class:               481 distinct identifiers
rank order:               1593 distinct identifiers
rank family:              4107 distinct identifiers
rank genus:               16686 distinct identifiers
rank species:             65703 distinct identifiers
now calculating detailed lineage counts...
...done!
saved 88761 lineage counts to 'aaa.csv'

and aaa.csv looks like:

rank count lineage
0 superkingdom 311480 d__Bacteria
1 phylum 141114 d__Bacteria;p__Proteobacteria
2 class 121804 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria
3 order 74108 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales
4 family 63971 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae
5 phylum 61795 d__Bacteria;p__Firmicutes
6 class 61794 d__Bacteria;p__Firmicutes;c__Bacilli
7 order 32177 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales
8 phylum 28532 d__Bacteria;p__Actinobacteriota
9 genus 27205 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia

@codecov
Copy link

codecov bot commented Oct 15, 2022

Codecov Report

Merging #2333 (43e094d) into latest (502f668) will increase coverage by 0.08%.
The diff coverage is 95.83%.

@@            Coverage Diff             @@
##           latest    #2333      +/-   ##
==========================================
+ Coverage   83.98%   84.06%   +0.08%     
==========================================
  Files         129      130       +1     
  Lines       14969    15059      +90     
  Branches     2192     2212      +20     
==========================================
+ Hits        12572    12660      +88     
  Misses       2103     2103              
- Partials      294      296       +2     
Flag Coverage Δ
python 92.18% <95.83%> (+0.04%) ⬆️
rust 57.73% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/tax/tax_utils.py 98.11% <91.80%> (-0.21%) ⬇️
src/sourmash/cli/tax/__init__.py 100.00% <100.00%> (ø)
src/sourmash/cli/tax/summarize.py 100.00% <100.00%> (ø)
src/sourmash/tax/__main__.py 93.64% <100.00%> (+1.00%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@ctb ctb changed the title [WIP] add tax summarize add tax summarize Nov 13, 2022
@ctb
Copy link
Contributor Author

ctb commented Nov 13, 2022

ready for review @bluegenes !

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!!

@ctb ctb merged commit 04a9dac into latest Nov 14, 2022
@ctb ctb deleted the add/tax_summarize branch November 14, 2022 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants