-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add tax crosscheck
to compare databases and lineages for correctness
#2361
Labels
Comments
ctb
added a commit
that referenced
this issue
Nov 14, 2022
This PR adds a `tax summarize` command per #2212. It also: * tackles native loading of with-lineages files produced by `tax annotate` as taxonomy spreadsheets (#2185) * improves error reporting output for wonky unicode formatted tax CSV files for #2326 Tackles #2212 Tackles #2185 Tackles parts of #2326 ## TODO - [x] tests! - [x] docs! - [x] check desired output format against christy e-mail - [x] provide "linting" style output? - punted to #2361 - [x] maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g. `tax crosscheck --db db --taxonomy <taxonomy>` that will tell us which identifiers don't have taxonomy, and which taxonomy entries don't have sketches? - punted to #2361 ## Example output Running on a traditional taxonomy file: ``` % sourmash tax summarize gtdb-rs202.taxonomy.v2.db == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 258406 entries. num idents: 258406 rank superkingdom: 2 distinct identifiers rank phylum: 169 distinct identifiers rank class: 419 distinct identifiers rank order: 1312 distinct identifiers rank family: 3264 distinct identifiers rank genus: 12888 distinct identifiers rank species: 47894 distinct identifiers ``` On a gather-with-lineages file: ``` % sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 84 entries. num idents: 84 rank superkingdom: 2 distinct identifiers rank phylum: 25 distinct identifiers rank class: 32 distinct identifiers rank order: 42 distinct identifiers rank family: 52 distinct identifiers rank genus: 60 distinct identifiers rank species: 84 distinct identifiers ``` On the bad CSV file from #2326 - ``` % sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ERROR while loading taxonomies! cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain' ``` ## CSV output of per-rank information With CSV output, ``` % sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 317542 entries. num idents: 317542 rank superkingdom: 2 distinct identifiers rank phylum: 189 distinct identifiers rank class: 481 distinct identifiers rank order: 1593 distinct identifiers rank family: 4107 distinct identifiers rank genus: 16686 distinct identifiers rank species: 65703 distinct identifiers now calculating detailed lineage counts... ...done! saved 88761 lineage counts to 'aaa.csv' ``` and `aaa.csv` looks like: | | rank | count | lineage | |---:|:-------------|--------:|:--------------------------------------------------------------------------------------------------------------| | 0 | superkingdom | 311480 | d__Bacteria | | 1 | phylum | 141114 | d__Bacteria;p__Proteobacteria | | 2 | class | 121804 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria | | 3 | order | 74108 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales | | 4 | family | 63971 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae | | 5 | phylum | 61795 | d__Bacteria;p__Firmicutes | | 6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli | | 7 | order | 32177 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales | | 8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota | | 9 | genus | 27205 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia |
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
extracted from #2212 (comment) -
The text was updated successfully, but these errors were encountered: