-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provide taxonomy operations that work on semicolon-separated lineages #2185
Comments
(this came up during the discussion of |
Implemented in #2333 - so, for example, the new summarize command would print out:
and
works as well.
|
This PR adds a `tax summarize` command per #2212. It also: * tackles native loading of with-lineages files produced by `tax annotate` as taxonomy spreadsheets (#2185) * improves error reporting output for wonky unicode formatted tax CSV files for #2326 Tackles #2212 Tackles #2185 Tackles parts of #2326 ## TODO - [x] tests! - [x] docs! - [x] check desired output format against christy e-mail - [x] provide "linting" style output? - punted to #2361 - [x] maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g. `tax crosscheck --db db --taxonomy <taxonomy>` that will tell us which identifiers don't have taxonomy, and which taxonomy entries don't have sketches? - punted to #2361 ## Example output Running on a traditional taxonomy file: ``` % sourmash tax summarize gtdb-rs202.taxonomy.v2.db == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 258406 entries. num idents: 258406 rank superkingdom: 2 distinct identifiers rank phylum: 169 distinct identifiers rank class: 419 distinct identifiers rank order: 1312 distinct identifiers rank family: 3264 distinct identifiers rank genus: 12888 distinct identifiers rank species: 47894 distinct identifiers ``` On a gather-with-lineages file: ``` % sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 84 entries. num idents: 84 rank superkingdom: 2 distinct identifiers rank phylum: 25 distinct identifiers rank class: 32 distinct identifiers rank order: 42 distinct identifiers rank family: 52 distinct identifiers rank genus: 60 distinct identifiers rank species: 84 distinct identifiers ``` On the bad CSV file from #2326 - ``` % sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ERROR while loading taxonomies! cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain' ``` ## CSV output of per-rank information With CSV output, ``` % sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 317542 entries. num idents: 317542 rank superkingdom: 2 distinct identifiers rank phylum: 189 distinct identifiers rank class: 481 distinct identifiers rank order: 1593 distinct identifiers rank family: 4107 distinct identifiers rank genus: 16686 distinct identifiers rank species: 65703 distinct identifiers now calculating detailed lineage counts... ...done! saved 88761 lineage counts to 'aaa.csv' ``` and `aaa.csv` looks like: | | rank | count | lineage | |---:|:-------------|--------:|:--------------------------------------------------------------------------------------------------------------| | 0 | superkingdom | 311480 | d__Bacteria | | 1 | phylum | 141114 | d__Bacteria;p__Proteobacteria | | 2 | class | 121804 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria | | 3 | order | 74108 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales | | 4 | family | 63971 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae | | 5 | phylum | 61795 | d__Bacteria;p__Firmicutes | | 6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli | | 7 | order | 32177 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales | | 8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota | | 9 | genus | 27205 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia |
semicolon-separated lineages and gather |
when we use
sourmash tax annotate
on gather results, we produce a column with semicolon-separated lineages in it. we don't have many (any?) sourmash subcommands that natively ingest that format, although we do have some parsing code here #2041 for metacoder.might be nice to think about tooling that easily interconverts between semicolon separated lineages and comma separated lineages, or something.
The text was updated successfully, but these errors were encountered: