-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add lineages/metadata summarization utility? #2212
Comments
side note, we could also let folks input a lineages gather file into this utility (the requisite columns exist!) to produce a nice human-readable summary. Not quite as nice as |
also ref #2185, deal with annotate-style output (re "let folks input a lineages gather file") |
Initial output:
hey @bluegenes @taylorreiter what else should this output? :) |
see also comment in #2333: maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g. That seems ...useful... |
brainstorm: will add a |
I've added CSV output as follows -
and
|
I can't currently think of any additional formats taht would be useful over what is included here. I might recommend a more descriptive column name than |
changed to |
This PR adds a `tax summarize` command per #2212. It also: * tackles native loading of with-lineages files produced by `tax annotate` as taxonomy spreadsheets (#2185) * improves error reporting output for wonky unicode formatted tax CSV files for #2326 Tackles #2212 Tackles #2185 Tackles parts of #2326 ## TODO - [x] tests! - [x] docs! - [x] check desired output format against christy e-mail - [x] provide "linting" style output? - punted to #2361 - [x] maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g. `tax crosscheck --db db --taxonomy <taxonomy>` that will tell us which identifiers don't have taxonomy, and which taxonomy entries don't have sketches? - punted to #2361 ## Example output Running on a traditional taxonomy file: ``` % sourmash tax summarize gtdb-rs202.taxonomy.v2.db == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 258406 entries. num idents: 258406 rank superkingdom: 2 distinct identifiers rank phylum: 169 distinct identifiers rank class: 419 distinct identifiers rank order: 1312 distinct identifiers rank family: 3264 distinct identifiers rank genus: 12888 distinct identifiers rank species: 47894 distinct identifiers ``` On a gather-with-lineages file: ``` % sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 84 entries. num idents: 84 rank superkingdom: 2 distinct identifiers rank phylum: 25 distinct identifiers rank class: 32 distinct identifiers rank order: 42 distinct identifiers rank family: 52 distinct identifiers rank genus: 60 distinct identifiers rank species: 84 distinct identifiers ``` On the bad CSV file from #2326 - ``` % sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ERROR while loading taxonomies! cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain' ``` ## CSV output of per-rank information With CSV output, ``` % sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 317542 entries. num idents: 317542 rank superkingdom: 2 distinct identifiers rank phylum: 189 distinct identifiers rank class: 481 distinct identifiers rank order: 1593 distinct identifiers rank family: 4107 distinct identifiers rank genus: 16686 distinct identifiers rank species: 65703 distinct identifiers now calculating detailed lineage counts... ...done! saved 88761 lineage counts to 'aaa.csv' ``` and `aaa.csv` looks like: | | rank | count | lineage | |---:|:-------------|--------:|:--------------------------------------------------------------------------------------------------------------| | 0 | superkingdom | 311480 | d__Bacteria | | 1 | phylum | 141114 | d__Bacteria;p__Proteobacteria | | 2 | class | 121804 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria | | 3 | order | 74108 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales | | 4 | family | 63971 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae | | 5 | phylum | 61795 | d__Bacteria;p__Firmicutes | | 6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli | | 7 | order | 32177 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales | | 8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota | | 9 | genus | 27205 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia |
Fixed in #2333 |
With #2211, I think database preparation is going ok in one of the two scenarios, but we don't have a good
sourmash sig summarize
-style utility to take in a lineages database and report information back to me, e.g. how many unique identifiers, etc.It would be really nice to have this utility so we can verify that all database identifiers got incorporated into the taxonomy database.
If/as taxonomy information gets integrated into databases, it may be useful to report information on the lineages as part of
sig summarize
output. But I do think folks may often want custom lineages file, so it'd be nice if this were a standalone tax utility as well.The text was updated successfully, but these errors were encountered: