-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash tax prepare
fails with No taxonomic identifiers found.
#2326
Comments
Probably should have tagged @bluegenes in this! |
some sort of weird formatting issue that affects the The file is in DOS format but ... weird. Nothing (vi, emacs, Mac OS Numbers) has a problem with it! python code to reproduce:
tl;dr open, save as CSV, try again. |
ya that's deeply annoying and the solution. I read it into R and wrote it out again and the problems were fixed. Doing so in vim or excel did not fix it. le sigh. thank you for your help!!!! |
leave this open and I'll add something to the error output listing the headers that WERE found... |
🪄 🌟 thank you! |
ah-hah! figured it out: this is the "byte order mark (BOM)" that means this file is UTF-8 encoded. See https://stackoverflow.com/questions/50130605/python-2-7-csv-file-read-write-xef-xbb-xbf-code. I'm not sure what the right move is here but at least I know what it is now! |
PR #2333 adds the following output:
Note the error output ("headers are") will be standard across all CSV-loading attempts, this is just an example using the |
asking question here: |
This Arrow PR adds support for BOM: apache/arrow#11892 |
This PR adds a `tax summarize` command per #2212. It also: * tackles native loading of with-lineages files produced by `tax annotate` as taxonomy spreadsheets (#2185) * improves error reporting output for wonky unicode formatted tax CSV files for #2326 Tackles #2212 Tackles #2185 Tackles parts of #2326 ## TODO - [x] tests! - [x] docs! - [x] check desired output format against christy e-mail - [x] provide "linting" style output? - punted to #2361 - [x] maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g. `tax crosscheck --db db --taxonomy <taxonomy>` that will tell us which identifiers don't have taxonomy, and which taxonomy entries don't have sketches? - punted to #2361 ## Example output Running on a traditional taxonomy file: ``` % sourmash tax summarize gtdb-rs202.taxonomy.v2.db == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 258406 entries. num idents: 258406 rank superkingdom: 2 distinct identifiers rank phylum: 169 distinct identifiers rank class: 419 distinct identifiers rank order: 1312 distinct identifiers rank family: 3264 distinct identifiers rank genus: 12888 distinct identifiers rank species: 47894 distinct identifiers ``` On a gather-with-lineages file: ``` % sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 84 entries. num idents: 84 rank superkingdom: 2 distinct identifiers rank phylum: 25 distinct identifiers rank class: 32 distinct identifiers rank order: 42 distinct identifiers rank family: 52 distinct identifiers rank genus: 60 distinct identifiers rank species: 84 distinct identifiers ``` On the bad CSV file from #2326 - ``` % sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ERROR while loading taxonomies! cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain' ``` ## CSV output of per-rank information With CSV output, ``` % sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv == This is sourmash version 4.5.0. == == Please cite Brown and Irber (2016), doi:10.21105/joss.00027. == loading taxonomies... ...loaded 317542 entries. num idents: 317542 rank superkingdom: 2 distinct identifiers rank phylum: 189 distinct identifiers rank class: 481 distinct identifiers rank order: 1593 distinct identifiers rank family: 4107 distinct identifiers rank genus: 16686 distinct identifiers rank species: 65703 distinct identifiers now calculating detailed lineage counts... ...done! saved 88761 lineage counts to 'aaa.csv' ``` and `aaa.csv` looks like: | | rank | count | lineage | |---:|:-------------|--------:|:--------------------------------------------------------------------------------------------------------------| | 0 | superkingdom | 311480 | d__Bacteria | | 1 | phylum | 141114 | d__Bacteria;p__Proteobacteria | | 2 | class | 121804 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria | | 3 | order | 74108 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales | | 4 | family | 63971 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae | | 5 | phylum | 61795 | d__Bacteria;p__Firmicutes | | 6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli | | 7 | order | 32177 | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales | | 8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota | | 9 | genus | 27205 | d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia |
Clearer error message added in #2333 |
Command and output pasted below. Lineages csv attached and reproduced!
cheesegenomes.lineages.csv:
I can't think what would be causing this...I tried to essentially copy the genbank lineage formats.
The text was updated successfully, but these errors were encountered: