Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sourmash tax prepare fails with No taxonomic identifiers found. #2326

Open
taylorreiter opened this issue Oct 12, 2022 · 10 comments
Open

sourmash tax prepare fails with No taxonomic identifiers found. #2326

taylorreiter opened this issue Oct 12, 2022 · 10 comments

Comments

@taylorreiter
Copy link
Contributor

Command and output pasted below. Lineages csv attached and reproduced!

sourmash tax prepare --taxonomy-csv inputs/sourmash_databases/cheesegenomes.lineages.csv -o tmp.sqldb

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from 'inputs/sourmash_databases/cheesegenomes.lineages.csv': No taxonomic identifiers found.

cheesegenomes.lineages.csv:

ident,taxid,superkingdom,phylum,class,order,family,genus,species,strain
pcamembertiSAM3_3runs.flye.diamond_microbeProteome922.fs_corrected.pilon,5075,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium camemberti,SAM3_3
pen12.pilon,2720512,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,12
rs17.pilon,5081,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium sp.,RS-17
geo.pilon,1173061,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Dipodascaceae,Geotrichum,Geotrichum candidum,geo
JBC_canu.pilon,229535,Eukaryota,Ascomycota,Eurotiomycetes,Eurotiales,Aspergillaceae,Penicillium,Penicillium nordicum,JBC
JB370.pilon,40374,Eukaryota,Ascomycota,Sordariomycetes,Microascales,Microascaceae,Scopulariopsis,Scopulariopsis sp.,JB370
135e.pilon,45537,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,,Diutina,Diutina catenulata,135e
135B.pilon,4959,Eukaryota,Ascomycota,Saccharomycetes,Saccharomycetales,Debaryomycetaceae,Debaryomyces,Debaryomyces hansenii,135B

I can't think what would be causing this...I tried to essentially copy the genbank lineage formats.

@taylorreiter
Copy link
Contributor Author

Probably should have tagged @bluegenes in this!

@ctb
Copy link
Contributor

ctb commented Oct 13, 2022

some sort of weird formatting issue that affects the csv module but not pandas.read_csv.

Screen Shot 2022-10-13 at 9 57 33 AM

The file is in DOS format but ... weird. Nothing (vi, emacs, Mac OS Numbers) has a problem with it!

python code to reproduce:

import csv
r = csv.reader(open(filename, newline=''))

for row in r:
    print(row)
    break

tl;dr open, save as CSV, try again.

@taylorreiter
Copy link
Contributor Author

taylorreiter commented Oct 13, 2022

ya that's deeply annoying and the solution. I read it into R and wrote it out again and the problems were fixed. Doing so in vim or excel did not fix it. le sigh. thank you for your help!!!!

@ctb
Copy link
Contributor

ctb commented Oct 13, 2022

leave this open and I'll add something to the error output listing the headers that WERE found...

@taylorreiter
Copy link
Contributor Author

🪄 🌟 thank you!

@ctb
Copy link
Contributor

ctb commented Oct 13, 2022

ah-hah! figured it out:

Screen Shot 2022-10-13 at 2 53 25 PM

this is the "byte order mark (BOM)" that means this file is UTF-8 encoded. See https://stackoverflow.com/questions/50130605/python-2-7-csv-file-read-write-xef-xbb-xbf-code.

I'm not sure what the right move is here but at least I know what it is now!

@ctb ctb mentioned this issue Oct 15, 2022
5 tasks
@ctb
Copy link
Contributor

ctb commented Oct 15, 2022

PR #2333 adds the following output:

% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'

Note the error output ("headers are") will be standard across all CSV-loading attempts, this is just an example using the tax summarize command (also new in #2333).

@ctb
Copy link
Contributor

ctb commented Oct 16, 2022

@ctb
Copy link
Contributor

ctb commented Oct 16, 2022

This Arrow PR adds support for BOM: apache/arrow#11892

ctb added a commit that referenced this issue Nov 14, 2022
This PR adds a `tax summarize` command per #2212.

It also:
* tackles native loading of with-lineages files produced by `tax
annotate` as taxonomy spreadsheets
(#2185)
* improves error reporting output for wonky unicode formatted tax CSV
files for #2326

Tackles #2212
Tackles #2185
Tackles parts of #2326

## TODO

- [x] tests!
- [x] docs!
- [x] check desired output format against christy e-mail
- [x] provide "linting" style output? - punted to
#2361
- [x] maybe we want to use this command, or a separate command, to
compare b/t a set of signatures (or a manifest...) and a set of
taxonomies? e.g. `tax crosscheck --db db --taxonomy <taxonomy>` that
will tell us which identifiers don't have taxonomy, and which taxonomy
entries don't have sketches? - punted to
#2361

## Example output

Running on a traditional taxonomy file:
```
% sourmash tax summarize gtdb-rs202.taxonomy.v2.db                      

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 258406 entries.
num idents: 258406
rank superkingdom:        2 distinct identifiers
rank phylum:              169 distinct identifiers
rank class:               419 distinct identifiers
rank order:               1312 distinct identifiers
rank family:              3264 distinct identifiers
rank genus:               12888 distinct identifiers
rank species:             47894 distinct identifiers
```

On a gather-with-lineages file:
```
% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers
```

On the bad CSV file from #2326 -

```
% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'
```

## CSV output of per-rank information

With CSV output,
```
% sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 317542 entries.
num idents: 317542
rank superkingdom:        2 distinct identifiers
rank phylum:              189 distinct identifiers
rank class:               481 distinct identifiers
rank order:               1593 distinct identifiers
rank family:              4107 distinct identifiers
rank genus:               16686 distinct identifiers
rank species:             65703 distinct identifiers
now calculating detailed lineage counts...
...done!
saved 88761 lineage counts to 'aaa.csv'
```
and `aaa.csv` looks like:

| | rank | count | lineage |

|---:|:-------------|--------:|:--------------------------------------------------------------------------------------------------------------|
| 0 | superkingdom | 311480 | d__Bacteria |
| 1 | phylum | 141114 | d__Bacteria;p__Proteobacteria |
| 2 | class | 121804 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria |
| 3 | order | 74108 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales
|
| 4 | family | 63971 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae
|
| 5 | phylum | 61795 | d__Bacteria;p__Firmicutes |
| 6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli |
| 7 | order | 32177 |
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales |
| 8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota |
| 9 | genus | 27205 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia
|
@ctb
Copy link
Contributor

ctb commented Nov 14, 2022

Clearer error message added in #2333

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants