Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tax crosscheck to compare databases and lineages for correctness #2361

Open
ctb opened this issue Nov 13, 2022 · 0 comments · May be fixed by #2362
Open

add tax crosscheck to compare databases and lineages for correctness #2361

ctb opened this issue Nov 13, 2022 · 0 comments · May be fixed by #2362
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Nov 13, 2022

extracted from #2212 (comment) -

maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g. tax crosscheck --db db --taxonomy that will tell us which identifiers don't have taxonomy, and which taxonomy entries don't have sketches?

@ctb ctb added the taxonomy label Nov 13, 2022
@ctb ctb mentioned this issue Nov 13, 2022
5 tasks
ctb added a commit that referenced this issue Nov 14, 2022
This PR adds a `tax summarize` command per #2212.

It also:
* tackles native loading of with-lineages files produced by `tax
annotate` as taxonomy spreadsheets
(#2185)
* improves error reporting output for wonky unicode formatted tax CSV
files for #2326

Tackles #2212
Tackles #2185
Tackles parts of #2326

## TODO

- [x] tests!
- [x] docs!
- [x] check desired output format against christy e-mail
- [x] provide "linting" style output? - punted to
#2361
- [x] maybe we want to use this command, or a separate command, to
compare b/t a set of signatures (or a manifest...) and a set of
taxonomies? e.g. `tax crosscheck --db db --taxonomy <taxonomy>` that
will tell us which identifiers don't have taxonomy, and which taxonomy
entries don't have sketches? - punted to
#2361

## Example output

Running on a traditional taxonomy file:
```
% sourmash tax summarize gtdb-rs202.taxonomy.v2.db                      

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 258406 entries.
num idents: 258406
rank superkingdom:        2 distinct identifiers
rank phylum:              169 distinct identifiers
rank class:               419 distinct identifiers
rank order:               1312 distinct identifiers
rank family:              3264 distinct identifiers
rank genus:               12888 distinct identifiers
rank species:             47894 distinct identifiers
```

On a gather-with-lineages file:
```
% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers
```

On the bad CSV file from #2326 -

```
% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'
```

## CSV output of per-rank information

With CSV output,
```
% sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 317542 entries.
num idents: 317542
rank superkingdom:        2 distinct identifiers
rank phylum:              189 distinct identifiers
rank class:               481 distinct identifiers
rank order:               1593 distinct identifiers
rank family:              4107 distinct identifiers
rank genus:               16686 distinct identifiers
rank species:             65703 distinct identifiers
now calculating detailed lineage counts...
...done!
saved 88761 lineage counts to 'aaa.csv'
```
and `aaa.csv` looks like:

| | rank | count | lineage |

|---:|:-------------|--------:|:--------------------------------------------------------------------------------------------------------------|
| 0 | superkingdom | 311480 | d__Bacteria |
| 1 | phylum | 141114 | d__Bacteria;p__Proteobacteria |
| 2 | class | 121804 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria |
| 3 | order | 74108 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales
|
| 4 | family | 63971 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae
|
| 5 | phylum | 61795 | d__Bacteria;p__Firmicutes |
| 6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli |
| 7 | order | 32177 |
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales |
| 8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota |
| 9 | genus | 27205 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia
|
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant