Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide taxonomy operations that work on semicolon-separated lineages #2185

Closed
ctb opened this issue Aug 7, 2022 · 3 comments
Closed

provide taxonomy operations that work on semicolon-separated lineages #2185

ctb opened this issue Aug 7, 2022 · 3 comments
Labels

Comments

@ctb
Copy link
Contributor

ctb commented Aug 7, 2022

when we use sourmash tax annotate on gather results, we produce a column with semicolon-separated lineages in it. we don't have many (any?) sourmash subcommands that natively ingest that format, although we do have some parsing code here #2041 for metacoder.

might be nice to think about tooling that easily interconverts between semicolon separated lineages and comma separated lineages, or something.

@ctb ctb added the taxonomy label Aug 7, 2022
@ctb
Copy link
Contributor Author

ctb commented Aug 7, 2022

(this came up during the discussion of tax grep over in #2178 (comment), and also seems relevant to some of the bigger select-on-metadata ideas out there e.g. #2180)

@ctb
Copy link
Contributor Author

ctb commented Oct 15, 2022

Implemented in #2333 - so, for example, the new summarize command would print out:

% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers

and

% sourmash tax prepare -t SRR606249-k31.x.gtdb.gather.with-lineages.csv -o zzz.csv -F csv

works as well.

ctb added a commit that referenced this issue Nov 14, 2022
This PR adds a `tax summarize` command per #2212.

It also:
* tackles native loading of with-lineages files produced by `tax
annotate` as taxonomy spreadsheets
(#2185)
* improves error reporting output for wonky unicode formatted tax CSV
files for #2326

Tackles #2212
Tackles #2185
Tackles parts of #2326

## TODO

- [x] tests!
- [x] docs!
- [x] check desired output format against christy e-mail
- [x] provide "linting" style output? - punted to
#2361
- [x] maybe we want to use this command, or a separate command, to
compare b/t a set of signatures (or a manifest...) and a set of
taxonomies? e.g. `tax crosscheck --db db --taxonomy <taxonomy>` that
will tell us which identifiers don't have taxonomy, and which taxonomy
entries don't have sketches? - punted to
#2361

## Example output

Running on a traditional taxonomy file:
```
% sourmash tax summarize gtdb-rs202.taxonomy.v2.db                      

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 258406 entries.
num idents: 258406
rank superkingdom:        2 distinct identifiers
rank phylum:              169 distinct identifiers
rank class:               419 distinct identifiers
rank order:               1312 distinct identifiers
rank family:              3264 distinct identifiers
rank genus:               12888 distinct identifiers
rank species:             47894 distinct identifiers
```

On a gather-with-lineages file:
```
% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers
```

On the bad CSV file from #2326 -

```
% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'
```

## CSV output of per-rank information

With CSV output,
```
% sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 317542 entries.
num idents: 317542
rank superkingdom:        2 distinct identifiers
rank phylum:              189 distinct identifiers
rank class:               481 distinct identifiers
rank order:               1593 distinct identifiers
rank family:              4107 distinct identifiers
rank genus:               16686 distinct identifiers
rank species:             65703 distinct identifiers
now calculating detailed lineage counts...
...done!
saved 88761 lineage counts to 'aaa.csv'
```
and `aaa.csv` looks like:

| | rank | count | lineage |

|---:|:-------------|--------:|:--------------------------------------------------------------------------------------------------------------|
| 0 | superkingdom | 311480 | d__Bacteria |
| 1 | phylum | 141114 | d__Bacteria;p__Proteobacteria |
| 2 | class | 121804 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria |
| 3 | order | 74108 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales
|
| 4 | family | 63971 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae
|
| 5 | phylum | 61795 | d__Bacteria;p__Firmicutes |
| 6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli |
| 7 | order | 32177 |
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales |
| 8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota |
| 9 | genus | 27205 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia
|
@ctb
Copy link
Contributor Author

ctb commented Nov 14, 2022

semicolon-separated lineages and gather with-lineages output is now natively supported as a taxonomy spreadsheet and can be used with all tax commands per #2333 🎉

@ctb ctb closed this as completed Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant