Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add lineages/metadata summarization utility? #2212

Closed
bluegenes opened this issue Aug 15, 2022 · 9 comments
Closed

add lineages/metadata summarization utility? #2212

bluegenes opened this issue Aug 15, 2022 · 9 comments

Comments

@bluegenes
Copy link
Contributor

With #2211, I think database preparation is going ok in one of the two scenarios, but we don't have a good sourmash sig summarize-style utility to take in a lineages database and report information back to me, e.g. how many unique identifiers, etc.

It would be really nice to have this utility so we can verify that all database identifiers got incorporated into the taxonomy database.

If/as taxonomy information gets integrated into databases, it may be useful to report information on the lineages as part of sig summarize output. But I do think folks may often want custom lineages file, so it'd be nice if this were a standalone tax utility as well.

@bluegenes
Copy link
Contributor Author

side note, we could also let folks input a lineages gather file into this utility (the requisite columns exist!) to produce a nice human-readable summary. Not quite as nice as write_human_summary from tax annotate, but handy if you already have the annotated file sitting around. Perhaps an "off-label" use case 🤷🏻‍♀️.

@ctb
Copy link
Contributor

ctb commented Aug 27, 2022

also ref #2185, deal with annotate-style output (re "let folks input a lineages gather file")

@ctb ctb mentioned this issue Oct 15, 2022
5 tasks
@ctb
Copy link
Contributor

ctb commented Oct 15, 2022

Initial output:

% sourmash tax summarize gtdb-rs202.taxonomy.v2.db                      

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 258406 entries.
num idents: 258406
rank superkingdom:        2 distinct identifiers
rank phylum:              169 distinct identifiers
rank class:               419 distinct identifiers
rank order:               1312 distinct identifiers
rank family:              3264 distinct identifiers
rank genus:               12888 distinct identifiers
rank species:             47894 distinct identifiers

hey @bluegenes @taylorreiter what else should this output? :)

@ctb
Copy link
Contributor

ctb commented Oct 15, 2022

see also comment in #2333:

maybe we want to use this command, or a separate command, to compare b/t a set of signatures (or a manifest...) and a set of taxonomies? e.g. tax crosscheck --db db --taxonomy <taxonomy> that will tell us which identifiers don't have taxonomy, and which taxonomy entries don't have sketches?

That seems ...useful...

@ctb
Copy link
Contributor

ctb commented Oct 15, 2022

brainstorm: will add a --csv option that outputs counts for each aggregate named rank.

@ctb
Copy link
Contributor

ctb commented Oct 16, 2022

I've added CSV output as follows -

% sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 317542 entries.
num idents: 317542
rank superkingdom:        2 distinct identifiers
rank phylum:              189 distinct identifiers
rank class:               481 distinct identifiers
rank order:               1593 distinct identifiers
rank family:              4107 distinct identifiers
rank genus:               16686 distinct identifiers
rank species:             65703 distinct identifiers
now calculating detailed lineage counts...
...done!
saved 88761 lineage counts to 'aaa.csv'

and aaa.csv looks like:

rank count lineage
0 superkingdom 311480 d__Bacteria
1 phylum 141114 d__Bacteria;p__Proteobacteria
2 class 121804 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria
3 order 74108 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales
4 family 63971 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae
5 phylum 61795 d__Bacteria;p__Firmicutes
6 class 61794 d__Bacteria;p__Firmicutes;c__Bacilli
7 order 32177 d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales
8 phylum 28532 d__Bacteria;p__Actinobacteriota
9 genus 27205 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia

@taylorreiter
Copy link
Contributor

I can't currently think of any additional formats taht would be useful over what is included here. I might recommend a more descriptive column name than count though -- could get confusing with num lineages vs. num hashes vs num est bp etc.

@ctb
Copy link
Contributor

ctb commented Nov 13, 2022

I can't currently think of any additional formats taht would be useful over what is included here. I might recommend a more descriptive column name than count though -- could get confusing with num lineages vs. num hashes vs num est bp etc.

changed to lineage_count!

ctb added a commit that referenced this issue Nov 14, 2022
This PR adds a `tax summarize` command per #2212.

It also:
* tackles native loading of with-lineages files produced by `tax
annotate` as taxonomy spreadsheets
(#2185)
* improves error reporting output for wonky unicode formatted tax CSV
files for #2326

Tackles #2212
Tackles #2185
Tackles parts of #2326

## TODO

- [x] tests!
- [x] docs!
- [x] check desired output format against christy e-mail
- [x] provide "linting" style output? - punted to
#2361
- [x] maybe we want to use this command, or a separate command, to
compare b/t a set of signatures (or a manifest...) and a set of
taxonomies? e.g. `tax crosscheck --db db --taxonomy <taxonomy>` that
will tell us which identifiers don't have taxonomy, and which taxonomy
entries don't have sketches? - punted to
#2361

## Example output

Running on a traditional taxonomy file:
```
% sourmash tax summarize gtdb-rs202.taxonomy.v2.db                      

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 258406 entries.
num idents: 258406
rank superkingdom:        2 distinct identifiers
rank phylum:              169 distinct identifiers
rank class:               419 distinct identifiers
rank order:               1312 distinct identifiers
rank family:              3264 distinct identifiers
rank genus:               12888 distinct identifiers
rank species:             47894 distinct identifiers
```

On a gather-with-lineages file:
```
% sourmash tax summarize SRR606249-k31.x.gtdb.gather.with-lineages.csv                   

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 84 entries.
num idents: 84
rank superkingdom:        2 distinct identifiers
rank phylum:              25 distinct identifiers
rank class:               32 distinct identifiers
rank order:               42 distinct identifiers
rank family:              52 distinct identifiers
rank genus:               60 distinct identifiers
rank species:             84 distinct identifiers
```

On the bad CSV file from #2326 -

```
% sourmash tax summarize /Users/t/Downloads/cheesegenomes.lineages.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
ERROR while loading taxonomies!
cannot read taxonomy assignments from '/Users/t/Downloads/cheesegenomes.lineages.csv': No taxonomic identifiers found; headers are '\ufeffident','taxid','superkingdom','phylum','class','order','family','genus','species','strain'
```

## CSV output of per-rank information

With CSV output,
```
% sourmash tax summarize gtdb-rs207.taxonomy.sqldb -o aaa.csv

== This is sourmash version 4.5.0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

loading taxonomies...
...loaded 317542 entries.
num idents: 317542
rank superkingdom:        2 distinct identifiers
rank phylum:              189 distinct identifiers
rank class:               481 distinct identifiers
rank order:               1593 distinct identifiers
rank family:              4107 distinct identifiers
rank genus:               16686 distinct identifiers
rank species:             65703 distinct identifiers
now calculating detailed lineage counts...
...done!
saved 88761 lineage counts to 'aaa.csv'
```
and `aaa.csv` looks like:

| | rank | count | lineage |

|---:|:-------------|--------:|:--------------------------------------------------------------------------------------------------------------|
| 0 | superkingdom | 311480 | d__Bacteria |
| 1 | phylum | 141114 | d__Bacteria;p__Proteobacteria |
| 2 | class | 121804 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria |
| 3 | order | 74108 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales
|
| 4 | family | 63971 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae
|
| 5 | phylum | 61795 | d__Bacteria;p__Firmicutes |
| 6 | class | 61794 | d__Bacteria;p__Firmicutes;c__Bacilli |
| 7 | order | 32177 |
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales |
| 8 | phylum | 28532 | d__Bacteria;p__Actinobacteriota |
| 9 | genus | 27205 |
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia
|
@ctb
Copy link
Contributor

ctb commented Nov 14, 2022

Fixed in #2333

@ctb ctb closed this as completed Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants