Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add column 3 to kreport #2306

Merged
merged 5 commits into from
Oct 6, 2022
Merged

[MRG] add column 3 to kreport #2306

merged 5 commits into from
Oct 6, 2022

Conversation

bluegenes
Copy link
Contributor

Fixes #2305

Updates kreport to include information in column 3, which is the bp "assigned" to a particular taxon. Since we make all assignments to the genome level and then apply taxonomy later, this column just contains the species-level assignments from column 2. I think this best mimics the intention of the output format / outputs from other tools.

Updated documentation with this and the recommendation to generate sketches with abundance.

This is basically a bug fix, since this is closer to what the format should have been to begin with.

@bluegenes
Copy link
Contributor Author

bluegenes commented Oct 3, 2022

I think this best mimics the intention of the output format / outputs from other tools.

My only thought here is that while sourmash assigns k-mers to genomes, the actual level of our resolution depends on alphabet and k-mer size. We don't do taxonomic assignments any differently, but I wouldn't necessarily recommend using, e.g. protein k7, to get species-level assignments. Perhaps I should add a sentence about this in the doc? e.g.

Note: sourmash makes all assignments to genomes, then integrates taxonomy information and
uses LCA-style summarization to build assignments. For species-level specificity, our current
recommendation is to use use our default k-mer size of 31.

I'm not really sure though, because 1. this is really just general sourmash info/recommendation, and 2. there are lots of reasons to try other ksizes.

@codecov
Copy link

codecov bot commented Oct 3, 2022

Codecov Report

Merging #2306 (66416a6) into latest (e615ce0) will increase coverage by 7.30%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2306      +/-   ##
==========================================
+ Coverage   84.84%   92.14%   +7.30%     
==========================================
  Files         131      100      -31     
  Lines       15687    11416    -4271     
  Branches     2189     2190       +1     
==========================================
- Hits        13309    10519    -2790     
+ Misses       2083      602    -1481     
  Partials      295      295              
Flag Coverage Δ
python 92.14% <100.00%> (+<0.01%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/tax/tax_utils.py 98.32% <100.00%> (+<0.01%) ⬆️
src/core/src/index/search.rs
src/core/src/signature.rs
src/core/src/ffi/index/revindex.rs
src/core/src/lib.rs
src/core/src/ffi/storage.rs
src/core/tests/test.rs
src/core/src/index/mod.rs
src/core/src/index/sbt/mod.rs
src/core/src/index/linear.rs
... and 22 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@bluegenes
Copy link
Contributor Author

ok @ctb @sourmash-bio/devs ready for review

@ctb
Copy link
Contributor

ctb commented Oct 6, 2022

I think this best mimics the intention of the output format / outputs from other tools.

My only thought here is that while sourmash assigns k-mers to genomes, the actual level of our resolution depends on alphabet and k-mer size. We don't do taxonomic assignments any differently, but I wouldn't necessarily recommend using, e.g. protein k7, to get species-level assignments. Perhaps I should add a sentence about this in the doc? e.g.

Note: sourmash makes all assignments to genomes, then integrates taxonomy information and
uses LCA-style summarization to build assignments. For species-level specificity, our current
recommendation is to use use our default k-mer size of 31.

I'm not really sure though, because 1. this is really just general sourmash info/recommendation, and 2. there are lots of reasons to try other ksizes.

I like the idea of adding this text, actually. We're seeing more and more evidence that people are reading, sometimes scouring, the text for details - let's give it to 'em!

The one addition I would make is to say "sourmash gather makes all assignments to genomes, and then sourmash tax integrates taxonomy information...".

@bluegenes bluegenes merged commit 5518c6f into latest Oct 6, 2022
@bluegenes bluegenes deleted the kreport-col3 branch October 6, 2022 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fix kreport format: third column
2 participants