-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Divide non-DNA MinHash ksize by 3 for external consumption. #1277
Conversation
Co-authored-by: C. Titus Brown <titus@idyll.org>
Codecov Report
@@ Coverage Diff @@
## latest #1277 +/- ##
==========================================
+ Coverage 88.67% 88.69% +0.02%
==========================================
Files 125 125
Lines 18231 18281 +50
Branches 1434 1440 +6
==========================================
+ Hits 16167 16215 +48
- Misses 1818 1819 +1
- Partials 246 247 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
ready for review @luizirber @bluegenes |
I'll leave the
Should these be moved into Rust, then? (not asking you to do it, but since both operations are 'fixing' what is being sent/retrieved from the Rust layer...) |
Before I wrote this PR, I would have said "no" or "maybe" to moving it to the Rust layer. Now I'm on "maybe" or "yes" - it's not that big a change, in practice! (I'm impressed with how clean it turned out to be.) |
No comment on mv to rust layer, but
This is wonderful. If I'm in protein space and request k = 9, I want sourmash to give me 9 amino acids. It's so so confusing when that is not the case. If @bluegenes has a different preference/opinion, hers is probably better than mine :)
So we still have to do the weird 3 thing...but only in part of sourmash? |
yes, and that part is getting deprecated in 4.0 and removed in 5.0. The reason to keep it working the same is so that we don't mess up people who are using 2.x and 3.x |
As I understand it:
I'm 100% on board with these! I do think that a deprecation warning for This PR gets close to not needing to worry about protein ksize conversion! Main question -- What is happening with selectors? It's not intuitive to sketch at |
|
yes!
yes!
actually, no - Python internals use "correct" ksizes!!
absolutely. I'm working on that in documentation PRs and will also put in a deprecation here (and think about backporting it).
I think, for you, you would simply never need to worry about it, yes! Rust users might, but that's an army of 1.6, for now (standard 1.5x multiplier for Luiz, plus 0.1 for me).
Selectors work with the divide-by-3 number, so it's all consistent. The only odd-command-out is I will take all of this as license to proceed towards making this a merge-able PR - thanks, all! |
Ready for review and merge! |
(I'll do a new PR for moving the checks into Rust, but easier to have this one merged and then use the tests to move the code around =]) |
yay w00t! |
This is a "light touch" fix to #1271, which makes it so that the entire Python (and CLI) layer of sourmash does protein/hp/dayhoff ksizes correctly: DNA k-mer sizes are the same, protein k-mer sizes are the lengths of the actual amino acid k-mers being used, and translated DNA k-mers are translated into the correct protein k-mer size.
This PR changes almost everything on the command-line and Python API side to match this - in particular, command line selectors are now different.
The one exception is
sourmash compute
which still uses the "old" meaning of ksize for backwards compatibility reasons.No changes apply to existing signature or database formats, so this is forwards and backwards compatible with 3.5!
The key changes are in the
MinHash
wrapper class insrc/sourmash/minhash.py
:ksize
property divides the ksize by 3 for non-DNA minhashes.Along the way, I also had to fix the LCA database to save/load its database correctly, with the adjusted ksize for non-DNA databases.
Most of the rest of the changes are minor fixes to command-line parameters or output, except for -
sourmash compute
ksizes for non-DNA signatures to be multiples of 3.Fixes #1271.
Includes #1019
TODO:
sourmash compute
for 5.0 removalsourmash compute
in 5.0Checklist
make test
Did it pass the tests?make coverage
Is the new code covered?without a major version increment. Changing file formats also requires a
major version number increment.
changes were made?