Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[EXP] minhash unique covered bp #2027

Open
wants to merge 7 commits into
base: latest
Choose a base branch
from
Open

[EXP] minhash unique covered bp #2027

wants to merge 7 commits into from

Conversation

bluegenes
Copy link
Contributor

@bluegenes bluegenes commented May 3, 2022

In ANI work, I dropped our rough scaled * num_hashes estimate for bp into minhash.py as the property covered_bp. Thanks to @dkoslicki for pointing out that 1. this is a bad name, bc we're not taking abundances into account, and 2. we should really be adding (ksize -1).

In this PR, I'm updating the property and renaming to unique_covered_bp

We could also add covered_bp that does take into account abundances if desired...

@codecov
Copy link

codecov bot commented May 3, 2022

Codecov Report

Merging #2027 (f4f5935) into latest (8136258) will increase coverage by 7.51%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2027      +/-   ##
==========================================
+ Coverage   84.15%   91.66%   +7.51%     
==========================================
  Files         129       98      -31     
  Lines       15087    10807    -4280     
  Branches     2119     2119              
==========================================
- Hits        12696     9906    -2790     
+ Misses       2095      604    -1491     
- Partials      296      297       +1     
Flag Coverage Δ
python 91.66% <100.00%> (-0.01%) ⬇️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/minhash.py 93.98% <100.00%> (ø)
src/sourmash/search.py 97.68% <100.00%> (-0.22%) ⬇️
src/sourmash/sketchcomparison.py 100.00% <100.00%> (ø)
src/core/src/storage.rs
src/core/src/ffi/hyperloglog.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/index/sbt/mod.rs
src/core/src/ffi/index/revindex.rs
src/core/src/cmd.rs
... and 25 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8136258...f4f5935. Read the comment docs.

@bluegenes
Copy link
Contributor Author

bluegenes commented May 4, 2022

@ctb I think I need your thoughts on this.

test_gather_abund_10_1 is the test that's currently failing, but I think it's useful to think about what we want with for bp vs hashes.

I think .unique_covered_bp is ok -- we just add (ksize +1 ) to get the total number of hashes.

The issue is with intersect_bp -- if we add (ksize+1) to each intersection, then the total number of intersected bp ends up being more base pairs than we originally have in the query.

For the test, I think this is causing the test prod (=intersect_bp*average_abund) to be (very slightly) different from the f_unique_weighted we're calculating within gather, which uses the hash intersection, not bp.

@dkoslicki
Copy link
Collaborator

@bluegenes Don't know if this helps your test pass, and/or if I had a typo in my previous email, but since the number of k-mers is L-k+1, if you want to recover L (the number of unique k-mers), you should take num_kmers + k -1. The +1 -> -1 may affect things (and apologies if I led you astray!)

@bluegenes
Copy link
Contributor Author

bluegenes commented May 4, 2022

@dkoslicki excellent, thanks -- @ctb and I puzzled over that yesterday, and then I just left it as k+1 while trying to puzzle through the other part of this 😂

will fix - thanks!

@ctb
Copy link
Contributor

ctb commented May 5, 2022

I had trouble thinking this through.

So I wrote some code:

import sourmash
from sourmash.sourmash_args import SaveSignaturesToLocation

# sourmash sketch dna tests/test-data/short.fa -p scaled=1,k=21 -o short.fa.sig

ss = sourmash.load_one_signature('short.fa.sig')

# pull 'short.fa.sig' into individual k-mers => signatures

siglist = []
for n, hashval in enumerate(ss.minhash.hashes):
    mh = ss.minhash.copy_and_clear()
    mh.add_hash(hashval)
    subsig = sourmash.SourmashSignature(mh, name=f"hash{n}")
    siglist.append(subsig)

with SaveSignaturesToLocation('short.hashes.zip') as savesig:
    for sig in siglist:
        savesig.add(sig)

# then:
# sourmash gather short.fa.sig short.hashes.zip -o xxx.csv --threshold-bp=0

the idea I have is that our gather results should work for scaled=1, and that can maybe guide our results for scaled=1000.

note, here short.fa is an exactly 1kb sequence, with a leading 'N' in it.

I don't know if this helps think through the numbers but it at least makes sense to me :). I'll revisit later...

@ctb
Copy link
Contributor

ctb commented May 8, 2022

more reasoned take number 1:

(1) gather fundamentally works on hashes/k-mers, so we should recognize that it is not providing bp estimates and adjust internal comments/code and external documentation appropriately. For now, just make another issue and leave the gather outputs as they were before this PR.

take number 2, if you don't like number 1:

(2) gather is an iterative algorithm, so we should only to add the k - 1 once - presumably at the very beginning.

@bluegenes bluegenes changed the title [WIP] minhash unique covered bp [EXP] minhash unique covered bp May 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants