-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] circumvent a very slow MinHash.remove_many(...)
call in sourmash gather
#2123
Conversation
Codecov Report
@@ Coverage Diff @@
## latest #2123 +/- ##
==========================================
+ Coverage 84.30% 91.70% +7.39%
==========================================
Files 130 99 -31
Lines 15278 11017 -4261
Branches 2166 2167 +1
==========================================
- Hits 12880 10103 -2777
+ Misses 2095 611 -1484
Partials 303 303
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
MinHash.remove_many(...)
call in sourmash gather
MinHash.remove_many(...)
call in sourmash gather
ready for review & merge @sourmash-bio/devs |
@ctb if you happen to have the py-spy flamegraph for this run, it might be nice to have in here so we can visually compare with #1771 (comment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other than adding the viz if you have it, this looks good to me!
added viz in PR description - thanks for the nudge! |
Nice! |
In #1771, we find that gather on a ~90% unidentified query has a very slow call to
MinHash.remove_many(...)
inGatherDatabases.__init__
. Here theremove_many
call is used to remove hashes without any overlaps in the prefetch signatures.An alternative approach would be to build a set of all hashes with some overlap, i.e. use the union of intersections.
And that's what this PR does - it replaces the single (slow) call to
remove_many
with many calls to intersection and then union. This is encapsulated in theCounterGather
class via new property,union_found
.Ref #1771 (comment)
Benchmarking
Using the data set mentioned here, on my laptop, I ran:
sudo time py-spy record -o latest.svg -- sourmash gather SRR10988543.10k.zip bins.10k.zip
and I see:
the
latest
branch: 214.63sthis branch: 65.9s
so that's much faster, yah. 😄
py-spy flamegraphs
from
latest
branch:from this PR: