Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] circumvent a very slow MinHash.remove_many(...) call in sourmash gather #2123

Merged
merged 3 commits into from
Jul 18, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jul 17, 2022

In #1771, we find that gather on a ~90% unidentified query has a very slow call to MinHash.remove_many(...) in GatherDatabases.__init__. Here the remove_many call is used to remove hashes without any overlaps in the prefetch signatures.

An alternative approach would be to build a set of all hashes with some overlap, i.e. use the union of intersections.

And that's what this PR does - it replaces the single (slow) call to remove_many with many calls to intersection and then union. This is encapsulated in the CounterGather class via new property, union_found.

Ref #1771 (comment)

Benchmarking

Using the data set mentioned here, on my laptop, I ran:

sudo time py-spy record -o latest.svg -- sourmash gather SRR10988543.10k.zip bins.10k.zip

and I see:

the latest branch: 214.63s
this branch: 65.9s

so that's much faster, yah. 😄

py-spy flamegraphs

from latest branch:

Screen Shot 2022-07-17 at 7 20 49 AM

from this PR:

Screen Shot 2022-07-18 at 9 27 06 AM

@codecov
Copy link

codecov bot commented Jul 17, 2022

Codecov Report

Merging #2123 (d8cb973) into latest (401ba48) will increase coverage by 7.39%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2123      +/-   ##
==========================================
+ Coverage   84.30%   91.70%   +7.39%     
==========================================
  Files         130       99      -31     
  Lines       15278    11017    -4261     
  Branches     2166     2167       +1     
==========================================
- Hits        12880    10103    -2777     
+ Misses       2095      611    -1484     
  Partials      303      303              
Flag Coverage Δ
python 91.70% <100.00%> (+0.01%) ⬆️
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/commands.py 88.95% <100.00%> (+0.06%) ⬆️
src/sourmash/index/__init__.py 96.71% <100.00%> (+0.05%) ⬆️
src/sourmash/search.py 97.93% <100.00%> (+<0.01%) ⬆️
src/core/src/index/search.rs
src/core/src/errors.rs
src/core/src/index/sbt/mod.rs
src/core/src/signature.rs
src/core/src/ffi/mod.rs
src/core/src/lib.rs
src/core/src/ffi/index/mod.rs
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 401ba48...d8cb973. Read the comment docs.

@ctb ctb changed the title [EXP] circument a very slow MinHash.remove_many(...) call in sourmash gather [MRG] circumvent a very slow MinHash.remove_many(...) call in sourmash gather Jul 17, 2022
@ctb
Copy link
Contributor Author

ctb commented Jul 18, 2022

ready for review & merge @sourmash-bio/devs

@bluegenes
Copy link
Contributor

Using the data set mentioned #1771 (comment), on my laptop, I ran:
sudo time py-spy record -o latest.svg -- sourmash gather SRR10988543.10k.zip bins.10k.zip
and I see:
the latest branch: 214.63s
this branch: 65.9s
so that's much faster, yah. 😄

@ctb if you happen to have the py-spy flamegraph for this run, it might be nice to have in here so we can visually compare with #1771 (comment)

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other than adding the viz if you have it, this looks good to me!

@ctb
Copy link
Contributor Author

ctb commented Jul 18, 2022

added viz in PR description - thanks for the nudge!

@ctb ctb merged commit 526f785 into latest Jul 18, 2022
@ctb ctb deleted the avoid_remove_many branch July 18, 2022 16:29
@mr-eyes
Copy link
Member

mr-eyes commented Jul 18, 2022

Nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants