Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] speed up prefetch by adjusting calls to use len(minhash) #2132

Merged
merged 1 commit into from
Jul 20, 2022

Conversation

ctb
Copy link
Contributor

@ctb ctb commented Jul 20, 2022

Several comparison dataclasses were introduced in #1955 and #2050, to facilitate gather/prefetch output and ANI calculation. Some of the methods in these dataclasses use len(minhash.hashes) which creates a new Python object and is typically quite slow.

This PR replaces len(minhash.hashes) in several key locations with len(minhash), which returns the same result but does so directly in Rust; on at least one benchmark, the speedup is about 10x (see numbers below).

Fixes #2124

timings

Using the benchmark from #1771 (comment) and also used here, we find:

This PR:

49.20 real        46.12 user         3.23 sys

latest branch / v4.4.2 -

469.38 real       419.82 user        47.47 sys

flamegraph profile for this PR

Screen Shot 2022-07-20 at 12 28 46 PM

flamegraph profile for latest branch (equiv v4.4.2)

Screen Shot 2022-07-20 at 1 22 01 PM

@ctb ctb changed the title [WIP] speed up ANI calculations on prefetch by adjusting calls to use len(MinHash) [MRG] speed up ANI calculations on prefetch by adjusting calls to use len(MinHash) Jul 20, 2022
@ctb ctb changed the title [MRG] speed up ANI calculations on prefetch by adjusting calls to use len(MinHash) [MRG] speed up prefetch by adjusting calls to use len(minhash) Jul 20, 2022
@codecov
Copy link

codecov bot commented Jul 20, 2022

Codecov Report

Merging #2132 (0d94cde) into latest (2054391) will increase coverage by 7.38%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           latest    #2132      +/-   ##
==========================================
+ Coverage   84.31%   91.70%   +7.38%     
==========================================
  Files         130       99      -31     
  Lines       15293    11017    -4276     
  Branches     2167     2167              
==========================================
- Hits        12895    10103    -2792     
+ Misses       2095      611    -1484     
  Partials      303      303              
Flag Coverage Δ
python 91.70% <100.00%> (ø)
rust ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/sourmash/minhash.py 94.02% <100.00%> (ø)
src/sourmash/search.py 97.93% <100.00%> (ø)
src/core/src/ffi/nodegraph.rs
src/core/src/index/bigsi.rs
src/core/src/index/search.rs
src/core/src/cmd.rs
src/core/src/index/linear.rs
src/core/src/ffi/cmd/compute.rs
src/core/src/ffi/mod.rs
src/core/tests/test.rs
... and 23 more

Copy link
Contributor

@bluegenes bluegenes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat, did not realize that was the distinction between the two.

Looks great, thx!

@ctb
Copy link
Contributor Author

ctb commented Jul 20, 2022

Neat, did not realize that was the distinction between the two.

it took me a while but the flamegraph/benchmarking made it really clear!

(I knew that len(...) shouldn't be slow. Figuring out that it was slow was the important part ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sourmash prefetch is ~10 times slower than sourmash gather?!
2 participants