Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] speed up SeqToHashes translate #1946

Merged
merged 3 commits into from
Apr 13, 2022
Merged

[MRG] speed up SeqToHashes translate #1946

merged 3 commits into from
Apr 13, 2022

Conversation

mr-eyes
Copy link
Member

@mr-eyes mr-eyes commented Apr 12, 2022

Vectors are very good in insertions, not in deletion. In #1938 I replaced Vec with VecDeque because I was popping out a hash per iteration. This deletion is very slow in Vec because it shifts all the elements in memory, while it's very fast in VecDeque.

In this PR I am reverting back to Vec and replacing all the parts that require deletion or popping by smoother code. Now there's no need for using VecDeque.

Resolves #1945

@codecov
Copy link

codecov bot commented Apr 12, 2022

Codecov Report

Merging #1946 (671382b) into latest (01119a2) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           latest    #1946   +/-   ##
=======================================
  Coverage   82.93%   82.94%           
=======================================
  Files         125      125           
  Lines       13755    13759    +4     
  Branches     1877     1877           
=======================================
+ Hits        11408    11412    +4     
  Misses       2075     2075           
  Partials      272      272           
Flag Coverage Δ
python 90.96% <ø> (ø)
rust 65.16% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/core/src/signature.rs 68.34% <100.00%> (+0.35%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01119a2...671382b. Read the comment docs.

@mr-eyes
Copy link
Member Author

mr-eyes commented Apr 12, 2022

Ready for review @ctb @luizirber
I believe it's better now :)

before and after timing in #1945

Copy link
Contributor

@ctb ctb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! But I have to admit I don't understand why the new code is so much faster 😭 . Could you maybe add some comments?

@mr-eyes
Copy link
Member Author

mr-eyes commented Apr 12, 2022

@ctb sure!

Here I am explaining the tradeoff of using Vec vs VecDeque and why I reverted back to Vec.

Vectors are very good in insertions, not in deletion. In #1938 I replaced Vec with VecDeque because I was popping out a hash per iteration. This deletion is very slow in Vec because it shifts all the elements in memory, while it's very fast in VecDeque.

In this PR I am reverting back to Vec and replacing all the parts that require deletion or popping by smoother code. Now there's no need for using VecDeque.


SeqToHashes is mainly an iterator (like a Python generator); In every iteration it hashes and return a kmer. Only in the translate mode, I hash all the kmers for all frames at once, and then consume a hash per iteration. The consuming/yelding part was the problem, I was popping an item in every yield until the vector of hashes is empty (this is very slow in vectors). In this PR I replaced the popping out with an vector indeces iterator and it will keep moving through the vector until it reaches the last element. Hope I well explained it :)

translate_iter_step: 0,

if self.translate_iter_step == self.hashes_buffer.len() {
self.hashes_buffer.clear();
self.kmer_index = self.max_index;
return Some(Ok(0));
}
let curr_idx = self.translate_iter_step;
self.translate_iter_step += 1;
Some(Ok(self.hashes_buffer[curr_idx]))

@ctb
Copy link
Contributor

ctb commented Apr 12, 2022

ahh, got it! but - is this something that is worth adding in comments in the actual code, do you think?

@ctb
Copy link
Contributor

ctb commented Apr 12, 2022

(I tend to think that something that required the amount of discussion this issue had is worth documenting in the code.)

@mr-eyes
Copy link
Member Author

mr-eyes commented Apr 12, 2022

I spent sometime to understand what I was doing in the code, so, yes I will add some comments.

@mr-eyes
Copy link
Member Author

mr-eyes commented Apr 13, 2022

@ctb I think it's ready now.

@ctb
Copy link
Contributor

ctb commented Apr 13, 2022

cool! I mean, it's been approved for a while, but I can do the button push ;)

@ctb ctb merged commit 3fcf2db into latest Apr 13, 2022
@ctb ctb deleted the mo/speedup_seqToHashes branch April 13, 2022 21:06
@mr-eyes
Copy link
Member Author

mr-eyes commented Apr 13, 2022

@ctb oh, ok :D next time I will merge when approved :) Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SeqToHashes needs more optimization for super long sequences
2 participants