-
-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimized keccak implementation #8262
Conversation
Here I can see that you are unrolling a lot of loops and substituting raw values in place of memory references . Is this where most of the speedup comes from? Just curious how it was sped up so much because that's very impressive. Good job! |
Unrolling loops allows to remove keccakf_rotc and keccakf_piln tables and substitute their values in compile time. This removes a lot of memory reads and pointer arithmetic from the loop. Also unrolling creates a lot of independent operations and allows compiler to shuffle all instructions as it sees best for superscalar CPU to execute in parallel. |
@SChernykh Have you tried bench-marking unrolling the outer loop |
Outer loop is just 2-3 instructions per iteration that can be saved, and unrolling it makes code size too big (it doesn't fit into micro-op cache on most CPUs). It gets slower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - nice spot!!
My comments are minor
uint64_t t, bc[5]; | ||
|
||
for (round = 0; round < rounds; round++) { | ||
|
||
for (round = 0; round < rounds; ++round) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good explanation why prefix increment is preferred over postfix in places like this for anyone else curious:
If you're the kind who worries about efficiency, you probably broke into a sweat when you first saw the postfix increment function. That function has to create a temporary object for its return value and the implementation above also creates an explicit temporary object that has to be constructed and destructed. The prefix increment function has no such temporaries...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in practice the compiler can optimize this for simple types. However, I do agree it should be pre-increment as a matter of code cleanliness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in practice the compiler can optimize this for simple types
It most definitely does, and has, for a very, very long time. That said, it's not as though the change makes the code less readable, and, when optimizing code, it can help to write the expressions as close to the desired operations as possible.
st[17] = ROTL64(st[11], 10); | ||
st[11] = ROTL64(st[ 7], 6); | ||
st[ 7] = ROTL64(st[10], 3); | ||
st[10] = ROTL64(t, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I explicitly tested this section yields equivalent values for all elements of st
as the old approach just to check my sanity)
All tests were conducted on the same PC (Ryzen 5 5600X running at fixed 4.65 GHz). Before: test_cn_fast_hash<32> (100000 calls) - OK: 1 us/call test_cn_fast_hash<16384> (1000 calls) - OK: 164 us/call After: test_cn_fast_hash<32> (100000 calls) - OK: 0 us/call test_cn_fast_hash<16384> (1000 calls) - OK: 31 us/call More than 5 times speedup for cn_fast_hash. Also noticed consistent 1-2% improvement in test_construct_tx results.
All tests were conducted on the same PC (Ryzen 5 5600X running at fixed 4.65 GHz).
Before:
After:
More than 5 times speedup for cn_fast_hash.
Also noticed consistent 1-2% improvement in
test_construct_tx
results. @j-berman this should also speed up view tags #8061