Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized keccak implementation #8262

Merged
merged 1 commit into from
May 10, 2022

Conversation

SChernykh
Copy link
Contributor

All tests were conducted on the same PC (Ryzen 5 5600X running at fixed 4.65 GHz).

Before:

test_cn_fast_hash<32> (100000 calls) - OK: 1 us/call
test_cn_fast_hash<16384> (1000 calls) - OK: 164 us/call

After:

test_cn_fast_hash<32> (100000 calls) - OK: 0 us/call
test_cn_fast_hash<16384> (1000 calls) - OK: 31 us/call

More than 5 times speedup for cn_fast_hash.

Also noticed consistent 1-2% improvement in test_construct_tx results. @j-berman this should also speed up view tags #8061

@jeffro256
Copy link
Contributor

Here I can see that you are unrolling a lot of loops and substituting raw values in place of memory references . Is this where most of the speedup comes from? Just curious how it was sped up so much because that's very impressive. Good job!

@SChernykh
Copy link
Contributor Author

Unrolling loops allows to remove keccakf_rotc and keccakf_piln tables and substitute their values in compile time. This removes a lot of memory reads and pointer arithmetic from the loop. Also unrolling creates a lot of independent operations and allows compiler to shuffle all instructions as it sees best for superscalar CPU to execute in parallel.

@jeffro256
Copy link
Contributor

@SChernykh Have you tried bench-marking unrolling the outer loop for (round = 0; round < rounds; ++round)? Since it's a larger loop, it may not provide much benefit, but it would be interesting to see.

@SChernykh
Copy link
Contributor Author

Outer loop is just 2-3 instructions per iteration that can be saved, and unrolling it makes code size too big (it doesn't fit into micro-op cache on most CPUs). It gets slower.

Copy link
Collaborator

@j-berman j-berman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - nice spot!!

My comments are minor

uint64_t t, bc[5];

for (round = 0; round < rounds; round++) {

for (round = 0; round < rounds; ++round) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good explanation why prefix increment is preferred over postfix in places like this for anyone else curious:

If you're the kind who worries about efficiency, you probably broke into a sweat when you first saw the postfix increment function. That function has to create a temporary object for its return value and the implementation above also creates an explicit temporary object that has to be constructed and destructed. The prefix increment function has no such temporaries...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice the compiler can optimize this for simple types. However, I do agree it should be pre-increment as a matter of code cleanliness.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in practice the compiler can optimize this for simple types

It most definitely does, and has, for a very, very long time. That said, it's not as though the change makes the code less readable, and, when optimizing code, it can help to write the expressions as close to the desired operations as possible.

src/crypto/keccak.c Outdated Show resolved Hide resolved
st[17] = ROTL64(st[11], 10);
st[11] = ROTL64(st[ 7], 6);
st[ 7] = ROTL64(st[10], 3);
st[10] = ROTL64(t, 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I explicitly tested this section yields equivalent values for all elements of st as the old approach just to check my sanity)

src/crypto/keccak.c Outdated Show resolved Hide resolved
All tests were conducted on the same PC (Ryzen 5 5600X running at fixed 4.65 GHz).

Before:
test_cn_fast_hash<32> (100000 calls) - OK: 1 us/call
test_cn_fast_hash<16384> (1000 calls) - OK: 164 us/call

After:
test_cn_fast_hash<32> (100000 calls) - OK: 0 us/call
test_cn_fast_hash<16384> (1000 calls) - OK: 31 us/call

More than 5 times speedup for cn_fast_hash.

Also noticed consistent 1-2% improvement in test_construct_tx results.
@luigi1111 luigi1111 merged commit 1561513 into monero-project:master May 10, 2022
@SChernykh SChernykh deleted the keccak-opt branch May 14, 2022 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants