Optimized keccak implementation #8262

SChernykh · 2022-04-16T10:05:59Z

All tests were conducted on the same PC (Ryzen 5 5600X running at fixed 4.65 GHz).

Before:

test_cn_fast_hash<32> (100000 calls) - OK: 1 us/call
test_cn_fast_hash<16384> (1000 calls) - OK: 164 us/call

After:

test_cn_fast_hash<32> (100000 calls) - OK: 0 us/call
test_cn_fast_hash<16384> (1000 calls) - OK: 31 us/call

More than 5 times speedup for cn_fast_hash.

Also noticed consistent 1-2% improvement in test_construct_tx results. @j-berman this should also speed up view tags #8061

jeffro256 · 2022-04-16T20:03:53Z

Here I can see that you are unrolling a lot of loops and substituting raw values in place of memory references . Is this where most of the speedup comes from? Just curious how it was sped up so much because that's very impressive. Good job!

SChernykh · 2022-04-16T20:11:52Z

Unrolling loops allows to remove keccakf_rotc and keccakf_piln tables and substitute their values in compile time. This removes a lot of memory reads and pointer arithmetic from the loop. Also unrolling creates a lot of independent operations and allows compiler to shuffle all instructions as it sees best for superscalar CPU to execute in parallel.

jeffro256 · 2022-04-16T20:16:15Z

@SChernykh Have you tried bench-marking unrolling the outer loop for (round = 0; round < rounds; ++round)? Since it's a larger loop, it may not provide much benefit, but it would be interesting to see.

SChernykh · 2022-04-16T20:20:28Z

Outer loop is just 2-3 instructions per iteration that can be saved, and unrolling it makes code size too big (it doesn't fit into micro-op cache on most CPUs). It gets slower.

j-berman

LGTM - nice spot!!

My comments are minor

j-berman · 2022-04-18T04:49:46Z

src/crypto/keccak.c

    uint64_t t, bc[5];

-    for (round = 0; round < rounds; round++) {
-
+    for (round = 0; round < rounds; ++round) {


Good explanation why prefix increment is preferred over postfix in places like this for anyone else curious:

If you're the kind who worries about efficiency, you probably broke into a sweat when you first saw the postfix increment function. That function has to create a temporary object for its return value and the implementation above also creates an explicit temporary object that has to be constructed and destructed. The prefix increment function has no such temporaries...

I think in practice the compiler can optimize this for simple types. However, I do agree it should be pre-increment as a matter of code cleanliness.

I think in practice the compiler can optimize this for simple types

It most definitely does, and has, for a very, very long time. That said, it's not as though the change makes the code less readable, and, when optimizing code, it can help to write the expressions as close to the desired operations as possible.

src/crypto/keccak.c

j-berman · 2022-04-18T06:24:16Z

src/crypto/keccak.c

+        st[17] = ROTL64(st[11], 10);
+        st[11] = ROTL64(st[ 7],  6);
+        st[ 7] = ROTL64(st[10],  3);
+        st[10] = ROTL64(t, 1);


(I explicitly tested this section yields equivalent values for all elements of st as the old approach just to check my sanity)

src/crypto/keccak.c

All tests were conducted on the same PC (Ryzen 5 5600X running at fixed 4.65 GHz). Before: test_cn_fast_hash<32> (100000 calls) - OK: 1 us/call test_cn_fast_hash<16384> (1000 calls) - OK: 164 us/call After: test_cn_fast_hash<32> (100000 calls) - OK: 0 us/call test_cn_fast_hash<16384> (1000 calls) - OK: 31 us/call More than 5 times speedup for cn_fast_hash. Also noticed consistent 1-2% improvement in test_construct_tx results.

j-berman approved these changes Apr 18, 2022

View reviewed changes

SChernykh force-pushed the keccak-opt branch from 7a575e9 to 268a039 Compare April 18, 2022 08:01

j-berman approved these changes Apr 18, 2022

View reviewed changes

luigi1111 merged commit 1561513 into monero-project:master May 10, 2022

SChernykh deleted the keccak-opt branch May 14, 2022 15:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized keccak implementation #8262

Optimized keccak implementation #8262

SChernykh commented Apr 16, 2022

jeffro256 commented Apr 16, 2022

SChernykh commented Apr 16, 2022

jeffro256 commented Apr 16, 2022

SChernykh commented Apr 16, 2022

j-berman left a comment

j-berman Apr 18, 2022

UkoeHB Apr 19, 2022

iamamyth Apr 20, 2022

j-berman Apr 18, 2022

Optimized keccak implementation #8262

Optimized keccak implementation #8262

Conversation

SChernykh commented Apr 16, 2022

jeffro256 commented Apr 16, 2022

SChernykh commented Apr 16, 2022

jeffro256 commented Apr 16, 2022

SChernykh commented Apr 16, 2022

j-berman left a comment

Choose a reason for hiding this comment

j-berman Apr 18, 2022

Choose a reason for hiding this comment

UkoeHB Apr 19, 2022

Choose a reason for hiding this comment

iamamyth Apr 20, 2022

Choose a reason for hiding this comment

j-berman Apr 18, 2022

Choose a reason for hiding this comment