add more inlines for better efficiency #999

altkdf · 2023-12-05T23:35:36Z

If I add more inlines, the performance in most current benchmarks improves on my machine (AMD EPYC 7302). Benchmarks were run as RUSTFLAGS='-C target-cpu=native' cargo bench --features expose-field.

    Running benches/ecdsa.rs
ecdsa/try_sign_prehashed
                        time:   [48.866 µs 48.901 µs 48.944 µs]
                        change: [-10.073% -8.5345% -6.8744%] (p = 0.00 < 0.05)
ecdsa/verify_prehashed  
                        time:   [103.95 µs 104.43 µs 105.00 µs]
                        change: [-12.366% -11.807% -11.312%] (p = 0.00 < 0.05)

     Running benches/field.rs
field element operations/normalize_weak
                        time:   [3.1100 ns 3.1228 ns 3.1391 ns]
                        change: [-44.824% -44.193% -43.486%] (p = 0.00 < 0.05)
field element operations/normalize
                        time:   [10.600 ns 10.665 ns 10.757 ns]
                        change: [-25.955% -25.278% -24.669%] (p = 0.00 < 0.05)
field element operations/mul
                        time:   [22.809 ns 22.943 ns 23.120 ns]
                        change: [-22.698% -22.031% -21.383%] (p = 0.00 < 0.05)
field element operations/square
                        time:   [17.109 ns 17.158 ns 17.223 ns]
                        change: [-24.579% -24.168% -23.815%] (p = 0.00 < 0.05)
field element operations/invert
                        time:   [4.8184 µs 4.8271 µs 4.8384 µs]
                        change: [-26.792% -26.425% -26.113%] (p = 0.00 < 0.05)
field element operations/sqrt
                        time:   [4.9200 µs 4.9255 µs 4.9332 µs]
                        change: [-24.270% -23.922% -23.657%] (p = 0.00 < 0.05)

     Running benches/scalar.rs
high-level operations/point-scalar mul
                        time:   [56.610 µs 56.936 µs 57.360 µs]
                        change: [-11.544% -10.793% -10.106%] (p = 0.00 < 0.05)
high-level operations/mul_by_generator naive
                        time:   [56.326 µs 56.422 µs 56.535 µs]
                        change: [-11.644% -11.084% -10.507%] (p = 0.00 < 0.05)
high-level operations/mul_by_generator precomputed
                        time:   [27.987 µs 28.020 µs 28.066 µs]
                        change: [-9.8668% -8.9146% -8.1681%] (p = 0.00 < 0.05)
high-level operations/lincomb via mul+add
                        time:   [112.67 µs 113.36 µs 114.37 µs]
                        change: [-9.3266% -8.7609% -8.0640%] (p = 0.00 < 0.05)
high-level operations/lincomb()
                        time:   [89.727 µs 90.030 µs 90.401 µs]
                        change: [-9.4576% -8.5924% -7.8728%] (p = 0.00 < 0.05)
scalar operations/sub   
                        time:   [6.2334 ns 6.2527 ns 6.2782 ns]
                        change: [-18.510% -17.243% -16.189%] (p = 0.00 < 0.05)
scalar operations/add   
                        time:   [8.1355 ns 8.1517 ns 8.1712 ns]
                        change: [-20.691% -19.968% -19.352%] (p = 0.00 < 0.05)
scalar operations/mul   
                        time:   [46.323 ns 46.443 ns 46.596 ns]
                        change: [-0.6997% -0.1909% +0.2646%] (p = 0.46 > 0.05)
scalar operations/negate
                        time:   [7.6674 ns 7.6991 ns 7.7444 ns]
                        change: [-19.449% -19.121% -18.801%] (p = 0.00 < 0.05)
scalar operations/invert
                        time:   [15.156 µs 15.167 µs 15.180 µs]
                        change: [-1.0203% -0.4949% -0.1341%] (p = 0.01 < 0.05)

This PR also adds criterion::black_box() to the input variables in benchmarks s.t. they are not optimized away.

altkdf · 2023-12-05T23:40:53Z

On my Mac (M1), the numbers look similar. It would be nice to benchmark this also on a simple x86 machine.

     Running benches/ecdsa.rs
ecdsa/try_sign_prehashed
                        time:   [33.505 µs 33.559 µs 33.626 µs]
                        change: [-9.5527% -9.4414% -9.3221%] (p = 0.00 < 0.05)
ecdsa/verify_prehashed  
                        time:   [64.628 µs 64.705 µs 64.825 µs]
                        change: [-12.094% -11.987% -11.868%] (p = 0.00 < 0.05)

     Running benches/field.rs
field element operations/normalize_weak
                        time:   [2.0719 ns 2.0731 ns 2.0749 ns]
                        change: [-66.954% -65.750% -64.414%] (p = 0.00 < 0.05)
field element operations/normalize
                        time:   [10.731 ns 10.733 ns 10.736 ns]
                        change: [-4.0784% -3.0586% -1.9188%] (p = 0.00 < 0.05)
field element operations/mul
                        time:   [14.358 ns 14.364 ns 14.371 ns]
                        change: [-13.244% -13.133% -13.035%] (p = 0.00 < 0.05)
field element operations/square
                        time:   [10.713 ns 10.719 ns 10.726 ns]
                        change: [-14.165% -14.095% -14.026%] (p = 0.00 < 0.05)
field element operations/invert
                        time:   [4.0922 µs 4.0970 µs 4.1036 µs]
                        change: [-24.712% -24.549% -24.393%] (p = 0.00 < 0.05)
field element operations/sqrt
                        time:   [4.0430 µs 4.0470 µs 4.0525 µs]
                        change: [-24.461% -24.355% -24.224%] (p = 0.00 < 0.05)

     Running benches/scalar.rs
high-level operations/point-scalar mul
                        time:   [35.023 µs 35.036 µs 35.054 µs]
                        change: [-12.498% -12.413% -12.336%] (p = 0.00 < 0.05)
high-level operations/mul_by_generator naive
                        time:   [35.023 µs 35.031 µs 35.041 µs]
                        change: [-12.529% -12.444% -12.369%] (p = 0.00 < 0.05)
high-level operations/mul_by_generator precomputed
                        time:   [16.422 µs 16.425 µs 16.429 µs]
                        change: [-11.650% -11.576% -11.502%] (p = 0.00 < 0.05)
high-level operations/lincomb via mul+add
                        time:   [70.348 µs 70.637 µs 70.986 µs]
                        change: [-12.356% -12.186% -11.959%] (p = 0.00 < 0.05)
high-level operations/lincomb()
                        time:   [54.505 µs 54.580 µs 54.678 µs]
                        change: [-12.645% -12.500% -12.362%] (p = 0.00 < 0.05)
scalar operations/sub   
                        time:   [2.5533 ns 2.5546 ns 2.5564 ns]
                        change: [-26.584% -26.510% -26.441%] (p = 0.00 < 0.05)
scalar operations/add   
                        time:   [3.8494 ns 3.8515 ns 3.8541 ns]
                        change: [-22.241% -22.157% -22.077%] (p = 0.00 < 0.05)
scalar operations/mul   
                        time:   [27.433 ns 27.447 ns 27.466 ns]
                        change: [-1.2385% -1.0699% -0.9246%] (p = 0.00 < 0.05)
scalar operations/negate
                        time:   [4.7519 ns 4.7537 ns 4.7563 ns]
                        change: [+1.8517% +1.9516% +2.0527%] (p = 0.00 < 0.05)
scalar operations/invert
                        time:   [12.537 µs 12.545 µs 12.557 µs]
                        change: [-0.1999% -0.1031% -0.0059%] (p = 0.03 < 0.05)

altkdf · 2023-12-05T23:49:30Z

I likely overdid it with the inlines and some could probably be removed if that's important.

tarcieri · 2023-12-06T14:18:51Z

That's a lot of inline(always) specifically. We've had various reports of impacts on compile times in the past. I'm curious what the impact here would be.

I agree there are definitely measurable places it helps, but it would be nice to check each one's impact versus just an #[inline]

altkdf · 2023-12-06T14:36:55Z

That's a lot of inline(always) specifically. We've had various reports of impacts on compile times in the past. I'm curious what the impact here would be.

I agree there are definitely measurable places it helps, but it would be nice to check each one's impact versus just an #[inline]

I agree. Would this be something that could be merged if I reduce the inlines to the minimum subset for the improvement?

However, in general it might make sense to inline some calls up to the caller s.t. the compiler has as lest the theoretical possibility for some optimizations like vectorization or elimination of redundant computation. This could be the case if we e.g. do several multiplications/squarings in the same function. To decide if that's really the case, we would need additional benchmarks, of course. In case it turns out that we do need many inlines, maybe adding a feature flag for inlining would also be an option as well for performance-critical applications to reduce the compilation time for applications where it is not important?

without inlines
RUSTFLAGS='-C target-cpu=native' cargo rebuild --release 20.68s user 8.13s system 444% cpu 6.478 total
with inlines
RUSTFLAGS='-C target-cpu=native' cargo rebuild --release 27.51s user 8.90s system 342% cpu 10.637 total

tarcieri · 2023-12-06T14:44:29Z

I would feel a lot better about merging ones where you saw measurable improvements

This reverts commit bbcf5e5.

This reverts commit b1cfa6d.

altkdf · 2023-12-13T16:45:53Z

The smallest subset of operations that improves the performance of point-scalar multiplication (the operation we are most interested in) seems to be just 5 inlines.

    Running benches/ecdsa.rs
ecdsa/try_sign_prehashed
                        time:   [49.439 µs 49.443 µs 49.448 µs]
                        change: [-7.7134% -7.6561% -7.6101%] (p = 0.00 < 0.05)
ecdsa/verify_prehashed
                        time:   [104.36 µs 104.40 µs 104.44 µs]
                        change: [-9.0640% -8.9887% -8.9015%] (p = 0.00 < 0.05)

     Running benches/field.rs
field element operations/normalize_weak
                        time:   [5.4543 ns 5.4585 ns 5.4639 ns]
                        change: [+2.5803% +2.7429% +2.8993%] (p = 0.00 < 0.05)
field element operations/normalize
                        time:   [13.891 ns 13.896 ns 13.904 ns]
                        change: [+2.3087% +2.3631% +2.4168%] (p = 0.00 < 0.05)
field element operations/mul
                        time:   [28.076 ns 28.077 ns 28.079 ns]
                        change: [-5.5033% -5.4694% -5.4385%] (p = 0.00 < 0.05)
field element operations/square
                        time:   [23.221 ns 23.249 ns 23.284 ns]
                        change: [+1.4764% +1.5799% +1.7015%] (p = 0.00 < 0.05)
field element operations/invert
                        time:   [6.4245 µs 6.4256 µs 6.4267 µs]
                        change: [-0.4908% -0.4317% -0.3845%] (p = 0.00 < 0.05)
field element operations/sqrt
                        time:   [6.3270 µs 6.3284 µs 6.3302 µs]
                        change: [-1.3848% -1.3421% -1.2839%] (p = 0.00 < 0.05)

     Running benches/scalar.rs
high-level operations/point-scalar mul
                        time:   [56.298 µs 56.336 µs 56.403 µs]
                        change: [-11.006% -10.954% -10.888%] (p = 0.00 < 0.05)
high-level operations/point mul with different scalars in same fn
                        time:   [112.06 µs 112.09 µs 112.11 µs]
                        change: [-10.924% -10.851% -10.795%] (p = 0.00 < 0.05)
high-level operations/mul_by_generator naive
                        time:   [56.303 µs 56.312 µs 56.320 µs]
                        change: [-10.932% -10.887% -10.842%] (p = 0.00 < 0.05)
high-level operations/mul_by_generator precomputed
                        time:   [28.330 µs 28.344 µs 28.370 µs]
                        change: [-8.9773% -8.9514% -8.9196%] (p = 0.00 < 0.05)
high-level operations/lincomb via mul+add
                        time:   [113.25 µs 113.26 µs 113.28 µs]
                        change: [-10.719% -10.684% -10.652%] (p = 0.00 < 0.05)
high-level operations/lincomb()
                        time:   [89.162 µs 89.192 µs 89.222 µs]
                        change: [-11.048% -10.989% -10.933%] (p = 0.00 < 0.05)
scalar operations/sub
                        time:   [7.3243 ns 7.3518 ns 7.3823 ns]
                        change: [-0.9813% -0.5357% -0.0583%] (p = 0.02 < 0.05)
scalar operations/add
                        time   [9.4150 ns 9.4291 ns 9.4490 ns]
                        change: [-1.9041% -1.4095% -0.8967%] (p = 0.00 < 0.05)
scalar operations/mul
                        time:   [46.885 ns 46.906 ns 46.928 ns]
                        change: [-0.2431% -0.1699% -0.0957%] (p = 0.00 < 0.05)
scalar operations/negate
                        time:   [9.8076 ns 9.8120 ns 9.8197 ns]
                        change: [+1.7637% +1.8514% +1.9584%] (p = 0.00 < 0.05)
scalar operations/invert
                        time:   [14.804 µs 14.808 µs 14.813 µs]
                        change: [-0.6235% -0.5937% -0.5657%] (p = 0.00 < 0.05)

altkdf · 2023-12-13T17:34:58Z

I also manually checked that removing (always) from 3 our of 5 inlines doesn't hurt the performance for point multiplication.

k256/benches/ecdsa.rs

altkdf added 2 commits December 5, 2023 22:58

use black_box in benches to prevent compiler optimizing away inputs

5a4fff7

add inlines

bbcf5e5

altkdf marked this pull request as ready for review December 6, 2023 13:59

altkdf added 4 commits December 13, 2023 16:01

Revert "add inlines"

ac9c84a

This reverts commit bbcf5e5.

add a benchmark with p * s1 + p * s2 + p.double()

b1cfa6d

add less inlines

e4043dd

Revert "add a benchmark with p * s1 + p * s2 + p.double()"

41860d4

This reverts commit b1cfa6d.

remove some (always) from inlines

04a91b3

tarcieri reviewed Dec 13, 2023

View reviewed changes

k256/benches/ecdsa.rs Show resolved Hide resolved

tarcieri merged commit 0ed00f7 into RustCrypto:master Dec 14, 2023
19 checks passed

tarcieri mentioned this pull request Jan 8, 2024

k256 v0.13.3 #1008

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add more inlines for better efficiency #999

add more inlines for better efficiency #999

altkdf commented Dec 5, 2023

altkdf commented Dec 5, 2023

altkdf commented Dec 5, 2023

tarcieri commented Dec 6, 2023

altkdf commented Dec 6, 2023

tarcieri commented Dec 6, 2023

altkdf commented Dec 13, 2023 •

edited

Loading

altkdf commented Dec 13, 2023

add more inlines for better efficiency #999

add more inlines for better efficiency #999

Conversation

altkdf commented Dec 5, 2023

altkdf commented Dec 5, 2023

altkdf commented Dec 5, 2023

tarcieri commented Dec 6, 2023

altkdf commented Dec 6, 2023

tarcieri commented Dec 6, 2023

altkdf commented Dec 13, 2023 • edited Loading

altkdf commented Dec 13, 2023

altkdf commented Dec 13, 2023 •

edited

Loading