Optimized linear combination of points #380

fjarri · 2021-07-14T04:49:35Z

The goal of this PR is to optimize a commonly used operation (x * k + y * l) where x and y are curve points, and k and l are scalars. It is currently used in verify.rs and recoverable.rs.

On my machine it speeds up ECDSA verification from 176us to 145us (and the newly added benchmark shows speed-up from 157us to 124us for linear combination alone).

Currently there PR contains a lincomb_generic() that takes arrays, and it has two aliases: mul() (internal, used in operator traits) and lincomb() (public, for two-point linear combinations).

codecov-commenter · 2021-07-14T04:52:43Z

Codecov Report

Merging #380 (30ab376) into master (2c12fc5) will decrease coverage by 0.37%.
The diff coverage is 40.35%.

@@            Coverage Diff             @@
##           master     #380      +/-   ##
==========================================
- Coverage   57.92%   57.54%   -0.38%     
==========================================
  Files          29       29              
  Lines        4036     4146     +110     
==========================================
+ Hits         2338     2386      +48     
- Misses       1698     1760      +62

Impacted Files	Coverage Δ
k256/src/arithmetic.rs	`100.00% <ø> (ø)`
k256/src/arithmetic/mul.rs	`59.41% <40.18%> (-29.66%)`	⬇️
k256/src/ecdsa/verify.rs	`43.39% <42.85%> (+0.53%)`	⬆️
k256/src/arithmetic/projective.rs	`77.04% <0.00%> (+0.77%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c12fc5...30ab376. Read the comment docs.

tarcieri

I think something like this is fine to get started.

how should we expose it to the user (and should we expose it at all?). As a function or as a "LinearCombination" trait of ProjectivePoint? In a submodule? Gated by a feature?

I think initially it's fine to just have an internal implementation specific to this use case.

It would be interesting to expose eventually, but raises a lot of questions. For example: should a trait like LinearCombination go in the group crate?

I would prefer to cross that bridge when we get there.

a generalization for n-term linear combination (even just joining mul_windowed() and lincomb() would be nice, they have pretty much identical code)

The latter seems worth exploring if it can be done without impacting performance

in verify.rs and recoverable.rs one of the points of the linear combination is a generator. So it can be sped up even more by pre-calculating generator powers. Should it be done in this PR or in another one?

I'd prefer to look at that in a separate PR. It definitely sounds like a good idea, but that's an area where it might make sense to feature gate it, as I expect there will be a tradeoff in code size which would impact embedded users who might prefer smaller code size over better performance.

hdevalence · 2021-07-14T17:05:04Z

Two traits that could be useful as design inspiration are the curve25519-dalek MultiscalarMul and VartimeMultiscalarMul.

IME there's pretty minimal cost to making an arbitrary-length multiscalar multiplication API over a double-base scalar multiplication API. Basically the only downside is that the lookup tables have to be heap-allocated, but this is a pretty minimal cost even for the double-base case.

This is just my opinion, but I think it's preferable to have only the arbitrary-length version, for reduced API surface. If you're going to have a specialized double-base scmul function, it's useful to have it hardcode one of the points (so it can use a bigger, statically precomputed lookup table).

tarcieri · 2021-07-15T23:31:40Z

@fjarri hope you don't mind but I flipped this PR out of draft

I think it's a reasonable self-contained first step.

fjarri · 2021-07-16T04:41:43Z

I just wanted to emphasize that it shouldn't be merged yet :)

So, I added a generic version of lincomb that could potentially replace mul_windowed and lincomb. The performance is the following:

high-level operations/lincomb via mul+add [157.31 us 157.39 us 157.48 us]
high-level operations/lincomb optimized [124.71 us 125.17 us 125.64 us]
high-level operations/lincomb optimized (generic) [122.25 us 122.31 us 122.37 us]
high-level operations/mul_windowed [80.548 us 80.767 us 80.993 us]
high-level operations/mul_windowed (generic) [80.870 us 81.071 us 81.269 us]

A strange effect is that ecdsa/verify_prehashed benchmark is ~5us faster with a non-generic version (152 vs 147us), while the benchmark of lincomb itself shows that the generic version is slightly faster.

Do you think the generic version overcomplicates things? (I wish Rust had maps and zips defined for static arrays, but alas...)

tarcieri · 2021-07-16T13:49:03Z

I wouldn't worry too much about a modicum of performance lost to the generic version.

I wish Rust had maps and zips defined for static arrays...

It should retain all of the information necessary to optimize those as part of slice::Iter, which is an ExactSizeIterator over a fix-sized structure. Though I have seen LLVM optimize those poorly in the past and we've often used for loops over a static range instead.

fjarri · 2021-07-16T21:28:30Z

It should retain all of the information necessary to optimize those as part of slice::Iter, which is an ExactSizeIterator over a fix-sized structure.

Yes, but as far as I know it is not possible to collect() into an array. I guess I can try to rewrite it in lazy maps, but this will result in some freaky expressions.

and we've often used for loops over a static range instead.

Well, that's basically what's happening in static_map()

tarcieri · 2021-07-16T21:34:26Z

Yes, but as far as I know it is not possible to collect() into an array.

Aah yes, that's been the long-desired one which continues to be an ongoing source of controversy as to how to implement it. I think this is the most up-to-date tracking issue on where things are at with that:

rust-lang/rust#81615

tarcieri · 2021-07-16T21:37:33Z

One somewhat dirty option which could unlock those sort of use cases we could potentially consider is copying and pasting the existing (unstable) functionality from libcore which provides this into some shared crate:

https://github.com/rust-lang/rust/blob/afaf33dcafe9c7068b63eb997df221aa08db7c29/library/core/src/array/mod.rs#L563-L628

We do have the crypto-common crate which might be a relatively OK place to put that sort of thing.

I wouldn't suggest blocking this PR on it though, but it is definitely something we could use in a lot of other places (particularly symmetric cryptography). The eventually we could migrate to similarly-shaped libcore functionality when it's eventually stabilized.

fjarri · 2021-07-16T22:08:40Z

So, do you think it's ok to use the current approach with static_map, noting that they should be replaced with standard tools when those are available? I made an attempt at rewriting the function in terms of maps and folds, but it's pretty hard to do without re-calculating intermediate values. And even if I manage to do it, the result will be quite obscure - I'd rather keep the imperative logic flow.

tarcieri · 2021-07-16T22:18:31Z

Absolutely, that's fine. As I said earlier, we use more imperative patterns in a lot of other places because LLVM seems to do a better job optimizing them.

Hopefully that situation changes as const generics improve, but for now I think it's much more straightforward (and also friendlier to C programmers who may not be familiar with more functional patterns)

fjarri · 2021-07-16T22:20:48Z

Got it, I'm going to clean it up then.

…inations

fjarri · 2021-07-16T23:27:32Z

Ok, done. I just exposed it publicly as lincomb() (because benchmarks need it). Ideally, I think, the feature expose-field should be renamed to something like expose-internals and used to gate stuff like that to expose it to benchmakrs. But I'm not sure if I should do that in this PR.

As for a trait that @hdevalence suggested, I think it's a good idea, but I don't know how I feel about using iterators instead of fixed-size arrays. Maybe we can leave this PR at just a function and add a trait next (sometime before the next release).

tarcieri · 2021-07-18T14:22:01Z

Ideally, I think, the feature expose-field should be renamed to something like expose-internals and used to gate stuff like that to expose it to benchmakrs. But I'm not sure if I should do that in this PR.

There are various and somewhat inconsistent ways that low-level internals are exposed across crates in the RustCrypto org.

Some crates have a hazmat feature (e.g. aes)
chacha20 and salsa20 have expose-core

It might be good to come up with a single naming convention for this, but doing so is a breaking change. I can try to open a sort of meta issue on this so we can pick a single name for this sort of thing and use it consistently.

tarcieri · 2021-07-18T19:56:26Z

@fjarri thanks for implementing this!

Would definitely be interested in precomputed powers for the generator point if you're interested in implementing that as well:

in verify.rs and recoverable.rs one of the points of the linear combination is a generator. So it can be sped up even more by pre-calculating generator powers.

fjarri · 2021-07-18T20:39:12Z

I thought about it, and the biggest problem here is precomputing and storing the required table. Rust is pretty rigid when it comes to constants; one way would be to precompute and hardcode the byte representations of required points, but that will increase the binary size, as you mentioned, and in general it's a rather roundabout way.

fjarri · 2021-07-18T20:41:30Z

Another performance improvement (perhaps not very significant, but still) is possible if one of the scalars in the linear combination is not secret. Then we can use regular table lookup instead of constant-time lookup.

tarcieri · 2021-07-18T20:43:40Z

the biggest problem here is precomputing and storing the required table.

Any thoughts on const fn (and possibly const generics) maturity for computing them at compile time with miri?

We could feature gate it so it doesn't impact embedded targets.

...we can use regular table lookup instead of constant-time lookup.

That'd be great for ECDSA. Also generally I've wondered if wNAF would be nice here, although I'm not sure if the benefits are a wash given secp256k1 endomorphisms. Are you familiar with what bitcoin-core/libsecp256k1 does here? (I am not, perhaps I should look)

fjarri · 2021-07-18T21:42:22Z

Any thoughts on const fn (and possibly const generics) maturity for computing them at compile time with miri?

In the stable Rust it won't work because trait methods will be involved (e.g. for arithmetic), and they can't be const. Maybe in a few minor versions...

Another possibility is a build macro, but that will be hard to write and hard to maintain.

That'd be great for ECDSA. Also generally I've wondered if wNAF would be nice here, although I'm not sure if the benefits are a wash given secp256k1 endomorphisms. Are you familiar with what bitcoin-core/libsecp256k1 does here? (I am not, perhaps I should look)

So, to make sure I understand it correctly, in the places where we currently use lincomb():

in verify_prehashed() u1 is secret, u2 isn't;
in recover_verify_key_from_digest_bytes() both u1 and u2 are public.
Right?

I actually have a rough port of libsecp256k1 implementation sitting in a branch in my repo, back from when I implemented the endomorphism. I think it was slower than the current algorithm (from Ristretto, I think), but perhaps it's faster for non-secret scalars.

tarcieri · 2021-07-18T21:47:21Z

in verify_prehashed() u1 is secret, u2 isn't

...do you mean "secret" as in could leak the message digest, which is potentially preimagable and could therefore leak the message, which could be bad if the message is secret?

I guess this is a debatable tradeoff, but the risk is leaking low-entropy secrets, as otherwise they wouldn't be preimagable from the digest via brute force.

Otherwise yes, I'd say all factors involved in verification can be considered public.

I actually have a rough port of libsecp256k1 implementation sitting in a branch in my repo, back from when I implemented the endomorphism. I think it was slower than the current algorithm (from Ristretto, I think), but perhaps it's faster for non-secret scalars.

Interesting

tarcieri · 2021-07-18T22:18:34Z

In the stable Rust it won't work because trait methods will be involved (e.g. for arithmetic), and they can't be const. Maybe in a few minor versions...

One way to address this for now is to hoist those trait methods out into const fn and have the traits call the const fns

fjarri · 2021-07-18T22:36:38Z

...do you mean "secret" as in could leak the message digest, which is potentially preimagable and could therefore leak the message, which could be bad if the message is secret?

You're right, I think I got a wrong impression. I suppose it's not secret either (although I'd rather not make a decision on that myself, my cryptography knowledge is very superficial).

In that case the current lincomb implementation might have been an overkill. Is there any use for a linear combination where both scalars are secret?

One way to address this for now is to hoist those trait methods out into const fn and have the traits call the const fns

I guess, but that seems like a lot of work. I think I'll see first how much speedup is actually possible with a precomputed table.

tarcieri · 2021-07-18T22:42:37Z

You're right, I think I got a wrong impression.

There's perhaps a case to be made there, especially for overall sidechannel resistance, but if the inputs are high-entropy secrets they aren't preimagable and so a constant-time hash function suffices, so perhaps this is a tradeoff which should target performance initially.

In that case the current lincomb implementation might have been an overkill. Is there any use for a linear combination where both scalars are secret?

I think right now all of the cases that matter are non-secret. Concretely those are ECDSA verification, public key recovery from an ECDSA signature, and perhaps things like Schnorr/Taproot (#381).

XuJiandong · 2023-08-07T07:35:04Z

When will this PR be applied to P256?

tarcieri · 2023-08-07T12:50:24Z

There aren't currently any plans to apply this optimization to other curves.

Currently the trait impl used by p256 lives in the primeorder crate, so an optimized implementation would either need to be generic, or we'd need to hoist the trait impls out of primeorder, which would be a breaking change:

elliptic-curves/primeorder/src/projective.rs

Lines 330 to 335 in d8e44b0

    
           impl<C> LinearCombination for ProjectivePoint<C> 
        
           where 
        
               Self: Double, 
        
               C: PrimeCurveParams, 
        
           { 
        
           }

ycscaly · 2023-08-20T21:47:06Z

In that case the current lincomb implementation might have been an overkill. Is there any use for a linear combination where both scalars are secret?

Pedersen commitments are one such use case

fjarri requested a review from tarcieri July 14, 2021 04:49

fjarri force-pushed the lincomb branch from 76d7803 to e78587c Compare July 14, 2021 04:52

tarcieri approved these changes Jul 14, 2021

View reviewed changes

tarcieri marked this pull request as ready for review July 15, 2021 23:31

Generalize point-scalar multiplication to allow optimized linear comb…

a85e930

…inations

fjarri force-pushed the lincomb branch 2 times, most recently from 2a6865f to 108218f Compare July 16, 2021 23:16

Add lincomb() as an alias for a 2-point linear combination

5c14bea

fjarri force-pushed the lincomb branch from 108218f to 5c14bea Compare July 16, 2021 23:22

tarcieri merged commit 937a924 into RustCrypto:master Jul 18, 2021

This was referenced Jul 18, 2021

k256+p256: performance optimization tracking issue #362

Closed

k256 v0.9.5 #383

Merged

tarcieri mentioned this pull request Aug 9, 2021

Add sum-of-products for G1Projective and G2Projective zkcrypto/bls12_381#70

Closed

fjarri deleted the lincomb branch August 3, 2022 04:14

tarcieri mentioned this pull request Aug 20, 2023

Expose lincomb_generic (or: Generic Linear Combination) #921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized linear combination of points #380

Optimized linear combination of points #380

fjarri commented Jul 14, 2021 •

edited

Loading

codecov-commenter commented Jul 14, 2021 •

edited

Loading

tarcieri left a comment

hdevalence commented Jul 14, 2021

tarcieri commented Jul 15, 2021

fjarri commented Jul 16, 2021 •

edited

Loading

tarcieri commented Jul 16, 2021

fjarri commented Jul 16, 2021 •

edited

Loading

tarcieri commented Jul 16, 2021

tarcieri commented Jul 16, 2021 •

edited

Loading

fjarri commented Jul 16, 2021

tarcieri commented Jul 16, 2021

fjarri commented Jul 16, 2021

fjarri commented Jul 16, 2021 •

edited

Loading

tarcieri commented Jul 18, 2021

tarcieri commented Jul 18, 2021

fjarri commented Jul 18, 2021

fjarri commented Jul 18, 2021 •

edited

Loading

tarcieri commented Jul 18, 2021

fjarri commented Jul 18, 2021

tarcieri commented Jul 18, 2021

tarcieri commented Jul 18, 2021

fjarri commented Jul 18, 2021

tarcieri commented Jul 18, 2021

XuJiandong commented Aug 7, 2023

tarcieri commented Aug 7, 2023

ycscaly commented Aug 20, 2023

Optimized linear combination of points #380

Optimized linear combination of points #380

Conversation

fjarri commented Jul 14, 2021 • edited Loading

codecov-commenter commented Jul 14, 2021 • edited Loading

Codecov Report

tarcieri left a comment

Choose a reason for hiding this comment

hdevalence commented Jul 14, 2021

tarcieri commented Jul 15, 2021

fjarri commented Jul 16, 2021 • edited Loading

tarcieri commented Jul 16, 2021

fjarri commented Jul 16, 2021 • edited Loading

tarcieri commented Jul 16, 2021

tarcieri commented Jul 16, 2021 • edited Loading

fjarri commented Jul 16, 2021

tarcieri commented Jul 16, 2021

fjarri commented Jul 16, 2021

fjarri commented Jul 16, 2021 • edited Loading

tarcieri commented Jul 18, 2021

tarcieri commented Jul 18, 2021

fjarri commented Jul 18, 2021

fjarri commented Jul 18, 2021 • edited Loading

tarcieri commented Jul 18, 2021

fjarri commented Jul 18, 2021

tarcieri commented Jul 18, 2021

tarcieri commented Jul 18, 2021

fjarri commented Jul 18, 2021

tarcieri commented Jul 18, 2021

XuJiandong commented Aug 7, 2023

tarcieri commented Aug 7, 2023

ycscaly commented Aug 20, 2023

fjarri commented Jul 14, 2021 •

edited

Loading

codecov-commenter commented Jul 14, 2021 •

edited

Loading

fjarri commented Jul 16, 2021 •

edited

Loading

fjarri commented Jul 16, 2021 •

edited

Loading

tarcieri commented Jul 16, 2021 •

edited

Loading

fjarri commented Jul 16, 2021 •

edited

Loading

fjarri commented Jul 18, 2021 •

edited

Loading