Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized linear combination of points #380

Merged
merged 2 commits into from
Jul 18, 2021
Merged

Conversation

fjarri
Copy link
Contributor

@fjarri fjarri commented Jul 14, 2021

The goal of this PR is to optimize a commonly used operation (x * k + y * l) where x and y are curve points, and k and l are scalars. It is currently used in verify.rs and recoverable.rs.

On my machine it speeds up ECDSA verification from 176us to 145us (and the newly added benchmark shows speed-up from 157us to 124us for linear combination alone).

Currently there PR contains a lincomb_generic() that takes arrays, and it has two aliases: mul() (internal, used in operator traits) and lincomb() (public, for two-point linear combinations).

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2021

Codecov Report

Merging #380 (30ab376) into master (2c12fc5) will decrease coverage by 0.37%.
The diff coverage is 40.35%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #380      +/-   ##
==========================================
- Coverage   57.92%   57.54%   -0.38%     
==========================================
  Files          29       29              
  Lines        4036     4146     +110     
==========================================
+ Hits         2338     2386      +48     
- Misses       1698     1760      +62     
Impacted Files Coverage Δ
k256/src/arithmetic.rs 100.00% <ø> (ø)
k256/src/arithmetic/mul.rs 59.41% <40.18%> (-29.66%) ⬇️
k256/src/ecdsa/verify.rs 43.39% <42.85%> (+0.53%) ⬆️
k256/src/arithmetic/projective.rs 77.04% <0.00%> (+0.77%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2c12fc5...30ab376. Read the comment docs.

Copy link
Member

@tarcieri tarcieri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like this is fine to get started.

  • how should we expose it to the user (and should we expose it at all?). As a function or as a "LinearCombination" trait of ProjectivePoint? In a submodule? Gated by a feature?

I think initially it's fine to just have an internal implementation specific to this use case.

It would be interesting to expose eventually, but raises a lot of questions. For example: should a trait like LinearCombination go in the group crate?

I would prefer to cross that bridge when we get there.

  • a generalization for n-term linear combination (even just joining mul_windowed() and lincomb() would be nice, they have pretty much identical code)

The latter seems worth exploring if it can be done without impacting performance

  • in verify.rs and recoverable.rs one of the points of the linear combination is a generator. So it can be sped up even more by pre-calculating generator powers. Should it be done in this PR or in another one?

I'd prefer to look at that in a separate PR. It definitely sounds like a good idea, but that's an area where it might make sense to feature gate it, as I expect there will be a tradeoff in code size which would impact embedded users who might prefer smaller code size over better performance.

@hdevalence
Copy link

Two traits that could be useful as design inspiration are the curve25519-dalek MultiscalarMul and VartimeMultiscalarMul.

IME there's pretty minimal cost to making an arbitrary-length multiscalar multiplication API over a double-base scalar multiplication API. Basically the only downside is that the lookup tables have to be heap-allocated, but this is a pretty minimal cost even for the double-base case.

This is just my opinion, but I think it's preferable to have only the arbitrary-length version, for reduced API surface. If you're going to have a specialized double-base scmul function, it's useful to have it hardcode one of the points (so it can use a bigger, statically precomputed lookup table).

@tarcieri tarcieri marked this pull request as ready for review July 15, 2021 23:31
@tarcieri
Copy link
Member

@fjarri hope you don't mind but I flipped this PR out of draft

I think it's a reasonable self-contained first step.

@fjarri
Copy link
Contributor Author

fjarri commented Jul 16, 2021

I just wanted to emphasize that it shouldn't be merged yet :)

So, I added a generic version of lincomb that could potentially replace mul_windowed and lincomb. The performance is the following:

high-level operations/lincomb via mul+add [157.31 us 157.39 us 157.48 us]
high-level operations/lincomb optimized [124.71 us 125.17 us 125.64 us]
high-level operations/lincomb optimized (generic) [122.25 us 122.31 us 122.37 us]
high-level operations/mul_windowed [80.548 us 80.767 us 80.993 us]
high-level operations/mul_windowed (generic) [80.870 us 81.071 us 81.269 us]

A strange effect is that ecdsa/verify_prehashed benchmark is ~5us faster with a non-generic version (152 vs 147us), while the benchmark of lincomb itself shows that the generic version is slightly faster.

Do you think the generic version overcomplicates things? (I wish Rust had maps and zips defined for static arrays, but alas...)

@tarcieri
Copy link
Member

I wouldn't worry too much about a modicum of performance lost to the generic version.

I wish Rust had maps and zips defined for static arrays...

It should retain all of the information necessary to optimize those as part of slice::Iter, which is an ExactSizeIterator over a fix-sized structure. Though I have seen LLVM optimize those poorly in the past and we've often used for loops over a static range instead.

@fjarri
Copy link
Contributor Author

fjarri commented Jul 16, 2021

It should retain all of the information necessary to optimize those as part of slice::Iter, which is an ExactSizeIterator over a fix-sized structure.

Yes, but as far as I know it is not possible to collect() into an array. I guess I can try to rewrite it in lazy maps, but this will result in some freaky expressions.

and we've often used for loops over a static range instead.

Well, that's basically what's happening in static_map()

@tarcieri
Copy link
Member

Yes, but as far as I know it is not possible to collect() into an array.

Aah yes, that's been the long-desired one which continues to be an ongoing source of controversy as to how to implement it. I think this is the most up-to-date tracking issue on where things are at with that:

rust-lang/rust#81615

@tarcieri
Copy link
Member

tarcieri commented Jul 16, 2021

One somewhat dirty option which could unlock those sort of use cases we could potentially consider is copying and pasting the existing (unstable) functionality from libcore which provides this into some shared crate:

https://github.com/rust-lang/rust/blob/afaf33dcafe9c7068b63eb997df221aa08db7c29/library/core/src/array/mod.rs#L563-L628

We do have the crypto-common crate which might be a relatively OK place to put that sort of thing.

I wouldn't suggest blocking this PR on it though, but it is definitely something we could use in a lot of other places (particularly symmetric cryptography). The eventually we could migrate to similarly-shaped libcore functionality when it's eventually stabilized.

@fjarri
Copy link
Contributor Author

fjarri commented Jul 16, 2021

So, do you think it's ok to use the current approach with static_map, noting that they should be replaced with standard tools when those are available? I made an attempt at rewriting the function in terms of maps and folds, but it's pretty hard to do without re-calculating intermediate values. And even if I manage to do it, the result will be quite obscure - I'd rather keep the imperative logic flow.

@tarcieri
Copy link
Member

Absolutely, that's fine. As I said earlier, we use more imperative patterns in a lot of other places because LLVM seems to do a better job optimizing them.

Hopefully that situation changes as const generics improve, but for now I think it's much more straightforward (and also friendlier to C programmers who may not be familiar with more functional patterns)

@fjarri
Copy link
Contributor Author

fjarri commented Jul 16, 2021

Got it, I'm going to clean it up then.

@fjarri fjarri force-pushed the lincomb branch 2 times, most recently from 2a6865f to 108218f Compare July 16, 2021 23:16
@fjarri
Copy link
Contributor Author

fjarri commented Jul 16, 2021

Ok, done. I just exposed it publicly as lincomb() (because benchmarks need it). Ideally, I think, the feature expose-field should be renamed to something like expose-internals and used to gate stuff like that to expose it to benchmakrs. But I'm not sure if I should do that in this PR.

As for a trait that @hdevalence suggested, I think it's a good idea, but I don't know how I feel about using iterators instead of fixed-size arrays. Maybe we can leave this PR at just a function and add a trait next (sometime before the next release).

@tarcieri tarcieri merged commit 937a924 into RustCrypto:master Jul 18, 2021
@tarcieri
Copy link
Member

Ideally, I think, the feature expose-field should be renamed to something like expose-internals and used to gate stuff like that to expose it to benchmakrs. But I'm not sure if I should do that in this PR.

There are various and somewhat inconsistent ways that low-level internals are exposed across crates in the RustCrypto org.

  • Some crates have a hazmat feature (e.g. aes)
  • chacha20 and salsa20 have expose-core

It might be good to come up with a single naming convention for this, but doing so is a breaking change. I can try to open a sort of meta issue on this so we can pick a single name for this sort of thing and use it consistently.

@tarcieri
Copy link
Member

@fjarri thanks for implementing this!

Would definitely be interested in precomputed powers for the generator point if you're interested in implementing that as well:

in verify.rs and recoverable.rs one of the points of the linear combination is a generator. So it can be sped up even more by pre-calculating generator powers.

@fjarri
Copy link
Contributor Author

fjarri commented Jul 18, 2021

I thought about it, and the biggest problem here is precomputing and storing the required table. Rust is pretty rigid when it comes to constants; one way would be to precompute and hardcode the byte representations of required points, but that will increase the binary size, as you mentioned, and in general it's a rather roundabout way.

@fjarri
Copy link
Contributor Author

fjarri commented Jul 18, 2021

Another performance improvement (perhaps not very significant, but still) is possible if one of the scalars in the linear combination is not secret. Then we can use regular table lookup instead of constant-time lookup.

@tarcieri
Copy link
Member

the biggest problem here is precomputing and storing the required table.

Any thoughts on const fn (and possibly const generics) maturity for computing them at compile time with miri?

We could feature gate it so it doesn't impact embedded targets.

...we can use regular table lookup instead of constant-time lookup.

That'd be great for ECDSA. Also generally I've wondered if wNAF would be nice here, although I'm not sure if the benefits are a wash given secp256k1 endomorphisms. Are you familiar with what bitcoin-core/libsecp256k1 does here? (I am not, perhaps I should look)

@fjarri
Copy link
Contributor Author

fjarri commented Jul 18, 2021

Any thoughts on const fn (and possibly const generics) maturity for computing them at compile time with miri?

In the stable Rust it won't work because trait methods will be involved (e.g. for arithmetic), and they can't be const. Maybe in a few minor versions...

Another possibility is a build macro, but that will be hard to write and hard to maintain.

That'd be great for ECDSA. Also generally I've wondered if wNAF would be nice here, although I'm not sure if the benefits are a wash given secp256k1 endomorphisms. Are you familiar with what bitcoin-core/libsecp256k1 does here? (I am not, perhaps I should look)

So, to make sure I understand it correctly, in the places where we currently use lincomb():

  • in verify_prehashed() u1 is secret, u2 isn't;
  • in recover_verify_key_from_digest_bytes() both u1 and u2 are public.
    Right?

I actually have a rough port of libsecp256k1 implementation sitting in a branch in my repo, back from when I implemented the endomorphism. I think it was slower than the current algorithm (from Ristretto, I think), but perhaps it's faster for non-secret scalars.

@tarcieri
Copy link
Member

in verify_prehashed() u1 is secret, u2 isn't

...do you mean "secret" as in could leak the message digest, which is potentially preimagable and could therefore leak the message, which could be bad if the message is secret?

I guess this is a debatable tradeoff, but the risk is leaking low-entropy secrets, as otherwise they wouldn't be preimagable from the digest via brute force.

Otherwise yes, I'd say all factors involved in verification can be considered public.

I actually have a rough port of libsecp256k1 implementation sitting in a branch in my repo, back from when I implemented the endomorphism. I think it was slower than the current algorithm (from Ristretto, I think), but perhaps it's faster for non-secret scalars.

Interesting

@tarcieri
Copy link
Member

In the stable Rust it won't work because trait methods will be involved (e.g. for arithmetic), and they can't be const. Maybe in a few minor versions...

One way to address this for now is to hoist those trait methods out into const fn and have the traits call the const fns

@fjarri
Copy link
Contributor Author

fjarri commented Jul 18, 2021

...do you mean "secret" as in could leak the message digest, which is potentially preimagable and could therefore leak the message, which could be bad if the message is secret?

You're right, I think I got a wrong impression. I suppose it's not secret either (although I'd rather not make a decision on that myself, my cryptography knowledge is very superficial).

In that case the current lincomb implementation might have been an overkill. Is there any use for a linear combination where both scalars are secret?

One way to address this for now is to hoist those trait methods out into const fn and have the traits call the const fns

I guess, but that seems like a lot of work. I think I'll see first how much speedup is actually possible with a precomputed table.

@tarcieri
Copy link
Member

You're right, I think I got a wrong impression.

There's perhaps a case to be made there, especially for overall sidechannel resistance, but if the inputs are high-entropy secrets they aren't preimagable and so a constant-time hash function suffices, so perhaps this is a tradeoff which should target performance initially.

In that case the current lincomb implementation might have been an overkill. Is there any use for a linear combination where both scalars are secret?

I think right now all of the cases that matter are non-secret. Concretely those are ECDSA verification, public key recovery from an ECDSA signature, and perhaps things like Schnorr/Taproot (#381).

@XuJiandong
Copy link
Contributor

When will this PR be applied to P256?

@tarcieri
Copy link
Member

tarcieri commented Aug 7, 2023

There aren't currently any plans to apply this optimization to other curves.

Currently the trait impl used by p256 lives in the primeorder crate, so an optimized implementation would either need to be generic, or we'd need to hoist the trait impls out of primeorder, which would be a breaking change:

impl<C> LinearCombination for ProjectivePoint<C>
where
Self: Double,
C: PrimeCurveParams,
{
}

@ycscaly
Copy link
Contributor

ycscaly commented Aug 20, 2023

In that case the current lincomb implementation might have been an overkill. Is there any use for a linear combination where both scalars are secret?

Pedersen commitments are one such use case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants