-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multiproof: improve performance of prover and verifier #63
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
51acbdd
to
20e1a50
Compare
1615b3c
to
d5df9a2
Compare
jsign
commented
Oct 18, 2023
@kevaundray, I've tested this in Kaustinen and all is looking fine, so I guess I can't get a better signal to make this ready for review. I guess the ultimate signal will be when more clients (not using go-ipa) will use Kaustinen, but that can take a while. And, we'll eventually find out anyway. |
kevaundray
reviewed
Oct 21, 2023
jsign
commented
Oct 23, 2023
kevaundray
approved these changes
Oct 23, 2023
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
Signed-off-by: Ignacio Hagopian <jsign.uy@gmail.com>
b7a52f3
to
550614a
Compare
This was referenced Oct 23, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the performance of the proof generator and verifier.
There were many changes that I'll describe below.
Which are the changes?
Optimizations that improved the performance of both the prover and verifier:
strings
, which eventually needed to escape to the heap since theHasher
requires[]bytes
. Also, the labels are fixed (e.g:C
,w
,t
,L
,R
, and domain separators) which doesn't even make sense that they escape once per transcript write. (e.g: for 100k points, we'd be escaping to the heap constant strings). This could be easily fixed by defining constant and not requiringstrings
since the hasher doesn't mutate their values.The first point massively reduced the number of allocations, creating less memory garbage and thus less GC pressure for the clients. This is good for secondary order effects (i.e: GC runs less often and thus burns less CPU).
Optimizations that improved the prover (apart from the above ones):
Z!=1
. This is the case since there were operations at the Verkle Tree data structure layer (e.g: updating commitments after tree writes). When including this point in the FS transcript we need to transform them to affine. Despite already batching all the inverses (e.g: for transforming 10k openings, we did one inversion and not 10k), the Montgomery trick still requires a decent amount of multiplications. All these are independent of each other, so this now was parallelized.Prover benchmarks
(Note: 128k openings are more than double the worst-case scenario estimations, I think? We'll have more idea after some Kaustinen inspection, probably. In any case, I included this case as a wild upper bound)
Here, I show benchmarks (before/after) for the prover in two setups:
AMD Ryzen 7 3800XT prover:
Notes:
time/op
got a massive speedup as expected due to parallelization and the FS-buffering (plus probably less allocs/GC pressure)alloc (MB)/op
(i.e: total memory allocated) got a bump due to parallelization since each goroutine requires memory at the same time. In relative terms, it looks like a decent bump, but in absolute terms I don't think it is a big deal. Clients could limit this by maybe allowing them to configure the amount of parallelization, but that's a tradeoffs between speed and memory usage. Note that in the above CPU we have 16 virtual cores (e.g: see the Rock5B benchmark below; since it has 8 cores (instead of 16), it uses less memory).alloc (count)/op
, this got a massive reduction. We're doing fewer allocations, meaning the GC must clean up less garbage. Note that we're generating less garbage, but using more memory. This is fine in the sense that the GC overhead is related to the amount of allocations done and not to their size.Note: We could push further to reducing allocations, which is always an interesting dance in Go... that could mean introducing extra complexity, which I'm not convinced is justified for what we might gain. So let's try to get some signal that is worth doing.
Rock5B prover:
Verifier benchmarks
AMD Ryzen 7 3800XT verifier:
Rock5B verifier:
Tangent: verifier vs prover speed?
I'd a personal feeling that the verifier isn't that faster than the prover. It's faster by double-digit %, but not incredibly faster. For the Rock5B case, the difference is a bit bigger. e.g: for 64k the Rock5B prover takes 890ms and verifier 617ms (1.44x). For my CPU, the prover takes 160ms and verifier 119ms (1.34x).
Taking a further look at, for example, the 100k openings case, more than half of that time is spent in an MSM of length 100k (we need this to compute
E
, i.e: the linear combination ofCs
with powers ofr
). This is parallelized by gnark-crypto, but still is a 100k MSM which is quite massive, so I guess it makes sense. A big part of the rest is appending 100k elements to the FS-transcript (i.e: for each opening, we have to appendC
,y
andz
, so thats32 bytes * 3 * 100K
, which is a decent amount of stuff to serialize and hash).The prover is quite fast since, apart from all the tricks and now parallelization, it can do most of the stuff in the evaluation form with the "grouping by evaluation point" that we do. So no MSM is required there. (Still have to append 100k (C+y+z), too, so that's the same for both).
Anyway, this is just a comment if this was surprising for some other reader. The verifier "IPA verification" part is very fast and constant (as expected, <10ms), so most of the overhead comes from the Multiproof part that is dependent on the number of openings.
The last note is that the Go standard library implementation of sha256, doesn't leverage SIMD instructions for sha256 if available in the CPU. I think this is planned in the next version of Go. I did a test with a Go library that does that, and the FS-part gets a decent speedup; but quite honestly, I'd prefer to stick to the Go standard library of sha256 since it's quite a delicate dependency to just save a dozen of ms (tested in my CPU).
TODOs
I'll keep this as a draft until: