perf(fft): introduce cache efficient bit reverse shuffling #446

gbotrel · 2023-09-14T19:29:16Z

Description

Bit reversal permutation naive implementation does not scale very well due to its memory access patterns and cache associativity issues (since we are accessing elements by strides of powers of 2).

For PlonK prover that needs to do couple of these permutations on potentially large domains (> 2**25) this is very noticeable.

This PR introduces a "COBRA" bit shuffle permutation, derived from:

Towards an Optimal Bit-Reversal Permutation Program
Larry Carter and Kang Su Gatlin, 1998
https://csaws.cs.technion.ac.il/~itai/Courses/Cache/bit.pdf
Practically efficient methods for performing bit-reversed
permutation in C++11 on the x86-64 architecture
Knauth, Adas, Whitfield, Wang, Ickler, Conrad, Serang, 2017
https://arxiv.org/pdf/1708.01873.pdf
and more specifically, constantine implementation:
https://github.com/mratsim/constantine/blob/d51699248db04e29c7b1ad97e0bafa1499db00b5/constantine/math/polynomials/fft.nim#L205
by Mamy Ratsimbazafy (@mratsim).

See code for details and benchmark section of this PR for numbers; but in practice, this is efficient for permutations over slices of field element > 2M, and on x86 architectures.

Type of change

New feature (non-breaking change which adds functionality)

How has this been tested?

see bitreverese_test.go in each curve specific package.

How has this been benchmarked?

benchmarks on Macbook pro M1, AWS Graviton (arm) are inconclusive; for arm64 target we still use the "naive" method.
benchmarks on hpc6a and consumer grade amd chip are good; up to 70% faster for large sizes.

On the hpc6a for sizes going up to 2^28:

Similarly on my amd-based desktop:

On the M1 macbook:

Checklist:

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have added tests that prove my fix is effective or that my feature works
I did not modify files generated from templates
golangci-lint does not output errors locally
New and existing unit tests pass locally with my changes

github-actions · 2023-09-14T19:59:09Z

Summary

✅ Passed: 5581
❌ Failed: 0
🚧 Skipped: 5

🚧 Skipped

TestReference (github.com/consensys/gnark-crypto/ecc/bn254/fr/sis)
TestLimbDecomposition (github.com/consensys/gnark-crypto/ecc/bn254/fr/sis)
TestAppend (github.com/consensys/gnark-crypto/ecc/bn254/fr/tensor-commitment)
TestAppendSis (github.com/consensys/gnark-crypto/ecc/bn254/fr/tensor-commitment)
TestCommitmentSis (github.com/consensys/gnark-crypto/ecc/bn254/fr/tensor-commitment)

mratsim · 2023-09-14T20:51:01Z

Wow that's a huge speedup.

Regarding Mac M1s or M2s, the absence of speedup is probably explained by the extremely high memory bandwidth of the Macs. Have to check the numbers as I don't know them from the top of my head.

Regarding Plonk, I'm curious about the cases where it's needed, by picking decimation in-time or decimation in frequency you can choose whether you start or end in canonical or bit-reversed domain. This repo has 3 variants out of 4 (no bit-reversed to bit-reversed) https://github.com/kwantam/fffft

gbotrel · 2023-09-15T14:44:13Z

Yep for the M1, that was my conclusion too, plus the cache sizes + cache lines are significantly bigger.
But ... also happens with AWS Graviton machines (arm) so there's more to that; maybe the extra number of registers or the go compiler that adds some weird stuff; didn't investigate much (not an important target at the moment).

For PlonK, most of the time yes, we avoid bit reverse all together, but in one or two spots, we need to convert a polynomial to canonical regular form -- didn't bench the impact yet (probably no more than 5% total perf impact on PlonK prover) but it did stand out on profiling traces.

yelhousni

LGTM 👍
I still don't get the arm64 peculiarities and the optimal vs. practical choices for the tile size, but I get the overall logic.

yelhousni · 2023-10-02T15:44:58Z

ecc/bls12-377/fr/fft/bitreverse.go

+}
+
+func bitReverseCobra(v []fr.Element) {
+	switch len(v) {


are these empirical results?

in a sense; these methods are just generated with constant sized arrays and offsets for specific sizes.
Below a certain threshold, the naive version performs well, and after that threshold (2**27) I just didn't bother generating the methods since I don't think it's very common to bit reverse 270M+ vectors (but maybe...)

gbotrel added 6 commits September 13, 2023 09:19

feat: add cobra bit reverse

1ad338d

experiment: generate code for CobraInPlace

f936bd6

unroll not good

1444769

fix previous commit

af69977

refactor: simplify some expressions in bitReverse

a79916c

style: cleaning up the PR

6af3f7f

gbotrel added perf zk-evm labels Sep 14, 2023

gbotrel requested review from yelhousni and ThomasPiellard September 14, 2023 19:29

gbotrel added 2 commits September 14, 2023 14:34

build: gofmt stuff

5f5cc26

Merge branch 'master' into perf/fft

89c7b9e

ThomasPiellard approved these changes Sep 29, 2023

View reviewed changes

yelhousni approved these changes Oct 2, 2023

View reviewed changes

gbotrel merged commit 95e674b into master Oct 2, 2023
7 checks passed

gbotrel deleted the perf/fft branch October 2, 2023 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(fft): introduce cache efficient bit reverse shuffling #446

perf(fft): introduce cache efficient bit reverse shuffling #446

gbotrel commented Sep 14, 2023

github-actions bot commented Sep 14, 2023

mratsim commented Sep 14, 2023

gbotrel commented Sep 15, 2023

yelhousni left a comment

yelhousni Oct 2, 2023

gbotrel Oct 2, 2023

perf(fft): introduce cache efficient bit reverse shuffling #446

perf(fft): introduce cache efficient bit reverse shuffling #446

Conversation

gbotrel commented Sep 14, 2023

Description

Type of change

How has this been tested?

How has this been benchmarked?

Checklist:

github-actions bot commented Sep 14, 2023

Summary

🚧 Skipped

mratsim commented Sep 14, 2023

gbotrel commented Sep 15, 2023

yelhousni left a comment

Choose a reason for hiding this comment

yelhousni Oct 2, 2023

Choose a reason for hiding this comment

gbotrel Oct 2, 2023

Choose a reason for hiding this comment