Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(fft): introduce cache efficient bit reverse shuffling #446

Merged
merged 8 commits into from
Oct 2, 2023
Merged

Conversation

gbotrel
Copy link
Collaborator

@gbotrel gbotrel commented Sep 14, 2023

Description

Bit reversal permutation naive implementation does not scale very well due to its memory access patterns and cache associativity issues (since we are accessing elements by strides of powers of 2).

For PlonK prover that needs to do couple of these permutations on potentially large domains (> 2**25) this is very noticeable.

This PR introduces a "COBRA" bit shuffle permutation, derived from:

See code for details and benchmark section of this PR for numbers; but in practice, this is efficient for permutations over slices of field element > 2M, and on x86 architectures.

Type of change

  • New feature (non-breaking change which adds functionality)

How has this been tested?

  • see bitreverese_test.go in each curve specific package.

How has this been benchmarked?

  • benchmarks on Macbook pro M1, AWS Graviton (arm) are inconclusive; for arm64 target we still use the "naive" method.
  • benchmarks on hpc6a and consumer grade amd chip are good; up to 70% faster for large sizes.

On the hpc6a for sizes going up to 2^28:
BenchmarkBitReverse_hpc

Similarly on my amd-based desktop:

BenchmarkBitReverse_ubuntu_home

On the M1 macbook:

BenchmarkBitReverse

Checklist:

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have added tests that prove my fix is effective or that my feature works
  • I did not modify files generated from templates
  • golangci-lint does not output errors locally
  • New and existing unit tests pass locally with my changes

@github-actions
Copy link

Summary

✅ Passed: 5581
❌ Failed: 0
🚧 Skipped: 5

🚧 Skipped

  • TestReference (github.com/consensys/gnark-crypto/ecc/bn254/fr/sis)
  • TestLimbDecomposition (github.com/consensys/gnark-crypto/ecc/bn254/fr/sis)
  • TestAppend (github.com/consensys/gnark-crypto/ecc/bn254/fr/tensor-commitment)
  • TestAppendSis (github.com/consensys/gnark-crypto/ecc/bn254/fr/tensor-commitment)
  • TestCommitmentSis (github.com/consensys/gnark-crypto/ecc/bn254/fr/tensor-commitment)

@mratsim
Copy link

mratsim commented Sep 14, 2023

Wow that's a huge speedup.

Regarding Mac M1s or M2s, the absence of speedup is probably explained by the extremely high memory bandwidth of the Macs. Have to check the numbers as I don't know them from the top of my head.

Regarding Plonk, I'm curious about the cases where it's needed, by picking decimation in-time or decimation in frequency you can choose whether you start or end in canonical or bit-reversed domain. This repo has 3 variants out of 4 (no bit-reversed to bit-reversed) https://github.com/kwantam/fffft

@gbotrel
Copy link
Collaborator Author

gbotrel commented Sep 15, 2023

Yep for the M1, that was my conclusion too, plus the cache sizes + cache lines are significantly bigger.
But ... also happens with AWS Graviton machines (arm) so there's more to that; maybe the extra number of registers or the go compiler that adds some weird stuff; didn't investigate much (not an important target at the moment).

For PlonK, most of the time yes, we avoid bit reverse all together, but in one or two spots, we need to convert a polynomial to canonical regular form -- didn't bench the impact yet (probably no more than 5% total perf impact on PlonK prover) but it did stand out on profiling traces.

Copy link
Collaborator

@yelhousni yelhousni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍
I still don't get the arm64 peculiarities and the optimal vs. practical choices for the tile size, but I get the overall logic.

}

func bitReverseCobra(v []fr.Element) {
switch len(v) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these empirical results?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in a sense; these methods are just generated with constant sized arrays and offsets for specific sizes.
Below a certain threshold, the naive version performs well, and after that threshold (2**27) I just didn't bother generating the methods since I don't think it's very common to bit reverse 270M+ vectors (but maybe...)

@gbotrel gbotrel merged commit 95e674b into master Oct 2, 2023
7 checks passed
@gbotrel gbotrel deleted the perf/fft branch October 2, 2023 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants