Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache alignment for serial and parallel FFT and IFFT #245

Merged
merged 25 commits into from
Apr 8, 2021
Merged

Conversation

jon-chuang
Copy link
Contributor

@jon-chuang jon-chuang commented Mar 24, 2021

Description

closes: #242
Results:
FFT Parallel 16C32T
2^20 - 11%
2^21 - 17%
2^22 - 25%

IFFT Parallel 16C32T
2^20 - 11%
2^21 - 23%
2^22 - 17%


Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.

  • Targeted PR against correct branch (master)
  • Linked to Github issue with discussion and accepted design OR have an explanation in the PR that describes this work.
  • Wrote unit tests
  • Updated relevant documentation in the code
  • Added a relevant changelog entry to the Pending section in CHANGELOG.md
  • Re-reviewed Files changed in the Github PR explorer

@jon-chuang jon-chuang requested review from ValarDragon and Pratyush and removed request for ValarDragon March 24, 2021 12:24
CHANGELOG.md Outdated Show resolved Hide resolved
@Pratyush
Copy link
Member

Thanks for the PR @jon-chuang! could you also benchmark the difference at lower thread count and at smaller sizes? IIRC when @ValarDragon tried the realignment strategy for the parallel case, there was a big slowdown for smaller sizes.

@ValarDragon
Copy link
Member

ValarDragon commented Mar 24, 2021

I tried this idea for the parallel case before, and was getting 20%+ slowdowns at small FFT's. (2^15, 2^16) I'll check again for this PR in particular to see how it performs.

In the parallel case, its not super clear to me that taking 50% extra memory is a great trade-off to be making across the board

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 25, 2021

@ValarDragon Yes, I was getting small slowdowns (~10%) too, both for parallel and non-parallel case.
Its hard to know in advance which of the two would work better for any given curve and number of threads.

With regards to 50% extra memory, I think this is smaller than that, its more like 25% of the length of the FFT array.

But currently, the parallelisation speedup I'm getting is pretty terrible: 4x on an 8C16T machine... There seems to be a lot of room for improvement... Thread util is about 50%. Pretty crap.

@jon-chuang
Copy link
Contributor Author

Here are the terrible sad results:
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Collecting 100 samples in estimated 5.1728 s (1700 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/32768
                        time:   [2.9688 ms 3.0300 ms 3.0938 ms]
                        change: [+29.440% +33.263% +37.093%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Collecting 100 samples in estimated 5.5117 s (1000 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/65536
                        time:   [5.4641 ms 5.6596 ms 5.8604 ms]
                        change: [+23.748% +28.742% +34.456%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Collecting 100 samples in estimated 6.1528 s (500 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/131072
                        time:   [10.995 ms 11.526 ms 12.092 ms]
                        change: [+22.798% +29.560% +35.644%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Collecting 100 samples in estimated 6.9858 s (300 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/262144
                        time:   [24.550 ms 25.577 ms 26.668 ms]
                        change: [+16.490% +21.934% +27.285%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.4s, or reduce sample count to 90.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Collecting 100 samples in estimated 5.4361 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/524288
                        time:   [50.259 ms 52.124 ms 54.128 ms]
                        change: [+5.6750% +9.6124% +14.538%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 13 outliers among 100 measurements (13.00%)
  11 (11.00%) high mild
  2 (2.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.2s, or reduce sample count to 50.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Collecting 100 samples in estimated 9.1722 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/1048576
                        time:   [92.486 ms 93.790 ms 95.188 ms]
                        change: [-10.309% -7.5650% -4.9447%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 20.1s, or reduce sample count to 20.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152: Collecting 100 samples in estimated 20.147 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/2097152
                        time:   [194.98 ms 197.81 ms 200.96 ms]
                        change: [-8.4606% -6.8893% -5.1352%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/4194304
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/4194304: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 38.1s, or reduce sample count to 10.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/4194304: Collecting 100 samples in estimated 38.052 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/4194304: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/4194304
                        time:   [382.67 ms 384.74 ms 386.91 ms]
                        change: [-17.897% -16.662% -15.519%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Collecting 100 samples in estimated 5.0437 s (2100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/32768
                        time:   [2.6042 ms 2.6729 ms 2.7477 ms]
                        change: [+6.9566% +10.969% +15.304%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Collecting 100 samples in estimated 5.0822 s (800 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/65536
                        time:   [5.2793 ms 5.4697 ms 5.6789 ms]
                        change: [+16.143% +22.205% +28.317%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Collecting 100 samples in estimated 5.1647 s (500 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/131072
                        time:   [9.9436 ms 10.117 ms 10.294 ms]
                        change: [+6.8750% +9.9687% +13.350%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Collecting 100 samples in estimated 7.0024 s (300 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/262144
                        time:   [22.617 ms 22.989 ms 23.383 ms]
                        change: [+4.4927% +7.6459% +10.659%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Collecting 100 samples in estimated 9.6489 s (200 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/524288
                        time:   [47.961 ms 48.583 ms 49.270 ms]
                        change: [-1.1671% +1.3471% +3.8213%] (p = 0.29 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.4s, or reduce sample count to 50.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Collecting 100 samples in estimated 9.4081 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/1048576
                        time:   [92.923 ms 93.911 ms 94.995 ms]
                        change: [-6.0794% -4.1757% -2.3586%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/2097152
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/2097152: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 19.3s, or reduce sample count to 20.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/2097152: Collecting 100 samples in estimated 19.337 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/2097152: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/2097152
                        time:   [193.46 ms 194.86 ms 196.31 ms]
                        change: [-10.583% -9.2857% -8.0497%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/4194304
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/4194304: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 39.5s, or reduce sample count to 10.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/4194304: Collecting 100 samples in estimated 39.478 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/4194304: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/4194304
                        time:   [409.03 ms 414.09 ms 419.59 ms]
                        change: [-9.6244% -8.4244% -7.0350%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe

Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Collecting 100 samples in estimated 5.2216 s (1700 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/32768
                        time:   [3.1329 ms 3.2375 ms 3.3454 ms]
                        change: [+18.493% +22.983% +28.159%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Collecting 100 samples in estimated 5.5730 s (900 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/65536
                        time:   [5.2982 ms 5.4014 ms 5.5130 ms]
                        change: [+2.2447% +4.6582% +7.2173%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Collecting 100 samples in estimated 6.0598 s (500 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/131072
                        time:   [11.114 ms 11.369 ms 11.640 ms]
                        change: [+3.9870% +6.8438% +10.094%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Collecting 100 samples in estimated 5.0236 s (200 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/262144
                        time:   [24.164 ms 24.654 ms 25.213 ms]
                        change: [-4.6415% -0.6099% +3.3769%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/524288
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/524288: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/524288: Collecting 100 samples in estimated 9.9466 s (200 iterations)

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 25, 2021

This much more favourable result is achieved when the subchunks are only utilised for very large gaps
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Collecting 100 samples in estimated 5.0861 s (2000 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/32768
                        time:   [2.5101 ms 2.5442 ms 2.5819 ms]
                        change: [+4.4667% +7.5754% +10.604%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Collecting 100 samples in estimated 5.2808 s (1200 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/65536
                        time:   [4.3433 ms 4.4042 ms 4.4675 ms]
                        change: [-9.5771% -5.5606% -1.7496%] (p = 0.01 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Collecting 100 samples in estimated 5.7595 s (600 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/131072
                        time:   [9.4923 ms 9.6384 ms 9.7949 ms]
                        change: [-5.7956% -2.7784% +0.1900%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Collecting 100 samples in estimated 6.5137 s (300 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/262144
                        time:   [21.344 ms 21.634 ms 21.931 ms]
                        change: [-19.379% -16.122% -12.667%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Collecting 100 samples in estimated 9.4797 s (200 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/524288
                        time:   [47.809 ms 48.882 ms 50.051 ms]
                        change: [-11.130% -8.0677% -4.7549%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.3s, or reduce sample count to 50.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Collecting 100 samples in estimated 9.3142 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/1048576
                        time:   [91.174 ms 92.364 ms 93.596 ms]
                        change: [-10.431% -8.5880% -6.7047%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 20.2s, or reduce sample count to 20.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152: Collecting 100 samples in estimated 20.218 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/2097152
                        time:   [192.43 ms 194.44 ms 196.56 ms]
                        change: [-10.448% -9.1806% -7.8372%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/4194304
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/4194304: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 38.1s, or reduce sample count to 10.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/4194304: Collecting 100 samples in estimated 38.124 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/4194304: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/4194304
                        time:   [379.72 ms 381.95 ms 384.38 ms]
                        change: [-15.891% -15.261% -14.629%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Collecting 100 samples in estimated 5.0694 s (2100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/32768
                        time:   [2.4197 ms 2.4658 ms 2.5184 ms]
                        change: [+5.6856% +8.6309% +11.566%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Collecting 100 samples in estimated 5.2877 s (1200 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/65536
                        time:   [4.4157 ms 4.5204 ms 4.6347 ms]
                        change: [-2.8322% +0.5157% +4.0209%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Collecting 100 samples in estimated 5.4422 s (600 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/131072
                        time:   [9.0741 ms 9.2183 ms 9.3674 ms]
                        change: [-1.7590% +1.2289% +4.0510%] (p = 0.41 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Collecting 100 samples in estimated 6.5420 s (300 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/262144
                        time:   [21.510 ms 21.792 ms 22.091 ms]
                        change: [+0.6440% +3.0114% +5.3517%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Collecting 100 samples in estimated 9.6532 s (200 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/524288
                        time:   [47.697 ms 48.500 ms 49.346 ms]
                        change: [-5.0135% -2.0695% +0.7938%] (p = 0.17 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.2s, or reduce sample count to 50.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Collecting 100 samples in estimated 9.2140 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/1048576
                        time:   [92.493 ms 93.593 ms 94.730 ms]
                        change: [-10.478% -7.9674% -5.6798%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/2097152
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/2097152: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 19.0s, or reduce sample count to 20.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/2097152: Collecting 100 samples in estimated 18.992 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/2097152: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/2097152
                        time:   [190.88 ms 192.36 ms 193.91 ms]
                        change: [-12.597% -11.378% -10.180%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/4194304
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/4194304: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 39.3s, or reduce sample count to 10.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/4194304: Collecting 100 samples in estimated 39.289 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/4194304: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/4194304
                        time:   [397.76 ms 400.90 ms 404.22 ms]
                        change: [-13.641% -12.583% -11.499%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Collecting 100 samples in estimated 5.1805 s (1900 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/32768
                        time:   [2.7607 ms 2.8130 ms 2.8719 ms]
                        change: [+9.3930% +11.864% +14.808%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Collecting 100 samples in estimated 5.2517 s (1000 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/65536
                        time:   [4.9596 ms 5.0647 ms 5.1799 ms]
                        change: [+3.9856% +6.3693% +8.9648%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Collecting 100 samples in estimated 5.1420 s (500 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/131072
                        time:   [10.354 ms 10.525 ms 10.709 ms]
                        change: [-20.452% -16.308% -11.850%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Collecting 100 samples in estimated 7.4327 s (300 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/262144
                        time:   [22.631 ms 22.892 ms 23.160 ms]
                        change: [-9.1506% -6.2321% -3.4109%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Details: 8C16T.
I think this makes it good enough. I will now bench for smaller sizes non parallel.

@Pratyush @ValarDragon Do help corroborate the results as I saw that the results fluctuated a lot between runs.

There is no change in results for MNT. Not sure if this is to be expected.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 25, 2021

Here is the data for the serial case:

WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Collecting 100 samples in estimated 5.8394 s (600 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/32768: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/32768
                        time:   [9.7242 ms 9.7499 ms 9.7808 ms]
                        change: [+8.1595% +8.9352% +9.6607%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Collecting 100 samples in estimated 6.2526 s (300 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/65536: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/65536
                        time:   [20.658 ms 20.737 ms 20.827 ms]
                        change: [+3.1853% +4.1361% +5.0873%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Collecting 100 samples in estimated 8.9820 s (200 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/131072: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/131072
                        time:   [44.775 ms 44.910 ms 45.046 ms]
                        change: [+4.9416% +5.7576% +6.5064%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.8s, or reduce sample count to 50.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Collecting 100 samples in estimated 9.7545 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/262144: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/262144
                        time:   [96.474 ms 96.656 ms 96.845 ms]
                        change: [-0.6048% -0.1526% +0.2655%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 20.6s, or reduce sample count to 20.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Collecting 100 samples in estimated 20.568 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/524288
                        time:   [205.17 ms 205.65 ms 206.14 ms]
                        change: [-3.5415% -2.9891% -2.4759%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 43.5s, or reduce sample count to 10.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Collecting 100 samples in estimated 43.523 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Analyzing
"bls12_381 - radix2" - subgroup_fft_in_place/1048576
                        time:   [432.84 ms 433.57 ms 434.31 ms]
                        change: [-10.785% -10.349% -9.9450%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Collecting 100 samples in estimated 5.0249 s (500 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/32768: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/32768
                        time:   [9.9924 ms 10.022 ms 10.056 ms]
                        change: [+4.1057% +4.6042% +5.1030%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Collecting 100 samples in estimated 6.3489 s (300 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/65536: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/65536
                        time:   [21.442 ms 21.525 ms 21.626 ms]
                        change: [+2.3683% +3.0082% +3.7269%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Collecting 100 samples in estimated 9.2127 s (200 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/131072: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/131072
                        time:   [45.954 ms 46.041 ms 46.136 ms]
                        change: [+0.2819% +1.0548% +1.7724%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.9s, or reduce sample count to 50.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Collecting 100 samples in estimated 9.8725 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/262144: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/262144
                        time:   [99.226 ms 99.470 ms 99.726 ms]
                        change: [-2.2848% -1.6243% -1.0141%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 20.7s, or reduce sample count to 20.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Collecting 100 samples in estimated 20.672 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/524288: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/524288
                        time:   [206.77 ms 207.12 ms 207.48 ms]
                        change: [-8.2831% -7.6417% -7.0368%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 43.9s, or reduce sample count to 10.
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Collecting 100 samples in estimated 43.944 s (100 iterations)
Benchmarking "bls12_381 - radix2" - subgroup_ifft_in_place/1048576: Analyzing
"bls12_381 - radix2" - subgroup_ifft_in_place/1048576
                        time:   [439.90 ms 440.62 ms 441.34 ms]
                        change: [-12.364% -11.914% -11.519%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Collecting 100 samples in estimated 5.4242 s (500 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/32768: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/32768
                        time:   [10.933 ms 10.960 ms 10.987 ms]
                        change: [+9.6970% +10.177% +10.644%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Collecting 100 samples in estimated 6.9663 s (300 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/65536: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/65536
                        time:   [23.195 ms 23.261 ms 23.328 ms]
                        change: [+8.1506% +8.7586% +9.3377%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Collecting 100 samples in estimated 9.8730 s (200 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/131072: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/131072
                        time:   [49.664 ms 49.792 ms 49.921 ms]
                        change: [+7.0807% +7.5261% +7.9696%] (p = 0.00 < 0.05)
                        Performance has regressed.
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.7s, or reduce sample count to 40.
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Collecting 100 samples in estimated 10.745 s (100 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/262144: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/262144
                        time:   [106.18 ms 106.47 ms 106.78 ms]
                        change: [+0.2334% +0.7263% +1.1819%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/524288
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/524288: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 22.4s, or reduce sample count to 20.
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/524288: Collecting 100 samples in estimated 22.385 s (100 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/524288: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/524288
                        time:   [222.00 ms 222.47 ms 222.97 ms]
                        change: [-4.5002% -4.0088% -3.5321%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/1048576
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/1048576: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 47.5s, or reduce sample count to 10.
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/1048576: Collecting 100 samples in estimated 47.524 s (100 iterations)
Benchmarking "bls12_381 - radix2" - coset_fft_in_place/1048576: Analyzing
"bls12_381 - radix2" - coset_fft_in_place/1048576
                        time:   [470.46 ms 471.29 ms 472.14 ms]
                        change: [-8.8260% -8.4579% -8.1081%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/32768
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/32768: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/32768: Collecting 100 samples in estimated 5.3178 s (500 iterations)
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/32768: Analyzing
"bls12_381 - radix2" - coset_ifft_in_place/32768
                        time:   [10.651 ms 10.712 ms 10.783 ms]
                        change: [+4.0475% +4.8278% +5.6070%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/65536
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/65536: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/65536: Collecting 100 samples in estimated 6.8066 s (300 iterations)
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/65536: Analyzing
"bls12_381 - radix2" - coset_ifft_in_place/65536
                        time:   [22.396 ms 22.472 ms 22.559 ms]
                        change: [+2.2767% +2.8382% +3.4491%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/131072
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/131072: Warming up for 3.0000 s
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/131072: Collecting 100 samples in estimated 9.7565 s (200 iterations)
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/131072: Analyzing
"bls12_381 - radix2" - coset_ifft_in_place/131072
                        time:   [48.583 ms 48.761 ms 48.952 ms]
                        change: [+1.7305% +2.5542% +3.3143%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/262144
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/262144: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 10.4s, or reduce sample count to 40.
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/262144: Collecting 100 samples in estimated 10.438 s (100 iterations)
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/262144: Analyzing
"bls12_381 - radix2" - coset_ifft_in_place/262144
                        time:   [102.48 ms 102.67 ms 102.86 ms]
                        change: [-1.9921% -1.5773% -1.1758%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/524288
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/524288: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 21.6s, or reduce sample count to 20.
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/524288: Collecting 100 samples in estimated 21.610 s (100 iterations)
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/524288: Analyzing
"bls12_381 - radix2" - coset_ifft_in_place/524288
                        time:   [216.67 ms 217.11 ms 217.58 ms]
                        change: [-5.3576% -4.7837% -4.2304%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/1048576
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/1048576: Warming up for 3.0000 s

Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 45.9s, or reduce sample count to 10.
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/1048576: Collecting 100 samples in estimated 45.889 s (100 iterations)
Benchmarking "bls12_381 - radix2" - coset_ifft_in_place/1048576: Analyzing
"bls12_381 - radix2" - coset_ifft_in_place/1048576
                        time:   [456.91 ms 457.48 ms 458.09 ms]
                        change: [-12.902% -12.442% -12.016%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

I think the compaction copies could be made to be avoided in ththe cases when the problem size is small enough... but this comes at the risk of working for one set of params and not working for others... can't optimise everything perfectly, I guess.
More extensive benching should be done to include MNT etc.

I can't decide if I am more in favour of optimising the small FFTs or the big ones... the sad fact is that unless one has some instrumentation to do live AB testing, it's impossible to autotune everything.

I suppose that one can create an optimisation config file, while does conditional compilation based on a single file. The unfortunate ugly fact is that currently one must manage this with features. Although searching over such a large optimisation space is hard, there exist many techniques in the literature like bayesian optimisation which can quickly determine locally-optimal parameters from expensive data generation i.e. benching of downstream functions. Although ideally, one is able to compile inidividual highly optimised functions in a binary with separate configs, this itself is hard.

If only there were a way to directly access LLVM's optimisation infrastructure from ordinary rust code, that allows for such an optimisation process.

This would be a truly beautiful feature in my mind. e.g. https://www.cl.cam.ac.uk/~ey204/pubs/MPHIL/2017_SZYMON.pdf wrought large.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Mar 25, 2021

Btw @ValarDragon , you do know that your benches were wrong, right? They did not include the cost of the roots compaction, due to the bug. So all those improvements that were stated are not proven. In fact, in the above benches, it is shown that it is much harder to find a suitable tradeoff.

@Pratyush
Copy link
Member

Pratyush commented Mar 25, 2021

I think the numbers for parallel realignment that @ValarDragon reported were from his old PR #177, which didn't have the bug.

CHANGELOG.md Outdated Show resolved Hide resolved
@jon-chuang
Copy link
Contributor Author

I think the numbers for parallel realignment that @ValarDragon reported were from his old PR #177, which didn't have the bug.

I'm not sure if I believe this. Could you confirm @ValarDragon ?

Co-authored-by: Pratyush Mishra <pratyushmishra@berkeley.edu>
@jon-chuang
Copy link
Contributor Author

All I can think howetver, is that since reasonable circuits are about size at least 2^18, one should disregard the benchmarks for 2^17 and below the current PR is doing very well. Especially, one would expect even better improvements for large N i.e. 2^25

@Pratyush
Copy link
Member

Pratyush commented Mar 25, 2021

It seems like the FFT code is faster even at small sizes for a small number of cores?

4 threads
"bls12_381 - radix2" - subgroup_fft_in_place/32768                                                                             
                        time:   [4.9159 ms 4.9797 ms 5.0551 ms]
                        change: [-10.802% -8.3784% -5.8054%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
"bls12_381 - radix2" - subgroup_fft_in_place/65536                                                                            
                        time:   [10.724 ms 10.784 ms 10.851 ms]
                        change: [-6.0406% -5.4021% -4.7377%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe
"bls12_381 - radix2" - subgroup_fft_in_place/131072                                                                            
                        time:   [22.814 ms 22.908 ms 23.003 ms]
                        change: [-5.7885% -5.2133% -4.5943%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
"bls12_381 - radix2" - subgroup_fft_in_place/262144                                                                            
                        time:   [46.696 ms 47.291 ms 47.836 ms]
                        change: [-13.601% -12.393% -11.245%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  14 (14.00%) low severe
  1 (1.00%) low mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/524288: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.7s, or reduce sample count to 50.
"bls12_381 - radix2" - subgroup_fft_in_place/524288                                                                            
                        time:   [94.261 ms 94.586 ms 94.937 ms]
                        change: [-12.658% -12.106% -11.563%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/1048576: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 18.5s, or reduce sample count to 20.
"bls12_381 - radix2" - subgroup_fft_in_place/1048576                                                                            
                        time:   [184.78 ms 185.32 ms 185.85 ms]
                        change: [-15.254% -14.912% -14.566%] (p = 0.00 < 0.05)
                        Performance has improved.
Benchmarking "bls12_381 - radix2" - subgroup_fft_in_place/2097152: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 42.3s, or reduce sample count to 10.
"bls12_381 - radix2" - subgroup_fft_in_place/2097152                                                                            
                        time:   [435.73 ms 438.96 ms 442.23 ms]
                        change: [-25.799% -25.038% -24.289%] (p = 0.00 < 0.05)
                        Performance has improved.

@jon-chuang
Copy link
Contributor Author

@Pratyush , yes, that is expected, as the code worries less about partitioning the data into subsets of reasonable size.

Comment on lines 210 to 213
let compaction_size = core::cmp::min(
roots_cache.len() / 2,
roots_cache.len() / MIN_COMPACTION_CHUNKS,
);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when roots_cache.len() is 2 or less than MIN_COMPACTION_CHUNKS? Could you amend the tests to check that as well? Thanks!

Copy link
Contributor Author

@jon-chuang jon-chuang Mar 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm the compaction wouldn't happen. So we don't have to worry about it. Notice that cmp::min is only necessary for MIN_COMPACTION_CHUNKS = 1, since chunks > 0.

If roots_cache.len() < MIN_COMPACTION_CHUNKS, then chunks <= xi.len() / 2 = roots_cache.len() < MIN_COMPACTION_CHUNKS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good, a comment to that effect would be great.

@Pratyush
Copy link
Member

Pratyush commented Mar 30, 2021

This looks great. In terms of refactoring, there's still some common code between io_helper and oi_helper, namely

cfg_chunks_mut!(xi, chunk_size).for_each(|cxi| {
let (lo, hi) = cxi.split_at_mut(gap);
// If the chunk is sufficiently big that parallelism helps,
// we parallelize the butterfly operation within the chunk.
if gap > MIN_PROBLEM_SIZE && num_chunks < max_threads {
cfg_iter_mut!(lo)
.zip(cfg_iter_mut!(hi))
.zip(cfg_iter!(roots).step_by(step))
.for_each(Self::butterfly_fn_io);
} else {
lo.iter_mut()
.zip(hi)
.zip(roots.iter().step_by(step))
.for_each(Self::butterfly_fn_io);
}
});
and
cfg_chunks_mut!(xi, chunk_size).for_each(|cxi| {
let (lo, hi) = cxi.split_at_mut(gap);
// If the chunk is sufficiently big that parallelism helps,
// we parallelize the butterfly operation within the chunk.
if gap > MIN_PROBLEM_SIZE && num_chunks < max_threads {
cfg_iter_mut!(lo)
.zip(cfg_iter_mut!(hi))
.zip(cfg_iter!(roots).step_by(step))
.for_each(Self::butterfly_fn_oi);
} else {
lo.iter_mut()
.zip(hi)
.zip(roots.iter().step_by(step))
.for_each(Self::butterfly_fn_oi);
}
});

I think we should extract these into a common method as well, so that really the only thing different between the two is in in the root compaction.

@jon-chuang
Copy link
Contributor Author

This looks great. In terms of refactoring, there's still some common code between io_helper and oi_helper, namely

cfg_chunks_mut!(xi, chunk_size).for_each(|cxi| {

to

and

cfg_chunks_mut!(xi, chunk_size).for_each(|cxi| {

to

I think we should extract these into a common method as well, so that really the only thing different between the two is in in the root compaction.

Not sure about passing a function pointer though...then one instantiates for both anyway... Rather than inlining sadly

@Pratyush
Copy link
Member

I think the method should be inlined anyway? As long as you use generics

@jon-chuang
Copy link
Contributor Author

Oh yes, let me do that

@ValarDragon
Copy link
Member

ValarDragon commented Mar 30, 2021

Just benchmarked the FFT on a 48 logical core (24 physical core) machine. It was a 1-2% slowdown until an FFT of size 2^18, after which it provided speedups.

I'm going to try to benchmark on a clean 8 core laptop instance tomorrow, just to see how it performs for laptop provers' cache settings

@jon-chuang
Copy link
Contributor Author

@ValarDragon cool! May I know if this was against the master or by setting the compaction threshold to be impossibly large?
Were the speedups pronounced for large N? Or did you stop measuring at 2^21?

I find that even on a 8C16T the speedup increases further for 2^23. I was hoping to get data on very large N and many cores.

@ValarDragon
Copy link
Member

ValarDragon commented Mar 30, 2021

That was measured against master, starting from 2^12 sized FFTs, and ranging until 2^22 sized. I'll measure up to higher sizes

CHANGELOG.md Outdated Show resolved Hide resolved
@Pratyush
Copy link
Member

This LGTM modulo the last two comments above, and pending @ValarDragon's benchmark

@Pratyush Pratyush changed the title Cache Alignment for Serial and Parallel FFT & IFFT Cache alignment for serial and parallel FFT and IFFT Apr 8, 2021
@Pratyush Pratyush merged commit e504bda into master Apr 8, 2021
@Pratyush Pratyush deleted the jonch/fft-opt branch April 8, 2021 16:03
@ValarDragon
Copy link
Member

ValarDragon commented Apr 8, 2021

Confirmed I got speedups on an 8 core laptop as well for relevant size range. (2^12 up to 2^21), glad this speedup got in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants