-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache alignment for serial and parallel FFT and IFFT #245
Conversation
Thanks for the PR @jon-chuang! could you also benchmark the difference at lower thread count and at smaller sizes? IIRC when @ValarDragon tried the realignment strategy for the parallel case, there was a big slowdown for smaller sizes. |
I tried this idea for the parallel case before, and was getting 20%+ slowdowns at small FFT's. (2^15, 2^16) I'll check again for this PR in particular to see how it performs. In the parallel case, its not super clear to me that taking 50% extra memory is a great trade-off to be making across the board |
@ValarDragon Yes, I was getting small slowdowns (~10%) too, both for parallel and non-parallel case. With regards to 50% extra memory, I think this is smaller than that, its more like 25% of the length of the FFT array. But currently, the parallelisation speedup I'm getting is pretty terrible: 4x on an 8C16T machine... There seems to be a lot of room for improvement... Thread util is about 50%. Pretty crap. |
Here are the terrible sad results:
|
This much more favourable result is achieved when the subchunks are only utilised for very large gaps
Details: 8C16T. @Pratyush @ValarDragon Do help corroborate the results as I saw that the results fluctuated a lot between runs. There is no change in results for MNT. Not sure if this is to be expected. |
Here is the data for the serial case:
I think the compaction copies could be made to be avoided in ththe cases when the problem size is small enough... but this comes at the risk of working for one set of params and not working for others... can't optimise everything perfectly, I guess. I can't decide if I am more in favour of optimising the small FFTs or the big ones... the sad fact is that unless one has some instrumentation to do live AB testing, it's impossible to autotune everything. I suppose that one can create an optimisation config file, while does conditional compilation based on a single file. The unfortunate ugly fact is that currently one must manage this with features. Although searching over such a large optimisation space is hard, there exist many techniques in the literature like bayesian optimisation which can quickly determine locally-optimal parameters from expensive data generation i.e. benching of downstream functions. Although ideally, one is able to compile inidividual highly optimised functions in a binary with separate configs, this itself is hard. If only there were a way to directly access LLVM's optimisation infrastructure from ordinary rust code, that allows for such an optimisation process. This would be a truly beautiful feature in my mind. e.g. https://www.cl.cam.ac.uk/~ey204/pubs/MPHIL/2017_SZYMON.pdf wrought large. |
Btw @ValarDragon , you do know that your benches were wrong, right? They did not include the cost of the roots compaction, due to the bug. So all those improvements that were stated are not proven. In fact, in the above benches, it is shown that it is much harder to find a suitable tradeoff. |
I think the numbers for parallel realignment that @ValarDragon reported were from his old PR #177, which didn't have the bug. |
I'm not sure if I believe this. Could you confirm @ValarDragon ? |
Co-authored-by: Pratyush Mishra <pratyushmishra@berkeley.edu>
All I can think howetver, is that since reasonable circuits are about size at least 2^18, one should disregard the benchmarks for 2^17 and below the current PR is doing very well. Especially, one would expect even better improvements for large N i.e. 2^25 |
It seems like the FFT code is faster even at small sizes for a small number of cores? 4 threads
|
@Pratyush , yes, that is expected, as the code worries less about partitioning the data into subsets of reasonable size. |
poly/src/domain/radix2/fft.rs
Outdated
let compaction_size = core::cmp::min( | ||
roots_cache.len() / 2, | ||
roots_cache.len() / MIN_COMPACTION_CHUNKS, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when roots_cache.len()
is 2
or less than MIN_COMPACTION_CHUNKS
? Could you amend the tests to check that as well? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm the compaction wouldn't happen. So we don't have to worry about it. Notice that cmp::min is only necessary for MIN_COMPACTION_CHUNKS = 1, since chunks > 0.
If roots_cache.len() < MIN_COMPACTION_CHUNKS, then chunks <= xi.len() / 2 = roots_cache.len() < MIN_COMPACTION_CHUNKS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok sounds good, a comment to that effect would be great.
This looks great. In terms of refactoring, there's still some common code between algebra/poly/src/domain/radix2/fft.rs Lines 186 to 202 in a03c4a0
algebra/poly/src/domain/radix2/fft.rs Lines 239 to 255 in a03c4a0
I think we should extract these into a common method as well, so that really the only thing different between the two is in in the root compaction. |
Not sure about passing a function pointer though...then one instantiates for both anyway... Rather than inlining sadly |
I think the method should be inlined anyway? As long as you use generics |
Oh yes, let me do that |
Just benchmarked the FFT on a 48 logical core (24 physical core) machine. It was a 1-2% slowdown until an FFT of size I'm going to try to benchmark on a clean 8 core laptop instance tomorrow, just to see how it performs for laptop provers' cache settings |
7e6104a
to
c3eaafe
Compare
@ValarDragon cool! May I know if this was against the master or by setting the compaction threshold to be impossibly large? I find that even on a 8C16T the speedup increases further for 2^23. I was hoping to get data on very large N and many cores. |
c3eaafe
to
7bf765f
Compare
That was measured against master, starting from 2^12 sized FFTs, and ranging until 2^22 sized. I'll measure up to higher sizes |
7bf765f
to
8d7ddfb
Compare
This LGTM modulo the last two comments above, and pending @ValarDragon's benchmark |
Confirmed I got speedups on an 8 core laptop as well for relevant size range. (2^12 up to 2^21), glad this speedup got in! |
Description
closes: #242
Results:
FFT Parallel 16C32T
2^20 - 11%
2^21 - 17%
2^22 - 25%
IFFT Parallel 16C32T
2^20 - 11%
2^21 - 23%
2^22 - 17%
Before we can merge this PR, please make sure that all the following items have been
checked off. If any of the checklist items are not applicable, please leave them but
write a little note why.
Pending
section inCHANGELOG.md
Files changed
in the Github PR explorer