-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimating Planner #37
Comments
I just made a PR for my first prototype estimating planner, mostly to show it and get some feedback. Bechmarking seems tricky. We need many many measurement points for this, so we can't average for too long on each one. I tried a few different ways but it's always noisy. One problem seems to be the fancy turbo frequencies etc of modern cpus which makes the speed vary in strange ways. |
fftw automated the process of generating benchmarks based on the requested parameters and it would automatically search among the benchmarks for the best implementation on the user's target platform. I'm no expert, so I don't know if this would be useful for RustFFT but it seems like a clever approach. |
Yes fftw can run some measurements to make sure it picks the best implementation. The downside is that this is relatively time consuming (I think is was half a second or something last time I tried it). It can also make a very large set of measurements and store the results in a "wisdom" database, which can then be used to instead of making new measurements. |
I definitely think a measuring planner is an interesting direction to go long-term. I'd call it out of scope for at least a couple years though, because yeah, i think we can get users to the 95% level with just estimation. |
I was searching for info on more accurate measurements, and I found this: https://hackmd.io/sH315lO2RuicY-SEt7ynGA It's pretty long, but the gist of it is that this crate has an unreleased (ie it's in master but not on crates.io) feature to measure performance based on CPU cycle counter To use it, you'd have to:
Based on the results in that article, this may give us measurements with significantly less noise. @HEnquist Are you interested in investigating this? If not I can do it after I've wrapped up real FFT work. |
Very interesting! I'll take a look when I'm done with adding sse/avx to my resampling lib (got inspired by the big gains avx gives in rustfft). In less than a week hopefully. |
https://github.com/bheisler/iai released today |
That is nice! Looks much easier to use than measureme. I'll give it a try! |
I played a little with iai just now, looks very promising! |
Great! It seems like the next step, until they support some form of procedural testing, is to make a rust script that generates test cases, similar to my "compare_2nem_strategies" benchmark. How long did it take to run? To benchmark something like MixedRadix with this, we'd need a really high volume of tests |
Those 32 tests (or really 16 since I need to run each one twice) ran in 7 seconds on my ryzen laptop. Leaving it overnight should produce a nice big set of results. I'm thinking to use macros to generate the test cases, and then make a python script to read and analyze the results. There is a iai_macro crate that should make things a little easier, but I haven't figured out how to use it yet. |
One thing I observed while benchmark radix4 is that it's faster at small-medium sizes, but at large sizes, making a mixed radix of roughly-even size was faster. As a quick test to see if this tool is giving real-world-applicable results, one useful test would be to plot the radix4 results, compared with the results of putting the power2 FFT inside a MixedRadix, with radix4 as the two inner FFTs. IE for 4096, do a MixedRadix with 2 inner size-64 Radix4 instances, for 8192, do a MixedRadix with inner 64 and 128 Radix4 instances. If it's working, we should see radix4 be faster at first, but get overtaken by MixedRadix. |
Actually, that seems to have changed after this: 6ab56f9#diff-07bb71e908a04b41bccc8c3665eae0e78f9d562af2e4425df3c442c2337f5bdc That "slightly faster" seems to grow with length, so I don't really see mixed radix winning any more (at length 4194304 and below at least). In my previous benchmarks at 4194304, I get that mixed radix is 17% slower than radix4. Using iai I get mixed radix 28% slower instead. |
Oh dang. Well as long as iai reflects reality that's a good sign haha |
Right now, the planner uses hardcoded heuristics to plan FFTs. We benchmark with a bunch of different FFT setups, use those benchmarks to build an intuition about what's fastest, and then hardcode that intuition into the planner's decision-making process.
For example, in FftPlannerScalar::design_prime(), we check if any of the inner FFT's prime factors are greater than some constant. If they are, we switch to bluestein's. If they're all smaller, we use rader's. That's presumably because, through benchmarking, it was determined that large prime factors slow down Rader's Algorithm.
This approach results in generally fast FFTs, but it has some limitations:
@HEnquist had the idea of changing our approach to instead build an estimate of how each individual FFT algorithm performs, and compose those estimates together to estimate how a chain of FFT algorithms would perform.
This approach is appealing because it limits the scope of any individual heuristic. Instead of having to measure Rader's Algorithm's prime factors, we only need to measure how Rader's Algorithm itself performs, independently of its child FFT. Then, when choosing a FFT algorithm, we can compose our estimate of Rader's Algorithm with our indepently-determined estimate of rader's child FFT.
As a proof of concept, I forked @HEnquist 's crate for generating performance charts https://github.com/ejmahler/fftbenching and changed it to measure the performance of MixedRadix2xnAvx, 3xnAvx, 4xnAvx etc, divided by the performance of its inner FFT, and I got this chart:
Based on eyeballing the chart, we can estimate a few different approaches to computing a FFT of size 480k. One option is to use MixedRadix12xnAvx with an inner FFT of size 40k. Eyeballing the chart, this would add a 15x multiplier to whatever the performance of the 40k FFT is.
Estimating other strategies from going from 480k to 40k:
So based on these rough estimates, 12xn is much faster. This matches the hardcoded heuristics of the AVX planner, which strongly prefers 12xn over any other algorithm. If we could find a way to automate this estimation process, we could have a planner whose heuristics are much more self-contained and reliable.
Some unanswered questions:
I wrote all this out to collect my thoughts on the issue and start a centralized discussion on it. I can see this shipping with rustFFT 5.1 or something if every goes smoothly - or maybe this is a yearlong project.
The text was updated successfully, but these errors were encountered: