-
Notifications
You must be signed in to change notification settings - Fork 2
Feature Detection "is faster" #7
Comments
Yeah, in a world where we expose "is_fast" kind of features, we would have to leave it up to the individual engine to decided whether a feature "is fast" or not, since we can't standardize a meaning for that. You're right that speed might change between hardware, but the engine could still make a best-effort attempt to usefully hint to an application about whether an instruction set is "fast" or not. I think in practice this could end up being very useful. |
Thomas you brought up some good points I didn't think of in leaving it to the engine to determine what is fast instead of the "application". Also agree this is a very useful idea if it can be pulled off. The problem is without running the exact code or something very representative I still think it may be too much work for an engine to determine if it "is fast"? There are so many dependencies on is fast, like cache size, cache line fetches, data alignment issues. An example in older Intel hardware was if the SIMD data was unaligned and/or crossed cache line boundaries you could take significant performance hits and pending on the code functionality non-SIMD would be at parity with SIMD. (Not sure I ever saw it regress but I guess it is possible.) |
True, if the hardware itself doesn't even get much speedup from SIMD or has a lot of performance pitfalls, then perhaps the best thing for the engine to do in that case is conservatively report "not fast." Assuming reasonably recent hardware, though, I expect the engine will mostly base the distinction on whether it polyfills/scalarizes SIMD or actually lowers it to native SIMD instructions. Again, it doesn't need to be perfect, just helpful on average. |
In my mind, "is_fast" amounts to "the engine+cpu have native support for this" rather than be emulated expensively in software. For example, in the case of SIMD, having native 128-bit registers and the majority of the spec'd instructions implemented in hardware, rather than by being emulated with scalar code by the engine. Of course microprocessor generations are going to vary on the exact cost of some instructions, but that's not what I had taken "is_fast" to mean. |
There is a lot of gray area within "native support" - some instructions can be substantially more expensive on one platform than on the other. And by substantial I don't mean 20x, but enough to make their use in SIMD-enabled algorithms tricky. This isn't a purely theoretical problem. For example, here is a popular code testing for x86 vs Arm to pick different microkernel implementations: https://github.com/google/XNNPACK/blob/852f70d3157ff847a316ae9321bc142be77cee87/src/init.c#L80-L85 Given that this has been known for a long time and we have standardized SIMD I think we are comfortable with this kind of code. I am wondering if this can be standardized in some form (probably not exactly I think @conrad-watt asked about code that does this kind of testing in practice in 2022-04-12 meeting, apologies if I am wrong. |
@penzn that's an incredible example, thanks for bringing it up! I've not previously been exposed to code in the wild testing NaN bits in this way. The more extreme version of my question was whether, in the presence of relaxed SIMD, any code would attempt to rely on platform-specific semantic differences beyond just making performance decisions (in particular, IIRC, testing for and relying on single-rounding FMA). But it actually looks like this question was also brought up and discussed previously WebAssembly/relaxed-simd#44. |
@conrad-watt, you are welcome! This is very similar to how native numerical libraries query CPU features, in fact the same source file does native init in XNNPACK. The only real difference is the Edit: what I meant to say is that this interesting check is used in lieu of FMA discussion wasn't conclusive, but yes, it can be an example of a situation where detecting a CPU feature might be necessary at some level. |
I have a problem with switching from a feature detection as originally proposed to an "is faster" approach in that hardware changes and what "is faster" means nothing from generation to generation. You will still have to detect the exact hardware to determine if it "is faster". Feature detection should probably be used for code path detection and not performance detection as that should be left up to the application IMO.
The text was updated successfully, but these errors were encountered: