Skip to content
This repository has been archived by the owner on Dec 7, 2024. It is now read-only.

Feature Detection "is faster" #7

Open
rrwinterton opened this issue Mar 15, 2022 · 7 comments
Open

Feature Detection "is faster" #7

rrwinterton opened this issue Mar 15, 2022 · 7 comments

Comments

@rrwinterton
Copy link

I have a problem with switching from a feature detection as originally proposed to an "is faster" approach in that hardware changes and what "is faster" means nothing from generation to generation. You will still have to detect the exact hardware to determine if it "is faster". Feature detection should probably be used for code path detection and not performance detection as that should be left up to the application IMO.

@tlively
Copy link
Member

tlively commented Mar 15, 2022

Yeah, in a world where we expose "is_fast" kind of features, we would have to leave it up to the individual engine to decided whether a feature "is fast" or not, since we can't standardize a meaning for that. You're right that speed might change between hardware, but the engine could still make a best-effort attempt to usefully hint to an application about whether an instruction set is "fast" or not. I think in practice this could end up being very useful.

@rrwinterton
Copy link
Author

Thomas you brought up some good points I didn't think of in leaving it to the engine to determine what is fast instead of the "application". Also agree this is a very useful idea if it can be pulled off. The problem is without running the exact code or something very representative I still think it may be too much work for an engine to determine if it "is fast"? There are so many dependencies on is fast, like cache size, cache line fetches, data alignment issues. An example in older Intel hardware was if the SIMD data was unaligned and/or crossed cache line boundaries you could take significant performance hits and pending on the code functionality non-SIMD would be at parity with SIMD. (Not sure I ever saw it regress but I guess it is possible.)

@tlively
Copy link
Member

tlively commented Mar 15, 2022

True, if the hardware itself doesn't even get much speedup from SIMD or has a lot of performance pitfalls, then perhaps the best thing for the engine to do in that case is conservatively report "not fast." Assuming reasonably recent hardware, though, I expect the engine will mostly base the distinction on whether it polyfills/scalarizes SIMD or actually lowers it to native SIMD instructions. Again, it doesn't need to be perfect, just helpful on average.

@titzer
Copy link

titzer commented Mar 15, 2022

In my mind, "is_fast" amounts to "the engine+cpu have native support for this" rather than be emulated expensively in software. For example, in the case of SIMD, having native 128-bit registers and the majority of the spec'd instructions implemented in hardware, rather than by being emulated with scalar code by the engine.

Of course microprocessor generations are going to vary on the exact cost of some instructions, but that's not what I had taken "is_fast" to mean.

@penzn
Copy link
Contributor

penzn commented Apr 27, 2022

There is a lot of gray area within "native support" - some instructions can be substantially more expensive on one platform than on the other. And by substantial I don't mean 20x, but enough to make their use in SIMD-enabled algorithms tricky.

This isn't a purely theoretical problem. For example, here is a popular code testing for x86 vs Arm to pick different microkernel implementations: https://github.com/google/XNNPACK/blob/852f70d3157ff847a316ae9321bc142be77cee87/src/init.c#L80-L85

Given that this has been known for a long time and we have standardized SIMD I think we are comfortable with this kind of code. I am wondering if this can be standardized in some form (probably not exactly is_wasm_x86, maybe something like is_swizzle_fast), I feel that relying on behavior not documented by the spec can lead to compatibility issues down the road. For example imagine an engine that decided to quiet the sign bit for whatever reason, or somebody modifying the check without understanding what it is supposed to do.

I think @conrad-watt asked about code that does this kind of testing in practice in 2022-04-12 meeting, apologies if I am wrong.

@conrad-watt
Copy link
Contributor

@penzn that's an incredible example, thanks for bringing it up! I've not previously been exposed to code in the wild testing NaN bits in this way.

The more extreme version of my question was whether, in the presence of relaxed SIMD, any code would attempt to rely on platform-specific semantic differences beyond just making performance decisions (in particular, IIRC, testing for and relying on single-rounding FMA). But it actually looks like this question was also brought up and discussed previously WebAssembly/relaxed-simd#44.

@penzn
Copy link
Contributor

penzn commented Apr 28, 2022

@conrad-watt, you are welcome! This is very similar to how native numerical libraries query CPU features, in fact the same source file does native init in XNNPACK. The only real difference is the cpuinfo_has_... calls, which are the native API for features.

Edit: what I meant to say is that this interesting check is used in lieu of cpuinfo API calls. Some form of this kind of testing is necessary given the nature of SIMD algorithms, as inefficient SIMD code can be self-defeating. If we had started with first-class vector operations or first-class data arrays instead of fixed-width SIMD this problem might have not arisen, but on the other hand we would have not be able to port many native algorithms.

FMA discussion wasn't conclusive, but yes, it can be an example of a situation where detecting a CPU feature might be necessary at some level.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants