-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods #10033
Comments
FYI. @CarolEidt, @fiigii |
We may need more scenarios and profiling data to support this statement... |
It may be worth noting that MSVC, GCC, and Clang all do this. They do it for both I do agree we should do the appropriate profiling/etc, but the native compilers doing it provide a good precedent. |
It's beneficial for both and is done by all native compilers. Shuffle is cheap but broadcast mostly not. |
There are many things that native compilers do that the JIT doesn't. We need to justify this kind of work with actual scenarios. |
@CarolEidt, this is basically the same as: https://github.com/dotnet/coreclr/issues/17256 (just replace I can come up with a bunch of scenarios showing the bad codegen here and actual numbers if it's needed. |
The literals of |
Right, so for
Could you elaborate? None of integer SIMD load instructions that I can find ( |
Yes, I meant the codegen of "evaluating the integer constants" rather than the SIMD instruction 😄
Yes, we have to store each constant vector in a memory slot. If a program has many distinct constant vectors, it would spend much memory. So, there is tread-off... |
I doubt there can be so many distinct constant vectors in a program that it would matter. The real problem is that each method is getting its own data section and thus its own constants. And even inside the same data section constants are not deduplicated so that the same constant is emitted multiple times. Also add the lack of proper packing into the mix and things start to look a bit risky. All this already happens today for float/double constants but at least those are much smaller... |
I'm not so sure on this part. It looks like it generally takes more bytes to do the insert/shift code than it does to store the raw bytes and read from memory. ; This takes ~38 bytes of code, plus 16-bytes of storage
vmovss xmm0, dword ptr [rip+0x00]
vinsertps xmm0, xmm0, dword ptr [reg+0x04], 0x10
vinsertps xmm0, xmm0, dword ptr [reg+0x08], 0x20
vinsertps xmm0, xmm0, dword ptr [reg+0x12], 0x30 ; This takes ~8 bytes of code, plus 16-bytes of storage
vmovups xmm0, xmmword ptr [reg+0x00] So, even with "perfect" deduping of float constants (which we don't have), we still only have a code savings of ~2-bytes. |
Hmm, right, |
I've modified this to track |
It looks like dotnet/coreclr#27909 gives us a scenario to motivate this. |
Note that what I did with |
I also have a bunch of constants that could use this. I managed to these into hand-written ReadOnlySpan proeperty getters + LoadDquVector256(). While the use of This mostly alleviates the pain of Vector256.Create() with constants, which is quite considerable. |
I'd like to see if I can get this done as it should help codegen in a few places. Would we want to try and mimic dotnet/coreclr#14393, which basically just handles this in lowering by replacing the node with a CC. @CarolEidt, @dotnet/jit-contrib |
Resolved with #35857 |
For the case where the
Vector64.Create
,Vector128.Create
, andVector256.Create
helper functions are called with all constant arguments, we should support emitting a constant which can be loaded from memory, rather than emitting a chain of shuffle or insert calls.We may also find some benefit in doing the same for partial constants as a partial constant with several inserts can still be faster than treating it as non-constant.
category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium
The text was updated successfully, but these errors were encountered: