Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods #10033

tannergooding · 2018-03-26T16:20:55Z

For the case where the Vector64.Create, Vector128.Create, and Vector256.Create helper functions are called with all constant arguments, we should support emitting a constant which can be loaded from memory, rather than emitting a chain of shuffle or insert calls.

We may also find some benefit in doing the same for partial constants as a partial constant with several inserts can still be faster than treating it as non-constant.

category:cq
theme:vector-codegen
skill-level:intermediate
cost:medium

The text was updated successfully, but these errors were encountered:

tannergooding · 2018-03-26T16:21:03Z

FYI. @CarolEidt, @fiigii

fiigii · 2018-03-26T16:24:16Z

rather than emitting a chain of shuffle or insert calls.

We may need more scenarios and profiling data to support this statement...

tannergooding · 2018-03-26T16:42:14Z

We may need more scenarios and profiling data to support this statement...

It may be worth noting that MSVC, GCC, and Clang all do this.

They do it for both _mm_set1 and _mm_set. However, I imagine it is only beneficial for the latter (_mm_set), since the former (_mm_set1) is generally a single broadcast or shuffle.

I do agree we should do the appropriate profiling/etc, but the native compilers doing it provide a good precedent.

4creators · 2018-03-26T21:17:27Z

it is only beneficial for the latter (_mm_set), since the former (_mm_set1) is generally a single broadcast or shuffle

It's beneficial for both and is done by all native compilers. Shuffle is cheap but broadcast mostly not.

CarolEidt · 2018-03-27T21:38:30Z

I do agree we should do the appropriate profiling/etc, but the native compilers doing it provide a good precedent.

There are many things that native compilers do that the JIT doesn't. We need to justify this kind of work with actual scenarios.

tannergooding · 2018-03-27T21:52:06Z

@CarolEidt, this is basically the same as: https://github.com/dotnet/coreclr/issues/17256 (just replace = new Vector4(1, 2, 3, 4) with = SetVector128(1, 2, 3, 4))

I can come up with a bunch of scenarios showing the bad codegen here and actual numbers if it's needed.

fiigii · 2018-03-28T00:20:01Z

this is basically the same as: dotnet/coreclr#17256 (just replace = new Vector4(1, 2, 3, 4) with = SetVector128(1, 2, 3, 4))

The literals of float/double have to be read from memory, but integers can be encoded in the instructions. So, SetVector128/256 should have different codegen heuristics from Vector4.

tannergooding · 2018-03-28T02:49:52Z

The literals of float/double have to be read from memory

Right, so for float/double, it should be more efficient to do a single read, when all four values are constant.

but integers can be encoded in the instructions

Could you elaborate?

None of integer SIMD load instructions that I can find (movd, movq, movdqa, movdqu, etc) support immediates. So at best, you will be emitting a mov reg, imm64, pinsrq xmm, reg, imm per 8-bytes (on new hardware, with more complex code involving shifting on older hardware).

fiigii · 2018-03-28T05:18:17Z

Could you elaborate?
None of integer SIMD load instructions that
you will be emitting a mov reg, imm64

Yes, I meant the codegen of "evaluating the integer constants" rather than the SIMD instruction 😄

it should be more efficient to do a single read, when all four values are constant.

Yes, we have to store each constant vector in a memory slot. If a program has many distinct constant vectors, it would spend much memory. So, there is tread-off...

mikedn · 2018-03-28T05:27:00Z

If a program has many distinct constant vectors, it would spend much memory. There is tread-off...

I doubt there can be so many distinct constant vectors in a program that it would matter. The real problem is that each method is getting its own data section and thus its own constants. And even inside the same data section constants are not deduplicated so that the same constant is emitted multiple times. Also add the lack of proper packing into the mix and things start to look a bit risky.

All this already happens today for float/double constants but at least those are much smaller...

tannergooding · 2018-03-28T05:30:34Z

If a program has many distinct constant vectors, it would spend much memory.

I'm not so sure on this part. It looks like it generally takes more bytes to do the insert/shift code than it does to store the raw bytes and read from memory.

; This takes ~38 bytes of code, plus 16-bytes of storage
vmovss xmm0, dword ptr [rip+0x00]
vinsertps xmm0, xmm0, dword ptr [reg+0x04], 0x10
vinsertps xmm0, xmm0, dword ptr [reg+0x08], 0x20
vinsertps xmm0, xmm0, dword ptr [reg+0x12], 0x30

; This takes ~8 bytes of code, plus 16-bytes of storage
vmovups xmm0, xmmword ptr [reg+0x00]

So, even with "perfect" deduping of float constants (which we don't have), we still only have a code savings of ~2-bytes.

mikedn · 2018-03-28T05:35:59Z

Hmm, right, SetVector tends to generate so much code that either way things aren't that great in terms of size. The concerns about data size would be more relevant to SetAll, that one generates much less code and yet native compilers tends to also use memory constants for that as well. But then native compilers have the luxury of deduplicating constants...

tannergooding · 2019-11-21T16:22:42Z

I've modified this to track Vector*.Create as the SetVector* methods no longer exist and were changed to the above instead.

CarolEidt · 2019-11-21T17:32:07Z

It looks like dotnet/coreclr#27909 gives us a scenario to motivate this.

mikedn · 2019-11-22T14:54:07Z

Note that what I did with Vector4 & co. may be insufficient/too limited. The proper solution might require adding GT_CNS_SIMD.

damageboy · 2019-12-01T20:27:53Z

I also have a bunch of constants that could use this.

I managed to these into hand-written ReadOnlySpan proeperty getters + LoadDquVector256().

While the use of ReadOnlySpan<byte> initially looks risky, given that these intrinsics are x86 only anyway, and the constant are obviously x86 only as well, it's an ugliness I'm willing to bear.

This mostly alleviates the pain of Vector256.Create() with constants, which is quite considerable.

tannergooding · 2020-04-05T04:36:40Z

I'd like to see if I can get this done as it should help codegen in a few places.

Would we want to try and mimic dotnet/coreclr#14393, which basically just handles this in lowering by replacing the node with a GT_CLS_VAR_ADDR and an emitted constant or would we want to introduce a new GT_CNS_* node type (GT_CNS_SIMD, as suggested above, or GT_CNS_VEC might be good names) and handle this earlier (maybe in importation or morph)?

CC. @CarolEidt, @dotnet/jit-contrib

tannergooding · 2020-05-08T16:10:30Z

Resolved with #35857

tannergooding changed the title ~~Support emitting a constant for certain SetVector hardware intrinsic helpers~~ Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods Nov 21, 2019

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

This was referenced Apr 30, 2020

Adding basic support for recognizing and handling SIMD intrinsics as HW intrinsics #35421

Merged

Optimize Vector128 and Vector256.Create methods #35857

Merged

tannergooding closed this as completed May 8, 2020

ghost locked as resolved and limited conversation to collaborators Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods #10033

Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods #10033

tannergooding commented Mar 26, 2018

tannergooding commented Mar 26, 2018

fiigii commented Mar 26, 2018

tannergooding commented Mar 26, 2018

4creators commented Mar 26, 2018

CarolEidt commented Mar 27, 2018

tannergooding commented Mar 27, 2018 •

edited

Loading

fiigii commented Mar 28, 2018

tannergooding commented Mar 28, 2018

fiigii commented Mar 28, 2018 •

edited

Loading

mikedn commented Mar 28, 2018

tannergooding commented Mar 28, 2018

mikedn commented Mar 28, 2018

tannergooding commented Nov 21, 2019

CarolEidt commented Nov 21, 2019

mikedn commented Nov 22, 2019

damageboy commented Dec 1, 2019

tannergooding commented Apr 5, 2020

tannergooding commented May 8, 2020

Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods #10033

Support emitting a constant for the Vector64/128/256.Create hardware intrinsic methods #10033

Comments

tannergooding commented Mar 26, 2018

tannergooding commented Mar 26, 2018

fiigii commented Mar 26, 2018

tannergooding commented Mar 26, 2018

4creators commented Mar 26, 2018

CarolEidt commented Mar 27, 2018

tannergooding commented Mar 27, 2018 • edited Loading

fiigii commented Mar 28, 2018

tannergooding commented Mar 28, 2018

fiigii commented Mar 28, 2018 • edited Loading

mikedn commented Mar 28, 2018

tannergooding commented Mar 28, 2018

mikedn commented Mar 28, 2018

tannergooding commented Nov 21, 2019

CarolEidt commented Nov 21, 2019

mikedn commented Nov 22, 2019

damageboy commented Dec 1, 2019

tannergooding commented Apr 5, 2020

tannergooding commented May 8, 2020

tannergooding commented Mar 27, 2018 •

edited

Loading

fiigii commented Mar 28, 2018 •

edited

Loading