Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable AVX512 embedded masking for most other intrinsics #101886

Merged
merged 22 commits into from
May 9, 2024

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented May 5, 2024

This is a continuation of #97675 and almost finishes out #87097

In particular, it enables the embedded masking support for all intrinsics except for the various load, store, move, and broadcast intrinsics that explicitly deal with memory operations.

As part of this, the PR explicitly marks intrinsics which should never appear as the intrinsicId of a node to help ensure the relevant intrinsics are being properly handled.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 5, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Comment on lines +9483 to +9491
bool canUseEmbeddedBroadcast() const
{
return JitConfig.EnableEmbeddedBroadcast();
}

bool canUseEmbeddedMasking() const
{
return JitConfig.EnableEmbeddedMasking();
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The embedded broadcast/masking support for both AVX512 and SVE is pretty complex in parts, as such having a knob to allow disabling it can be beneficial to help validate perf/size wins for the feature and to allow users to workaround any issues if they happen to be found.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed through the changes in hwintrinsiclistxarch.h and instrsxarch.h and they looked OK to me. If there are any specific changes, other than adding HW_Flag_InvalidNodeId or INS_Flags_EmbeddedBroadcastSupported, let me know.

Overall the changes looks good. It seems at multiple places having insOpts having default value would save us from passing INS_OPTS_NONE around.

Waiting for superpmi-diff results.

src/coreclr/jit/hwintrinsiclistarm64.h Show resolved Hide resolved
src/coreclr/jit/emitxarch.h Show resolved Hide resolved
src/coreclr/jit/codegen.h Show resolved Hide resolved
src/coreclr/jit/hwintrinsic.h Show resolved Hide resolved
src/coreclr/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved
src/coreclr/jit/hwintrinsicxarch.cpp Show resolved Hide resolved
src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved
@tannergooding
Copy link
Member Author

Overall the changes looks good. It seems at multiple places having insOpts having default value would save us from passing INS_OPTS_NONE around.

I opted to have it default for the emitIns_* APIs that are more generally used but require it for emitIns_SIMD_* since the latter needs to be more explicit and I wanted to ensure that all places were passing through insOpts and otherwise considering whether broadcasting/masking/rounding need to be handled for the cases where INS_OPTS_NONE is explicitly passed.

@tannergooding tannergooding marked this pull request as ready for review May 8, 2024 17:13
@kunalspathak
Copy link
Member

I didn't quite understand the changes made in f83162d...can you elaborate? other than that, things look good.

@tannergooding
Copy link
Member Author

I didn't quite understand the changes made in https://github.com/dotnet/runtime/commit/f83162d11baa9b139d6497cbdf61f50779e7d5bd...can you elaborate? other than that, things look good.

For SSE-SSE41 there isn't actually an instruction to do floating-point CompareGreaterThan or CompareGreaterThanOrEqual. Inversely for SSE-AVX2 there isn't actually an instruction to do integer CompareLessThan. Instead, these were emulated by swapping the operands in an early phase (import or lowering). -- That is, for example, given float we'd have CGT x, y and in lowering we'd change it to CGT y, x and just have codegen emit it as CLT y, x. This was originally done to simplify various other bits and because we never need to make observations about these intrinsics from that point.

With AVX512 and the ability to do embedded masking, we want to emit CompareGreaterThanMask instead but only when its part of a ConditionalSelect (and we know the mask register will be used directly). Because we were swapping from CGT x, y to CGT y, x we had no way to know that it now actually meant CLT y, x and thus should become CompareLessThanMask instead.

So the change just ensured that we stopped lying about the operation being done when the operands were swapped. Thus CGT x, y becomes CLT y, x instead and latter operations can correctly introspect the operation and do the right thing.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding
Copy link
Member Author

linux x64

Diffs are based on 2,304,731 contexts (997,292 MinOpts, 1,307,439 FullOpts).

MISSED contexts: 7 (0.00%)

Overall (-21,300 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.linux.x64.checked.mch 15,697,916 -360 -1.24%
benchmarks.run_pgo.linux.x64.checked.mch 70,177,184 -1,995 -1.17%
benchmarks.run_tiered.linux.x64.checked.mch 15,151,406 -491 -1.16%
coreclr_tests.run.linux.x64.checked.mch 416,332,755 -10,909 -1.87%
libraries.pmi.linux.x64.checked.mch 61,114,110 -87 +1.66%
libraries_tests.run.linux.x64.Release.mch 354,860,299 -5,100 -0.50%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch 133,696,745 -1,982 -1.12%
realworld.run.linux.x64.checked.mch 13,650,157 -255 -0.37%
smoke_tests.nativeaot.linux.x64.checked.mch 4,210,219 -121 -1.45%
MinOpts (-8,946 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run_pgo.linux.x64.checked.mch 23,161,983 -522 -0.91%
benchmarks.run_tiered.linux.x64.checked.mch 11,562,115 -393 -0.81%
coreclr_tests.run.linux.x64.checked.mch 289,839,696 -6,365 -1.89%
libraries_tests.run.linux.x64.Release.mch 193,849,452 -1,666 -0.66%
FullOpts (-12,354 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
benchmarks.run.linux.x64.checked.mch 15,323,854 -360 -1.24%
benchmarks.run_pgo.linux.x64.checked.mch 47,015,201 -1,473 -1.28%
benchmarks.run_tiered.linux.x64.checked.mch 3,589,291 -98 -2.11%
coreclr_tests.run.linux.x64.checked.mch 126,493,059 -4,544 -1.83%
libraries.pmi.linux.x64.checked.mch 61,000,849 -87 +1.66%
libraries_tests.run.linux.x64.Release.mch 161,010,847 -3,434 -0.43%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch 123,001,625 -1,982 -1.12%
realworld.run.linux.x64.checked.mch 13,242,834 -255 -0.37%
smoke_tests.nativeaot.linux.x64.checked.mch 4,209,172 -121 -1.45%

windows x64

Diffs are based on 2,615,190 contexts (1,040,939 MinOpts, 1,574,251 FullOpts).

MISSED contexts: 4 (0.00%)

Overall (-23,037 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
aspnet.run.windows.x64.checked.mch 63,103,761 -2,979 -1.08%
benchmarks.run.windows.x64.checked.mch 8,728,529 -395 -1.21%
benchmarks.run_pgo.windows.x64.checked.mch 35,373,663 -1,881 -1.13%
benchmarks.run_tiered.windows.x64.checked.mch 13,034,809 -441 -0.99%
coreclr_tests.run.windows.x64.checked.mch 404,903,405 -10,624 -1.92%
libraries.crossgen2.windows.x64.checked.mch 45,226,392 -6 -0.29%
libraries.pmi.windows.x64.checked.mch 62,293,489 -465 +0.46%
libraries_tests.run.windows.x64.Release.mch 301,873,279 -4,482 -1.55%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 138,326,612 -1,729 -1.14%
realworld.run.windows.x64.checked.mch 13,553,568 +91 -0.43%
smoke_tests.nativeaot.windows.x64.checked.mch 5,016,989 -126 -1.32%
MinOpts (-9,043 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
aspnet.run.windows.x64.checked.mch 29,168,185 -445 -1.16%
benchmarks.run_pgo.windows.x64.checked.mch 14,311,838 -507 -0.94%
benchmarks.run_tiered.windows.x64.checked.mch 9,747,978 -393 -0.84%
coreclr_tests.run.windows.x64.checked.mch 282,314,302 -6,281 -1.90%
libraries_tests.run.windows.x64.Release.mch 186,524,539 -1,417 -0.46%
FullOpts (-13,994 bytes)
Collection Base size (bytes) Diff size (bytes) PerfScore in Diffs
aspnet.run.windows.x64.checked.mch 33,935,576 -2,534 -1.06%
benchmarks.run.windows.x64.checked.mch 8,728,100 -395 -1.21%
benchmarks.run_pgo.windows.x64.checked.mch 21,061,825 -1,374 -1.23%
benchmarks.run_tiered.windows.x64.checked.mch 3,286,831 -48 -1.40%
coreclr_tests.run.windows.x64.checked.mch 122,589,103 -4,343 -1.96%
libraries.crossgen2.windows.x64.checked.mch 45,224,679 -6 -0.29%
libraries.pmi.windows.x64.checked.mch 62,179,538 -465 +0.46%
libraries_tests.run.windows.x64.Release.mch 115,348,740 -3,065 -2.12%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 127,486,318 -1,729 -1.14%
realworld.run.windows.x64.checked.mch 13,147,847 +91 -0.43%
smoke_tests.nativeaot.windows.x64.checked.mch 5,015,942 -126 -1.32%

@tannergooding
Copy link
Member Author

Diffs generally look similar to the following:

-9 (-23.08%) : 69294.dasm - System.Buffers.IndexOfAnyAsciiSearcher+Ssse3AndWasmHandleZeroInNeedle:PackSources(System.Runtime.Intrinsics.Vector128`1[ushort],System.Runtime.Intrinsics.Vector128`1[ushort]):System.Runtime.Intrinsics.Vector128`1[ubyte] (Tier1)
@@ -15,7 +15,7 @@ ;* V03 loc0 [V03 ] ( 0, 0 ) simd16 -> zero-ref <System.Runtime.Intrinsics.Vector128`1[short]> ;* V04 loc1 [V04 ] ( 0, 0 ) simd16 -> zero-ref <System.Runtime.Intrinsics.Vector128`1[short]> ;# V05 OutArgs [V05 ] ( 1, 1 ) struct ( 0) [rsp+0x00] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-; V06 cse0 [V06,T03] ( 3, 3 ) simd16 -> mm1 "CSE #01: aggressive"
+; V06 cse0 [V06,T03] ( 3, 3 ) simd16 -> mm0 "CSE #01: aggressive"
; ; Lcl frame size = 0 @@ -23,23 +23,21 @@ G_M3343_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, ;; size=0 bbWeight=1 PerfScore 0.00 G_M3343_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0106 {rcx rdx r8}, byref ; byrRegs +[rcx rdx r8]
- vmovups xmm0, xmmword ptr [rdx] - vmovups xmm1, xmmword ptr [reloc @RWD00] - vpminuw xmm0, xmm0, xmm1 - vmovups xmm2, xmmword ptr [r8] - vpminuw xmm1, xmm2, xmm1 - vpackuswb xmm0, xmm0, xmm1
+ vmovups xmm0, xmmword ptr [reloc @RWD00] + vpminuw xmm1, xmm0, xmmword ptr [rdx] + vpminuw xmm0, xmm0, xmmword ptr [r8] + vpackuswb xmm0, xmm1, xmm0
vmovups xmmword ptr [rcx], xmm0 mov rax, rcx ; byrRegs +[rax]
- ;; size=38 bbWeight=1 PerfScore 15.25
+ ;; size=29 bbWeight=1 PerfScore 12.25
G_M3343_IG03: ; bbWeight=1, epilog, nogc, extend ret ;; size=1 bbWeight=1 PerfScore 1.00 RWD00 dq 00FF00FF00FF00FFh, 00FF00FF00FF00FFh
-; Total bytes of code 39, prolog size 0, PerfScore 16.25, instruction count 9, allocated bytes for code 39 (MethodHash=940ff2f0) for method System.Buffers.IndexOfAnyAsciiSearcher+Ssse3AndWasmHandleZeroInNeedle:PackSources(System.Runtime.Intrinsics.Vector128`1[ushort],System.Runtime.Intrinsics.Vector128`1[ushort]):System.Runtime.Intrinsics.Vector128`1[ubyte] (Tier1)
+; Total bytes of code 30, prolog size 0, PerfScore 13.25, instruction count 7, allocated bytes for code 30 (MethodHash=940ff2f0) for method System.Buffers.IndexOfAnyAsciiSearcher+Ssse3AndWasmHandleZeroInNeedle:PackSources(System.Runtime.Intrinsics.Vector128`1[ushort],System.Runtime.Intrinsics.Vector128`1[ushort]):System.Runtime.Intrinsics.Vector128`1[ubyte] (Tier1)
; ============================================================ Unwind Info:

-3 (-4.69%) : 4801.dasm - System.Diagnostics.Stopwatch:GetElapsedTime(long,long):System.TimeSpan (Tier1)
@@ -26,22 +26,21 @@ G_M44428_IG02: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref vxorps xmm0, xmm0, xmm0 vcvtsi2sd xmm0, xmm0, rdx vfixupimmsd xmm0, xmm0, xmmword ptr [reloc @RWD00], 0
- vmovups xmm1, xmmword ptr [reloc @RWD16] - vcmppd xmm2, xmm0, xmmword ptr [reloc @RWD32], 13 - vcvttsd2si rax, xmm0 - vpbroadcastq xmm0, rax - vpternlogq xmm2, xmm1, xmm0, -54 - vmovd rax, xmm2 - ;; size=63 bbWeight=1 PerfScore 27.08
+ vcmppd k1, xmm0, xmmword ptr [reloc @RWD16], 13 + vcvttsd2si rax, xmm0 + vpbroadcastq xmm0, rax + vpblendmq xmm0 {k1}, xmm0, xmmword ptr [reloc @RWD32] + vmovd rax, xmm0 + ;; size=60 bbWeight=1 PerfScore 25.58
G_M44428_IG03: ; bbWeight=1, epilog, nogc, extend ret ;; size=1 bbWeight=1 PerfScore 1.00 RWD00 dq 0000000000000088h, 0000000000000000h
-RWD16 dq 7FFFFFFFFFFFFFFFh, 7FFFFFFFFFFFFFFFh -RWD32 dq 43E0000000000000h, 43E0000000000000h
+RWD16 dq 43E0000000000000h, 43E0000000000000h +RWD32 dq 7FFFFFFFFFFFFFFFh, 7FFFFFFFFFFFFFFFh
-; Total bytes of code 64, prolog size 0, PerfScore 28.08, instruction count 11, allocated bytes for code 64 (MethodHash=d8da5273) for method System.Diagnostics.Stopwatch:GetElapsedTime(long,long):System.TimeSpan (Tier1)
+; Total bytes of code 61, prolog size 0, PerfScore 26.58, instruction count 10, allocated bytes for code 62 (MethodHash=d8da5273) for method System.Diagnostics.Stopwatch:GetElapsedTime(long,long):System.TimeSpan (Tier1)
; ============================================================ Unwind Info:

@tannergooding
Copy link
Member Author

In the optimal case, like seen in some of the tests, we can convert something like:

Vector512.ConditionalSelect(mask, x + Vector512.Create(cns), Vector512<T>.Zero)

into

vaddps zmm0 {k1}{z}, zmm0, dword ptr [rax] {1to16}

The few regressions that do show up tend to be from using the EVEX encoding, but that's to be expected as we're using larger instructions that have lower cost. There's some longer term improvements that could still be done around containment and commutativity (such as inversing the mask or specially handling some types of blending more), but those are longer term goals to handle.

@tannergooding
Copy link
Member Author

Some other prominent diffs look like

-       vcmppd   xmm4, xmm0, xmm0, 0
-       vcmppd   xmm5, xmm3, xmm3, 0
-       vcmppd   xmm6, xmm1, xmm2, 0
-       vxorps   xmm7, xmm7, xmm7
-       vpcmpgtq xmm7, xmm7, xmm0
-       vpternlogq xmm7, xmm0, xmm3, -54
-       vcmppd   xmm1, xmm2, xmm1, 1
-       vpternlogq xmm1, xmm0, xmm3, -54
-       vpternlogq xmm6, xmm7, xmm1, -54
-       vpternlogq xmm5, xmm6, xmm3, -54
-       vpternlogq xmm4, xmm5, xmm0, -54
-       vmovups  xmmword ptr [rcx], xmm4
+       vcmppd   k1, xmm0, xmm0, 0
+       vcmppd   k2, xmm3, xmm3, 0
+       vcmppd   k3, xmm1, xmm2, 0
+       vxorps   xmm4, xmm4, xmm4
+       vpcmpgtq k4, xmm4, xmm0
+       vblendmpd xmm4 {k4}, xmm3, xmm0
+       vcmppd   k4, xmm2, xmm1, 1
+       vblendmpd xmm1 {k4}, xmm3, xmm0
+       vblendmpd xmm1 {k3}, xmm1, xmm4
+       vblendmpd xmm1 {k2}, xmm3, xmm1
+       vblendmpd xmm0 {k1}, xmm0, xmm1
+       vmovups  xmmword ptr [rcx], xmm0

and

-       vpcmpgtd ymm1, ymm1, ymm0
-       vxorps   ymm2, ymm2, ymm2
-       vpsubd   ymm2, ymm2, ymm0
-       vpternlogd ymm1, ymm2, ymm0, -54
-       vxorps   ymm2, ymm2, ymm2
-       vpcmpgtd ymm1, ymm2, ymm1
+       vpcmpgtd k1, ymm1, ymm0
+       vmovaps  ymm2, ymm0
+       vpsubd   ymm2 {k1}, ymm1, ymm0
+       vpcmpgtd ymm1, ymm1, ymm2

The TP regression peaks at around +0.06% in minopts. I had actually tried to avoid doing this in minopts in one of the early PRs and that actually turned out to be closer to a +1.1% regression due to the register allocator having to do overall more work in the typical scenario. So this ends up being an overall good balance across the entirety of the code.

@tannergooding
Copy link
Member Author

CC. @fanyang-mono, seems there's a Mono LLVMAOT failure in the form of:

  /__w/1/s/artifacts/bin/mono/linux.x64.Release/opt: mono_aot_dFNFkp/temp.bc: error: Invalid record (Producer: 'LLVM16.0.5' Reader: 'LLVM 16.0.5')
  AOT of image /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_X86_r/X86_Sse2_r.dll failed.
  Mono Ahead of Time compiler - compiling assembly /__w/1/s/artifacts/tests/coreclr/linux.x64.Release/JIT/HardwareIntrinsics/HardwareIntrinsics_X86_r/X86_Sse2_r.dll

I'm guessing this has something to do with the Mono V128 acceleration for x64, but it's not clear what in the tests would be causing it. It's only failing for Sse2_r and Sse2_ro from what I can tell while the additional tests exist more broadly and are replicated across other projects too. The actual test additions are also fairly simple, just adding 4 new methods that cover embedded broadcast and embedded masking patterns using the xplat APIs, which should already be well supported or fallback to the software implementation for Mono.

@tannergooding
Copy link
Member Author

I've logged #102037 to track the general issue

@tannergooding tannergooding merged commit 5fdb133 into dotnet:main May 9, 2024
117 of 120 checks passed
@tannergooding tannergooding deleted the avx512-embed-mask branch May 9, 2024 17:06
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
* Remove HW_Flag_MultiIns in favor of using HW_Flag_SpecialCodeGen

* Add a new flag HW_Flag_InvalidNodeId

* Change HW_Flag_EmbMaskingIncompatible to be HW_Flag_EmbMaskingCompatible

* Mark various compare intrinsics with HW_Flag_NoEvexSemantics

* Marking various intrinsics as EmbBroadcastCompatible, EmbMaskingCompatible, or Commutative

* Applying formatting patch

* Ensure WithLower/WithUpper are not marked as InvalidNodeId

* Ensure that instOptions are being passed down all relevant hwintrinsic code paths

* Ensure the insOpts are plumbed through for EVEX instructions

* Ensure EVEX instructions are properly annotated with EmbeddedBroadcastSupported

* Ensure that embedded broadcast/masking is displayed in the disassembly

* Applying formatting patch

* Updating the hwintrinsic tests to cover embedded broadcast/masking

* Fix some handling in the JIT related to embedded broadcast/masking

* Fixup some tests where validating embedded masking is non-trivial

* Cleanup some cases found by SPMI

* Ensure that CompareLessThan has its operands swapped back if its being converted to the AVX512 form

* Don't regress a scenario around op_Equality and TYP_MASK

* Adjusting hardware intrinsic tests to test non-zero masks

* Avoid some messiness around operand swapping

* Ensure embedded masks mark TYP_SIMD16 and TYP_SIMD32 instructions as needing EVEX

* Mark Sse2_r/Sse2_ro as AotIncompatible due to runtime/102037
@github-actions github-actions bot locked and limited conversation to collaborators Jun 9, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants