Improve the handling of SIMD comparisons #104944

tannergooding · 2024-07-16T06:43:10Z

No description provided.

dotnet-policy-service · 2024-07-16T06:43:46Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

tannergooding · 2024-07-16T21:14:55Z

Got some good diffs (https://dev.azure.com/dnceng-public/public/_build/results?buildId=743416&view=ms.vss-build-web.run-extensions-tab) for Arm64 and x64.

windows arm64

Overall (-3,548 bytes)
MinOpts (-36 bytes)
FullOpts (-3,512 bytes)

windows x64

Overall (-11,220 bytes)
MinOpts (-1,610 bytes)
FullOpts (-9,610 bytes)

For arm64 this is just from allowing op_Equality and op_Inequality to fold. For example:

-            mvni    v16.4s, #0
-            mvni    v17.4s, #0
-            cmeq    v16.4s, v16.4s, v17.4s
-            uminp   v16.4s, v16.4s, v16.4s
-            umov    x0, v16.d[0]
-            cmn     x0, #1
-            cset    x0, eq
-						;; size=28 bbWeight=1 PerfScore 5.00
+            mov     w0, #1

For x64 we get all the same scenarios, but we also further optimize comparisons against AllBitsSet. For example:

        vpcmpeqd xmm1, xmm1, xmm1
-       vpcmpd   k1, xmm1, xmm0, 4
-       kortestb k1, k1
-       sete     al
+       vptest   xmm0, xmm1
+       setb     al

xtqqczze · 2024-07-17T20:37:04Z

@MihuBot -dependsOn 104944

MihuBot/runtime-utils#537

Doesn't seem to have helped with regressions in #104488 (compare with MihuBot/runtime-utils#519).

tannergooding · 2024-07-17T20:47:40Z

The "regresssions" in that PR are minor size regressions, they aren't necessarily perf regressions.

I plan on handling those separately as its a bit more complex to solve given it requires looking across two levels. What's happening is that or is used by and and the and is used in op_Equality. We only see that or is used by and so we decide to fold it to ternarylogic.

It's a tradeoff in a minor size increase vs additional complexity in the JIT, where the minor size increase in this edge case is representative of significant size reduction in other cases.

vpor is 1 cycle; 0.33 reciprocal throughput. vptest is <4 cycles; 1 cycle reciprocal throughput

vmovaps for register to register is 0-1 cycle (typically 0 as its handled by the register renamer on most CPUs, vpternlogd is 1 cycle; 0.33 reciprocal throughput, and vptest` remains the same

So it's essentially a 1-to-1 trade here that isn't as meaningful to "fix" immediately

xtqqczze · 2024-07-17T21:02:59Z

So it's essentially a 1-to-1 trade here that isn't as meaningful to "fix" immediately

Replace use of target dependent TestZ intrinsic #104488 (comment) yeah that diff is 1:1
~~Replace use of target dependent TestZ intrinsic #104488 (comment) this diff is 1.47:1 (on latency, single iteration)~~ nvm this isn't a loop

In case, these are not diffs from this PR.

tannergooding · 2024-07-17T21:47:47Z

this diff is 1.47:1 (on latency, single iteration)

You're putting a lot of reliance on mca, which itself isn't a fully accurate tool and is giving a best potential estimate of the performance for a very particular microarchitecture. It's not necessarily representative of actual execution performance, latencies, or other behaviors of real world application code.

We do not microtune to that level in the BCL because it is not relevant to most real world workloads (and is far too microarchitecture dependent). We instead find a reasonable balance between overall readability, maintainability, and performance across the entire application. That may include trading a couple cycles in one piece of code to win back cycles in a lot of other more common code.

The specific case you've called out is something that will be addressed, it is not relevant to block the PR you've linked on and is not related to the improvements that are being done in this PR. It's simply additional improvements that could be had on top to ensure that a specific case of load bottlenecked vpternlog can be worse than two independent bitwise operations.

EgorBo · 2024-07-19T18:25:18Z

src/coreclr/jit/gentree.cpp

+            case NI_Vector512_op_Equality:
+#endif // !TARGET_ARM64 && !TARGET_XARCH
+            {
+                if (varTypeIsFloating(simdBaseType))


how is this path different from case NI_Vector128_op_Equality: above? if it's only for floats, then why it's not an assert?

Ah, it's when only one of the operand is constant?

Right, it’s for when one is all nan, which is an optimization we can do for float/double

EgorBo

LGTM, although, the last time I tried to constant fold vector comparisons I had no diffs, so it's very likely we have no test coverage there (I presume your diffs aren't from <cns_vec> ==/!= <cns_vec>)

EgorBo · 2024-07-19T18:35:51Z

src/coreclr/jit/gentree.cpp

+                case NI_Vector64_op_Equality:
+#elif defined(TARGET_XARCH)
+                case NI_Vector256_op_Equality:
+                case NI_Vector512_op_Equality:


There is also NI_VectorX_EqualsAll (unless they're normalized to op_Equality somewhere). Btw the last time I tried to constant fold these, you told me that is odd to cover only EQ/NE relation operators ;-) #85584 (comment)

We’ve already added support for all the other comparisons (elementwise eq/ge/gt/le/lt/ne), what was remaining was the ==/!= operators, which this pr covers.

EqualsAll/Any and the other All/Any APIs are then imported as elementwise compare + op ==/!=, so this covers the full set

tannergooding · 2024-07-19T18:52:23Z

(I presume your diffs aren't from <cns_vec> ==/!= <cns_vec>)

That’s what most of the diffs are from, we have a lot of coverage now, especially since vector2/3/4 are in managed and matrix4x4 was accelerated. There’s also some cases in other test/code where we’re getting opts from the switch to use more xplat APIs and centralized helpers

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 16, 2024

dotnet-policy-service bot assigned tannergooding Jul 16, 2024

tannergooding force-pushed the simd_equality branch 2 times, most recently from e3f71b1 to f45c339 Compare July 16, 2024 13:35

tannergooding added 2 commits July 16, 2024 10:32

Ensure that we can constant fold op_Equality and op_Inequality for SIMD

cb75246

Optimize comparisons against AllBitsSet on pre-AVX512 hardware

33d95dd

tannergooding force-pushed the simd_equality branch from 06323f6 to 33d95dd Compare July 16, 2024 17:32

xtqqczze mentioned this pull request Jul 16, 2024

Replace use of target dependent TestZ intrinsic #104488

Merged

tannergooding marked this pull request as ready for review July 16, 2024 21:07

kunalspathak requested a review from jakobbotsch July 16, 2024 21:19

MihuBot mentioned this pull request Jul 17, 2024

[JitDiff X64] [tannergooding] Improve the handling of SIMD comparisons MihuBot/runtime-utils#538

Open

EgorBo reviewed Jul 19, 2024

View reviewed changes

EgorBo approved these changes Jul 19, 2024

View reviewed changes

tannergooding merged commit e0ecd1f into dotnet:main Jul 19, 2024
107 checks passed

tannergooding deleted the simd_equality branch July 19, 2024 18:55

LoopedBard3 mentioned this pull request Jul 25, 2024

[Perf] Linux/arm64: 9 Improvements on 7/21/2024 1:40:36 PM dotnet/perf-autofiling-issues#38911

Closed

github-actions bot locked and limited conversation to collaborators Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve the handling of SIMD comparisons #104944

Improve the handling of SIMD comparisons #104944

tannergooding commented Jul 16, 2024

dotnet-policy-service bot commented Jul 16, 2024

tannergooding commented Jul 16, 2024

xtqqczze commented Jul 17, 2024

tannergooding commented Jul 17, 2024

xtqqczze commented Jul 17, 2024 •

edited

Loading

tannergooding commented Jul 17, 2024

EgorBo Jul 19, 2024

EgorBo Jul 19, 2024

tannergooding Jul 19, 2024

EgorBo left a comment •

edited

Loading

EgorBo Jul 19, 2024

tannergooding Jul 19, 2024

tannergooding commented Jul 19, 2024

Improve the handling of SIMD comparisons #104944

Improve the handling of SIMD comparisons #104944

Conversation

tannergooding commented Jul 16, 2024

dotnet-policy-service bot commented Jul 16, 2024

tannergooding commented Jul 16, 2024

windows arm64

windows x64

xtqqczze commented Jul 17, 2024

tannergooding commented Jul 17, 2024

xtqqczze commented Jul 17, 2024 • edited Loading

tannergooding commented Jul 17, 2024

EgorBo Jul 19, 2024

Choose a reason for hiding this comment

EgorBo Jul 19, 2024

Choose a reason for hiding this comment

tannergooding Jul 19, 2024

Choose a reason for hiding this comment

EgorBo left a comment • edited Loading

Choose a reason for hiding this comment

EgorBo Jul 19, 2024

Choose a reason for hiding this comment

tannergooding Jul 19, 2024

Choose a reason for hiding this comment

tannergooding commented Jul 19, 2024

xtqqczze commented Jul 17, 2024 •

edited

Loading

EgorBo left a comment •

edited

Loading