Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the handling of SIMD comparisons #104944

Merged
merged 2 commits into from
Jul 19, 2024

Conversation

tannergooding
Copy link
Member

No description provided.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 16, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@tannergooding
Copy link
Member Author

Got some good diffs (https://dev.azure.com/dnceng-public/public/_build/results?buildId=743416&view=ms.vss-build-web.run-extensions-tab) for Arm64 and x64.

windows arm64

Overall (-3,548 bytes)
MinOpts (-36 bytes)
FullOpts (-3,512 bytes)

windows x64

Overall (-11,220 bytes)
MinOpts (-1,610 bytes)
FullOpts (-9,610 bytes)

For arm64 this is just from allowing op_Equality and op_Inequality to fold. For example:

-            mvni    v16.4s, #0
-            mvni    v17.4s, #0
-            cmeq    v16.4s, v16.4s, v17.4s
-            uminp   v16.4s, v16.4s, v16.4s
-            umov    x0, v16.d[0]
-            cmn     x0, #1
-            cset    x0, eq
-						;; size=28 bbWeight=1 PerfScore 5.00
+            mov     w0, #1

For x64 we get all the same scenarios, but we also further optimize comparisons against AllBitsSet. For example:

        vpcmpeqd xmm1, xmm1, xmm1
-       vpcmpd   k1, xmm1, xmm0, 4
-       kortestb k1, k1
-       sete     al
+       vptest   xmm0, xmm1
+       setb     al

@xtqqczze
Copy link
Contributor

@MihuBot -dependsOn 104944

MihuBot/runtime-utils#537

Doesn't seem to have helped with regressions in #104488 (compare with MihuBot/runtime-utils#519).

@tannergooding
Copy link
Member Author

The "regresssions" in that PR are minor size regressions, they aren't necessarily perf regressions.

I plan on handling those separately as its a bit more complex to solve given it requires looking across two levels. What's happening is that or is used by and and the and is used in op_Equality. We only see that or is used by and so we decide to fold it to ternarylogic.

It's a tradeoff in a minor size increase vs additional complexity in the JIT, where the minor size increase in this edge case is representative of significant size reduction in other cases.


vpor is 1 cycle; 0.33 reciprocal throughput. vptest is <4 cycles; 1 cycle reciprocal throughput

vmovaps for register to register is 0-1 cycle (typically 0 as its handled by the register renamer on most CPUs, vpternlogd is 1 cycle; 0.33 reciprocal throughput, and vptest` remains the same

So it's essentially a 1-to-1 trade here that isn't as meaningful to "fix" immediately

@xtqqczze
Copy link
Contributor

xtqqczze commented Jul 17, 2024

So it's essentially a 1-to-1 trade here that isn't as meaningful to "fix" immediately

In case, these are not diffs from this PR.

@tannergooding
Copy link
Member Author

this diff is 1.47:1 (on latency, single iteration)

You're putting a lot of reliance on mca, which itself isn't a fully accurate tool and is giving a best potential estimate of the performance for a very particular microarchitecture. It's not necessarily representative of actual execution performance, latencies, or other behaviors of real world application code.

We do not microtune to that level in the BCL because it is not relevant to most real world workloads (and is far too microarchitecture dependent). We instead find a reasonable balance between overall readability, maintainability, and performance across the entire application. That may include trading a couple cycles in one piece of code to win back cycles in a lot of other more common code.

The specific case you've called out is something that will be addressed, it is not relevant to block the PR you've linked on and is not related to the improvements that are being done in this PR. It's simply additional improvements that could be had on top to ensure that a specific case of load bottlenecked vpternlog can be worse than two independent bitwise operations.

case NI_Vector512_op_Equality:
#endif // !TARGET_ARM64 && !TARGET_XARCH
{
if (varTypeIsFloating(simdBaseType))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is this path different from case NI_Vector128_op_Equality: above? if it's only for floats, then why it's not an assert?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it's when only one of the operand is constant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it’s for when one is all nan, which is an optimization we can do for float/double

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although, the last time I tried to constant fold vector comparisons I had no diffs, so it's very likely we have no test coverage there (I presume your diffs aren't from <cns_vec> ==/!= <cns_vec>)

case NI_Vector64_op_Equality:
#elif defined(TARGET_XARCH)
case NI_Vector256_op_Equality:
case NI_Vector512_op_Equality:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also NI_VectorX_EqualsAll (unless they're normalized to op_Equality somewhere). Btw the last time I tried to constant fold these, you told me that is odd to cover only EQ/NE relation operators ;-) #85584 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ve already added support for all the other comparisons (elementwise eq/ge/gt/le/lt/ne), what was remaining was the ==/!= operators, which this pr covers.

EqualsAll/Any and the other All/Any APIs are then imported as elementwise compare + op ==/!=, so this covers the full set

@tannergooding
Copy link
Member Author

(I presume your diffs aren't from <cns_vec> ==/!= <cns_vec>)

That’s what most of the diffs are from, we have a lot of coverage now, especially since vector2/3/4 are in managed and matrix4x4 was accelerated. There’s also some cases in other test/code where we’re getting opts from the switch to use more xplat APIs and centralized helpers

@tannergooding tannergooding merged commit e0ecd1f into dotnet:main Jul 19, 2024
107 checks passed
@tannergooding tannergooding deleted the simd_equality branch July 19, 2024 18:55
@github-actions github-actions bot locked and limited conversation to collaborators Aug 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants