-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve the handling of SIMD comparisons #104944
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
e3f71b1
to
f45c339
Compare
06323f6
to
33d95dd
Compare
Got some good diffs (https://dev.azure.com/dnceng-public/public/_build/results?buildId=743416&view=ms.vss-build-web.run-extensions-tab) for Arm64 and x64. windows arm64
windows x64
For arm64 this is just from allowing - mvni v16.4s, #0
- mvni v17.4s, #0
- cmeq v16.4s, v16.4s, v17.4s
- uminp v16.4s, v16.4s, v16.4s
- umov x0, v16.d[0]
- cmn x0, #1
- cset x0, eq
- ;; size=28 bbWeight=1 PerfScore 5.00
+ mov w0, #1 For x64 we get all the same scenarios, but we also further optimize comparisons against AllBitsSet. For example: vpcmpeqd xmm1, xmm1, xmm1
- vpcmpd k1, xmm1, xmm0, 4
- kortestb k1, k1
- sete al
+ vptest xmm0, xmm1
+ setb al |
Doesn't seem to have helped with regressions in #104488 (compare with MihuBot/runtime-utils#519). |
The "regresssions" in that PR are minor size regressions, they aren't necessarily perf regressions. I plan on handling those separately as its a bit more complex to solve given it requires looking across two levels. What's happening is that It's a tradeoff in a minor size increase vs additional complexity in the JIT, where the minor size increase in this edge case is representative of significant size reduction in other cases.
So it's essentially a 1-to-1 trade here that isn't as meaningful to "fix" immediately |
In case, these are not diffs from this PR. |
You're putting a lot of reliance on mca, which itself isn't a fully accurate tool and is giving a best potential estimate of the performance for a very particular microarchitecture. It's not necessarily representative of actual execution performance, latencies, or other behaviors of real world application code. We do not microtune to that level in the BCL because it is not relevant to most real world workloads (and is far too microarchitecture dependent). We instead find a reasonable balance between overall readability, maintainability, and performance across the entire application. That may include trading a couple cycles in one piece of code to win back cycles in a lot of other more common code. The specific case you've called out is something that will be addressed, it is not relevant to block the PR you've linked on and is not related to the improvements that are being done in this PR. It's simply additional improvements that could be had on top to ensure that a specific case of load bottlenecked vpternlog can be worse than two independent bitwise operations. |
case NI_Vector512_op_Equality: | ||
#endif // !TARGET_ARM64 && !TARGET_XARCH | ||
{ | ||
if (varTypeIsFloating(simdBaseType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how is this path different from case NI_Vector128_op_Equality:
above? if it's only for floats, then why it's not an assert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it's when only one of the operand is constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it’s for when one is all nan, which is an optimization we can do for float/double
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, although, the last time I tried to constant fold vector comparisons I had no diffs, so it's very likely we have no test coverage there (I presume your diffs aren't from <cns_vec> ==/!= <cns_vec>)
case NI_Vector64_op_Equality: | ||
#elif defined(TARGET_XARCH) | ||
case NI_Vector256_op_Equality: | ||
case NI_Vector512_op_Equality: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also NI_VectorX_EqualsAll
(unless they're normalized to op_Equality
somewhere). Btw the last time I tried to constant fold these, you told me that is odd to cover only EQ/NE relation operators ;-) #85584 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’ve already added support for all the other comparisons (elementwise eq/ge/gt/le/lt/ne), what was remaining was the ==/!= operators, which this pr covers.
EqualsAll/Any and the other All/Any APIs are then imported as elementwise compare + op ==/!=, so this covers the full set
That’s what most of the diffs are from, we have a lot of coverage now, especially since vector2/3/4 are in managed and matrix4x4 was accelerated. There’s also some cases in other test/code where we’re getting opts from the switch to use more xplat APIs and centralized helpers |
No description provided.