-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading SpanHelpers with Vector512 #86655
Upgrading SpanHelpers with Vector512 #86655
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsThis is a draft PR created to get some feedback from @tannergooding re some specific questions I have
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had just a quick view -- same applies to the char
variant.
// Find the last unique (which is not equal to ch1) byte | ||
// the algorithm is fine if both are equal, just a little bit less efficient | ||
byte ch2Val = Unsafe.Add(ref value, valueTailLength); | ||
nint ch1ch2Distance = valueTailLength; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valueTailLength
is of type int
, so to avoid the sign-extending move use
nint ch1ch2Distance = valueTailLength; | |
nint ch1ch2Distance = (nint)(uint)valueTailLength; |
See codegen
goto CANDIDATE_FOUND; | ||
} | ||
|
||
LOOP_FOOTER: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codegen-wise: is it good to have these labels in the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is sort of hand-written PGO, less relevant with Dynamic PGO being enabled by default, but we have other targets.
Would be nice to see benchmarks for small inputs (which are more commonly used in these perf-sensitive primitives), maybe it even makes sense to wrap avx-512 path into a call?.. |
4bcdb12
to
aa1e03d
Compare
aa1e03d
to
bda9c39
Compare
ed3f25f
to
f7ed104
Compare
src/coreclr/jit/gentree.cpp
Outdated
else if (elementSize == 1) | ||
{ | ||
for (uint32_t i = 0; i < elementCount; i++) | ||
{ | ||
vecCns.u8[i] = (uint8_t)(vecCns.u8[i * elementSize] / elementSize); | ||
} | ||
|
||
op2 = gtNewVconNode(type); | ||
op2->AsVecCon()->gtSimdVal = vecCns; | ||
|
||
// swap the operands to match the encoding requirements | ||
retNode = gtNewSimdHWIntrinsicNode(type, op2, op1, NI_AVX512VBMI_PermuteVar64x8, simdBaseJitType, simdSize); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to extract this acceleration into its own PR and also cover Vector256 at the same time, if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to separate PR
src/coreclr/jit/hwintrinsicxarch.cpp
Outdated
@@ -2361,7 +2361,7 @@ GenTree* Compiler::impSpecialIntrinsic(NamedIntrinsic intrinsic, | |||
} | |||
else if (simdSize == 64) | |||
{ | |||
if (varTypeIsByte(simdBaseType)) | |||
if (varTypeIsByte(simdBaseType) && (!compExactlyDependsOn(InstructionSet_AVX512VBMI))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Unnecessary parentheses and we want an opportunistic check, since the fallback behavior is identical, just slower:
if (varTypeIsByte(simdBaseType) && (!compExactlyDependsOn(InstructionSet_AVX512VBMI))) | |
if (varTypeIsByte(simdBaseType) && !compOpportunisticallyDependsOn(InstructionSet_AVX512VBMI)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of different PR
byte ch2Val = Unsafe.Add(ref value, valueTailLength); | ||
nint ch1ch2Distance = (nint)(uint)valueTailLength; | ||
while (ch2Val == value && ch1ch2Distance > 1) | ||
ch2Val = Unsafe.Add(ref value, --ch1ch2Distance); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing for you to do here, just calling out that this kind of duplication between Vector512/256/128
is another reason why having an ISimdVector
would be nice: #76423
It might be feasible for us to define/expose for internal use only for the time being. Interested in @stephentoub's thoughts on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly worth experimenting with internally at first.
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
The changes overall LGTM, there's just quite a lot of essentially duplicated code that we can't currently share between Vector512/256/128. I think we want to also consider how to avoid this negatively impacting 1st gen AVX-512 hardware and so I proposed an internal only property to hang off |
3abe035
to
8903c22
Compare
d3c2331
to
bc62579
Compare
3b99276
to
6a2ec3a
Compare
@tannergooding Let me know if ther are any other changes you want me to do for this? |
This should be ready for merge, just would be good to get a secondary sign-off. There's no perf diff for hardware without AVX-512 support since the entire code path is treated as dead code. Numbers generally look good for hardware with AVX-512 support with the additional branch causing some very minor TP hits for < 64 bytes of data. The code is effectively the V256 code path, but duplicated. We can reduce this duplication longer term with the |
src/libraries/System.Private.CoreLib/src/System/Numerics/BitOperations.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.T.cs
Outdated
Show resolved
Hide resolved
6a2ec3a
to
2f6889a
Compare
… Vector512Throttling issues
@stephentoub @tannergooding Bump. Anything else required from me here? |
@stephentoub @tannergooding Do you have any further requests or comments on this PR? E.g., any request for additional testing or performance analysis? Or is it ready to be merged? |
Nice wins thanks @DeepakRajendrakumaran ! |
What this PR does:
Span
implementation by enabling acceleration using AVX512 instructions when possible. This is achieved using. This is achieved by addingVector512
paths in relevantSpanHelper
implementations.Vector512.IsHardwareAccelerated
implementation so that it now returnsTRUE
only on targets where there is no throttling for AVX512(this is based on discussion with Tanner)Performance testing for the PR
Ran Microbenchmark( --filter "System.Memory.*") and compared with main branch : This was with following thresholds for ResultComparator on ICX( --threshold 5% --noise 1ns). I have re-run the ones showing regression locally and DO NOT see any which are consistently slower