-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Light up core ASCII.Utility methods with Vector256/Vector512 code paths. #88532
Conversation
Tagging subscribers to this area: @dotnet/area-system-numerics Issue DetailsThis PR lights up some code path in For the We are open to adjusting the implementation style for either path. Perf numbers coming soon.
|
A rough summary of the results...
|
0e56a90
to
d1153c8
Compare
@dotnet/avx512-contrib can we get a review on this? I checked the failures which look like they are related to a CPUID test, I don't think the changes would impact? |
CpuId failures are known and have a PR up to resolve them #88623 |
/// Uses double instead of long to get a single instruction instead of storing temps on general porpose register (or stack) | ||
/// </remarks> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static void StoreLowerUnsafe<T>(this Vector256<T> source, ref T destination, nuint elementOffset = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this instead of source.GetLower().StoreUnsafe(ref destination, elementOffset)
?
{ | ||
uint SizeOfVector512InBytes = (uint)Vector512<byte>.Count; // JIT will make this a const | ||
|
||
if (Unsafe.ReadUnaligned<Vector512<byte>>(pBuffer).ExtractMostSignificantBits() == 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not this instead?
if (Unsafe.ReadUnaligned<Vector512<byte>>(pBuffer).ExtractMostSignificantBits() == 0) | |
if (Vector512.Load(pBuffer).ExtractMostSignificantBits() == 0) |
|
||
if (Vector512.IsHardwareAccelerated && bufferLength >= 2 * (uint)Vector512<byte>.Count) | ||
{ | ||
uint SizeOfVector512InBytes = (uint)Vector512<byte>.Count; // JIT will make this a const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually have an internal Vector512.Size
that is a const for just these types of cases.
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
with Vector512 and Vector256 APIs.
579459b
to
ea913db
Compare
uint SizeOfVectorInChars = (uint)Vector<ushort>.Count; // JIT will make this a const | ||
uint SizeOfVectorInBytes = (uint)Vector<byte>.Count; // JIT will make this a const | ||
uint SizeOfVector512InChars = (uint)Vector512<ushort>.Count; // JIT will make this a const | ||
uint SizeOfVector512InBytes = (uint)Vector512.Size; // JIT will make this a const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should be able to use Vecto512.Size
directly and have C# keep it a const instead.
It will make things a bit more readable and allow C# to constant fold some of the simple cases (e.g. ~(nuint)(SizeOfVector512InBytes - 1)
itself will become constant foldable rather than forcing IL to emit ldloc; sub; conv.i; sub
and then requiring the JIT to optimize that down.
/// Uses double instead of long to get a single instruction instead of storing temps on general porpose register (or stack) | ||
/// </remarks> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static void StoreLowerUnsafe<T>(this Vector512<T> source, ref T destination, nuint elementOffset = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same general question here as on the other. Why this helper rather than simply doing source.GetLower().StoreUnsafe(ref destination, elementOffset)
It looks like the right stuff happens for the former already and it avoids the JIT needing to do any inlining for it.
|
||
if (Vector512.IsHardwareAccelerated && bufferLength >= 2 * (uint)Vector512<byte>.Count) | ||
{ | ||
uint SizeOfVector512InBytes = (uint)Vector512.Size; // JIT will make this a const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here and other places. If we use Vector512.Size
directly, we get much simply and constant foldable IL, allowing the JIT to do less work.
Since Vector512.Size
is itself a constant, we shouldn't need to insert extra casts anywhere. But if we did and it was a non-trivial number, it would be better to declare this as const uint SizeOfVector512InBytes
instead so the same constant folding could still happen.
@@ -1407,6 +1867,20 @@ private static bool VectorContainsNonAsciiChar(Vector128<byte> asciiVector) | |||
} | |||
} | |||
|
|||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
private static bool VectorContainsNonAsciiChar(Vector256<byte> asciiVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need wrapper methods for this simple case?
Is there any reason it can't just be asciiVector != Vector256<byte>.Zero
instead which allows ptest; jcc
rather than movmsk, cmp; jcc
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was doing this for code reuse/readability.
Would you prefer to just inline the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think inlining it manually in this case would be better.
The abstractions like this are generally helpful when we have more complex logic differing between platforms or when there is a high likelihood to need to identify and change the pattern for the same pattern repeatedly in the future.
In cases like this, where we're really just doing a trivial comparison check, I don't think the helper buys us much in terms of readability/maintainability and it does have drawbacks in the form of forcing the JIT to do more work to inline/optimize the code.
// Narrows two vectors of words [ w7 w6 w5 w4 w3 w2 w1 w0 ] and [ w7' w6' w5' w4' w3' w2' w1' w0' ] | ||
// to a vector of bytes [ b7 ... b0 b7' ... b0']. | ||
|
||
// prefer architecture specific intrinsic as they don't perform additional AND like Vector512.Narrow does |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is out of sync. We're using the non architecture specific Narrow below.
Likewise, given we're just calling the xplat narrow, do we need this helper method or can we just call Narrow directly and avoid the need to inline?
uint SizeOfVector256 = (uint)Vector256<byte>.Count; | ||
nuint MaskOfAllBitsInVector256 = (nuint)(SizeOfVector256 - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These can both be const
, using Vector256.Size
directly and declaring const nuint MaskOfAllBitsInVector256
for the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the equivalent case in Intrinsified_512
1. turn some variables into explicitly specified const. 2. removed some helper functions and inlined them.
Debug.Assert((nuint)pBuffer % Vector512.Size == 0, "Vector read should be aligned."); | ||
if (Vector512.LoadAligned(pBuffer).ExtractMostSignificantBits() != 0) | ||
{ | ||
break; // found non-ASCII data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're within the vectorized code path and see non-ASCII data, since we already have the return value of ExtractMostSignificantBits
in a register somewhere, I wonder if it would make sense to tzcnt the result and return immediately rather than falling down the drain code path.
Not important for this review, just a random musing.
|
||
do | ||
{ | ||
Debug.Assert((nuint)pBuffer % SizeOfVector512InChars == 0, "Vector read should be aligned."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should read:
Debug.Assert((nuint)pBuffer % SizeOfVector512InBytes == 0, "Vector read should be aligned.");
(Looks like this is a bug in the original GetIndexOfFirstNonAsciiChar_Default
method as well.)
{ | ||
const uint SizeOfVector512InChars = Vector512.Size / sizeof(ushort); | ||
|
||
Vector512<ushort> asciiMask = Vector512.Create((ushort) 0xFF80); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused local?
// We're going to get the best performance when we have aligned writes, so we'll take the | ||
// hit of potentially unaligned reads in order to hit this sweet spot. | ||
|
||
// pAsciiBuffer points to the start of the destination buffer, immediately before where we wrote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Para should be updated: 0x10
bit, &pAsciiBuffer[SizeOfVector256 / 2]
, 16-byte write.
Looking back over the original comment this was copied from, I realize now what I originally wrote was word salad. 🙂 My comment wasn't intended to refer to the value stored at the referenced address, but rather the address itself. Basically, if you're trying to become 16-byte aligned, then one of the following must hold: (a) you can from your current position back up 7 or fewer bytes to achieve 16-byte alignment; or (b) you can write 8 bytes, bump the pointer, then back up 7 or fewer bytes to achieve 16-byte alignment.
For this method, since you're trying to become 32-byte aligned, those clauses become "15 or fewer bytes" and "you can write 16 bytes."
goto Finish; | ||
} | ||
|
||
// Turn the 32 ASCII chars we just read into 32 ASCII bytes, then copy it to the destination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
16, not 32.
|
||
// First part was all ASCII, narrow and aligned write. Note we're only filling in the low half of the vector. | ||
|
||
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % sizeof(ulong) == 0, "Destination should be ulong-aligned."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % sizeof(ulong) == 0, "Destination should be ulong-aligned."); | |
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % Vector128.Size == 0, "Destination should be 128-bit-aligned."); |
// We're going to get the best performance when we have aligned writes, so we'll take the | ||
// hit of potentially unaligned reads in order to hit this sweet spot. | ||
|
||
// pAsciiBuffer points to the start of the destination buffer, immediately before where we wrote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feedback as earlier: fix up comments.
|
||
// First part was all ASCII, narrow and aligned write. Note we're only filling in the low half of the vector. | ||
|
||
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % sizeof(ulong) == 0, "Destination should be ulong-aligned."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % sizeof(ulong) == 0, "Destination should be ulong-aligned."); | |
Debug.Assert(((nuint)pAsciiBuffer + currentOffsetInElements) % Vector256.Size == 0, "Destination should be 256-bit-aligned."); |
break; | ||
} | ||
|
||
(Vector512<ushort> low, Vector512<ushort> upper) = Vector512.Widen(asciiVector); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: keep the same local variable names as the other blocks in this method.
CC. @GrabYourPitchforks, looks like all your feedback has been addressed. |
if (Vector512.IsHardwareAccelerated || Vector256.IsHardwareAccelerated) | ||
{ | ||
return GetIndexOfFirstNonAsciiByte_Vector(pBuffer, bufferLength); | ||
} | ||
else if (Sse2.IsSupported || (AdvSimd.IsSupported && BitConverter.IsLittleEndian)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GetIndexOfFirstNonAsciiByte_Vector has a Vector128.IsHardwareAccelerated code path. We can't just rely on that and delete GetIndexOfFirstNonAsciiByte_Intrinsified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Vector128
fallback path is slower than the (more complex) Intrinsified path (see #88532 (comment)).
This PR lights up some code path in
ASCII.Utility
with Vector256/Vector512 code, namely,NarrowUtf16ToAscii
,WidenAsciiToUtf16
,GetIndexOfFirstNonAsciiChar
, andGetIndexOfFirstNonAsciiByte
.For the
GetIndexOfMethods
, we have implemented the simpler, existing "default" code path but with the explicty VectorXX apis; for theNarrow
/Widen
methods, we have implemented the more complex SSE/Vector256 path but with the Vector256/Vector512 APIs. Right now, both are a slight tradeoff in terms of code complexity/performance.We are open to adjusting the implementation style for either path. Perf numbers coming soon.
Perf
Please see the next section for the raw data collected from the
utf8
case forSystem.Text.Tests.Perf_Encoding
. I have formatted it a bit here to make it slightly easier to draw conclusions. Essentially, I have run with a base, run withEnable_AVX512F=1
for thevector512
path, and withEnable_AVX512F=0
for thevector256
path.The micro is run with...
I have added some additional data sizes and an additional micro
GetCharCount
which simply invokesenc.GetCharCount
analogous toGetByteCount
. I call out some speedup with green, and some slowdown with orange.GetBytes
GetChars
GetByteCount
GetCharCount
Raw Results from Two Runs
With
DOTNET_AVX512F=1
With
DOTNET_EnableAVX512F=0