-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to iSimdVector and Align WidenAsciiToUtf16 #99982
Conversation
76ecff2
to
0772e00
Compare
@tannergooding @dotnet/avx512-contrib Can you please review this? |
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128_1.cs
Outdated
Show resolved
Hide resolved
d8bc357
to
a7bb34a
Compare
a7bb34a
to
39cb56a
Compare
39cb56a
to
39da22a
Compare
private static unsafe bool HasMatch<TVectorByte>(TVectorByte vector) | ||
where TVectorByte : unmanaged, ISimdVector<TVectorByte, byte> | ||
{ | ||
return !(vector & TVectorByte.Create((byte)0x80)).Equals(TVectorByte.Zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not (vector & TVectorByte.Create((byte)0x80)) != TVectorByte.Zero
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I basically ran into a weird issue where perf degraded significantly for specific cases on my ICX when I had '(vector & TVectorByte.Create((byte)0x80)) != TVectorByte.Zero' and the issue went away with what I have now. It was pretty consistent. I had some trouble narrowing down the exact why with VTune though.
I decided to go with the performant version for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a codegen difference between them? I'd expect them to generate the same code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a quick check on godbolt and they all look the same - https://godbolt.org/z/P87dPdTGa
Now I'm curious if it was just something off when I ran it locally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a strong preference for any of these patterns?
I can look into the why this is happening if it's important. For now, I'm just keeping the fast version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding I have created an issue detailing this as discussed: #100493
Please let me know if there is anything else needed for this PR
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Text/Ascii.Utility.cs
Outdated
Show resolved
Hide resolved
if (!HasMatch<TVectorByte>(asciiVector)) | ||
{ | ||
(TVectorUShort utf16LowVector, TVectorUShort utf16HighVector) = Widen<TVectorByte, TVectorUShort>(asciiVector); | ||
utf16LowVector.Store(pCurrentWriteAddress); | ||
utf16HighVector.Store(pCurrentWriteAddress + TVectorUShort.Count); | ||
pCurrentWriteAddress += (nuint)(TVectorUShort.Count * 2); | ||
if (((int)pCurrentWriteAddress & 1) == 0) | ||
{ | ||
// Bump write buffer up to the next aligned boundary | ||
pCurrentWriteAddress = (ushort*)((nuint)pCurrentWriteAddress & ~(nuint)(TVectorUShort.Alignment - 1)); | ||
nuint numBytesWritten = (nuint)pCurrentWriteAddress - (nuint)pUtf16Buffer; | ||
currentOffset += (nuint)numBytesWritten / 2; | ||
} | ||
else | ||
{ | ||
// If input isn't char aligned, we won't be able to align it to a Vector | ||
currentOffset += (nuint)TVectorByte.Count; | ||
} | ||
while (currentOffset <= finalOffsetWhereCanRunLoop) | ||
{ | ||
asciiVector = TVectorByte.Load(pAsciiBuffer + currentOffset); | ||
if (HasMatch<TVectorByte>(asciiVector)) | ||
{ | ||
break; | ||
} | ||
(utf16LowVector, utf16HighVector) = Widen<TVectorByte, TVectorUShort>(asciiVector); | ||
utf16LowVector.StoreAligned(pCurrentWriteAddress); | ||
utf16HighVector.StoreAligned(pCurrentWriteAddress + TVectorUShort.Count); | ||
currentOffset += (nuint)TVectorByte.Count; | ||
pCurrentWriteAddress += (nuint)(TVectorUShort.Count * 2); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here looks generally good, and I don't expect anything to be changed for this PR based on what I'm about to comment.
However, I would like to refer to how we set up everything for TensorPrimitives
as it works very well and allows a lot of code sharing (noting it isn't using ISimdVector
yet since it's out of band, but could easily do so in the future): https://source.dot.net/#System.Numerics.Tensors/System/Numerics/Tensors/netcore/Common/TensorPrimitives.IUnaryOperator.cs,fdab74764af40a1e
In general it tries to do the pre-checks up front and only ever execute 1 vector path (so it doesn't have to fallthrough from Vector512->Vector256->Vector128 as remainders exist). To achieve this, it has the Vectorized###
helpers (which are all identical, except for the size they operate on, this is what would eventually use ISimdVector
) and then a shared VectorizedSmall
which is simply a jump table designed to handle any data that is less than a full vector using a single branch.
The core logic for the vectorized algorithm (https://source.dot.net/#System.Numerics.Tensors/System/Numerics/Tensors/netcore/Common/TensorPrimitives.IUnaryOperator.cs,ef9adce4e9561b04) then basically has a path to handle the main loop (which is currently unrolled by a factor of 8) and otherwise hits a jump table to handle remaining blocks (so they can likewise be handled with a single branch).
To help optimize, it preloads the beginning and ending vectors. In the worst case this will result in double processing of some inputs for very small sizes, but its ultimately only 2 main operations which is fine.
The main loop then attempts to align and has an optimized path for extremely large inputs for non-temporal data if alignment could be achieved. Smaller inputs just do regular unaligned stores since the actual address will have been aligned if that was feasible.
This general approach is done because it allows all paths, but particularly the smallest inputs, to minimize the total number of branches done (no more than 2 branches for non-vectorized data and no more 3 to hit the main loop for vectorized code). It also allows us to separate the "algorithm logic" from the "vectorization logic" and share that vectorization logic between multiple vectorized algorithms.
This general setup has worked so well and provided very stable perf numbers for all sizes, such that we opened #93217 as a means of investigating if we could make it more general purpose and public. Long term, it'd probably be desirable to move algorithms like this WidenAsciiToUtf16
to follow the same approach so that we get the best perf, with the least overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really useful. And something I can add on in future. Thanks for the detailed comment
825bf75
to
bdc5b3f
Compare
@@ -692,6 +692,11 @@ private string ToString([StringSyntax(StringSyntaxAttribute.NumericFormat)] stri | |||
// New Surface Area | |||
// | |||
|
|||
static bool ISimdVector<Vector128<T>, T>.AnyMatches(Vector128<T> vector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did review/approve this (#98055 (comment)) and settled on bool Any(Vector128<T> vector, T value)
and bool AnyWhereAllBitsSet(Vector<T> vector)
It would be nice to fix this to follow that. The PR otherwise looks good and should be mergeable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, the way I understand is we do not need AnyMatches()
anymore
And Any
and AnyWhereAllBitsSet
would look something like follows
static bool ISimdVector<Vector512<T>, T>.AnyWhereAllBitsSet(Vector512<T> vector)
{
return (vector.EqualsAny(Vector512<T>.AllBitsSet));
}
static bool ISimdVector<Vector512<T>, T>.Any(Vector512<T> vector, T value)
{
return (vector.EqualsAny(Vector512.Create((T)value)));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, essentially, with the ability for the JIT to recognize these as intrinsic and optimize them more in appropriate scenarios (but that's not necessary for this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - took out AnyMatches()
and added AnyWhereAllBitsSet(
) and Any()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding Does this look correct? Anything else you'd like me to fix?
bdc5b3f
to
424c52a
Compare
@@ -692,6 +692,16 @@ private string ToString([StringSyntax(StringSyntaxAttribute.NumericFormat)] stri | |||
// New Surface Area | |||
// | |||
|
|||
static bool ISimdVector<Vector128<T>, T>.AnyWhereAllBitsSet(Vector128<T> vector) | |||
{ | |||
return (Vector128.EqualsAny(vector, Vector128<T>.AllBitsSet)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary parens here and in Any
If you could fix that in a follow up PR, that'd be great (going to merge this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! Will put it up later today.
Regressions |
Possible Mono regressions:
|
mono wasm aot regression dotnet/perf-autofiling-issues#32601 |
* Add AnyMatches() to iSimdVector interface * Switch to iSimdVector and Align WidenAsciiToUtf16. * Fixing perf * Addressing Review Comments. * Mirroring API change : dotnet#98055 (comment)
This is an updated version of this PR(#89892). It does the following
AnyMatches
' support for iSimdVectorWidenAsciiToUtf16
' implementationPerf Results
Ran the following tests(sizes : 16, 512, 1024, 5120, 10240) on EMR: (base = main branch, diff = with change)https://github.com/dotnet/performance/blob/47d21ee9571164a8e3f8088d8709ca4061d96827/src/benchmarks/micro/libraries/System.Text.Encoding/Perf.Encoding.cs
On EMR
On ICX - Not much diff