-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Double IndexOf throughput for chars #78861
Conversation
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsWhen searching through strings, it's very common to have single-byte values (think ASCII). The In this POC PR, I added implementations that do such packing for If we're happy with the direction, I can add implementations for Benchmark numbers
|
cc: @EgorBo @stephentoub |
1318d5b
to
34ee7e0
Compare
Any thoughts on this approach @dotnet/area-system-memory? |
@MihaZupan Sorry for the delay, I'll take a look at this early next week. |
@MihaZupan I'm new to this area and I think I'm missing some context to properly review this. Can you give me a high-level summary of the intentions behind the changes in each file? |
Sure. The main idea behind this change is the observation that the char values we commonly search for are ASCII, in which case half of their UTF16 representation will always be 0. When doing vectorized searches, this means we're mostly ignoring half of the input and half of the result of each comparison.
|
To confirm, this is because any value that needs the full char to be represented will saturate to 255 when narrowed to a byte, correct? |
That's right. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Left a few clarifying questions.
For testing, I assume all these paths already had sufficient coverage, right?
@@ -558,7 +558,7 @@ public static ReadOnlyMemory<char> AsMemory(this string? text, Range range) | |||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | |||
public static unsafe int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, T value) where T : IEquatable<T>? | |||
{ | |||
if (SpanHelpers.CanVectorizeAndBenefit<T>(span.Length)) | |||
if (RuntimeHelpers.IsBitwiseEquatable<T>()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamsitnik I see you added this in #73768, can you please clarify what the intent was?
As far as I can tell, this is just adding a redundant length check given that all the SpanHelpers
implementations we're calling into also do the length check and have code for handling short inputs. Am I missing something (I didn't see any discussion about this on your PR)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MihaZupan, @adamsitnik, was this ever answered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe so. Is the changed version causing issues somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but it sounded like there might be extra work happening unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was the case before this change (we would inline a length check and 2 calls to the worker methods), now it should just be a call to the 1 worker method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, I misunderstood the comment
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.Packed.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.Packed.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.Packed.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.Packed.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Char.Packed.cs
Outdated
Show resolved
Hide resolved
Yes. I'll double-check that we don't accidentally end up losing substantial coverage of the existing (NonPacked) code paths if we're mostly testing with ASCII values.
I'll add the |
34ee7e0
to
15afc4c
Compare
Added Updated perf numbers
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAny2CharValues.cs
Show resolved
Hide resolved
/azp run runtime-libraries-coreclr outerloop |
Azure Pipelines successfully started running 1 pipeline(s). |
It appears that the packing is noticeably more expensive on ARM in comparison. ARM64 benchmarks (no match)
ARM64 benchmarks (the first character matches)
|
15afc4c
to
a59cff5
Compare
@tannergooding any concerns about using this sort of approach only on X86, given that it's not profitable on ARM? |
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Packed.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Packed.cs
Outdated
Show resolved
Hide resolved
All failures are known according to build-analysis |
Improvements on arm64:
Improvements on x64: |
This should avoids the size regression on WebAssembly and possibly other platforms without Sse2. The regression is side effect of dotnet#78861 which uses `PackedSpanHelpers.CanUsePackedIndexOf (!!T)` and TShouldUsePacked.Value to guard the usage of PackedSpanHelpers. Because these involve generics, illinker is unable to link the PackedSpanHelpers type away and that pulls other parts in, like System.Runtime.Intrinsics.X86.* types. See https://gist.github.com/radekdoulik/c0b52247d472f69bcf983ade78a924ea for more complete list. This change gets us back 9,216 bytes in the case of app used to repro the regression. ... - Type System.PackedSpanHelpers - Type System.Runtime.Intrinsics.X86.X86Base - Type System.Runtime.Intrinsics.X86.Sse - Type System.Runtime.Intrinsics.X86.Sse2 Summary: - 9,216 File size -0.76% (of 1,215,488) - 2,744 Metadata size -0.43% (of 636,264) - 4 Types count
* Use PackedIndexOfIsSupported checks in more places This should avoids the size regression on WebAssembly and possibly other platforms without Sse2. The regression is side effect of #78861 which uses `PackedSpanHelpers.CanUsePackedIndexOf (!!T)` and TShouldUsePacked.Value to guard the usage of PackedSpanHelpers. Because these involve generics, illinker is unable to link the PackedSpanHelpers type away and that pulls other parts in, like System.Runtime.Intrinsics.X86.* types. See https://gist.github.com/radekdoulik/c0b52247d472f69bcf983ade78a924ea for more complete list. This change gets us back 9,216 bytes in the case of app used to repro the regression. ... - Type System.PackedSpanHelpers - Type System.Runtime.Intrinsics.X86.X86Base - Type System.Runtime.Intrinsics.X86.Sse - Type System.Runtime.Intrinsics.X86.Sse2 Summary: - 9,216 File size -0.76% (of 1,215,488) - 2,744 Metadata size -0.43% (of 636,264) - 4 Types count * Update src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyValues.cs Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com> * Update src/libraries/System.Private.CoreLib/src/System/IndexOfAnyValues/IndexOfAnyValues.cs Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com> * Feedback Co-authored-by: Miha Zupan <mihazupan.zupan1@gmail.com>
When searching through strings, it's very common to have single-byte values (think ASCII).
As long as the value falls within an appropriate range (
[1, 254]
on X86 or[0, 254]
on ARM), we can speed up the search by packing two input vectors together before comparing the value.The
IndexOfAnyAsciiSearcher
implementation I added in #78093 is already using this trick, but it applies to regularIndexOf
as well.In this PR, I added implementations that do such packing for
Contains(char)
,IndexOf(char)
,IndexOfAny(char, char)
,IndexOfAny(char, char, char)
, andIndexOfAnyInRange(char, char)
, roughly doubling the throughput for long inputs.Do we want to do the same for the
Last-
variants as well?I don't think specialized
IndexOf(4/5 values)
would be useful. For ASCII values, usingIndexOfAnyValues
is already very close in throughput (and things like Regex will use that).Benchmark numbers
This is generally a slight regression if a match is found at the start
If the first character matches