-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use intrinsics for SequenceEqual<byte> vectorization to emit at R2R #32371
Conversation
fd38233
to
7a7e391
Compare
Redoing this on top of @ahsonkhan's change #32364 as that outperformed this in various areas |
@benaadams, can you run the following benchmark with what's in master (with my recent change) vs. what's in this PR to measure/validate the small buffer perf? I am asking because I noticed not using the actually built [BenchmarkCategory(Categories.CoreFX, Categories.JSON)]
[DisassemblyDiagnoser(printPrologAndEpilog: true, recursiveDepth: 5)]
public class SequenceEqualThreshold
{
private byte[] _input;
private byte[] _expected;
[Params(0, 1, 2, 3, 4)]
public int Length;
[GlobalSetup]
public void Setup()
{
var builder = new StringBuilder();
for (int i = 0; i < Length; i++)
{
builder.Append("a");
}
string input = builder.ToString();
_input = Encoding.UTF8.GetBytes(input);
_expected = _input;
_expected = Encoding.UTF8.GetBytes(input);
Console.WriteLine(typeof(Span<byte>).AssemblyQualifiedName);
Console.WriteLine(typeof(Span<byte>).Assembly.Location);
}
[Benchmark]
public bool SequenceEqual()
{
return _input.AsSpan().SequenceEqual(_expected);
}
} This is what I did here: #32363 Using the dotnet/performance repo (see https://github.com/dotnet/performance/blob/ca80d8e2886b583d0a69635740b188248d3d6fdd/src/benchmarks/micro/README.md#private-runtime-builds):
Here are the dlls I copy/override: If you have another, easier way to do it, please do that (and share) :) I probably made things more complicated than needed, so there gotta be a better way to do the perf measurements to speed up inner dev loop. Btw, @adamsitnik - the workflow instructions need to be updated. The testhost\corerun folder doesn't contain the latest built Also, we may want to see whether removing multiple return statements in the main public method helps (also apparently, the if-branch is the special case, so putting the common code in the else branch or outside the if might be better for perf too, so inverted the condition). Maybe you can find ways to optimize that as well in different ways :) [MethodImpl(MethodImplOptions.AggressiveInlining)]
public static bool SequenceEqual<T>(this Span<T> span, ReadOnlySpan<T> other) where T : IEquatable<T>
{
int length = span.Length;
bool result = length == other.Length;
if (!RuntimeHelpers.IsBitwiseEquatable<T>())
{
result = result &&
SpanHelpers.SequenceEqual(ref MemoryMarshal.GetReference(span), ref MemoryMarshal.GetReference(other), length);
}
else
{
nuint size = (nuint)Unsafe.SizeOf<T>();
result = result &&
SpanHelpers.SequenceEqual(
ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(span)),
ref Unsafe.As<T, byte>(ref MemoryMarshal.GetReference(other)),
((nuint)length) * size); // If this multiplication overflows, the Span we got overflows the entire address range. There's no happy outcome for this api in such a case so we choose not to take the overhead of checking.
}
return result;
} |
{ | ||
while (offset < lengthToExamine) | ||
{ | ||
if (Unsafe.Add(ref first, (IntPtr)offset) != (Unsafe.Add(ref second, (IntPtr)offset))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nint
-> IntPtr
casts are a performance trap. I believe that it will go to 64-bit long first on 32-platforms, and the 64-bit long then gets down-casted using checked cast to 32-bit again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uses the explicit operator IntPtr(int value)
-> IntPtr(int value)
so should be ok? (rather than nuint
which would go via long
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI what Jan said is the reason the UTF-8 transcoding logic uses void*
as an intermediary when converting between IntPtr and (whatever integral type).
runtime/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf8Utility.Transcoding.cs
Line 114 in a42c583
uint remainingInputBytes = (uint)(void*)Unsafe.ByteOffset(ref *pInputBuffer, ref *pFinalPosWhereCanReadDWordFromInputBuffer) + 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. This should be fine. I thought there is unsigned/signed conversion too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But as @GrabYourPitchforks noted it is very easy to miss the cases where it is not fine. We had number of 32-bit specific perf bugs because of that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I previously had an issue because pointers are unsigned so my less than zero tests always went the wrong way, which I hadn't expected :(
@ahsonkhan I was using local copies of Doing it this way for a couple reasons.
public bool SequenceEqual()
{
return _input.AsSpan().SequenceEqual(_expected);
}
Should branch eliminate to only 1 return? |
I assume so since the check is an intrinsic. How can we test/verify that is indeed the case? Is there a way to observe that in the disassembly? I will re-run the benchmark tomorrow to verify that it has no perf impact. |
06a3478
to
99ca277
Compare
Span passing costs seem very high? e.g. passing 2 Spans from one method to another is more expensive than comparing the whole 4096 byte spans for equality? (Windows)
// Passing as params
[Benchmark]
[MethodImpl(MethodImplOptions.NoInlining)]
public bool UseSequenceEqualPR()
{
return SequenceEqualPR(_input.Span, _expected.Span);
}
[MethodImpl(MethodImplOptions.NoInlining)]
private static bool SequenceEqualPR(Span<byte> input, Span<byte> expected)
{
return input.Length == expected.Length && SequenceEqualPR(
ref MemoryMarshal.GetReference(input),
ref MemoryMarshal.GetReference(expected),
(nuint)input.Length);
}
// Using direct
[Benchmark]
[MethodImpl(MethodImplOptions.NoInlining)]
public bool UseSequenceEqualPRDirect()
{
return SequenceEqualPRDirect();
}
[MethodImpl(MethodImplOptions.NoInlining)]
private bool SequenceEqualPRDirect()
{
Span<byte> input = _input.Span;
Span<byte> expected = _expected.Span;
return input.Length == expected.Length && SequenceEqualPR(
ref MemoryMarshal.GetReference(input),
ref MemoryMarshal.GetReference(expected),
(nuint)input.Length);
} |
Updated the benchmark to show the Span passing cost https://gist.github.com/benaadams/bf85405a5eae4c750cf6470a5506fd8d can make the |
Raised issue for the |
a03a876
to
2c19a63
Compare
cfd2b32
to
822cdc3
Compare
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
822cdc3
to
8dfa95c
Compare
@@ -1309,85 +1311,191 @@ public static unsafe int LastIndexOfAny(ref byte searchSpace, byte value0, byte | |||
|
|||
// Optimized byte-based SequenceEquals. The "length" parameter for this one is declared a nuint rather than int as we also use it for types other than byte | |||
// where the length can exceed 2Gb once scaled by sizeof(T). | |||
[MethodImpl(MethodImplOptions.AggressiveOptimization)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it now has Sse2 intrinisics, removed the AggressiveOptimization
which prevents them from being emitted at R2R.
Pure Vector<T>
methods are blocked from R2R so need AggressiveOptimization
to bypass the inline restrictions at Tier0.
Note this is a regression on Arm as it will run Tier0 code; however I couldn't find a #if
to put it behind; and it should get picked up by #33308
R2R version ; Assembly listing for method SpanHelpers:SequenceEqual(byref,byref,long):bool
; Emitting BLENDED_CODE for X64 CPU with SSE2 - Windows
; ReadyToRun compilation
; optimized code
; rsp based frame
; fully interruptible
; Final local variable assignments
;
; V00 arg0 [V00,T01] ( 11, 10 ) byref -> rcx
; ...
;* V47 tmp27 [V47 ] ( 0, 0 ) byref -> zero-ref "Inlining Arg"
;
; Lcl frame size = 0
G_M37173_IG01:
;; bbWeight=1 PerfScore 0.00
G_M37173_IG02:
cmp r8, 8
jae SHORT G_M37173_IG07
;; bbWeight=1 PerfScore 1.25
G_M37173_IG03:
cmp r8, 4
jae SHORT G_M37173_IG06
xor eax, eax
mov r9, r8
and r9, 2
test r9, r9
je SHORT G_M37173_IG04
movzx rax, word ptr [rcx]
movzx r10, word ptr [rdx]
sub eax, r10d
;; bbWeight=0.50 PerfScore 3.75
G_M37173_IG04:
test r8b, 1
je SHORT G_M37173_IG05
movzx rcx, byte ptr [rcx+r9]
movzx rdx, byte ptr [rdx+r9]
sub ecx, edx
or ecx, eax
mov eax, ecx
;; bbWeight=0.50 PerfScore 3.00
G_M37173_IG05:
test eax, eax
sete al
movzx rax, al
jmp SHORT G_M37173_IG08
;; bbWeight=0.50 PerfScore 1.75
G_M37173_IG06:
add r8, -4
mov eax, dword ptr [rcx]
sub eax, dword ptr [rdx]
mov ecx, dword ptr [rcx+r8]
sub ecx, dword ptr [rdx+r8]
or eax, ecx
test eax, eax
sete al
movzx rax, al
jmp SHORT G_M37173_IG08
;; bbWeight=0.50 PerfScore 6.00
G_M37173_IG07:
cmp rcx, rdx
je SHORT G_M37173_IG09
jmp SHORT G_M37173_IG11
;; bbWeight=0.50 PerfScore 1.63
G_M37173_IG08:
ret
;; bbWeight=0.50 PerfScore 0.50
G_M37173_IG09:
mov eax, 1
;; bbWeight=0.50 PerfScore 0.13
G_M37173_IG10:
ret
;; bbWeight=0.50 PerfScore 0.50
G_M37173_IG11:
cmp r8, 16
jb SHORT G_M37173_IG14
xor rax, rax
add r8, -16
test r8, r8
je SHORT G_M37173_IG13
;; bbWeight=0.50 PerfScore 1.50
G_M37173_IG12:
movups xmm0, xmmword ptr [rcx+rax]
movups xmm1, xmmword ptr [rdx+rax]
pcmpeqb xmm0, xmm1
pmovmskb r9d, xmm0
cmp r9d, 0xFFFF
jne SHORT G_M37173_IG15
add rax, 16
cmp r8, rax
ja SHORT G_M37173_IG12
;; bbWeight=4 PerfScore 49.00
G_M37173_IG13:
movups xmm0, xmmword ptr [rcx+r8]
movups xmm1, xmmword ptr [rdx+r8]
pcmpeqb xmm0, xmm1
pmovmskb ecx, xmm0
cmp ecx, 0xFFFF
jne SHORT G_M37173_IG15
jmp SHORT G_M37173_IG09
;; bbWeight=0.50 PerfScore 6.38
G_M37173_IG14:
lea rax, [r8-8]
mov r8, qword ptr [rcx]
sub r8, qword ptr [rdx]
mov rcx, qword ptr [rcx+rax]
sub rcx, qword ptr [rdx+rax]
or r8, rcx
test r8, r8
sete al
movzx rax, al
jmp SHORT G_M37173_IG08
;; bbWeight=0.50 PerfScore 6.13
G_M37173_IG15:
xor eax, eax
;; bbWeight=0.50 PerfScore 0.13
G_M37173_IG16:
ret
;; bbWeight=0.50 PerfScore 0.50
; Total bytes of code 225, prolog size 0, PerfScore 104.63, (MethodHash=63986eca) for method SpanHelpers:SequenceEqual(byref,byref,long):bool
; ============================================================ |
858ef64
to
8111936
Compare
Please excuse me for the late response. Both the benchmarking and profiling docs have been updated some time ago and now they are up to date: https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md Please let me know if something does not work as expected. |
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/SpanHelpers.Byte.cs
Outdated
Show resolved
Hide resolved
9ecbc0f
to
6b3970a
Compare
/cc @GrabYourPitchforks any more to do here? |
@tannergooding do you have time to help get this reviewed? I know @GrabYourPitchforks is fully occupied with something critical. At least one other PR is blocked on this one. |
@danmosemsft did you mean to comment on a different PR? This one is merged. |
Doh. My goal was to unblock @benadams |
Need to minimise code churn/merge clashes between the PRs, have done a cleanup PR to make it easier #35765 |
Bit more stable across sizes; but mostly similar. The biggest win I'd highlight is that it's now emitted at R2R rather than always requiring JIT.
gist Benchamark+Results
Resolves #32363
/cc @ahsonkhan