-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OrdinalIgnoreCase could be faster when one of the args is a const ASCII string #45613
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
It could be very beneficial for headers parsing in web servers. |
Somewhat related to #11484. If the JIT has knowledge that one of the parameters is constant, it can tell us to go down a separate code path, or it can even redirect the call to a different method. |
@GrabYourPitchforks Nice idea, reminds me this LLVM intrinsic: https://llvm.org/docs/LangRef.html#llvm-is-constant-intrinsic Thanks to #37226 and #1378 the following code is computed as a const! 🎉 public ulong Test() => ConstStringTo64BitInt("True");
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ulong ConstStringTo64BitInt(string str)
{
Debug.Assert(str.Length <= 4);
ulong v = 0;
if (str.Length == 4) v |= ((ulong)str[3] << 48);
if (str.Length >= 3) v |= ((ulong)str[2] << 32);
if (str.Length >= 2) v |= ((ulong)str[1] << 16);
if (str.Length >= 1) v |= ((ulong)str[0]);
if (str.Length == 0) return 0;
return v;
} Codegen for 48B85400720075006500 mov rax, 0x65007500720054
C3 ret A sort of |
|
There are many APIs that can be replaced with more efficient call-site specific equivalent when the input is in a particular shape. Would source generators be better way to handle that? It would be much more powerful solution than IsConstAsciiString. |
Does it mean that optimized path in language agnostic BCL could only be used in C#? |
I think at least for the string.Equals(str, "cns", StringComparison.OrdinalIgnoreCase); // Also, for just `Ordinal` for Inspired by LLVM: https://godbolt.org/z/z3MKa9 UPD: example: public static bool Test(string str)
{
return string.Equals(str, "true", StringComparison.Ordinal);
} is optimized into: public static bool Test(string str)
{
return object.ReferenceEquals(str, "true") || // <-- perhaps we can omit this one since the comparison is fast any way
(str != null &&
str.Length == 4 &&
Unsafe.ReadUnaligned<ulong>(
ref Unsafe.As<char, byte>(ref str._firstChar)) == 0x65007500720054);
} Codegen: 48B8B8310090AA010000 mov rax, 0x1AA900031B8
483B08 cmp rcx, gword ptr [rax]
7426 je SHORT G_M59561_IG07
4885C9 test rcx, rcx
741E je SHORT G_M59561_IG05
83790804 cmp dword ptr [rcx+8], 4
7518 jne SHORT G_M59561_IG05
4883C10C add rcx, 12
48B85400720075006500 mov rax, 0x65007500720054
483901 cmp qword ptr [rcx], rax
0F94C0 sete al
0FB6C0 movzx rax, al
C3 ret
G_M59561_IG05:
33C0 xor eax, eax
C3 ret
G_M59561_IG07:
B801000000 mov eax, 1
C3 ret
; Total bytes of code 59 Without 4885C9 test rcx, rcx
741E je SHORT G_M59561_IG05
83790804 cmp dword ptr [rcx+8], 4
7518 jne SHORT G_M59561_IG05
4883C10C add rcx, 12
48B85400720075006500 mov rax, 0x65007500720054
483901 cmp qword ptr [rcx], rax
0F94C0 sete al
0FB6C0 movzx rax, al
C3 ret
G_M59561_IG05:
33C0 xor eax, eax
C3 ret
; Total bytes of code 38
master: 4883EC28 sub rsp, 40
48BAF831FFAE39010000 mov rdx, 0x139AEFF31F8
488B12 mov rdx, gword ptr [rdx]
41B805000000 mov r8d, 5
E81CA1FDFF call System.String:Equals(System.String,System.String,int):bool
90 nop
4883C428 add rsp, 40
C3 ret
; Total bytes of code: 34 |
Prototype for ^ in JIT: EgorBo@5883362 |
Could you please share links to the webserver code where you think this would be very beneficial?
I think it make sense to start with the simpler memcpy-like and memcmp-like cases before going to more esoteric cases like globalization. Also, we need to maintain parity between string and Spans (the same optimization has to be implemented for both). We have been promoting Spans as the high performance solution. We do not want to be in situation where strings have extra optimizations that are missing for Spans. |
I'm working in Azure Gateway team (reverse proxy for everything in Azure), our header parsing code taking 3.5% CPU time. private static bool ShouldIgnore(string headerName)
{
switch (headerName.Length)
{
case 13:
return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
case 17:
return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
case 22:
return
headerName.Equals("foo", StringComparison.OrdinalIgnoreCase) ||
headerName.Equals("foo", StringComparison.OrdinalIgnoreCase) ||
headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
case 25:
return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
case 28:
return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
default:
return false;
}
} I'm pretty sure ASP Net team also will be pleased to optimize header parsing. Example2 (this is basically codegen) |
The first customer of such a source-gen could be System.Text.Json's property matching, I use pretty much the same approach (without avx and only ordinal comparison, not case-insensitive) in SpanJson and up to ~32 bytes it's a lot faster than string.Equals. |
Yep, this can make naively written parsers better. There are other tricks that high-performance parsers use as the ASP.NET example shows and that this won't handle. I doubt that ASP.NET will want to replace their fine-tuned parser even if this optimization gets implemented. |
...and a lot of fast parsers operate with utf-8 or ascii bytes and try to avoid creating strings as much as possible. Maybe also consider Equals with ROS in this (general) deliberation. |
memcpy/memcmp primitives are exposed via number of different methods: CopyTo, TryCopyTo, Equals, SequenceEquals, StartsWith, ... . Example for memcpy: This would be more naturaly written as Example for memcmp: runtime/src/libraries/System.Net.Http/src/System/Net/Http/Headers/AltSvcHeaderParser.cs Line 144 in f9a8abc
I agree with the idea of enabling people to write natural code. I would add that the optimizations in this space need to work well with our other best practices for writing high-performance code, that means Spans. Optimizations that only kick in for code that is likely slow for other reason (e.g. allocates a lot of strings) won't move the needle. |
@jkotas Thanks for the examples! static bool Test(ReadOnlySpan<char> span)
{
return span.StartsWith("test");
} Emits a freaking forest of GenTree 😢 https://gist.github.com/EgorBo/7dd7537b8a8cb4341b47bc9499f60e7a So the only way is to intrinsify each of them separately I guess. Or rely on that source-gen approach. |
Or apply the optimization later in the pipeline once this forest of GenTrees gets simplified. |
Still contains a lot of temps - https://gist.github.com/EgorBo/c6eb8edd8de19934506a3a9f3c859f46 ref char p = ref MemoryMarshal.GetReference("hello".AsSpan()); // after inlining should be optimized into just ref char p = ref "hello"._firstChar; Perhaps, spans should have some special operators in JIT's IR or something like that |
Ah, I didn't notice it was assigned |
Don't worry, my task was just to do a further triage, I have not started the implementation :-) |
It's a quite popular case when we compare a string against some predefined one using
Ordinal/OrdinalIgnoreCase
case (I found plenty of cases in dotnet/aspnetcore and dotnet/runtime repos). E.g. patterns like:So Roslyn/ILLink/JIT (perhaps, it should be JIT to handle more cases after inlining + Roslyn/ILLink know nothing about target arch) could optimize such comparisons by inlining a more optimized check, e.g. here is what we can do for small strings (1-4 chars) keeping in mind that strings are 8-bytes aligned (for simplicity I only cared about AMD64 arch):
We also can inline SIMD stuff for longer strings, e.g. here I check that an http header is
"Proxy-Authenticate"
(18 chars) using two AVX2 vectors: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44when the input string is let's say 30 bytes the results are even better:
So for [0..32] chars (string.Length) we can emit an inlined super-fast comparison:
[0..4]: using a single 64bit GP register
[5..8]: using two 64bit GP registers
[9..16]: using two 128bit vectors
[17..32]: using two 256bit vectors
[33...]: leave as is.
/cc @GrabYourPitchforks @benaadams @jkotas @stephentoub
The text was updated successfully, but these errors were encountered: