OrdinalIgnoreCase could be faster when one of the args is a const ASCII string #45613

EgorBo · 2020-12-04T21:33:47Z

It's a quite popular case when we compare a string against some predefined one using Ordinal/OrdinalIgnoreCase case (I found plenty of cases in dotnet/aspnetcore and dotnet/runtime repos). E.g. patterns like:

* string.Equals(str, "cns", StringComparison.OrdinalIgnoreCase); // or just Ordinal

* str.Equals("cns", StringComparison.OrdinalIgnoreCase); // or just Ordinal

* StringComparer.OrdinalIgnoreCase.Equals(str, "cns"); // or just Ordinal

So Roslyn/ILLink/JIT (perhaps, it should be JIT to handle more cases after inlining + Roslyn/ILLink know nothing about target arch) could optimize such comparisons by inlining a more optimized check, e.g. here is what we can do for small strings (1-4 chars) keeping in mind that strings are 8-bytes aligned (for simplicity I only cared about AMD64 arch):

[Benchmark(Baseline = true)]
[Arguments("true")]
[Arguments("TRUE")]
public bool StringEquals(string str)
{
    return string.Equals(str, "True", StringComparison.OrdinalIgnoreCase);
}

[Benchmark]
[Arguments("true")]
[Arguments("TRUE")]
public bool Inlined(string str)
{
    return object.ReferenceEquals(str, "True") || 
           (str.Length == 4 &&

            // string's content fits into 64bit register, the following code looks awful
            // but it's a simple 'or + cmp' in the codegen
            (Unsafe.ReadUnaligned<ulong>(ref Unsafe.As<char, byte>(
                   ref MemoryMarshal.GetReference(str.AsSpan()))) | 0x0020002000200020 /*'to upper' since the const arg doesn't contain anything but [a..Z]*/) == 0x65007500720074);
}

|       Method |  str |      Mean |     Error |    StdDev | Ratio |
|------------- |----- |----------:|----------:|----------:|------:|
| StringEquals | TRUE | 4.1723 ns | 0.0023 ns | 0.0020 ns |  1.00 |
|      Inlined | TRUE | 0.3416 ns | 0.0024 ns | 0.0022 ns |  0.08 |
|              |      |           |           |           |       |
| StringEquals | true | 4.1718 ns | 0.0023 ns | 0.0021 ns |  1.00 |
|      Inlined | true | 0.3406 ns | 0.0019 ns | 0.0016 ns |  0.08 |

We also can inline SIMD stuff for longer strings, e.g. here I check that an http header is "Proxy-Authenticate" (18 chars) using two AVX2 vectors: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44

|           Method |         headerName |      Mean |     Error |    StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------- |------------------- |----------:|----------:|----------:|------:|------:|------:|------:|----------:|
|     StringEqauls | PROXY-AUTHENTICATE | 10.296 ns | 0.0087 ns | 0.0078 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | PROXY-AUTHENTICATE |  2.560 ns | 0.0042 ns | 0.0038 ns |  0.25 |     - |     - |     - |         - |
|                  |                    |           |           |           |       |       |       |       |           |
|     StringEqauls | proxy-authenticate | 10.298 ns | 0.0078 ns | 0.0065 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | proxy-authenticate |  2.563 ns | 0.0071 ns | 0.0066 ns |  0.25 |     - |     - |     - |         - |

when the input string is let's say 30 bytes the results are even better:

|           Method |           headerName |      Mean |     Error |    StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|----------------- |--------------------- |----------:|----------:|----------:|------:|------:|------:|------:|----------:|
|     StringEqauls | PROXY(...)WORLD [30] | 15.396 ns | 0.0141 ns | 0.0117 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | PROXY(...)WORLD [30] |  2.558 ns | 0.0010 ns | 0.0008 ns |  0.17 |     - |     - |     - |         - |
|                  |                      |           |           |           |       |       |       |       |           |
|     StringEqauls | proxy(...)World [30] | 14.997 ns | 0.0089 ns | 0.0079 ns |  1.00 |     - |     - |     - |         - |
| StringEqauls_AVX | proxy(...)World [30] |  2.550 ns | 0.0038 ns | 0.0030 ns |  0.17 |     - |     - |     - |         - |

So for [0..32] chars (string.Length) we can emit an inlined super-fast comparison:
[0..4]: using a single 64bit GP register
[5..8]: using two 64bit GP registers
[9..16]: using two 128bit vectors
[17..32]: using two 256bit vectors
[33...]: leave as is.

/cc @GrabYourPitchforks @benaadams @jkotas @stephentoub

The text was updated successfully, but these errors were encountered:

Dotnet-GitSync-Bot · 2020-12-04T21:33:50Z

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

Szer · 2020-12-04T21:43:54Z

It could be very beneficial for headers parsing in web servers.

GrabYourPitchforks · 2020-12-04T22:13:34Z

Somewhat related to #11484. If the JIT has knowledge that one of the parameters is constant, it can tell us to go down a separate code path, or it can even redirect the call to a different method.

EgorBo · 2020-12-04T22:39:42Z

@GrabYourPitchforks Nice idea, reminds me this LLVM intrinsic: https://llvm.org/docs/LangRef.html#llvm-is-constant-intrinsic

Thanks to #37226 and #1378 the following code is computed as a const! 🎉

public ulong Test() => ConstStringTo64BitInt("True");

[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static ulong ConstStringTo64BitInt(string str)
{
    Debug.Assert(str.Length <= 4);
    ulong v = 0;
    if (str.Length == 4) v |= ((ulong)str[3] << 48);
    if (str.Length >= 3) v |= ((ulong)str[2] << 32);
    if (str.Length >= 2) v |= ((ulong)str[1] << 16);
    if (str.Length >= 1) v |= ((ulong)str[0]);
    if (str.Length == 0) return 0;
    return v;
}

Codegen for Test:

       48B85400720075006500 mov      rax, 0x65007500720054
       C3                   ret

A sort of JitHelpers.IsConstString() should be easy to implement. I can make a prototype.

EgorBo · 2020-12-04T23:07:41Z

bool JitHelpers.IsConstAsciiString(string) prototype: EgorBo@86ca130

jkotas · 2020-12-05T02:22:13Z

There are many APIs that can be replaced with more efficient call-site specific equivalent when the input is in a particular shape. Would source generators be better way to handle that? It would be much more powerful solution than IsConstAsciiString.

Szer · 2020-12-05T10:31:19Z

Would source generators be better way to handle that? It would be much more powerful solution than IsConstAsciiString

Does it mean that optimized path in language agnostic BCL could only be used in C#?

EgorBo · 2020-12-05T10:37:22Z

I think at least for the

string.Equals(str, "cns", StringComparison.OrdinalIgnoreCase); // Also, for just `Ordinal`

for str.Length <= 4 (or 2 in case of 32bit) small strings we can introduce the optimization directly in the JIT, it shouldn't be difficult to implement - a couple of GT_QMARK, GT_EQ and an indirect load. We can also rely on PGO and don't do it in the cold blocks. But I'd honestly also implement if for longer strings (up to 32 symbols) using SSE/AVX, it shouldn't be a lot of code as well. So the existing code and F# could benefit from it (and handle more cases after inlining) in contrast with the SourceGen-based approach

Inspired by LLVM: https://godbolt.org/z/z3MKa9

UPD: example:

public static bool Test(string str)
{
    return string.Equals(str, "true", StringComparison.Ordinal);
}

is optimized into:

public static bool Test(string str)
{
    return object.ReferenceEquals(str, "true") || // <-- perhaps we can omit this one since the comparison is fast any way
            (str != null &&
            str.Length == 4 &&
            Unsafe.ReadUnaligned<ulong>(
                ref Unsafe.As<char, byte>(ref str._firstChar)) == 0x65007500720054);
}

Codegen:

       48B8B8310090AA010000 mov      rax, 0x1AA900031B8
       483B08               cmp      rcx, gword ptr [rax]
       7426                 je       SHORT G_M59561_IG07
       4885C9               test     rcx, rcx
       741E                 je       SHORT G_M59561_IG05
       83790804             cmp      dword ptr [rcx+8], 4
       7518                 jne      SHORT G_M59561_IG05
       4883C10C             add      rcx, 12
       48B85400720075006500 mov      rax, 0x65007500720054
       483901               cmp      qword ptr [rcx], rax
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
       C3                   ret
G_M59561_IG05:              
       33C0                 xor      eax, eax
       C3                   ret
G_M59561_IG07:              
       B801000000           mov      eax, 1
       C3                   ret

; Total bytes of code 59

Without object.ReferenceEquals:

       4885C9               test     rcx, rcx
       741E                 je       SHORT G_M59561_IG05
       83790804             cmp      dword ptr [rcx+8], 4
       7518                 jne      SHORT G_M59561_IG05
       4883C10C             add      rcx, 12
       48B85400720075006500 mov      rax, 0x65007500720054
       483901               cmp      qword ptr [rcx], rax
       0F94C0               sete     al
       0FB6C0               movzx    rax, al
       C3                   ret
G_M59561_IG05:              
       33C0                 xor      eax, eax
       C3                   ret

; Total bytes of code 38

\--*  NE        int
   +--*  LCL_VAR   ref    V00 arg0
   \--*  CNS_INT   ref    null

qmark 

\--*  EQ        int
   +--*  ARR_LENGTH int
   |  \--*  LCL_VAR   ref    V00 arg0
   \--*  CNS_INT   int    4

qmark

\--*  EQ        int
   +--*  IND       long
       \--*  ADD       byref
          +--*  LCL_VAR   ref    V00
    \--*  CNS_INT   long   0x65007500720054

else

\--*  RETURN       int
   \--*  CNS_INT   int    0

master:

       4883EC28             sub      rsp, 40
       48BAF831FFAE39010000 mov      rdx, 0x139AEFF31F8
       488B12               mov      rdx, gword ptr [rdx]
       41B805000000         mov      r8d, 5
       E81CA1FDFF           call     System.String:Equals(System.String,System.String,int):bool
       90                   nop      
       4883C428             add      rsp, 40
       C3                   ret      

; Total bytes of code: 34

EgorBo · 2020-12-05T14:27:42Z

Prototype for ^ in JIT: EgorBo@5883362

jkotas · 2020-12-05T14:53:12Z

It could be very beneficial for headers parsing in web servers.

Could you please share links to the webserver code where you think this would be very beneficial?

Prototype

I think it make sense to start with the simpler memcpy-like and memcmp-like cases before going to more esoteric cases like globalization.

Also, we need to maintain parity between string and Spans (the same optimization has to be implemented for both). We have been promoting Spans as the high performance solution. We do not want to be in situation where strings have extra optimizations that are missing for Spans.

Szer · 2020-12-05T15:08:51Z

Could you please share links to the webserver code where you think this would be very beneficial?

I'm working in Azure Gateway team (reverse proxy for everything in Azure), our header parsing code taking 3.5% CPU time.
Code looks like this (header names were removed):

        private static bool ShouldIgnore(string headerName)
        {
            switch (headerName.Length)
            {
                case 13:
                    return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                case 17:
                    return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                case 22:
                    return
                        headerName.Equals("foo", StringComparison.OrdinalIgnoreCase) ||
                        headerName.Equals("foo", StringComparison.OrdinalIgnoreCase) ||
                        headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                case 25:
                    return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                case 28:
                    return headerName.Equals("foo", StringComparison.OrdinalIgnoreCase);
                default:
                    return false;
            }
        }

I'm pretty sure ASP Net team also will be pleased to optimize header parsing.
Example 1

Example2 (this is basically codegen)

Tornhoof · 2020-12-05T17:27:28Z

There are many APIs that can be replaced with more efficient call-site specific equivalent when the input is in a particular shape. Would source generators be better way to handle that? It would be much more powerful solution than IsConstAsciiString.

The first customer of such a source-gen could be System.Text.Json's property matching, I use pretty much the same approach (without avx and only ordinal comparison, not case-insensitive) in SpanJson and up to ~32 bytes it's a lot faster than string.Equals.

jkotas · 2020-12-05T18:17:18Z

Yep, this can make naively written parsers better. There are other tricks that high-performance parsers use as the ASP.NET example shows and that this won't handle. I doubt that ASP.NET will want to replace their fine-tuned parser even if this optimization gets implemented.

gfoidl · 2020-12-05T19:22:27Z

...and a lot of fast parsers operate with utf-8 or ascii bytes and try to avoid creating strings as much as possible.

Maybe also consider Equals with ROS in this (general) deliberation.

EgorBo · 2020-12-05T21:26:31Z

@jkotas

I think it make sense to start with the simpler memcpy-like and memcmp-like cases before going to more esoteric cases like globalization.

I'm not sure I follow, can you please post an example of these memcpy/memcmp usages?

My idea was to optimize cases when we compare a string against some predefined one in the most convenient way. E.g. even a simple search shows many cases in dotnet/runtime:

and it doesn't even cover cases when the constant string is a named constant.

I agree it'd be nice to also handle ReadOnlySpan as well but it's more complicated to implement because of that const string -> to Span conversion in the middle, it definitely won't be a single statement in the import phase.

jkotas · 2020-12-06T14:18:01Z

I'm not sure I follow, can you please post an example of these memcpy/memcmp usages?

memcpy/memcmp primitives are exposed via number of different methods: CopyTo, TryCopyTo, Equals, SequenceEquals, StartsWith, ... .

Example for memcpy:

https://github.com/dotnet/runtime/blob/master/src/libraries/System.Private.CoreLib/src/System/Boolean.cs#L102

This would be more naturaly written as "True".AsSpan().CopyTo(destination) and compiled into a sigle instruction that writes all 8 bytes in one shot.

Example for memcmp:

runtime/src/libraries/System.Net.Http/src/System/Net/Http/Headers/AltSvcHeaderParser.cs

Line 144 in f9a8abc

else if (value.AsSpan(idx).StartsWith("persist="))

My idea was to optimize cases when we compare a string against some predefined one in the most convenient way

I agree with the idea of enabling people to write natural code. I would add that the optimizations in this space need to work well with our other best practices for writing high-performance code, that means Spans. Optimizations that only kick in for code that is likely slow for other reason (e.g. allocates a lot of strings) won't move the needle.

EgorBo · 2020-12-06T14:48:35Z

@jkotas Thanks for the examples!
Looks like thse StartsWith/EndsWith/SequenceCompareTo methods for Spans have AggressiveInlining on them and all three use SpanHelpers.SequenceEqual under the hood so in theory it should be beneficial to try to intrinsify it for const args. The problem that even this simple method:

static bool Test(ReadOnlySpan<char> span)
{
    return span.StartsWith("test");
}

Emits a freaking forest of GenTree 😢 https://gist.github.com/EgorBo/7dd7537b8a8cb4341b47bc9499f60e7a
(26 basic-blocks!)

So the only way is to intrinsify each of them separately I guess. Or rely on that source-gen approach.

jkotas · 2020-12-06T15:04:14Z

Or apply the optimization later in the pipeline once this forest of GenTrees gets simplified.

EgorBo · 2020-12-06T19:53:57Z

Or apply the optimization later in the pipeline once this forest of GenTrees gets simplified.

Still contains a lot of temps - https://gist.github.com/EgorBo/c6eb8edd8de19934506a3a9f3c859f46
Ideally this code:

ref char p = ref MemoryMarshal.GetReference("hello".AsSpan()); // after inlining

should be optimized into just

ref char p = ref "hello"._firstChar;

Perhaps, spans should have some special operators in JIT's IR or something like that

EgorBo · 2020-12-25T11:58:24Z

Ah, I didn't notice it was assigned

sandreenko · 2020-12-25T18:33:43Z

Ah, I didn't notice it was assigned

Don't worry, my task was just to do a further triage, I have not started the implementation :-)

EgorBo added the tenet-performance Performance related issue label Dec 4, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Dec 4, 2020

GrabYourPitchforks added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 4, 2020

JulieLeeMSFT assigned sandreenko Dec 6, 2020

JulieLeeMSFT added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Dec 6, 2020

EgorBo mentioned this issue Dec 25, 2020

[RyuJIT] Unroll StartsWith/SequenceEqual for ([ReadOnly]Span<char>, const-string) #46392

Closed

sandreenko removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Dec 25, 2020

sandreenko assigned EgorBo Dec 25, 2020

sandreenko removed their assignment Dec 25, 2020

sandreenko added this to the 6.0.0 milestone Dec 25, 2020

EgorBo modified the milestones: 6.0.0, Future Mar 8, 2021

EgorBo mentioned this issue Feb 4, 2022

Streamline bool.TryParse/Format #64782

Merged

EgorBo closed this as completed Sep 6, 2022

ghost locked as resolved and limited conversation to collaborators Oct 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OrdinalIgnoreCase could be faster when one of the args is a const ASCII string #45613

OrdinalIgnoreCase could be faster when one of the args is a const ASCII string #45613

EgorBo commented Dec 4, 2020 •

edited

Loading

Dotnet-GitSync-Bot commented Dec 4, 2020

Szer commented Dec 4, 2020

GrabYourPitchforks commented Dec 4, 2020

EgorBo commented Dec 4, 2020 •

edited

Loading

EgorBo commented Dec 4, 2020

jkotas commented Dec 5, 2020

Szer commented Dec 5, 2020

EgorBo commented Dec 5, 2020 •

edited

Loading

EgorBo commented Dec 5, 2020

jkotas commented Dec 5, 2020

Szer commented Dec 5, 2020 •

edited

Loading

Tornhoof commented Dec 5, 2020

jkotas commented Dec 5, 2020

gfoidl commented Dec 5, 2020

EgorBo commented Dec 5, 2020

jkotas commented Dec 6, 2020 •

edited

Loading

EgorBo commented Dec 6, 2020

jkotas commented Dec 6, 2020

EgorBo commented Dec 6, 2020

EgorBo commented Dec 25, 2020

sandreenko commented Dec 25, 2020

OrdinalIgnoreCase could be faster when one of the args is a const ASCII string #45613

OrdinalIgnoreCase could be faster when one of the args is a const ASCII string #45613

Comments

EgorBo commented Dec 4, 2020 • edited Loading

Dotnet-GitSync-Bot commented Dec 4, 2020

Szer commented Dec 4, 2020

GrabYourPitchforks commented Dec 4, 2020

EgorBo commented Dec 4, 2020 • edited Loading

EgorBo commented Dec 4, 2020

jkotas commented Dec 5, 2020

Szer commented Dec 5, 2020

EgorBo commented Dec 5, 2020 • edited Loading

EgorBo commented Dec 5, 2020

jkotas commented Dec 5, 2020

Szer commented Dec 5, 2020 • edited Loading

Tornhoof commented Dec 5, 2020

jkotas commented Dec 5, 2020

gfoidl commented Dec 5, 2020

EgorBo commented Dec 5, 2020

jkotas commented Dec 6, 2020 • edited Loading

EgorBo commented Dec 6, 2020

jkotas commented Dec 6, 2020

EgorBo commented Dec 6, 2020

EgorBo commented Dec 25, 2020

sandreenko commented Dec 25, 2020

EgorBo commented Dec 4, 2020 •

edited

Loading

EgorBo commented Dec 4, 2020 •

edited

Loading

EgorBo commented Dec 5, 2020 •

edited

Loading

Szer commented Dec 5, 2020 •

edited

Loading

jkotas commented Dec 6, 2020 •

edited

Loading