Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster IndexOfAny for IgnoreCase Ascii letters #96588

Merged
merged 8 commits into from
Feb 17, 2024

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Jan 7, 2024

The idea is that if we're searching for both cases of an Ascii letter, we can do it by masking off the bit where they differ and only search for one. E.g. (input | 0x20) is 'a' or 'b' instead of input is 'a' or 'A' or 'b' or 'B'.

This PR adds a variant of packed IndexOfAny(char) and IndexOfAny(char, char) that does the | 0x20 input transformation and teaches SearchValues to use it.
I've also updated the RegexCompiler to use SearchValues for sets of 4/5 values like the source generator does, so that it can make use of this change.

public class SearchValuesPackedIgnoreCase
{
    private static readonly SearchValues<char> s_asciiTwoLetters = SearchValues.Create("Aa");
    private static readonly SearchValues<char> s_asciiFourLetters = SearchValues.Create("AaBb");
    private static readonly string s_text = new string('\n', 20_000);

    [Benchmark] public int IndexOfOneIgnoreCase() => s_text.AsSpan().IndexOfAny(s_asciiTwoLetters);
    [Benchmark] public int IndexOfTwoIgnoreCase() => s_text.AsSpan().IndexOfAny(s_asciiFourLetters);
}
Method Toolchain Mean Error Ratio
IndexOfOneIgnoreCase main 583.5 ns 3.56 ns 1.00
IndexOfOneIgnoreCase pr 499.5 ns 1.71 ns 0.86
IndexOfTwoIgnoreCase main 742.5 ns 4.02 ns 1.00
IndexOfTwoIgnoreCase pr 642.2 ns 5.54 ns 0.87
Method Toolchain Pattern Options Mean Error Ratio
Count main (?i)Sherlock|Holmes Compiled 676.61 us 1.858 us 1.00
Count pr (?i)Sherlock|Holmes Compiled 623.75 us 1.762 us 0.92
Count main [Zz] Compiled 32.16 us 0.126 us 1.00
Count pr [Zz] Compiled 30.39 us 0.112 us 0.95
Count main [ZzQq] Compiled 63.99 us 0.289 us 1.00
Count pr [ZzQq] Compiled 46.14 us 0.131 us 0.72

The bigger difference for [ZzQq] is switching from a regular 4-value IndexOfAny to a 2-value packed one.

I tested the same approach for IndexOfAny(char, char, char) and IndexOfAnyInRange, but the throughput there is pretty much on par with or slower than IndexOfAnyAsciiSearcher, so I avoided adding more code for it.

@MihaZupan MihaZupan added this to the 9.0.0 milestone Jan 7, 2024
@MihaZupan MihaZupan self-assigned this Jan 7, 2024
@ghost
Copy link

ghost commented Jan 7, 2024

Tagging subscribers to this area: @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

Issue Details

The idea is that if we're searching for both cases of an Ascii letter, we can do it by masking off the bit where they differ and only search for one. E.g. (input | 0x20) is 'a' or 'b' instead of input is 'a' or 'A' or 'b' or 'B'.

This PR adds a variant of packed IndexOfAny(char) and IndexOfAny(char, char) that does the | 0x20 input transformation and teaches SearchValues to use it.
I've also updated the RegexCompiler to use SearchValues for sets of 4/5 values like the source generator does, so that it can make use of this change.

public class SearchValuesPackedIgnoreCase
{
    private static readonly SearchValues<char> s_asciiTwoLetters = SearchValues.Create("Aa");
    private static readonly SearchValues<char> s_asciiFourLetters = SearchValues.Create("AaBb");
    private static readonly string s_text = new string('\n', 20_000);

    [Benchmark] public int IndexOfOneIgnoreCase() => s_text.AsSpan().IndexOfAny(s_asciiTwoLetters);
    [Benchmark] public int IndexOfTwoIgnoreCase() => s_text.AsSpan().IndexOfAny(s_asciiFourLetters);
}
Method Toolchain Mean Error Ratio
IndexOfOneIgnoreCase main 583.5 ns 3.56 ns 1.00
IndexOfOneIgnoreCase pr 499.5 ns 1.71 ns 0.86
IndexOfTwoIgnoreCase main 742.5 ns 4.02 ns 1.00
IndexOfTwoIgnoreCase pr 642.2 ns 5.54 ns 0.87
Method Toolchain Pattern Options Mean Error Ratio
Count main (?i)Sherlock Compiled 57.02 us 0.121 us 1.00
Count pr (?i)Sherlock Compiled 57.75 us 0.132 us 1.01
Count main (?i)Sherlock|Holmes Compiled 676.61 us 1.858 us 1.00
Count pr (?i)Sherlock|Holmes Compiled 623.75 us 1.762 us 0.92
Count main [Zz] Compiled 32.16 us 0.126 us 1.00
Count pr [Zz] Compiled 30.39 us 0.112 us 0.95
Count main [ZzQq] Compiled 63.99 us 0.289 us 1.00
Count pr [ZzQq] Compiled 46.14 us 0.131 us 0.72

The bigger difference for [ZzQq] is switching from a regular 4-value IndexOfAny to a 2-value packed one.

Author: MihaZupan
Assignees: MihaZupan
Labels:

area-System.Buffers

Milestone: 9.0.0

@tannergooding
Copy link
Member

Is this pending anything or can it be merged?

Saw a couple questions and an ask for a comment to be added from Stephen, but its unclear if you want to handle that in a follow up or not.

@MihaZupan
Copy link
Member Author

This one is waiting on me, I still have to follow up on Stephen's feedback

@tannergooding tannergooding added the needs-author-action An issue or pull request that requires more info or actions from the author. label Feb 1, 2024
@ghost ghost added the no-recent-activity label Feb 15, 2024
@ghost
Copy link

ghost commented Feb 15, 2024

This pull request has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

@MihaZupan MihaZupan removed no-recent-activity needs-author-action An issue or pull request that requires more info or actions from the author. labels Feb 17, 2024
@MihaZupan MihaZupan merged commit 432397d into dotnet:main Feb 17, 2024
175 of 178 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Mar 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants