Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SearchValues<char> implementation for two sets of 128 chars #103216

Merged
merged 4 commits into from
Jul 22, 2024

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Jun 10, 2024

#101001 significantly improved the performance of the non-vectorized -Except paths of non-ASCII SearchValues<char>.
However, they are still not vectorized, and this PR changes that for values where the non-ASCII part can fit into a 128-bit bitmap.

This adds an implementation that's almost the same as the AsciiCharSearchValues, but where the core lookup checks against two 128-bit bitmaps, with the second one at a variable offset:

-Vector128<byte> source = TOptimizations.PackSources(source0.AsUInt16(), source1.AsUInt16());
-Vector128<byte> result = IndexOfAnyLookupCore(source, bitmapLookup);
-return TNegator.NegateIfNeeded(result);
+Vector128<byte> packed0 = TOptimizations.PackSources(source0.AsUInt16(), source1.AsUInt16());
+Vector128<byte> packed1 = Default.PackSources(source0.AsUInt16() - offset, source1.AsUInt16() - offset);
+Vector128<byte> result0 = IndexOfAnyLookupCore(packed0, bitmapLookup0);
+Vector128<byte> result1 = IndexOfAnyLookupCore(packed1, bitmapLookup1);
+return TNegator.NegateIfNeeded(result0 | result1);

(All the numbers below are using the Avx2 path of the new implementation -- measured before #103710)

Avx2 where the text is mostly non-ASCII (previous 'Mixed' would use the prob map / scalar fallback)
Method Toolchain Length Mean Error Ratio
Ascii main 1000 38.13 ns 0.129 ns 1.00
Ascii pr 1000 37.32 ns 0.160 ns 0.98
Mixed main 1000 127.47 ns 0.429 ns 1.00
Mixed pr 1000 51.20 ns 0.095 ns 0.40
AsciiLast main 1000 35.82 ns 0.264 ns 1.00
AsciiLast pr 1000 35.81 ns 0.079 ns 1.00
MixedLast main 1000 138.47 ns 0.335 ns 1.00
MixedLast pr 1000 52.36 ns 0.138 ns 0.38
AsciiExcept main 1000 35.40 ns 0.138 ns 1.00
AsciiExcept pr 1000 35.47 ns 0.078 ns 1.00
MixedExcept main 1000 592.08 ns 1.005 ns 1.00
MixedExcept pr 1000 51.00 ns 0.133 ns 0.09
AsciiLastExcept main 1000 36.77 ns 0.535 ns 1.00
AsciiLastExcept pr 1000 36.34 ns 0.086 ns 0.99
MixedLastExcept main 1000 682.31 ns 8.073 ns 1.00
MixedLastExcept pr 1000 53.97 ns 1.372 ns 0.08
Avx512 machine (the probabilistic map is a lot faster than on Avx2)
Method Toolchain Length Mean Error Ratio
Ascii main 1000 37.44 ns 0.006 ns 1.00
Ascii pr 1000 37.73 ns 0.005 ns 1.01
Mixed main 1000 86.86 ns 0.111 ns 1.00
Mixed pr 1000 60.36 ns 0.010 ns 0.69
AsciiLast main 1000 38.02 ns 0.013 ns 1.00
AsciiLast pr 1000 38.28 ns 0.003 ns 1.01
MixedLast main 1000 95.96 ns 0.023 ns 1.00
MixedLast pr 1000 61.16 ns 0.009 ns 0.64
AsciiExcept main 1000 40.17 ns 0.005 ns 1.00
AsciiExcept pr 1000 39.81 ns 0.005 ns 0.99
MixedExcept main 1000 585.99 ns 0.183 ns 1.00
MixedExcept pr 1000 63.22 ns 0.010 ns 0.11
AsciiLastExcept main 1000 41.54 ns 0.003 ns 1.00
AsciiLastExcept pr 1000 41.54 ns 0.005 ns 1.00
MixedLastExcept main 1000 855.18 ns 0.059 ns 1.00
MixedLastExcept pr 1000 64.65 ns 0.010 ns 0.08
Early matches
Method Toolchain InputContainsNonAscii Mean Error Ratio
Mixed main False 5.130 ns 0.0374 ns 1.00
Mixed pr False 3.425 ns 0.0303 ns 0.67
MixedLast main False 4.830 ns 0.0182 ns 1.00
MixedLast pr False 4.179 ns 0.0750 ns 0.87
Mixed main True 8.440 ns 0.0780 ns 1.00
Mixed pr True 3.383 ns 0.0163 ns 0.40
MixedLast main True 9.011 ns 0.2175 ns 1.00
MixedLast pr True 4.073 ns 0.0087 ns 0.45

From that, we can see that the Ascii-only search is at ~1.5x the throughput of the two bitmaps impl.
The two bitmaps have ~1.5x the throughput of the probabilistic map on Avx512 and ~2.5x on Avx2.
In other words, this change is a throughput regression for ProbabilisticWithAsciiCharSearchValues if the text is all ASCII, and an improvement otherwise. It's always chepaer for early matches though.
For the -Except paths where ProbabilisticWithAsciiCharSearchValues uses a scalar fallback, the two bitmaps approach is obviously a lot (10x+) faster.

The change doesn't seem to impact existing Regex benchmarks much, likely because they're very focused on ASCII.


I also tried different implementation approaches to try and reduce code duplication between the existing Ascii and "ascii with second set" implementations:

  • c6f1495, combining the second set into the existing AsciiState, which does save some duplication, but increases the memory consumption of all existing Ascii-only SearchValues.
  • 32f9cf1, that goes all-in with generics, but the JIT can't quite deal with having the vector state be completely generic.
    public static int IndexOfAny<TNegator, TOptimizations>(ref short searchSpace, int searchSpaceLength, ref AsciiWithSecondSetState state)
        where TNegator : struct, INegator
        where TOptimizations : struct, IOptimizations =>
        IndexOfAnyCore<int, TNegator, IndexOfAnyResultMapper<short>, AsciiWithSecondSetLookup<TOptimizations>, AsciiWithSecondSetState, (Vector128<byte> AsciiBitmap, Vector128<byte> SecondBitmap, Vector128<ushort> Offset), (Vector256<byte> AsciiBitmap, Vector256<byte> SecondBitmap, Vector256<ushort> Offset)>(ref searchSpace, searchSpaceLength, ref state);
    
    private static TResult IndexOfAnyCore<TResult, TNegator, TResultMapper, TLookup, TState, TVector128State, TVector256State>(ref short searchSpace, int searchSpaceLength, ref TState state)
        where TResult : struct
        where TNegator : struct, INegator
        where TResultMapper : struct, IResultMapper<short, TResult>
        where TLookup : struct, ILookup<TState, TVector128State, TVector256State>
        where TState : struct
        where TVector128State : struct
        where TVector256State : struct

Looking at patterns from Regex_RealWorldPatterns.json, ~75% of non-ASCII sets would use the new implementation over the Ascii+ProbMap, most of which because of the kelvin sign.

Number of sources: 18886
numberOfSearchValues
5771
numberOfSearchValuesWithNonAscii
1238
numberOfSearchValuesWithTwoSets
960
numberOfSearchValuesWhereNonAsciiIsKelvin
684
numberOfPatternsWithSearchValues
5093
numberOfPatternsWithSearchValuesWithNonAscii
1038
numberOfPatternsWithSearchValuesWithTwoSets
765

Parsed data for the above:
Regex_RealWorldPatterns.SearchValues.json

@MihaZupan MihaZupan added this to the 9.0.0 milestone Jun 10, 2024
@MihaZupan MihaZupan self-assigned this Jun 10, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-buffers
See info in area-owners.md if you want to be subscribed.

@danmoseley
Copy link
Member

The change doesn't seem to impact existing Regex benchmarks much, likely because they're very focused on ASCII.

is it worth adding one or two more non ASCII?

@MihaZupan

This comment was marked as outdated.

@MihuBot

This comment was marked as outdated.

@MihaZupan
Copy link
Member Author

@MihuBot benchmark RustLang_Sherlock https://github.com/MihaZupan/performance/tree/compiled-regex-only -medium

@MihaZupan
Copy link
Member Author

@MihuBot fuzz SearchValues

@MihaZupan

This comment was marked as outdated.

@EgorBo
Copy link
Member

EgorBo commented Jun 22, 2024

@MihaZupan you need to either omit Run<> or pass args to Run, so in your case:

BenchmarkRunner.Run<Bench>(args);

Otherwise --corerun /base/corerun /diff/corerun args will be ignored 🙂

@MihaZupan
Copy link
Member Author

MihaZupan commented Jun 22, 2024

Aaah right, thanks. I also forgot to remove the ShortRunJob.

@MihaZupan
Copy link
Member Author

@EgorBot -intel

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Buffers;

#nullable disable

public class Bench
{
    private static readonly SearchValues<char> _allowedAscii = SearchValues.Create("1234567890abcdefghijklmnopqrstuvwxyz");
    private static readonly SearchValues<char> _allowedMixed = SearchValues.Create("äöü1234567890abcdefghijklmnopqrstuvw");

    private string _asciiInput;
    private string _mixedInput;

    [Params(16, 100, 10_000)]
    public int Length;

    [GlobalSetup]
    public void Setup()
    {
        _asciiInput = new string('a', Length);
        _mixedInput = 'ä' + new string('a', Length - 1);
    }

    [Benchmark]
    public bool ContainsOnlyAscii() => !_asciiInput.AsSpan().ContainsAnyExcept(_allowedAscii);

    [Benchmark]
    public bool ContainsOnlyMixed() => !_mixedInput.AsSpan().ContainsAnyExcept(_allowedMixed);
}

@EgorBot

This comment was marked as outdated.

@EgorBot
Copy link

EgorBot commented Jun 22, 2024

Benchmark results on Intel
BenchmarkDotNet v0.13.12, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 16 logical and 8 physical cores
  Job-SZXNIX : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  Job-DHGYYU : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Method Toolchain Length Mean Error Ratio
ContainsOnlyAscii Main 16 2.434 ns 0.0115 ns 1.00
ContainsOnlyAscii PR 16 2.434 ns 0.0063 ns 1.00
ContainsOnlyMixed Main 16 12.207 ns 0.0035 ns 1.00
ContainsOnlyMixed PR 16 3.184 ns 0.0004 ns 0.26
ContainsOnlyAscii Main 100 4.251 ns 0.0005 ns 1.00
ContainsOnlyAscii PR 100 4.251 ns 0.0004 ns 1.00
ContainsOnlyMixed Main 100 62.381 ns 0.0320 ns 1.00
ContainsOnlyMixed PR 100 5.499 ns 0.0020 ns 0.09
ContainsOnlyAscii Main 10000 208.463 ns 0.0200 ns 1.00
ContainsOnlyAscii PR 10000 208.571 ns 0.0259 ns 1.00
ContainsOnlyMixed Main 10000 5,783.502 ns 1.4382 ns 1.00
ContainsOnlyMixed PR 10000 391.050 ns 0.0232 ns 0.07

BDN_Artifacts.zip

@MihuBot
Copy link

MihuBot commented Jun 22, 2024

System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock
BenchmarkDotNet v0.13.13-nightly.20240311.145, Ubuntu 22.04.4 LTS (Jammy Jellyfish)
AMD EPYC 9V74, 1 CPU, 8 logical and 4 physical cores
MediumRun : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
Job=MediumRun  OutlierMode=DontRemove  IterationCount=15
LaunchCount=2  MemoryRandomization=True  WarmupCount=10
Method Toolchain Pattern Mean Error Ratio Allocated Alloc Ratio
Count Main .* 577,117.72 ns 1,898.483 ns 1.00 2 B 1.00
Count PR .* 609,940.38 ns 5,541.168 ns 1.06 2 B 1.00
Count Main (?i)Holmes 53,669.80 ns 163.206 ns 1.00 - NA
Count PR (?i)Holmes 53,751.77 ns 95.378 ns 1.00 - NA
Count Main (?i)Sher[a-z]+|Hol[a-z]+ 96,797.28 ns 7,445.773 ns 1.01 - NA
Count PR (?i)Sher[a-z]+|Hol[a-z]+ 101,675.49 ns 7,778.427 ns 1.06 - NA
Count Main (?i)Sherlock 45,540.09 ns 158.021 ns 1.00 - NA
Count PR (?i)Sherlock 45,699.65 ns 198.254 ns 1.00 - NA
Count Main (?i)Sherlock Holmes 45,299.62 ns 140.891 ns 1.00 - NA
Count PR (?i)Sherlock Holmes 45,322.99 ns 60.249 ns 1.00 - NA
Count Main (?i)Sherlock|Holmes|Watson 98,222.98 ns 9,160.202 ns 1.02 - NA
Count PR (?i)Sherlock|Holmes|Watson 98,387.61 ns 9,077.154 ns 1.02 - NA
Count Main (?i)Sherlock|(...)er|John|Baker [49] 210,972.18 ns 25,319.086 ns 1.03 1 B 1.00
Count PR (?i)Sherlock|(...)er|John|Baker [49] 216,838.46 ns 27,873.222 ns 1.06 1 B 1.00
Count Main (?i)the 249,601.71 ns 10,074.303 ns 1.00 1 B 1.00
Count PR (?i)the 245,957.03 ns 10,337.071 ns 0.99 1 B 1.00
Count Main (?m)^Sherlock(...)rlock Holmes$ [37] 58,240.97 ns 2,237.819 ns 1.00 - NA
Count PR (?m)^Sherlock(...)rlock Holmes$ [37] 59,147.93 ns 2,749.661 ns 1.02 - NA
Count Main (?s).* 39.26 ns 0.088 ns 1.00 - NA
Count PR (?s).* 41.27 ns 1.728 ns 1.05 - NA
Count Main [^\\n]* 576,131.95 ns 2,377.308 ns 1.00 2 B 1.00
Count PR [^\\n]* 577,114.20 ns 4,299.926 ns 1.00 2 B 1.00
Count Main [a-q][^u-z]{13}x 23,158.57 ns 113.019 ns 1.00 - NA
Count PR [a-q][^u-z]{13}x 23,147.72 ns 90.268 ns 1.00 - NA
Count Main [a-zA-Z]+ing 4,112,765.60 ns 6,996.218 ns 1.00 19 B 1.00
Count PR [a-zA-Z]+ing 4,195,122.65 ns 50,241.289 ns 1.02 21 B 1.11
Count Main \b\w+n\b 8,324,707.71 ns 18,654.897 ns 1.00 44 B 1.00
Count PR \b\w+n\b 8,419,875.57 ns 67,893.821 ns 1.01 44 B 1.00
Count Main \p{L} 10,252,616.17 ns 258,692.197 ns 1.00 35 B 1.00
Count PR \p{L} 10,178,460.27 ns 120,522.988 ns 0.99 35 B 1.00
Count Main \p{Ll} 10,218,675.98 ns 78,494.108 ns 1.00 35 B 1.00
Count PR \p{Ll} 11,001,606.18 ns 501,912.621 ns 1.08 35 B 1.00
Count Main \p{Lu} 355,667.38 ns 7,908.522 ns 1.00 1 B 1.00
Count PR \p{Lu} 348,108.98 ns 3,458.353 ns 0.98 1 B 1.00
Count Main \s[a-zA-Z]{0,12}ing\s 4,387,091.94 ns 11,384.876 ns 1.00 24 B 1.00
Count PR \s[a-zA-Z]{0,12}ing\s 4,404,995.88 ns 9,088.263 ns 1.00 24 B 1.00
Count Main \w+ 4,712,257.79 ns 27,825.642 ns 1.00 18 B 1.00
Count PR \w+ 4,671,552.17 ns 7,259.567 ns 0.99 21 B 1.17
Count Main \w+\s+Holmes 3,340,654.11 ns 10,688.067 ns 1.00 11 B 1.00
Count PR \w+\s+Holmes 3,355,149.72 ns 15,672.577 ns 1.00 10 B 0.91
Count Main \w+\s+Holmes\s+\w+ 3,609,649.22 ns 65,235.529 ns 1.00 10 B 1.00
Count PR \w+\s+Holmes\s+\w+ 3,502,413.34 ns 60,104.536 ns 0.97 12 B 1.20
Count Main aei 38,764.39 ns 528.311 ns 1.00 - NA
Count PR aei 38,671.05 ns 532.363 ns 1.00 - NA
Count Main aqj 38,552.12 ns 579.654 ns 1.00 - NA
Count PR aqj 38,708.25 ns 516.758 ns 1.00 - NA
Count Main Holmes 50,202.09 ns 78.982 ns 1.00 - NA
Count PR Holmes 50,229.34 ns 115.917 ns 1.00 - NA
Count Main Holmes.{0,25}(...).{0,25}Holmes [39] 44,351.33 ns 90.615 ns 1.00 - NA
Count PR Holmes.{0,25}(...).{0,25}Holmes [39] 44,416.66 ns 119.101 ns 1.00 - NA
Count Main Sher[a-z]+|Hol[a-z]+ 48,711.02 ns 113.935 ns 1.00 - NA
Count PR Sher[a-z]+|Hol[a-z]+ 48,852.79 ns 227.029 ns 1.00 - NA
Count Main Sherlock 58,662.58 ns 2,239.132 ns 1.00 - NA
Count PR Sherlock 59,492.49 ns 2,839.858 ns 1.02 - NA
Count Main Sherlock Holmes 59,536.37 ns 2,887.051 ns 1.01 - NA
Count PR Sherlock Holmes 59,734.87 ns 2,760.671 ns 1.01 - NA
Count Main Sherlock\s+Holmes 59,934.57 ns 2,381.314 ns 1.00 - NA
Count PR Sherlock\s+Holmes 60,715.04 ns 3,024.286 ns 1.02 - NA
Count Main Sherlock|Holmes 44,782.94 ns 106.657 ns 1.00 - NA
Count PR Sherlock|Holmes 44,866.33 ns 114.894 ns 1.00 - NA
Count Main Sherlock|Holmes|Watson 58,630.95 ns 77.246 ns 1.00 - NA
Count PR Sherlock|Holmes|Watson 59,046.70 ns 110.269 ns 1.01 - NA
Count Main Sherlock|Holm(...)er|John|Baker [45] 109,855.32 ns 133.155 ns 1.00 - NA
Count PR Sherlock|Holm(...)er|John|Baker [45] 109,887.13 ns 89.847 ns 1.00 - NA
Count Main Sherlock|Street 25,047.25 ns 62.195 ns 1.00 - NA
Count PR Sherlock|Street 25,004.97 ns 85.354 ns 1.00 - NA
Count Main the 179,288.74 ns 632.033 ns 1.00 1 B 1.00
Count PR the 179,042.42 ns 643.054 ns 1.00 1 B 1.00
Count Main The 54,675.24 ns 84.398 ns 1.00 - NA
Count PR The 54,504.36 ns 156.081 ns 1.00 - NA
Count Main the\s+\w+ 282,854.54 ns 12,723.264 ns 1.00 1 B 1.00
Count PR the\s+\w+ 288,927.85 ns 11,924.621 ns 1.03 1 B 1.00
Count Main zqj 38,776.08 ns 523.611 ns 1.00 - NA
Count PR zqj 38,694.38 ns 541.346 ns 1.00 - NA

Vector512<byte> secondBitmap512 = state.SecondBitmap512;
Vector512<ushort> offset512 = Vector512.Create(state.Offset);

if (searchSpaceLength > 2 * Vector512<short>.Count)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GrabYourPitchforks, in the UTF8 experiment you shared with me, you had code that, after having validated that Vector256 was hardware accelerated, aligned an address and then read a full vector, without concern for whether that vector read under or overread the target region. Is that safe to do on all platforms? It seems we could avoid some branching with similar techniques in many of our implementations that use vectorization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that safe to do on all platforms?

As long as the data is pinned, otherwise GC can interrupt at any point and break the alignment assumption

@MihaZupan MihaZupan force-pushed the searchvalues-asciiWithSecondSet branch 2 times, most recently from f378aca to e747fd0 Compare July 10, 2024 22:48
@stephentoub
Copy link
Member

the Ascii-only search is at ~1.5x the throughput of the two bitmaps impl.

Does this mean someone searching mostly ASCII text with IgnoreCase for [a-z] will see a 50% regression?

@MihaZupan
Copy link
Member Author

MihaZupan commented Jul 11, 2024

Does this mean someone searching mostly ASCII text with IgnoreCase for [a-z] will see a 50% regression?

If you had a long run of text that was just ASCII and didn't match, yes, it would be slower.
But if you were stopping on matches along the way (even if they were ASCII), the two sets can be cheaper due to the lower overhead on matches (mainly since there's an extra method involved in the ascii+probmap implementation).

I reran Sherlock, and the throughput difference doesn't seem to be affecting it: MihuBot/runtime-utils#505 (comment)

We could also choose to use the two bitmaps approach only on -Except paths to avoid the scalar fallbacks.

@MihaZupan
Copy link
Member Author

Any concerns with merging this one as-is and seeing if benchmarks complain?

@stephentoub
Copy link
Member

Ok, let's give it a try but be ready to back it out if any meaningful regressions pop up.

@stephentoub stephentoub merged commit 13abb44 into dotnet:main Jul 22, 2024
144 of 147 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Aug 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants