Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a SearchValues implementation for values with unique low nibbles #106900

Merged
merged 5 commits into from
Sep 10, 2024

Conversation

MihaZupan
Copy link
Member

@MihaZupan MihaZupan commented Aug 23, 2024

Based on http://0x80.pl/articles/simd-byte-lookup.html#special-case-3-unique-lower-and-higher-nibbles

If all of the values have a different low nibble, we can use a faster search that takes advantage of that fact.
For example, this applies to the "Sherlock|Holmes|Watson|Irene|Adler|John|Baker" regex pattern which uses SearchValues.Create("ABHIJSW").

As a comparison, the current core lookup for an ASCII set on AVX2 uses: 2 and, 1 shift, 2 shuffles

Vector256<byte> Lookup(Vector256<byte> source, Vector256<byte> bitmapLookup)
{
    Vector256<byte> highNibbles = (source.AsInt32() >>> 4).AsByte() & Vector256.Create((byte)0xF);
    Vector256<byte> bitMask = Avx2.Shuffle(bitmapLookup, source);
    Vector256<byte> bitPositions = Avx2.Shuffle(Vector256.Create(0x8040201008040201).AsByte(), highNibbles);
    return bitMask & bitPositions;
}

Where the core lookup for values with unique low nibbles uses: 1 comparison, 1 shuffle

Vector256<byte> Lookup(Vector256<byte> source, Vector256<byte> valuesByLowNibble)
{
    Vector256<byte> values = Avx2.Shuffle(valuesByLowNibble, source);
    return Vector256.Equals(source, values);
}

(code-wise, most of the implementation in this PR is a copy-paste of the existing ASCII logic, swapping out this core lookup routine)


Consider a benchmark inspired by @lemire's https://lemire.me/blog/2024/07/05/scan-html-faster-with-simd-instructions-net-c-edition/
In this case, we're scanning UTF8 input for bytes relevant to HTML (<, &, \r and \0).
Previously, SearchValues would pick the same implementation as span.IndexOfAny(4 values).
The blog post highlights that a hand-written approach can beat SearchValues in this case -- not anymore :)

public class Bench
{
    private static readonly SearchValues<byte> s_searchValues = SearchValues.Create("\0\r&<"u8);
    private static byte[] s_bytes = Encoding.ASCII.GetBytes(new string('x', 10_000));

    [Benchmark] public int FindHtmlChar() => s_bytes.AsSpan().IndexOfAny(s_searchValues);
}

This approach doubles the searching performance on my AVX2 CPU (Ryzen 1700).
On ARM, it's a 1.6X improvement.

Method Toolchain Mean Ratio
FindHtmlChar main 605.7 ns 1.00
FindHtmlChar pr 304.5 ns 0.50

Compared to the implementation for an arbitrary ASCII set, this improves throughput between 1.2x and 1.5x depending on the hardware (see more numbers below).


The UniqueLowNibble approach could be used a lot more aggresively (see benchmarks below).
I conservatively placed it between 3 and 4 values to minimize the risk of regressions for now.
In practice, we're currently only using SearchValues with 4 or more values across runtime/aspnet.

As a follow up, I plan on changing our heuristics around which approach we pick in SearchValues depending on the platform.
After that, we may want to consider using it even with fewer values (e.g. 2 or 3).

We should also consider using PackedSpanHelpers on ARM.
Searching for any subset of ASCII is currently faster than a basic IndexOf('a') on M1 hardware because we're not doing that.


Throughput numbers for scanning through 10k elements (10k bytes or 10k chars).
Rows are ordered from fastest to slowest.

ARM (Apple M1)
Method Mean Error
IndexOfAny1Byte 233.2 ns 0.24 ns
IndexOfAnyByteInRange 253.7 ns 0.24 ns
IndexOfAny2Byte 274.4 ns 0.14 ns
IndexOfAnyUniqueLowNibbleByte 275.7 ns 0.83 ns
IndexOfAnyAsciiByte 346.5 ns 0.02 ns
IndexOfAny3Byte 346.8 ns 0.11 ns
IndexOfAny4Byte 444.5 ns 0.03 ns
IndexOfAnyByte 541.7 ns 0.05 ns
IndexOfAny5Byte 542.2 ns 0.05 ns
IndexOfAnyUniqueLowNibbleChar 351.4 ns 0.49 ns
IndexOfAnyAsciiChar 448.2 ns 0.41 ns
IndexOfAny1Char 453.2 ns 0.23 ns
IndexOfAnyInRange 497.6 ns 0.31 ns
IndexOfAny2Chars 543.2 ns 0.37 ns
IndexOfAny3Chars 688.8 ns 0.19 ns
IndexOfAny4Chars 884.2 ns 0.21 ns
IndexOfAny5Chars 1,079.5 ns 0.10 ns
ARM (Azure D8plsv5 VM)
Method Mean Error
IndexOfAny1Byte 493.0 ns 0.04 ns
IndexOfAnyByteInRange 544.2 ns 2.92 ns
IndexOfAny2Byte 636.7 ns 6.03 ns
IndexOfAnyUniqueLowNibbleByte 664.5 ns 4.58 ns
IndexOfAny3Byte 851.6 ns 7.29 ns
IndexOfAnyAsciiByte 853.6 ns 5.32 ns
IndexOfAny4Byte 1,067.7 ns 8.85 ns
IndexOfAny5Byte 1,292.5 ns 11.32 ns
IndexOfAnyByte 1,309.2 ns 10.58 ns
IndexOfAny1Char 979.7 ns 0.08 ns
IndexOfAnyInRange 1,075.4 ns 4.08 ns
IndexOfAnyUniqueLowNibbleChar 1,088.8 ns 53.19 ns
IndexOfAny2Chars 1,279.2 ns 13.17 ns
IndexOfAnyAsciiChar 1,316.2 ns 0.91 ns
IndexOfAny3Chars 1,702.1 ns 14.53 ns
IndexOfAny4Chars 2,135.7 ns 17.84 ns
IndexOfAny5Chars 2,578.2 ns 21.88 ns
X64 with Vector256 (i9-10900X - no full Avx512)
Method Mean Error
IndexOfAny1Byte 164.1 ns 2.56 ns
IndexOfAnyUniqueLowNibbleByte 163.8 ns 0.53 ns
IndexOfAnyByteInRange 200.0 ns 1.26 ns
IndexOfAny2Byte 214.8 ns 2.16 ns
IndexOfAny3Byte 216.4 ns 1.80 ns
IndexOfAny4Byte 227.1 ns 1.27 ns
IndexOfAnyAsciiByte 248.0 ns 2.47 ns
IndexOfAny5Byte 252.0 ns 0.75 ns
IndexOfAnyByte 361.8 ns 1.34 ns
IndexOfAny1PackedChar 209.1 ns 0.23 ns
IndexOfLetterIgnoreCase 199.4 ns 1.92 ns
IndexOfAnyUniqueLowNibbleChar 218.3 ns 0.25 ns
IndexOfAny2PackedChars 231.7 ns 2.57 ns
IndexOfTwoLettersIgnoreCase 243.4 ns 2.00 ns
IndexOfAny3PackedChars 248.4 ns 2.82 ns
IndexOfAnyInRangePacked 248.7 ns 2.49 ns
IndexOfAnyAsciiChar 287.2 ns 0.38 ns
IndexOfAny1Char 304.3 ns 3.55 ns
IndexOfAnyInRange 395.7 ns 3.20 ns
IndexOfAny2Chars 416.4 ns 5.64 ns
IndexOfAny3Chars 410.5 ns 3.66 ns
IndexOfAny4Chars 440.0 ns 3.07 ns
IndexOfAny5Chars 496.1 ns 1.85 ns
X64 with Vector256 (Ryzen 1700)
Method Mean Error
IndexOfAny1Byte 241.3 ns 1.51 ns
IndexOfAnyUniqueLowNibbleByte 279.0 ns 1.54 ns
IndexOfAnyByteInRange 368.5 ns 1.80 ns
IndexOfAny2Byte 369.9 ns 1.89 ns
IndexOfAny3Byte 447.2 ns 2.03 ns
IndexOfAnyAsciiByte 455.7 ns 2.62 ns
IndexOfAny4Byte 557.9 ns 1.79 ns
IndexOfAny5Byte 640.4 ns 3.19 ns
IndexOfAnyByte 655.3 ns 3.58 ns
IndexOfAny1PackedChar 280.7 ns 1.48 ns
IndexOfAnyUniqueLowNibbleChar 363.0 ns 1.94 ns
IndexOfAny2PackedChars 365.7 ns 1.99 ns
IndexOfLetterIgnoreCase 369.2 ns 1.98 ns
IndexOfAnyInRangePacked 375.1 ns 1.27 ns
IndexOfAny3PackedChars 448.3 ns 2.02 ns
IndexOfTwoLettersIgnoreCase 459.5 ns 2.24 ns
IndexOfAny1Char 461.1 ns 1.85 ns
IndexOfAnyAsciiChar 545.5 ns 18.02 ns
IndexOfAnyInRange 718.8 ns 4.14 ns
IndexOfAny2Chars 734.8 ns 2.81 ns
IndexOfAny3Chars 922.0 ns 2.75 ns
IndexOfAny4Chars 1,091.1 ns 5.75 ns
IndexOfAny5Chars 1,254.2 ns 7.16 ns
X64 with Vector512 (Xeon Platinum 8370C)
Method Mean Error
IndexOfAny1Byte 99.20 ns 0.811 ns
IndexOfAny2Byte 186.23 ns 0.157 ns
IndexOfAny3Byte 236.63 ns 0.228 ns
IndexOfAnyByteInRange 253.85 ns 0.279 ns
IndexOfAnyUniqueLowNibbleByte 273.06 ns 3.011 ns
IndexOfAny4Byte 312.36 ns 0.102 ns
IndexOfAnyAsciiByte 346.18 ns 2.557 ns
IndexOfAny5Byte 363.69 ns 0.160 ns
IndexOfAnyByte 422.75 ns 1.270 ns
IndexOfAny1PackedChar 165.53 ns 3.280 ns
IndexOfAnyInRangePacked 168.52 ns 2.998 ns
IndexOfLetterIgnoreCase 170.30 ns 2.769 ns
IndexOfAny1Char 194.79 ns 0.097 ns
IndexOfAny2PackedChars 217.14 ns 0.150 ns
IndexOfTwoLettersIgnoreCase 239.79 ns 0.205 ns
IndexOfAny3PackedChars 268.56 ns 0.217 ns
IndexOfAnyUniqueLowNibbleChar 271.63 ns 1.392 ns
IndexOfAnyAsciiChar 327.17 ns 1.021 ns
IndexOfAny2Chars 366.79 ns 0.087 ns
IndexOfAny3Chars 468.80 ns 0.093 ns
IndexOfAnyInRange 500.94 ns 0.521 ns
IndexOfAny4Chars 621.20 ns 0.209 ns
IndexOfAny5Chars 723.78 ns 0.212 ns

@MihaZupan MihaZupan added this to the 10.0.0 milestone Aug 23, 2024
@MihaZupan MihaZupan self-assigned this Aug 23, 2024
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

{
// Avoid false positives for the zero character if no other character has a low nibble of zero.
// We can replace it with any other byte that has a non-zero low nibble.
valuesByLowNibble.SetElementUnsafe(0, (byte)1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't fully grok this. Why don't we need to check if 1 is already being used?

Copy link
Member Author

@MihaZupan MihaZupan Sep 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All vector elements start out as 0, and not all of them may be initialized.

We map every input character to an element based on its lower nibble.

 0, 16, 32 ... => valuesByLowNibble[0]
 1, 17, 33 ... => valuesByLowNibble[1]
15, 31, 47 ... => valuesByLowNibble[15] 

The search works by first picking a potential match based on the low nibble (Shuffle) and then confirming it (Equals).

This means that input characters with a given low nibble only care about the element of valuesByLowNibble for that nibble. Values like 1 or 2 don't care about what the value of valuesByLowNibble[7] is since they'll never be mapped to it.

This also means that it's okay for valuesByLowNibble to be left uninitialized at 0.
The Equals could only match for an input character 0, but those will always be mapped to valuesByLowNibble[0] by the shuffle instead.

The edge case is the 0th nibble since the character 0 could be a false positive there.
But it'll only be a false positive if we don't have the character 0 in our values.
That's the valuesByLowNibble.GetElement(0) == 0 && !lookup.Contains(0) check above.

To avoid false positives for 0, we can use the same trick of setting the element to some "unreachable" value.
We can use any value with a non-zero nibble, as the shuffle will map any inputs with those values to a different element. 1 is just an arbitrary choice.

Edit: I tweaked the comment a bit, hopefully, it's decipherable.

@MihaZupan MihaZupan merged commit b06d5e2 into dotnet:main Sep 10, 2024
146 of 148 checks passed
jtschuster pushed a commit to jtschuster/runtime that referenced this pull request Sep 17, 2024
…otnet#106900)

* Add SearchValues implementation for values with unique low nibbles

* More generics

* Tweak comment

* Remove extra empty line

* Update comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants