-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve vectorization of String.Split #64899
Conversation
- Implement Vector265 for longer strings - Simplify the Vector code and use new cross-platform intrinsic API - Use ref _firstChar instead of ref MemoryMarshal.GetReference(this.AsSpan()); - Use unsigned check for separators.Length so that two redundant range checks are optimized away
I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label. |
@@ -1637,27 +1637,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild | |||
} | |||
|
|||
// Special-case the common cases of 1, 2, and 3 separators, with manual comparisons against each separator. | |||
else if (separators.Length <= 3) | |||
else if (separators.Length <= 3u) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it affect codegen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it got rid of redundant range checks for separators
, doing (uint)separators.Length <= (uint)3
is one movsxd less, but I personally thought this was cleaner. However, I can see it being too obscure with it's intent.
Vector256<ushort> vector = Vector256.LoadUnsafe(ref source, (uint)i); | ||
Vector256<ushort> cmp = Vector256.Equals(vector, v1) | Vector256.Equals(vector, v2) | Vector256.Equals(vector, v3); | ||
|
||
uint mask = cmp.AsByte().ExtractMostSignificantBits() & 0b0101010101010101; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be a good idea to also use TestZ for faster out, e.g.
if (cmp == Vector256<ushort>.Zero)
continue;
it's faster than movmsk
{ | ||
sepListBuilder.Append(idx); | ||
sepListBuilder.Append(i + BitOperations.TrailingZeroCount(mask) / 2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
if ((lowBits & 0xF) != 0) | ||
Vector256<ushort> vector = Vector256.LoadUnsafe(ref source, (uint)i); | ||
Vector256<ushort> cmp = Vector256.Equals(vector, v1) | Vector256.Equals(vector, v2) | Vector256.Equals(vector, v3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider splitting this to temps for better pipelining so all compare instructions will be next to each other and so are ORs
|
||
for (int idx = i; lowBits != 0; idx++) | ||
int vector256ShortCount = Vector256<ushort>.Count; | ||
for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider processing trailing elements via overlapping instead of scalar fallback
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a risk though that the code will start getting a bit complicated, I wanted to keep the code easy to follow since it's only used for a specific scenario. If you still think it's worth it, I can definitely look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handling trailing elements in the same loop (or via a spilled iteration) shows nice improvements for small-medium sized inputs, in theory it only adds an additional check inside the loop, feel free to keep it as is, we can then follow up
|
||
for (int idx = i; lowBits != 0; idx++) | ||
int vector256ShortCount = Vector256<ushort>.Count; | ||
for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(i + vector256ShortCount) <= Length
might overflow, it should be
i <= Length - vector256ShortCount
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides that the i <= len - count
version can keep the len - count
in a register, whilst i + count
needs a repeated addition.
Also local vector256ShortCount
isn't needed, as JIT will treat Vector256<ushort>.Count
as constant.
src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ValueListBuilder.cs
Show resolved
Hide resolved
Vector128<ushort> v3 = Vector128.Create((ushort)c3); | ||
|
||
ref char c0 = ref MemoryMarshal.GetReference(this.AsSpan()); | ||
int cond = Length & -Vector128<ushort>.Count; | ||
int i = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int
-> nint
, it will help to avoid redundant sign extensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same variable is used as index to the scalar/non vectorized version at the bottom. I'll see if I can find a middle-way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can always cast it to signed just once before the scalar version
|
||
for (int idx = i; lowBits != 0; idx++) | ||
int vector256ShortCount = Vector256<ushort>.Count; | ||
for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides that the i <= len - count
version can keep the len - count
in a register, whilst i + count
needs a repeated addition.
Also local vector256ShortCount
isn't needed, as JIT will treat Vector256<ushort>.Count
as constant.
int vector128ShortCount = Vector128<ushort>.Count; | ||
for (; (i + vector128ShortCount) <= Length; i += vector128ShortCount) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
int vector128ShortCount = Vector128<ushort>.Count; | |
for (; (i + vector128ShortCount) <= Length; i += vector128ShortCount) | |
for (; i <= Length - Vector128<ushort>.Count; i += Vector128<ushort>.Count) |
When i
is of type nint
just check if the comparison doesn't introduce any sign extensions -- please double check to be on the safe side.
@EgorBo @gfoidl Thanks for the good tips and feedback, I updated the pull request accordingly. The benchmark numbers for 256 bit is much more realistic now. It looks to be much closer to the 128 bit version now. Please let me know your opinion, and sorry for the mistake |
It's the current numbers in the PR's description? |
@gfoidl Yes those are the latest numbers. I can remove the 256bit path it if you want |
Are any of these tests for really long inputs containing very few separators? |
… in ValueListBuilder. Improve sequential iteration
Vector256<byte> cmp = (vector1 | vector2 | vector3).AsByte(); | ||
Vector256<ushort> v1 = Vector256.Create((ushort)c); | ||
Vector256<ushort> v2 = Vector256.Create((ushort)c2); | ||
Vector256<ushort> v3 = Vector256.Create((ushort)c3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, am just curios if our guidelines allow to use var
here, the type of vector should be pretty obvious from the expression on the right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The guidelines say var should only be used for a ctor or explicit cast. While it's arguable that Create is equivalent to a ctor, there's nothing that requires it to return the same type it's declared on, and in fact there are cases where Create methods don't, e.g. File.Create
.
Sorry for my delay, I decided to rewrite the 256 bit spilling and made some improvements for the scalar loop. Here's a gist of a much bigger bechmark suite: https://gist.github.com/yesmey/2e7a7868bb10043553b78d77cbc3f2b8 benchmark code for gistpublic class Benchmarks
{
private static string _testStr;
private static System.Text.StringBuilder st;
private static char[][] _testChar = new char[3][];
static Benchmarks()
{
st = new System.Text.StringBuilder(5_000_000);
_testChar[0] = new char[1] { ' ' };
_testChar[1] = new char[2] { ' ', 't' };
_testChar[2] = new char[3] { ' ', 't', 'f' };
}
private static string BuildStr(char c, int stringLength, int sepFreq, char sep)
{
for (int i = 0; i < stringLength; i++)
{
if (i % sepFreq == 0)
{
st.Append(sep);
}
else { st.Append(c); }
}
string t = st.ToString();
st.Clear();
return t;
}
[GlobalSetup]
public void Init()
{
_testStr = BuildStr('a', Size, SepFreq, _testChar[2][SplitCount - 1]);
}
[Params(16, 64, 200, 1000, 10000)]
public int Size { get; set; }
[Params(1, 2, 5, 200)]
public int SepFreq { get; set; }
[Params(1, 2, 3)]
public int SplitCount { get; set; }
[Benchmark]
public string[] Split()
{
return _testStr.Split(_testChar[SplitCount - 1]);
}
} Updated numbers from previous benchmarks: csv + dotnet/performanceBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
[Host] : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
Job-AJDBJE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-XTZHCY : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
[Host] : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
Job-AJDBJE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-XTZHCY : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
There seems to be regressions on the strings with no split chars in them |
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsThis pull request aims to simplify and improve upon the current vectorized fast path of string.Split. Changes include:
For benchmark testing I tried to use both the csv parsing in #38001 and the benchmark referenced in #51259 BenchmarksBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1466 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
[Host] : .NET 7.0.0 (7.0.21.56701), X64 RyuJIT
Job-OHGYOD : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Toolchain=CoreRun
Benchmark code[DisassemblyDiagnoser]
public class CsvBenchmarks
{
private string[] _strings;
public IEnumerable<string> CorpusList()
{
yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
}
[ParamsSource("CorpusList")]
public string CorpusUri { get; set; }
[GlobalSetup]
public void Setup()
{
_strings = GetStringsFromCorpus().GetAwaiter().GetResult();
}
private async Task<string[]> GetStringsFromCorpus()
{
using var client = new HttpClient();
using var response = await client.GetAsync(CorpusUri);
response.EnsureSuccessStatusCode();
var body = await response.Content.ReadAsStringAsync();
List<string> lines = new();
StringReader reader = new StringReader(body);
string? line;
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
}
return lines.ToArray();
}
[Benchmark]
public string[]? SplitCsv()
{
string[]? split = null;
string[] lines = _strings;
for (int i = 0; i < lines.Length; i++)
{
split = lines[i].Split(',');
}
return split;
}
}
[DisassemblyDiagnoser]
public class RegressionBenchmark
{
[Benchmark]
[Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", ' ')]
[Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ' ')]
public string[] SplitArray(string s, char chr)
=> s.Split(chr);
}
public class Program
{
public static void Main(string[] args)
{
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
}
} Related to #51259
|
Status update: I can't get the 256 bit version to perform well on lower-mid ranges because of the saving/restore overhead of registers due to the nested calls inside ValueListBuilder.Append. The 256 bit assembly currently looks like this: https://gist.github.com/yesmey/7786c102927cf8e9abf966cf44a35484, and as you can tell there's a lot of initial vmovaps just for the potential call of Grow in AddWithResize. Just to prove my point, I commented out Grow inside AddWithResize for comparison here. So since I'm not getting any further there, I'm thinking maybe giving up on the 256 bit and keep the 128 bit version, which is on par in performance, just to have an implementation for arm |
that's ok, we try to use AVX only where it's definitely profitable. |
benchmarks for commit dcadf05 CSV parsing benchmarksBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
[Host] : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
dotnet/performance benchmarksBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
[Host] : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
partial 38001 issue benchmark suiteBenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
[Host] : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
benchmark sourceusing BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
public class CsvBenchmarks
{
private string[] _strings;
public IEnumerable<string> CorpusList()
{
// only these three urls still return any result
yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
}
[ParamsSource("CorpusList")]
public string CorpusUri { get; set; }
[GlobalSetup]
public void Setup()
{
_strings = GetStringsFromCorpus().GetAwaiter().GetResult();
}
private async Task<string[]> GetStringsFromCorpus()
{
using var client = new HttpClient();
using var response = await client.GetAsync(CorpusUri);
response.EnsureSuccessStatusCode();
var body = await response.Content.ReadAsStringAsync();
List<string> lines = new();
StringReader reader = new StringReader(body);
string? line;
while ((line = reader.ReadLine()) != null)
{
lines.Add(line);
}
return lines.ToArray();
}
[Benchmark]
public string[]? SplitCsv()
{
string[]? split = null;
string[] lines = _strings;
for (int i = 0; i < lines.Length; i++)
{
split = lines[i].Split(',');
}
return split;
}
}
public class RegressionBenchmark
{
[Benchmark]
[Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", ' ')]
[Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ' ')]
public string[] SplitChar(string s, char chr)
=> s.Split(chr);
[Benchmark]
[Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", new char[] { ' ' }, StringSplitOptions.None)]
[Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)]
[Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", new char[]{' '}, StringSplitOptions.None)]
[Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", new char[]{' '}, StringSplitOptions.RemoveEmptyEntries)]
public string[] Split(string s, char[] arr, StringSplitOptions options)
=> s.Split(arr, options);
}
public class Benchmarks
{
private static string _testStr;
private static System.Text.StringBuilder st;
private static char[][] _testChar = new char[3][];
static Benchmarks()
{
st = new System.Text.StringBuilder(5_000_000);
_testChar[0] = new char[1] { ' ' };
_testChar[1] = new char[3] { ' ', 't', 'f' };
}
private static string BuildStr(char c, int stringLength, int sepFreq, char sep)
{
for (int i = 0; i < stringLength; i++)
{
if (i % sepFreq == 0)
{
st.Append(sep);
}
else { st.Append(c); }
}
string t = st.ToString();
st.Clear();
return t;
}
[GlobalSetup]
public void Init()
{
_testStr = BuildStr('a', Size, SepFreq, _testChar[1][SplitCount - 1]);
}
[Params(16, 200, 1000, 10000)]
public int Size { get; set; }
[Params(1, 5, 200)]
public int SepFreq { get; set; }
[Params(1, 2)]
public int SplitCount { get; set; }
[Benchmark]
public string[] Split()
{
return _testStr.Split(_testChar[SplitCount - 1]);
}
}
public class Program
{
public static void Main(string[] args)
{
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
}
} |
@EgorBo @stephentoub @gfoidl is your feedback addressed ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danmoseley I had another look, when these points are addressed I'm happy with the PR 😄.
@@ -1609,14 +1609,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild | |||
} | |||
|
|||
// Special-case the common cases of 1, 2, and 3 separators, with manual comparisons against each separator. | |||
else if (separators.Length <= 3) | |||
else if ((uint)separators.Length <= (uint)3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cast still needed?
AFAIK JIT recognizes this now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gfoidl do you mean we no longer need the pattern if ((uint)index > (uint)array.Length)
that we have everywhere in the tree? If so we should have an issue to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great! Glad it got fixed, I'll get rid of it
// Redundant test so we won't prejit remainder of this method | ||
// on platforms without SSE. | ||
if (!Sse41.IsSupported) | ||
if (!Vector128.IsHardwareAccelerated) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the comment from the previous version (left side of comparison) to make it clear that this check is needed to avoid prejit.
Otherwise a Debug.Assert(Vector128.IsHardwareAccelerated)
could do it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll reintroduce the comment with a small text change since it's not limited to only SSE anymore
int i = 0; | ||
|
||
for (; i < cond; i += Vector128<ushort>.Count) | ||
while (offset <= lengthToExamine - (nuint)Vector128<ushort>.Count) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Above is L1618 we guard by Vector128<ushort>.Count * 2
, so when reaching this point, we know that there are for sure enough elements available. This this check isn't need at this point. So you could change the loop to da do-while
loop. Thus the first iteration is without any (further) pre-condition, and after the iteration the check for more available elements is done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'll add a Debug.Assert on entry to make it a little more obvious to the reader that its a precondition
{ | ||
char curr = Unsafe.Add(ref c0, (IntPtr)(uint)i); | ||
char curr = (char)Unsafe.Add(ref source, (nint)offset); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
char curr = (char)Unsafe.Add(ref source, (nint)offset); | |
char curr = (char)Unsafe.Add(ref source, offset); |
Not needed, there's an overload for nuint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must've missed that it got added, thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question -- otherwise LGTM.
@@ -1615,8 +1615,7 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild | |||
sep0 = separators[0]; | |||
sep1 = separators.Length > 1 ? separators[1] : sep0; | |||
sep2 = separators.Length > 2 ? separators[2] : sep1; | |||
|
|||
if (Length >= 16 && Sse41.IsSupported) | |||
if (Vector128.IsHardwareAccelerated && Length >= Vector128<ushort>.Count * 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to double-check: the * 2
is intentional as perf-numbers showed that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes exactly, smaller strings doesn't perform as well
Improvement on win-x64 dotnet/perf-autofiling-issues#4291 |
Nice drop in that graph @yesmey . Do you plan to do more of this kind of work? |
This pull request aims to simplify and improve upon the current vectorized fast path of string.Split.
Changes include:
For benchmark testing I tried to use both the csv parsing in #38001 and the benchmark referenced in #51259
The benchmarks include both 256 bit and 128 bit versions (sse/avx). Unfortunately I have not been able to benchmark any other platforms than x86_64
Benchmarks
Benchmark code
Related to #51259