Improve vectorization of String.Split #64899

yesmey · 2022-02-07T12:46:13Z

This pull request aims to simplify and improve upon the current vectorized fast path of string.Split.

Changes include:

Replace specialized SSE4.1 instructions with the new cross-platform intrinsic API
Add 265 bit instructions for longer strings
Improve the common path of Append in ValueListBuilder
- Haven't made any explicit benchmark for this, but you can compare assembly output here: before after

For benchmark testing I tried to use both the csv parsing in #38001 and the benchmark referenced in #51259
The benchmarks include both 256 bit and 128 bit versions (sse/avx). Unfortunately I have not been able to benchmark any other platforms than x86_64

Benchmarks

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1466 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
  [Host]     : .NET 7.0.0 (7.0.21.56701), X64 RyuJIT
  Job-OHGYOD : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Toolchain=CoreRun

Method	CorpusUri	Mean	Error	StdDev
SplitCsv main	http(...).csv [107]	11.71 μs	0.224 μs	0.210 μs
SplitCsv Vector128	http(...).csv [107]	9.915 μs	0.0225 μs	0.0175 μs
SplitCsv Vector256	http(...).csv [107]	9.768 μs	0.1745 μs	0.2612 μs
SplitCsv main	https(...)e.csv [50]	69.22 μs	0.182 μs	0.170 μs
SplitCsv Vector128	https(...)e.csv [50]	65.784 μs	0.0870 μs	0.0772 μs
SplitCsv Vector256	https(...)e.csv [50]	58.383 μs	0.3115 μs	0.2914 μs
SplitCsv main	https(...)A.csv [77]	354.65 μs	2.050 μs	1.712 μs
SplitCsv Vector128	https(...)A.csv [77]	311.968 μs	0.8166 μs	0.6376 μs
SplitCsv Vector256	https(...)A.csv [77]	319.919 μs	2.0044 μs	1.8749 μs

Method	s	chr	Mean	Error	StdDev
SplitArray main	A B C(...)X Y Z [51]	' '	291.20 ns	5.879 ns	12.655 ns
SplitArray Vector128	A B C(...)X Y Z [51]	' '	270.11 ns	2.206 ns	2.063 ns
SplitArray Vector256	A B C(...)X Y Z [51]	' '	271.53 ns	3.810 ns	3.377 ns
SplitArray main	ABCDE(...)VWXYZ [26]	' '	19.57 ns	0.180 ns	0.151 ns
SplitArray Vector128	ABCDE(...)VWXYZ [26]	' '	18.82 ns	0.082 ns	0.077 ns
SplitArray Vector256	ABCDE(...)VWXYZ [26]	' '	19.35 ns	0.059 ns	0.052 ns

Benchmark code

[DisassemblyDiagnoser]
public class CsvBenchmarks
{
    private string[] _strings;

    public IEnumerable<string> CorpusList()
    {
        yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
        yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
        yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
    }

    [ParamsSource("CorpusList")]
    public string CorpusUri { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _strings = GetStringsFromCorpus().GetAwaiter().GetResult();
    }

    private async Task<string[]> GetStringsFromCorpus()
    {
        using var client = new HttpClient();
        using var response = await client.GetAsync(CorpusUri);
        response.EnsureSuccessStatusCode();

        var body = await response.Content.ReadAsStringAsync();

        List<string> lines = new();

        StringReader reader = new StringReader(body);
        string? line;
        while ((line = reader.ReadLine()) != null)
        {
            lines.Add(line);
        }

        return lines.ToArray();
    }

    [Benchmark]
    public string[]? SplitCsv()
    {
        string[]? split = null;
        string[] lines = _strings;
        for (int i = 0; i < lines.Length; i++)
        {
            split = lines[i].Split(',');
        }
        return split;
    }   
}

[DisassemblyDiagnoser]
public class RegressionBenchmark
{
    [Benchmark]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", ' ')]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ' ')]
    public string[] SplitArray(string s, char chr)
        => s.Split(chr);
}

public class Program
{
    public static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }
}

Related to #51259

- Implement Vector265 for longer strings - Simplify the Vector code and use new cross-platform intrinsic API - Use ref _firstChar instead of ref MemoryMarshal.GetReference(this.AsSpan()); - Use unsigned check for separators.Length so that two redundant range checks are optimized away

dotnet-issue-labeler · 2022-02-07T12:46:19Z

I couldn't figure out the best area label to add to this PR. If you have write-permissions please help me learn by adding exactly one area label.

EgorBo · 2022-02-07T13:07:23Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

@@ -1637,27 +1637,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
            }

            // Special-case the common cases of 1, 2, and 3 separators, with manual comparisons against each separator.
-            else if (separators.Length <= 3)
+            else if (separators.Length <= 3u)


does it affect codegen?

Yes, it got rid of redundant range checks for separators, doing (uint)separators.Length <= (uint)3 is one movsxd less, but I personally thought this was cleaner. However, I can see it being too obscure with it's intent.

EgorBo · 2022-02-07T13:15:34Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

+                    Vector256<ushort> vector = Vector256.LoadUnsafe(ref source, (uint)i);
+                    Vector256<ushort> cmp = Vector256.Equals(vector, v1) | Vector256.Equals(vector, v2) | Vector256.Equals(vector, v3);
+
+                    uint mask = cmp.AsByte().ExtractMostSignificantBits() & 0b0101010101010101;


It might be a good idea to also use TestZ for faster out, e.g.

if (cmp == Vector256<ushort>.Zero) continue;

it's faster than movmsk

EgorBo · 2022-02-07T13:18:43Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

                    {
-                        sepListBuilder.Append(idx);
+                        sepListBuilder.Append(i + BitOperations.TrailingZeroCount(mask) / 2);


use cast to uint here, e.g. https://sharplab.io/#v2:EYLgtghglgdgNAFxBAzmAPgAQEwEYCwAUJgAwAEmuAdAHICuYApgE5QDGKA3EUZgMwVsZAMJEA3kTJSKA2AjIBZXAAo6csgA8AlJOkTC0wxQDsZAEJQEAeQAOLCAigB7GCioAVZtAA2sAOYAWixOwk50MAjK2mQA9GTY3AbSAL66UmkyZOoK2Krq2hn6RtKYpspyWsp5EVoW1nZeji5unj7+QcwhYRFRWrHxWomGqYTJQA==

EgorBo · 2022-02-07T13:20:24Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

                {
-                    if ((lowBits & 0xF) != 0)
+                    Vector256<ushort> vector = Vector256.LoadUnsafe(ref source, (uint)i);
+                    Vector256<ushort> cmp = Vector256.Equals(vector, v1) | Vector256.Equals(vector, v2) | Vector256.Equals(vector, v3);


consider splitting this to temps for better pipelining so all compare instructions will be next to each other and so are ORs

EgorBo · 2022-02-07T13:21:28Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs


-                for (int idx = i; lowBits != 0; idx++)
+                int vector256ShortCount = Vector256<ushort>.Count;
+                for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount)


Consider processing trailing elements via overlapping instead of scalar fallback

There's a risk though that the code will start getting a bit complicated, I wanted to keep the code easy to follow since it's only used for a specific scenario. If you still think it's worth it, I can definitely look into it

handling trailing elements in the same loop (or via a spilled iteration) shows nice improvements for small-medium sized inputs, in theory it only adds an additional check inside the loop, feel free to keep it as is, we can then follow up

EgorBo · 2022-02-07T13:21:54Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs


-                for (int idx = i; lowBits != 0; idx++)
+                int vector256ShortCount = Vector256<ushort>.Count;
+                for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount)


(i + vector256ShortCount) <= Length might overflow, it should be
i <= Length - vector256ShortCount

Besides that the i <= len - count version can keep the len - count in a register, whilst i + count needs a repeated addition.

Also local vector256ShortCount isn't needed, as JIT will treat Vector256<ushort>.Count as constant.

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ValueListBuilder.cs

EgorBo · 2022-02-07T13:24:01Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

-            Vector128<ushort> v3 = Vector128.Create((ushort)c3);
-
-            ref char c0 = ref MemoryMarshal.GetReference(this.AsSpan());
-            int cond = Length & -Vector128<ushort>.Count;
            int i = 0;


int -> nint, it will help to avoid redundant sign extensions

The same variable is used as index to the scalar/non vectorized version at the bottom. I'll see if I can find a middle-way

you can always cast it to signed just once before the scalar version

gfoidl · 2022-02-07T15:28:20Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs


-                for (int idx = i; lowBits != 0; idx++)
+                int vector256ShortCount = Vector256<ushort>.Count;
+                for (; (i + vector256ShortCount) <= Length; i += vector256ShortCount)


Besides that the i <= len - count version can keep the len - count in a register, whilst i + count needs a repeated addition.

Also local vector256ShortCount isn't needed, as JIT will treat Vector256<ushort>.Count as constant.

gfoidl · 2022-02-07T15:34:44Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

+                int vector128ShortCount = Vector128<ushort>.Count;
+                for (; (i + vector128ShortCount) <= Length; i += vector128ShortCount)


Suggested change

int vector128ShortCount = Vector128<ushort>.Count;

for (; (i + vector128ShortCount) <= Length; i += vector128ShortCount)

for (; i <= Length - Vector128<ushort>.Count; i += Vector128<ushort>.Count)

When i is of type nint just check if the comparison doesn't introduce any sign extensions -- please double check to be on the safe side.

yesmey · 2022-02-07T23:50:12Z

@EgorBo @gfoidl Thanks for the good tips and feedback, I updated the pull request accordingly.
Unfortunately the 256 bit code had a bug - I was masking the movmskb result with every other bit, but accidentally had copied the mask from the 128 bit code where the result is only 16 bit. There wasn't any string in the test suite to cover it.

The benchmark numbers for 256 bit is much more realistic now. It looks to be much closer to the 128 bit version now. Please let me know your opinion, and sorry for the mistake

gfoidl · 2022-02-08T12:20:20Z

benchmark numbers for 256 bit is much more realistic now. It looks to be much closer to the 128 bit version now

It's the current numbers in the PR's description?
For Vector256 there's only little gain, so is it worth to have a dedicated code-path for it? ARM won't support it anyway.

yesmey · 2022-02-08T12:30:20Z

@gfoidl Yes those are the latest numbers. I can remove the 256bit path it if you want

stephentoub · 2022-02-08T12:56:11Z

Yes those are the latest numbers. I can remove the 256bit path it if you want

Are any of these tests for really long inputs containing very few separators?

… in ValueListBuilder. Improve sequential iteration

EgorBo · 2022-02-09T12:22:39Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

-                    Vector256<byte> cmp = (vector1 | vector2 | vector3).AsByte();
+                    Vector256<ushort> v1 = Vector256.Create((ushort)c);
+                    Vector256<ushort> v2 = Vector256.Create((ushort)c2);
+                    Vector256<ushort> v3 = Vector256.Create((ushort)c3);


Unrelated to this PR, am just curios if our guidelines allow to use var here, the type of vector should be pretty obvious from the expression on the right.

The guidelines say var should only be used for a ctor or explicit cast. While it's arguable that Create is equivalent to a ctor, there's nothing that requires it to return the same type it's declared on, and in fact there are cases where Create methods don't, e.g. File.Create.

yesmey · 2022-02-09T13:26:56Z

Sorry for my delay, I decided to rewrite the 256 bit spilling and made some improvements for the scalar loop.

Here's a gist of a much bigger bechmark suite: https://gist.github.com/yesmey/2e7a7868bb10043553b78d77cbc3f2b8
(note: the bold text is baseline)

benchmark code for gist

public class Benchmarks
{
    private static string _testStr;
    private static System.Text.StringBuilder st;
    private static char[][] _testChar = new char[3][];

    static Benchmarks()
    {
        st = new System.Text.StringBuilder(5_000_000);
        _testChar[0] = new char[1] { ' ' };
        _testChar[1] = new char[2] { ' ', 't' };
        _testChar[2] = new char[3] { ' ', 't', 'f' };
    } 
    
    private static string BuildStr(char c, int stringLength, int sepFreq, char sep)
    {
        for (int i = 0; i < stringLength; i++)
        {
            if (i % sepFreq == 0)
            {
                st.Append(sep);
            }
            else { st.Append(c); }
        }
        string t =  st.ToString();
        st.Clear();
        return t;
    }

    [GlobalSetup]
    public void Init()
    {
        _testStr = BuildStr('a', Size, SepFreq, _testChar[2][SplitCount - 1]);
    }

    [Params(16, 64, 200, 1000, 10000)]
    public int Size { get; set; }
    
    [Params(1, 2, 5, 200)]
    public int SepFreq { get; set; }
    
    [Params(1, 2, 3)]
    public int SplitCount { get; set; }
    
    [Benchmark]
    public string[] Split()
    {
        return _testStr.Split(_testChar[SplitCount - 1]);
    }
}

Updated numbers from previous benchmarks:

csv + dotnet/performance

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-AJDBJE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-XTZHCY : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method	Job	Toolchain	CorpusUri	Mean	Error	StdDev	Ratio	RatioSD
SplitCsv	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	http(...).csv [107]	11.493 μs	0.1911 μs	0.1787 μs	1.19	0.02
SplitCsv	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	http(...).csv [107]	9.631 μs	0.1624 μs	0.1519 μs	1.00	0.00

SplitCsv	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	https(...)e.csv [50]	60.477 μs	0.2351 μs	0.2200 μs	0.96	0.01
SplitCsv	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	https(...)e.csv [50]	62.927 μs	0.9685 μs	1.5078 μs	1.00	0.00

SplitCsv	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	https(...)A.csv [77]	343.339 μs	3.5451 μs	3.3161 μs	1.19	0.01
SplitCsv	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	https(...)A.csv [77]	287.859 μs	2.7981 μs	2.6174 μs	1.00	0.00

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-AJDBJE : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-XTZHCY : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method	Job	Toolchain	s	chr	arr	options	Mean	Error	StdDev	Ratio	RatioSD
SplitChar	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]		?	?	312.40 ns	3.786 ns	3.541 ns	1.08	0.11
SplitChar	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]		?	?	284.75 ns	8.865 ns	26.139 ns	1.00	0.00

Split	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]	?	Char[1]	None	292.72 ns	2.842 ns	2.519 ns	1.17	0.02
Split	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]	?	Char[1]	None	251.01 ns	4.483 ns	3.974 ns	1.00	0.00

Split	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]	?	Char[1]	RemoveEmptyEntries	368.19 ns	5.179 ns	4.591 ns	1.09	0.02
Split	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]	?	Char[1]	RemoveEmptyEntries	337.88 ns	0.900 ns	0.841 ns	1.00	0.00

SplitChar	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]		?	?	19.20 ns	0.035 ns	0.032 ns	0.76	0.00
SplitChar	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]		?	?	25.16 ns	0.044 ns	0.041 ns	1.00	0.00

Split	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]	?	Char[1]	None	18.71 ns	0.028 ns	0.025 ns	0.63	0.00
Split	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]	?	Char[1]	None	29.67 ns	0.052 ns	0.046 ns	1.00	0.00

Split	Job-AJDBJE	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]	?	Char[1]	RemoveEmptyEntries	17.71 ns	0.026 ns	0.024 ns	0.55	0.00
Split	Job-XTZHCY	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]	?	Char[1]	RemoveEmptyEntries	32.04 ns	0.050 ns	0.045 ns	1.00	0.00

There seems to be regressions on the strings with no split chars in them

ghost · 2022-02-12T17:32:15Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

This pull request aims to simplify and improve upon the current vectorized fast path of string.Split.

Changes include:

Replace specialized SSE4.1 instructions with the new cross-platform intrinsic API
Add 265 bit instructions for longer strings
Improve the common path of Append in ValueListBuilder
- Haven't made any explicit benchmark for this, but you can compare assembly output here: before after

For benchmark testing I tried to use both the csv parsing in #38001 and the benchmark referenced in #51259
The benchmarks include both 256 bit and 128 bit versions (sse/avx). Unfortunately I have not been able to benchmark any other platforms than x86_64

Benchmarks

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1466 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-alpha.1.21568.2
  [Host]     : .NET 7.0.0 (7.0.21.56701), X64 RyuJIT
  Job-OHGYOD : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Toolchain=CoreRun

Method	CorpusUri	Mean	Error	StdDev
SplitCsv main	http(...).csv [107]	11.71 μs	0.224 μs	0.210 μs
SplitCsv Vector128	http(...).csv [107]	9.915 μs	0.0225 μs	0.0175 μs
SplitCsv Vector256	http(...).csv [107]	9.768 μs	0.1745 μs	0.2612 μs
SplitCsv main	https(...)e.csv [50]	69.22 μs	0.182 μs	0.170 μs
SplitCsv Vector128	https(...)e.csv [50]	65.784 μs	0.0870 μs	0.0772 μs
SplitCsv Vector256	https(...)e.csv [50]	58.383 μs	0.3115 μs	0.2914 μs
SplitCsv main	https(...)A.csv [77]	354.65 μs	2.050 μs	1.712 μs
SplitCsv Vector128	https(...)A.csv [77]	311.968 μs	0.8166 μs	0.6376 μs
SplitCsv Vector256	https(...)A.csv [77]	319.919 μs	2.0044 μs	1.8749 μs

Method	s	chr	Mean	Error	StdDev
SplitArray main	A B C(...)X Y Z [51]	' '	291.20 ns	5.879 ns	12.655 ns
SplitArray Vector128	A B C(...)X Y Z [51]	' '	270.11 ns	2.206 ns	2.063 ns
SplitArray Vector256	A B C(...)X Y Z [51]	' '	271.53 ns	3.810 ns	3.377 ns
SplitArray main	ABCDE(...)VWXYZ [26]	' '	19.57 ns	0.180 ns	0.151 ns
SplitArray Vector128	ABCDE(...)VWXYZ [26]	' '	18.82 ns	0.082 ns	0.077 ns
SplitArray Vector256	ABCDE(...)VWXYZ [26]	' '	19.35 ns	0.059 ns	0.052 ns

Benchmark code

[DisassemblyDiagnoser]
public class CsvBenchmarks
{
    private string[] _strings;

    public IEnumerable<string> CorpusList()
    {
        yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
        yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
        yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
    }

    [ParamsSource("CorpusList")]
    public string CorpusUri { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _strings = GetStringsFromCorpus().GetAwaiter().GetResult();
    }

    private async Task<string[]> GetStringsFromCorpus()
    {
        using var client = new HttpClient();
        using var response = await client.GetAsync(CorpusUri);
        response.EnsureSuccessStatusCode();

        var body = await response.Content.ReadAsStringAsync();

        List<string> lines = new();

        StringReader reader = new StringReader(body);
        string? line;
        while ((line = reader.ReadLine()) != null)
        {
            lines.Add(line);
        }

        return lines.ToArray();
    }

    [Benchmark]
    public string[]? SplitCsv()
    {
        string[]? split = null;
        string[] lines = _strings;
        for (int i = 0; i < lines.Length; i++)
        {
            split = lines[i].Split(',');
        }
        return split;
    }   
}

[DisassemblyDiagnoser]
public class RegressionBenchmark
{
    [Benchmark]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", ' ')]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ' ')]
    public string[] SplitArray(string s, char chr)
        => s.Split(chr);
}

public class Program
{
    public static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }
}

Related to #51259

Author:	yesmey
Assignees:	-
Labels:	`area-System.Runtime`, `community-contribution`
Milestone:	-

yesmey · 2022-02-13T19:53:17Z

Status update: I can't get the 256 bit version to perform well on lower-mid ranges because of the saving/restore overhead of registers due to the nested calls inside ValueListBuilder.Append. The 256 bit assembly currently looks like this: https://gist.github.com/yesmey/7786c102927cf8e9abf966cf44a35484, and as you can tell there's a lot of initial vmovaps just for the potential call of Grow in AddWithResize. Just to prove my point, I commented out Grow inside AddWithResize for comparison here.

So since I'm not getting any further there, I'm thinking maybe giving up on the 256 bit and keep the 128 bit version, which is on par in performance, just to have an implementation for arm

EgorBo · 2022-02-13T20:05:08Z

So since I'm not getting any further there, I'm thinking maybe giving up on the 256 bit and keep the 128 bit version

that's ok, we try to use AVX only where it's definitely profitable.

yesmey · 2022-02-13T21:29:32Z

benchmarks for commit dcadf05

CSV parsing benchmarks

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method	Job	Toolchain	CorpusUri	Mean	Error	StdDev	Ratio	RatioSD
SplitCsv	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	http(...).csv [107]	11.25 μs	0.104 μs	0.098 μs	1.00	0.00
SplitCsv	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	http(...).csv [107]	10.04 μs	0.011 μs	0.009 μs	0.89	0.01

SplitCsv	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	https(...)e.csv [50]	57.39 μs	1.093 μs	1.074 μs	1.00	0.00
SplitCsv	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	https(...)e.csv [50]	53.80 μs	0.305 μs	0.285 μs	0.94	0.02

SplitCsv	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	https(...)A.csv [77]	326.79 μs	1.039 μs	0.921 μs	1.00	0.00
SplitCsv	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	https(...)A.csv [77]	297.05 μs	5.757 μs	5.912 μs	0.91	0.02

dotnet/performance benchmarks

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method	Job	Toolchain	s	chr	arr	options	Mean	Error	StdDev	Ratio	RatioSD
SplitChar	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]		?	?	297.36 ns	5.652 ns	6.282 ns	1.00	0.00
SplitChar	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]		?	?	291.99 ns	5.821 ns	12.022 ns	0.99	0.04

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]	?	Char[1]	None	314.12 ns	9.964 ns	29.379 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]	?	Char[1]	None	272.73 ns	3.769 ns	5.160 ns	0.88	0.08

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]	?	Char[1]	RemoveEmptyEntries	374.66 ns	4.502 ns	3.759 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	A B C(...)X Y Z [51]	?	Char[1]	RemoveEmptyEntries	366.11 ns	0.924 ns	0.819 ns	0.98	0.01

SplitChar	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]		?	?	25.87 ns	0.030 ns	0.027 ns	1.00	0.00
SplitChar	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]		?	?	18.23 ns	0.016 ns	0.013 ns	0.70	0.00

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]	?	Char[1]	None	18.19 ns	0.146 ns	0.122 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]	?	Char[1]	None	18.39 ns	0.034 ns	0.030 ns	1.01	0.01

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]	?	Char[1]	RemoveEmptyEntries	17.82 ns	0.017 ns	0.013 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	ABCDE(...)VWXYZ [26]	?	Char[1]	RemoveEmptyEntries	18.37 ns	0.066 ns	0.059 ns	1.03	0.00

partial 38001 issue benchmark suite

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19042.1526 (20H2/October2020Update)
AMD Ryzen 7 3700X, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.2.22108.4
  [Host]     : .NET 7.0.0 (7.0.22.10302), X64 RyuJIT
  Job-EPAGWH : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DBDUQW : .NET 7.0.0 (42.42.42.42424), X64 RyuJIT

Method	Job	Toolchain	Size	SepFreq	SplitCount	Mean	Error	StdDev	Median	Ratio	RatioSD
Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	1	1	90.66 ns	0.146 ns	0.129 ns	90.66 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	1	1	98.80 ns	0.351 ns	0.329 ns	98.83 ns	1.09	0.00

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	1	2	101.48 ns	1.119 ns	0.992 ns	101.72 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	1	2	110.29 ns	2.051 ns	1.919 ns	110.52 ns	1.08	0.02

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	5	1	62.59 ns	1.013 ns	0.846 ns	62.98 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	5	1	54.82 ns	0.617 ns	0.516 ns	54.83 ns	0.88	0.01

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	5	2	66.75 ns	1.380 ns	1.842 ns	65.78 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	5	2	54.66 ns	1.106 ns	0.924 ns	54.41 ns	0.82	0.03

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	200	1	33.08 ns	0.107 ns	0.095 ns	33.10 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	200	1	30.76 ns	0.220 ns	0.205 ns	30.80 ns	0.93	0.01

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	200	2	35.91 ns	0.745 ns	1.362 ns	35.77 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	16	200	2	31.82 ns	0.462 ns	0.432 ns	31.69 ns	0.89	0.06

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	1	1	1,043.18 ns	20.611 ns	36.099 ns	1,045.06 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	1	1	1,060.80 ns	20.757 ns	32.922 ns	1,052.50 ns	1.02	0.05

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	1	2	928.95 ns	18.445 ns	35.538 ns	937.08 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	1	2	1,040.02 ns	12.802 ns	9.995 ns	1,044.64 ns	1.09	0.03

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	5	1	546.27 ns	10.817 ns	18.073 ns	544.85 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	5	1	455.87 ns	9.008 ns	10.373 ns	453.83 ns	0.84	0.04

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	5	2	514.16 ns	5.528 ns	4.900 ns	513.00 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	5	2	481.95 ns	9.573 ns	18.444 ns	485.77 ns	0.92	0.03

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	200	1	65.77 ns	0.118 ns	0.098 ns	65.77 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	200	1	66.08 ns	0.448 ns	0.420 ns	66.05 ns	1.00	0.01

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	200	2	65.91 ns	0.365 ns	0.342 ns	65.75 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	200	200	2	67.15 ns	0.508 ns	0.424 ns	67.10 ns	1.02	0.01

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	1	1	4,139.27 ns	15.050 ns	11.750 ns	4,139.32 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	1	1	4,358.61 ns	9.762 ns	8.654 ns	4,354.70 ns	1.05	0.00

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	1	2	4,162.92 ns	52.087 ns	77.961 ns	4,133.73 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	1	2	4,639.01 ns	92.551 ns	197.234 ns	4,646.57 ns	1.11	0.05

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	5	1	2,348.05 ns	47.024 ns	112.666 ns	2,355.87 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	5	1	2,201.39 ns	53.919 ns	158.983 ns	2,145.12 ns	0.94	0.08

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	5	2	2,408.45 ns	49.325 ns	145.436 ns	2,465.65 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	5	2	2,160.78 ns	42.667 ns	85.210 ns	2,151.04 ns	0.89	0.07

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	200	1	248.99 ns	0.974 ns	0.911 ns	249.01 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	200	1	246.70 ns	1.574 ns	1.314 ns	247.05 ns	0.99	0.01

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	200	2	245.95 ns	1.117 ns	1.045 ns	245.52 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	1000	200	2	244.58 ns	1.302 ns	1.218 ns	244.91 ns	0.99	0.00

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	1	1	40,257.63 ns	395.206 ns	330.015 ns	40,114.94 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	1	1	43,103.42 ns	797.633 ns	622.740 ns	43,088.24 ns	1.07	0.01

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	1	2	40,372.55 ns	764.782 ns	715.378 ns	39,915.27 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	1	2	43,351.75 ns	857.048 ns	1,916.910 ns	42,166.56 ns	1.10	0.05

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	5	1	23,065.97 ns	419.810 ns	372.151 ns	22,920.42 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	5	1	21,142.54 ns	471.224 ns	1,389.414 ns	20,593.73 ns	0.94	0.08

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	5	2	25,646.14 ns	506.991 ns	1,204.917 ns	25,906.24 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	5	2	23,400.37 ns	466.163 ns	1,344.987 ns	23,348.94 ns	0.92	0.08

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	200	1	2,161.35 ns	3.897 ns	3.254 ns	2,159.56 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	200	1	2,182.22 ns	16.648 ns	15.573 ns	2,181.21 ns	1.01	0.01

Split	Job-EPAGWH	\runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	200	2	2,174.07 ns	3.713 ns	3.291 ns	2,173.86 ns	1.00	0.00
Split	Job-DBDUQW	\yesmey_runtime\artifacts\bin\testhost\net7.0-windows-Release-x64\shared\Microsoft.NETCore.App\7.0.0\CoreRun.exe	10000	200	2	2,130.82 ns	8.968 ns	7.949 ns	2,131.29 ns	0.98	0.00

benchmark source

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

public class CsvBenchmarks
{
    private string[] _strings;

    public IEnumerable<string> CorpusList()
    {
        // only these three urls still return any result
        yield return "https://www.census.gov/econ/bfs/csv/date_table.csv";
        yield return "https://www.sba.gov/sites/default/files/aboutsbaarticle/FY16_SBA_RAW_DATA.csv";
        yield return "https://wfmi.nifc.gov/fire_reporting/annual_dataset_archive/1972-2010/_WFMI_Big_Files/BOR_1972-2010_Gis.csv";
    }

    [ParamsSource("CorpusList")]
    public string CorpusUri { get; set; }

    [GlobalSetup]
    public void Setup()
    {
        _strings = GetStringsFromCorpus().GetAwaiter().GetResult();
    }

    private async Task<string[]> GetStringsFromCorpus()
    {
        using var client = new HttpClient();
        using var response = await client.GetAsync(CorpusUri);
        response.EnsureSuccessStatusCode();

        var body = await response.Content.ReadAsStringAsync();

        List<string> lines = new();

        StringReader reader = new StringReader(body);
        string? line;
        while ((line = reader.ReadLine()) != null)
        {
            lines.Add(line);
        }

        return lines.ToArray();
    }

    [Benchmark]
    public string[]? SplitCsv()
    {
        string[]? split = null;
        string[] lines = _strings;
        for (int i = 0; i < lines.Length; i++)
        {
            split = lines[i].Split(',');
        }
        return split;
    }   
}

public class RegressionBenchmark
{
    [Benchmark]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", ' ')]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", ' ')]
    public string[] SplitChar(string s, char chr)
        => s.Split(chr);

    [Benchmark]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", new char[] { ' ' }, StringSplitOptions.None)]
    [Arguments("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z", new char[] { ' ' }, StringSplitOptions.RemoveEmptyEntries)]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", new char[]{' '}, StringSplitOptions.None)]
    [Arguments("ABCDEFGHIJKLMNOPQRSTUVWXYZ", new char[]{' '}, StringSplitOptions.RemoveEmptyEntries)]
    public string[] Split(string s, char[] arr, StringSplitOptions options)
        => s.Split(arr, options);
}

public class Benchmarks
{
    private static string _testStr;
    private static System.Text.StringBuilder st;
    private static char[][] _testChar = new char[3][];

    static Benchmarks()
    {
        st = new System.Text.StringBuilder(5_000_000);
        _testChar[0] = new char[1] { ' ' };
        _testChar[1] = new char[3] { ' ', 't', 'f' };
    } 
    
    private static string BuildStr(char c, int stringLength, int sepFreq, char sep)
    {
        for (int i = 0; i < stringLength; i++)
        {
            if (i % sepFreq == 0)
            {
                st.Append(sep);
            }
            else { st.Append(c); }
        }
        string t =  st.ToString();
        st.Clear();
        return t;
    }

    [GlobalSetup]
    public void Init()
    {
        _testStr = BuildStr('a', Size, SepFreq, _testChar[1][SplitCount - 1]);
    }

    [Params(16, 200, 1000, 10000)]
    public int Size { get; set; }
    
    [Params(1, 5, 200)]
    public int SepFreq { get; set; }

    [Params(1, 2)]
    public int SplitCount { get; set; }
    
    [Benchmark]
    public string[] Split()
    {
        return _testStr.Split(_testChar[SplitCount - 1]);
    }
}

public class Program
{
    public static void Main(string[] args)
    {
        BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);
    }
}

danmoseley · 2022-03-23T04:13:46Z

@EgorBo @stephentoub @gfoidl is your feedback addressed ?

gfoidl

@danmoseley I had another look, when these points are addressed I'm happy with the PR 😄.

gfoidl · 2022-03-23T10:31:59Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

@@ -1609,14 +1609,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
            }

            // Special-case the common cases of 1, 2, and 3 separators, with manual comparisons against each separator.
-            else if (separators.Length <= 3)
+            else if ((uint)separators.Length <= (uint)3)


Is this cast still needed?
AFAIK JIT recognizes this now.

@gfoidl do you mean we no longer need the pattern if ((uint)index > (uint)array.Length) that we have everywhere in the tree? If so we should have an issue to remove it.

#62864 is the PR for that change (got merged 26 days ago).

If so we should have an issue to remove it.

Filed #67044 for it.

That's great! Glad it got fixed, I'll get rid of it

gfoidl · 2022-03-23T10:34:01Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

-            // Redundant test so we won't prejit remainder of this method
-            // on platforms without SSE.
-            if (!Sse41.IsSupported)
+            if (!Vector128.IsHardwareAccelerated)


Please use the comment from the previous version (left side of comparison) to make it clear that this check is needed to avoid prejit.
Otherwise a Debug.Assert(Vector128.IsHardwareAccelerated) could do it too.

I'll reintroduce the comment with a small text change since it's not limited to only SSE anymore

gfoidl · 2022-03-23T10:38:13Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

-            int i = 0;
-
-            for (; i < cond; i += Vector128<ushort>.Count)
+            while (offset <= lengthToExamine - (nuint)Vector128<ushort>.Count)


Above is L1618 we guard by Vector128<ushort>.Count * 2, so when reaching this point, we know that there are for sure enough elements available. This this check isn't need at this point. So you could change the loop to da do-while loop. Thus the first iteration is without any (further) pre-condition, and after the iteration the check for more available elements is done.

Thanks. I'll add a Debug.Assert on entry to make it a little more obvious to the reader that its a precondition

gfoidl · 2022-03-23T10:43:52Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

            {
-                char curr = Unsafe.Add(ref c0, (IntPtr)(uint)i);
+                char curr = (char)Unsafe.Add(ref source, (nint)offset);


Suggested change

char curr = (char)Unsafe.Add(ref source, (nint)offset);

char curr = (char)Unsafe.Add(ref source, offset);

Not needed, there's an overload for nuint.

I must've missed that it got added, thanks

gfoidl

Just one question -- otherwise LGTM.

gfoidl · 2022-03-23T21:12:02Z

src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs

@@ -1615,8 +1615,7 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
                sep0 = separators[0];
                sep1 = separators.Length > 1 ? separators[1] : sep0;
                sep2 = separators.Length > 2 ? separators[2] : sep1;
-
-                if (Length >= 16 && Sse41.IsSupported)
+                if (Vector128.IsHardwareAccelerated && Length >= Vector128<ushort>.Count * 2)


Just to double-check: the * 2 is intentional as perf-numbers showed that?

Yes exactly, smaller strings doesn't perform as well

danmoseley · 2022-03-24T00:28:32Z

methodtable assert is #64544
FSW crash is #67071
JSON assert is #60962

EgorBo · 2022-04-07T17:06:46Z

Improvement on win-x64 dotnet/perf-autofiling-issues#4291

danmoseley · 2022-04-07T18:09:59Z

Nice drop in that graph @yesmey . Do you plan to do more of this kind of work?

yesmey added 2 commits February 7, 2022 12:36

Improve performance of common path for Append in ValueListBuilder

e64d510

ghost added the community-contribution Indicates that the PR has been added by a community member label Feb 7, 2022

EgorBo reviewed Feb 7, 2022

View reviewed changes

src/libraries/System.Private.CoreLib/src/System/Collections/Generic/ValueListBuilder.cs Show resolved Hide resolved

EgorBo reviewed Feb 7, 2022

View reviewed changes

gfoidl reviewed Feb 7, 2022

View reviewed changes

EgorBo mentioned this pull request Feb 7, 2022

Treat TZCNT/POPCNT/LZCNT as never negative #64909

Closed

runfoapp bot mentioned this pull request Feb 7, 2022

profiler.elt work item test failures in slowpatheltenter #60018

Closed

yesmey added 2 commits February 8, 2022 00:21

Address pr feedback

60d94e1

Add a longer string split test example

4c8c2b5

yesmey and others added 2 commits February 9, 2022 10:59

Merge branch 'dotnet:main' into main

0cf5ff2

Improve spilling for the Vector256 version. Remove temp span variable…

915ff82

… in ValueListBuilder. Improve sequential iteration

EgorBo reviewed Feb 9, 2022

View reviewed changes

marek-safar added the area-System.Runtime label Feb 12, 2022

Revert the AVX version

dcadf05

danmoseley closed this Mar 23, 2022

danmoseley reopened this Mar 23, 2022

gfoidl reviewed Mar 23, 2022

View reviewed changes

gfoidl mentioned this pull request Mar 23, 2022

Remove JIT-workaround for eliminating bound check #67044

Open

EgorBo mentioned this pull request Mar 23, 2022

Optimized string.Replace(char, char) #67049

Merged

yesmey and others added 2 commits March 23, 2022 21:58

Merge branch 'dotnet:main' into main

8e7996c

Address feedback

51b102b

gfoidl approved these changes Mar 23, 2022

View reviewed changes

tannergooding approved these changes Mar 23, 2022

View reviewed changes

danmoseley merged commit b4e258a into dotnet:main Mar 24, 2022

radekdoulik pushed a commit to radekdoulik/runtime that referenced this pull request Mar 30, 2022

Improve vectorization of String.Split (dotnet#64899)

c27cc12

ghost locked as resolved and limited conversation to collaborators May 7, 2022

		int vector128ShortCount = Vector128<ushort>.Count;
		for (; (i + vector128ShortCount) <= Length; i += vector128ShortCount)

	int vector128ShortCount = Vector128<ushort>.Count;
	for (; (i + vector128ShortCount) <= Length; i += vector128ShortCount)
	for (; i <= Length - Vector128<ushort>.Count; i += Vector128<ushort>.Count)

	char curr = (char)Unsafe.Add(ref source, (nint)offset);
	char curr = (char)Unsafe.Add(ref source, offset);

Improve vectorization of String.Split #64899

Improve vectorization of String.Split #64899

Conversation

yesmey commented Feb 7, 2022 • edited Loading

dotnet-issue-labeler bot commented Feb 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

EgorBo Feb 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yesmey commented Feb 7, 2022 • edited Loading

gfoidl commented Feb 8, 2022

yesmey commented Feb 8, 2022

stephentoub commented Feb 8, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yesmey commented Feb 9, 2022 • edited Loading

ghost commented Feb 12, 2022

yesmey commented Feb 13, 2022

EgorBo commented Feb 13, 2022

yesmey commented Feb 13, 2022 • edited Loading

danmoseley commented Mar 23, 2022

gfoidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yesmey Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

gfoidl Mar 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfoidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danmoseley commented Mar 24, 2022

EgorBo commented Apr 7, 2022

danmoseley commented Apr 7, 2022

yesmey commented Feb 7, 2022 •

edited

Loading

EgorBo Feb 7, 2022 •

edited

Loading

yesmey commented Feb 7, 2022 •

edited

Loading

yesmey commented Feb 9, 2022 •

edited

Loading

yesmey commented Feb 13, 2022 •

edited

Loading

yesmey Mar 23, 2022 •

edited

Loading

gfoidl Mar 23, 2022 •

edited

Loading