Potential performance regression in OrdinalIgnoreCase string comparision #59087

adamsitnik · 2021-09-14T13:51:37Z

Originally detected by the bot in DrewScoggins/performance-2#4549 but not reported in dotnet/runtime (cc @DrewScoggins)

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

Result	Base	Diff	Ratio	Alloc Delta	Modality	Operating System	Bit	Processor Name	Base V	Diff V
Same	978.46	1223.95	0.80	+0	bimodal	Windows 10.0.19043.1165	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Slower	1405.76	1696.68	0.83	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1390.04	1662.75	0.84	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1228.43	1642.76	0.75	+0		Windows 10.0.18363.1621	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	1689.86	2838.86	0.60	+0		Windows 8.1	X64	Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge)	5.0.921.35908	6.0.21.45401
Slower	1366.24	1803.72	0.76	+0		Windows 10.0.19042.685	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701
Slower	1014.33	1552.26	0.65	+0		Windows 10.0.19043.1165	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	5.0.921.35908	6.0.21.41701
Slower	1355.82	1964.26	0.69	+0		Windows 10.0.22454	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	5.0.921.35908	6.0.21.41701
Slower	858.70	1325.30	0.65	+0		Windows 10.0.22451	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)	5.0.921.35908	6.0.21.41701
Slower	984.06	1414.77	0.70	+0		Windows 10.0.19042.1165	X64	Intel Core i9-9900T CPU 2.10GHz	5.0.921.35908	6.0.21.41701
Slower	4173.22	6321.15	0.66	+0		Windows 7 SP1	X64	Intel Core2 Duo CPU T9600 2.80GHz	5.0.721.25508	6.0.21.41701
Slower	1394.92	1628.06	0.86	+0		centos 8	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1371.22	1630.47	0.84	+0		debian 10	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1402.69	1653.62	0.85	+0		rhel 7	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1401.21	1614.74	0.87	+0		sles 15	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1378.12	1623.93	0.85	+0		opensuse-leap 15.3	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1020.15	1478.16	0.69	+0		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Same	1744.39	2038.20	0.86	+0	bimodal	alpine 3.13	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	5.0.921.35908	6.0.21.41701
Slower	3021.64	3715.93	0.81	+0		ubuntu 16.04	Arm64	Unknown processor	5.0.421.11614	6.0.21.41701
Slower	1809.60	2855.11	0.63	+0		Windows 10.0.19043.1165	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	1987.02	2863.98	0.69	+0		Windows 10.0.22000	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	989.54	1249.97	0.79	+0	several?	Windows 10.0.19043.1165	X86	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Slower	1446.38	2158.57	0.67	+0		Windows 10.0.18363.1621	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	1816.65	2931.48	0.62	+0		Windows 10.0.19043.1165	Arm	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	1306.85	1927.69	0.68	+0		macOS Big Sur 11.5.2	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	1161.13	1668.36	0.70	+0		macOS Big Sur 11.5.2	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	1198.18	1747.52	0.69	+0		macOS Big Sur 11.4	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper

A quick look at the historical data

and zooming:

It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):

f64246c...25f1800

category:performance
theme:benchmarks

The text was updated successfully, but these errors were encountered:

ghost · 2021-09-14T13:51:40Z

Tagging subscribers to this area: @tarekgh, @safern
See info in area-owners.md if you want to be subscribed.

Issue Details

Originally detected by the bot in DrewScoggins/performance-2#4549 but not reported in dotnet/runtime (cc @DrewScoggins)

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper

A quick look at the historical data

and zooming:

It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):

f64246c...25f1800

Author:	adamsitnik
Assignees:	-
Labels:	`area-System.Globalization`, `tenet-performance`
Milestone:	-

EgorBo · 2021-09-14T14:21:37Z

Standalone repro:

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System.Collections.Generic;
using System.IO;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace System.Globalization.Tests
{
    [Config(typeof(ConfigWithCustomEnvVars))]
    [DisassemblyDiagnoser(maxDepth: 6, exportDiff: true)]
    public class StringEquality
    {
        private class ConfigWithCustomEnvVars : ManualConfig
        {
            private const string JitNoInline = "COMPlus_JitNoInline";

            public ConfigWithCustomEnvVars()
            {
                AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithId("Default"));
                AddJob(Job.Default.WithRuntime(CoreRuntime.Core60)
                    .WithEnvironmentVariables(new EnvironmentVariable("DOTNET_JitDisablePgo", "1"))
                    .WithId("No PGO"));
            }
        }

        private string _value, _same, _sameUpper, _diffAtFirstChar;

        public static IEnumerable<(CultureInfo CultureInfo, CompareOptions CompareOptions)> GetOptions()
        {
            yield return (new CultureInfo("en-US"), CompareOptions.OrdinalIgnoreCase);
        }

        [ParamsSource(nameof(GetOptions))]
        public (CultureInfo CultureInfo, CompareOptions CompareOptions) Options;

        [Params(1024)] // single execution path = single test case
        public int Count;

        [GlobalSetup]
        public void Setup()
        {
            // we are using part of Alice's Adventures in Wonderland text as test data
            char[] characters = File.ReadAllText(@"path\to\alice29.txt").Take(Count).ToArray();
            _value = new string(characters);
            _same = new string(characters);
            _sameUpper = _same.ToUpper();
            char[] copy = characters.ToArray();
            copy[0] = (char)(copy[0] + 1);
            _diffAtFirstChar = new string(copy);
        }

        [Benchmark] // the most work to do for IgnoreCase: every char needs to be compared and uppercased
        public int Compare_Same_Upper() => Options.CultureInfo.CompareInfo.Compare(_value, _sameUpper, Options.CompareOptions);

        public static void Main() => BenchmarkRunner.Run<StringEquality>();
    }
}

Asmdiff (NoPGO is on the left): https://www.diffchecker.com/2guAUS5i

EgorBo · 2021-09-14T14:50:05Z

The most interesting part is that FullPGO is still slower than NoPGO mode cc @AndyAyersMS (I've not investigated why yet):


|             Method |              Job |         EnvironmentVariables | Count |              Options |     Mean |    Error |  StdDev |
|------------------- |----------------- |----------------------------- |------ |--------------------- |---------:|---------:|--------:|
| Compare_Same_Upper |          FullPGO | DOTNET_TC_QuickJitForLoops=1 |  1024 | (en-U(...)Case) [26] | 908.5 ns |  2.94 ns | 2.45 ns |
| Compare_Same_Upper |           No PGO |       DOTNET_JitDisablePgo=1 |  1024 | (en-U(...)Case) [26] | 869.2 ns |  9.12 ns | 8.53 ns |
| Compare_Same_Upper |          Default |                        Empty |  1024 | (en-U(...)Case) [26] | 948.0 ns | 10.51 ns | 9.83 ns |

adamsitnik · 2021-09-14T18:08:58Z

Another benchmark that got most likely affected by this change:

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fSystem.Globalization.Tests.StringEquality.Compare_DifferentFirstChar(Count%3a%201024%2c%20Options%3a%20(en-US%2c%20Ordinal)).html

But in this particular case only Unix-like systems got affected:

System.Globalization.Tests.StringEquality.Compare_DifferentFirstChar(Count: 1024, Options: (en-US, Ordinal))

Result	Base	Diff	Ratio	Alloc Delta	Modality	Operating System	Bit	Processor Name	Base V	Diff V
Same	10.67	10.41	1.03	+0		Windows 10.0.19043.1165	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Same	13.61	13.19	1.03	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Same	13.59	13.19	1.03	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Same	10.84	9.88	1.10	+0		Windows 10.0.18363.1621	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Faster	17.97	15.89	1.13	+0		Windows 8.1	X64	Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge)	5.0.921.35908	6.0.21.45401
Same	11.84	11.12	1.07	+0		Windows 10.0.19042.685	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701
Same	10.48	9.42	1.11	+0	several?	Windows 10.0.19043.1165	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	5.0.921.35908	6.0.21.41701
Same	14.24	15.04	0.95	+0		Windows 10.0.22454	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	5.0.921.35908	6.0.21.41701
Same	9.66	9.74	0.99	+0		Windows 10.0.22451	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)	5.0.921.35908	6.0.21.41701
Same	10.44	9.05	1.15	+0		Windows 10.0.19042.1165	X64	Intel Core i9-9900T CPU 2.10GHz	5.0.921.35908	6.0.21.41701
Slower	19.70	29.32	0.67	+0		Windows 7 SP1	X64	Intel Core2 Duo CPU T9600 2.80GHz	5.0.721.25508	6.0.21.41701
Slower	17.37	21.38	0.81	+0		centos 8	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	13.29	20.88	0.64	+0		debian 10	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	17.48	20.97	0.83	+0		rhel 7	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	13.56	21.21	0.64	+0		sles 15	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	13.25	22.69	0.58	+0		opensuse-leap 15.3	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	11.09	15.13	0.73	+0		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	17.81	23.91	0.74	+0		ubuntu 18.04	X64	Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)	5.0.921.35908	6.0.21.41701
Slower	11.35	15.96	0.71	+0		alpine 3.13	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	5.0.921.35908	6.0.21.41701
Slower	43.14	54.98	0.78	+0		ubuntu 16.04	Arm64	Unknown processor	5.0.421.11614	6.0.21.41701
Faster	13.65	11.51	1.19	+0		Windows 10.0.19043.1165	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Faster	14.64	9.85	1.49	+0		Windows 10.0.22000	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Same	11.37	10.14	1.12	+0		Windows 10.0.19043.1165	X86	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Same	13.42	13.76	0.98	+0		Windows 10.0.18363.1621	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	15.32	18.75	0.82	+0		Windows 10.0.19043.1165	Arm	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	14.28	19.23	0.74	+0		macOS Big Sur 11.5.2	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	12.55	16.66	0.75	+0		macOS Big Sur 11.5.2	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	13.13	17.43	0.75	+0		macOS Big Sur 11.4	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701

adamsitnik · 2021-09-14T18:31:16Z

One more Unix-specific benchmark that regressed then:

System.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt64(value: 12345)

Result	Base	Diff	Ratio	Alloc Delta	Modality	Operating System	Bit	Processor Name	Base V	Diff V
Same	8.43	7.11	1.18	+0		Windows 10.0.19043.1165	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Same	10.14	9.51	1.07	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Same	10.16	9.45	1.08	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Same	10.38	10.14	1.02	+0		Windows 10.0.18363.1621	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Same	16.68	15.04	1.11	+0		Windows 8.1	X64	Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge)	5.0.921.35908	6.0.21.45401
Same	11.61	11.18	1.04	+0		Windows 10.0.19042.685	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701
Same	9.83	8.78	1.12	+0		Windows 10.0.19043.1165	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	5.0.921.35908	6.0.21.41701
Same	12.85	13.06	0.98	+0	several?	Windows 10.0.22454	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	5.0.921.35908	6.0.21.41701
Same	9.21	8.65	1.06	+0		Windows 10.0.22451	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)	5.0.921.35908	6.0.21.41701
Same	10.19	9.32	1.09	+0		Windows 10.0.19042.1165	X64	Intel Core i9-9900T CPU 2.10GHz	5.0.921.35908	6.0.21.41701
Slower	21.23	36.56	0.58	+0		Windows 7 SP1	X64	Intel Core2 Duo CPU T9600 2.80GHz	5.0.721.25508	6.0.21.41701
Slower	9.41	12.35	0.76	+0		centos 8	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Same	10.37	11.61	0.89	+0		debian 10	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	9.55	12.51	0.76	+0	several?	rhel 7	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Same	10.13	11.76	0.86	+0		sles 15	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	9.42	11.69	0.81	+0		opensuse-leap 15.3	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Same	10.05	12.02	0.84	+0		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	14.97	19.07	0.79	+0		ubuntu 18.04	X64	Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge)	5.0.921.35908	6.0.21.41701
Slower	9.25	11.46	0.81	+0		alpine 3.13	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	5.0.921.35908	6.0.21.41701
Same	28.50	26.57	1.07	+0		ubuntu 16.04	Arm64	Unknown processor	5.0.421.11614	6.0.21.41701
Same	11.26	11.56	0.97	+0		Windows 10.0.19043.1165	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Same	13.15	11.53	1.14	+0		Windows 10.0.22000	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Same	8.34	9.77	0.85	+0		Windows 10.0.19043.1165	X86	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Same	12.32	13.57	0.91	+0	several?	Windows 10.0.18363.1621	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	35.63	48.83	0.73	+0		Windows 10.0.19043.1165	Arm	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	13.01	15.58	0.84	+0		macOS Big Sur 11.5.2	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	11.19	13.47	0.83	+0		macOS Big Sur 11.5.2	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	11.44	14.51	0.79	+0		macOS Big Sur 11.4	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fSystem.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt64(value%3a%2012345).html

adamsitnik · 2021-09-15T10:41:38Z

Another:

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_Windows%2010.0.18362%2fSystem.Tests.Perf_Int32.ToStringHex(value%3a%202147483647).html

adamsitnik · 2021-09-15T11:11:42Z

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fSystem.Globalization.Tests.StringEquality.Compare_Same_Upper(Count%3a%201024%2c%20Options%3a%20(en-US%2c%20Ordinal)).html

ghost · 2021-09-16T02:03:17Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Originally detected by the bot in DrewScoggins/performance-2#4549 but not reported in dotnet/runtime (cc @DrewScoggins)

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

Result	Base	Diff	Ratio	Alloc Delta	Modality	Operating System	Bit	Processor Name	Base V	Diff V
Same	978.46	1223.95	0.80	+0	bimodal	Windows 10.0.19043.1165	X64	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Slower	1405.76	1696.68	0.83	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1390.04	1662.75	0.84	+0		Windows 10.0.20348	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1228.43	1642.76	0.75	+0		Windows 10.0.18363.1621	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	1689.86	2838.86	0.60	+0		Windows 8.1	X64	Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge)	5.0.921.35908	6.0.21.45401
Slower	1366.24	1803.72	0.76	+0		Windows 10.0.19042.685	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701
Slower	1014.33	1552.26	0.65	+0		Windows 10.0.19043.1165	X64	Intel Core i7-6700 CPU 3.40GHz (Skylake)	5.0.921.35908	6.0.21.41701
Slower	1355.82	1964.26	0.69	+0		Windows 10.0.22454	X64	Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R)	5.0.921.35908	6.0.21.41701
Slower	858.70	1325.30	0.65	+0		Windows 10.0.22451	X64	Intel Core i7-8700 CPU 3.20GHz (Coffee Lake)	5.0.921.35908	6.0.21.41701
Slower	984.06	1414.77	0.70	+0		Windows 10.0.19042.1165	X64	Intel Core i9-9900T CPU 2.10GHz	5.0.921.35908	6.0.21.41701
Slower	4173.22	6321.15	0.66	+0		Windows 7 SP1	X64	Intel Core2 Duo CPU T9600 2.80GHz	5.0.721.25508	6.0.21.41701
Slower	1394.92	1628.06	0.86	+0		centos 8	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1371.22	1630.47	0.84	+0		debian 10	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1402.69	1653.62	0.85	+0		rhel 7	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1401.21	1614.74	0.87	+0		sles 15	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1378.12	1623.93	0.85	+0		opensuse-leap 15.3	X64	AMD EPYC 7452	5.0.921.35908	6.0.21.41701
Slower	1020.15	1478.16	0.69	+0		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Same	1744.39	2038.20	0.86	+0	bimodal	alpine 3.13	X64	Intel Core i7-7700 CPU 3.60GHz (Kaby Lake)	5.0.921.35908	6.0.21.41701
Slower	3021.64	3715.93	0.81	+0		ubuntu 16.04	Arm64	Unknown processor	5.0.421.11614	6.0.21.41701
Slower	1809.60	2855.11	0.63	+0		Windows 10.0.19043.1165	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	1987.02	2863.98	0.69	+0		Windows 10.0.22000	Arm64	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	989.54	1249.97	0.79	+0	several?	Windows 10.0.19043.1165	X86	AMD Ryzen Threadripper PRO 3945WX 12-Cores	5.0.921.35908	6.0.21.41701
Slower	1446.38	2158.57	0.67	+0		Windows 10.0.18363.1621	X86	Intel Xeon CPU E5-1650 v4 3.60GHz	5.0.921.35908	6.0.21.41701
Slower	1816.65	2931.48	0.62	+0		Windows 10.0.19043.1165	Arm	Microsoft SQ1 3.0 GHz	5.0.921.35908	6.0.21.41701
Slower	1306.85	1927.69	0.68	+0		macOS Big Sur 11.5.2	X64	Intel Core i5-4278U CPU 2.60GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	1161.13	1668.36	0.70	+0		macOS Big Sur 11.5.2	X64	Intel Core i7-4870HQ CPU 2.50GHz (Haswell)	5.0.921.35908	6.0.21.41701
Slower	1198.18	1747.52	0.69	+0		macOS Big Sur 11.4	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	5.0.921.35908	6.0.21.41701

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper

A quick look at the historical data

and zooming:

It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):

f64246c...25f1800

Author:	adamsitnik
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

AndyAyersMS · 2021-09-21T02:40:56Z

Looking at the first of these... will update with details.

AndyAyersMS · 2021-09-21T18:27:02Z

There seem to be several issues in Compare_Same_Upper.

First is that the PGO training data does not see many cases of case-insensitive comparison. So in the compound expression

runtime/src/libraries/System.Private.CoreLib/src/System/Globalization/Ordinal.cs

Lines 28 to 29 in 1b14c94

    
           ((charA | 0x20) == (charB | 0x20) && 
        
               (uint)((charA | 0x20) - 'a') <= (uint)('z' - 'a')))

The first clause is true with likelihood 0.97. This means that the CSE charA | 0x20 is not deemed profitable by the jit as the second use is considered quite infrequent.

In the benchmark however this code path is taken quite often and so PGO imposes an extra cost. One possible fix here is to broaden the set of inputs we see during PGO training, though it's also possible the likelihoods we see now are realistic.

We should arguably revisit the CSE costing heuristics as it does not seem like doing this extra CSE would actually cause any issues.

Second is that PGO data leads to poor layout decisions. The jit's block layout algorithm is locally greedy and this leads to globally sub-optimal layouts. One such example happens early in the method where there is a control flow diamond; with PGO
we see this diamond is biased with one block at 0.81 likelihood. Another happens later in the code where a PGO-rare path in a loop is moved and requires several jumps to rejoin the flow. In general the JIT is too aggressive in moving lower-frequency blocks out of line, doing so inhibits opportunities for jump elimination later on.

There is work anticipated here in .NET 7 but no easy fix in the meantime.

Third is that the early block reordering done by the JIT interferes with loop recognition, and the JIT mistakenly thinks there are two loops in the method (instead of one multi-exit loop). While concerning, it doesn't seem to cause problems here as the loops are not currently optimizable but the fact that the JIT does the reordering so early is indicative of two shortcomings: (1) loop recognition is too pattern sensitive; (2) optimizing block order should generally wait until later in the phase pipeline.

This is also something we've seen elsewhere and may try and address in .NET 7.

Compare_DifferentFirstChar hits these same code paths (though just one pass through, not many) and so likely has the same root cause.

AndyAyersMS · 2021-09-21T20:02:50Z

I can repro the FormatterInt64 regression on x64 Linux, but for some reason the disassembly diagnoser is failing for the default config, so I haven't been able to compare codegen yet...

Unhandled exception. System.NotSupportedException: Unknown Acknowledgment: 
   at BenchmarkDotNet.Engines.ConsoleHost.SendSignal(HostSignal hostSignal)
   at BenchmarkDotNet.Engines.HostExtensions.AfterAll(IHost host)
   at BenchmarkDotNet.Autogenerated.UniqueProgramName.AfterAssemblyLoadingAttached(String[] args) in /home/andy/repos/performance/artifacts/bin/MicroBenchmarks/Release/net6.0/b1743b7d-d6d1-4487-b16d-dab5a6be6e52/b1743b7d-d6d1-4487-b16d-dab5a6be6e52.notcs:line 77
   at BenchmarkDotNet.Autogenerated.UniqueProgramName.Main(String[] args) in /home/andy/repos/performance/artifacts/bin/MicroBenchmarks/Release/net6.0/b1743b7d-d6d1-4487-b16d-dab5a6be6e52/b1743b7d-d6d1-4487-b16d-dab5a6be6e52.notcs:line 24

The other cases to (longer values) to format also show small regressions.

AndyAyersMS · 2021-09-22T19:32:31Z

Can repro FormatterInt64 regression with corerun & checked jit dropped into release build.

Using COMPlus_JitDisablePgo=1 can verify the regression is related to incorporation of PGO data.

The 3 key methods are Utf8FormatterTests:FormatterInt64, WriteDigits and CountDigits. PGO modifies codegen for all three, but the changes in WriteDigits and CountDigits do not seem to be related to the regressions seen here (via a modified setup where I can selectively suppress PGO per method).

So regression seems to be coming from changes in Utf8FormatterTests:FormatterInt64. This is a fairly large method and the resulting codegen is quite different. The root method has no static PGO and a loop so it bypasses tiering. A number of inlinees have PGO data, and there are a whole lot of aggressive inlines happening:

Inlines into 06001894 [via ExtendedDefaultPolicy] Utf8FormatterTests:FormatterInt64(long):bool:this
  [1 IL=0007 TR=000003 06001711] [below ALWAYS_INLINE size] Span`1:op_Implicit(ref):Span`1
    [2 IL=0001 TR=000029 06001707] [aggressive inline attribute] Span`1:.ctor(ref):this
      [3 IL=0058 TR=000046 06004B30] [aggressive inline attribute] MemoryMarshal:GetArrayDataReference(ref):byref
  [4 IL=0023 TR=000012 06002A9F] [below ALWAYS_INLINE size] Utf8Formatter:TryFormat(long,Span`1,byref,StandardFormat):bool
    [5 IL=0006 TR=000080 06002AA0] [aggressive inline attribute] Utf8Formatter:TryFormatInt64(long,long,Span`1,byref,StandardFormat):bool
      [6 IL=0002 TR=000096 06002A57] [profitable inline] StandardFormat:get_IsDefault():bool:this
      [7 IL=0012 TR=000272 06002AA2] [aggressive inline attribute] Utf8Formatter:TryFormatInt64Default(long,Span`1,byref):bool
        [8 IL=0010 TR=000332 06002AA8] [aggressive inline attribute] Utf8Formatter:TryFormatUInt32SingleDigit(int,Span`1,byref):bool
          [9 IL=0002 TR=000346 0600170C] [below ALWAYS_INLINE size] Span`1:get_Length():int:this
        [10 IL=0016 TR=000311 060012AA] [below ALWAYS_INLINE size] IntPtr:get_Size():int
        [11 IL=0025 TR=000319 06002AA3] [aggressive inline attribute] Utf8Formatter:TryFormatInt64MultipleDigits(long,Span`1,byref):bool
          [12 IL=0010 TR=000423 06002A7C] [aggressive inline attribute] FormattingHelpers:CountDigits(long):int
          [13 IL=0019 TR=000430 0600170C] [below ALWAYS_INLINE size] Span`1:get_Length():int:this
          [14 IL=0052 TR=000462 0600171E] [aggressive inline attribute] Span`1:Slice(int,int):Span`1:this
            [0 IL=0014 TR=000617 06001BEE] [FAILED: does not return] ThrowHelper:ThrowArgumentOutOfRangeException()
            [15 IL=0032 TR=000607 06006C1B] [aggressive inline attribute] Unsafe:Add(byref,long):byref
            [16 IL=0038 TR=000615 0600170A] [aggressive inline attribute] Span`1:.ctor(byref,int):this
          [17 IL=0057 TR=000464 06002A82] [aggressive inline attribute] FormattingHelpers:WriteDigits(long,Span`1)
            [18 IL=0002 TR=000650 0600170C] [below ALWAYS_INLINE size] Span`1:get_Length():int:this
          [19 IL=0067 TR=000409 06002AA9] [aggressive inline attribute] Utf8Formatter:TryFormatUInt64MultipleDigits(long,Span`1,byref):bool
            [0 IL=0001 TR=000731 06002A7C] [FAILED: inline exceeds budget] FormattingHelpers:CountDigits(long):int
            [20 IL=0010 TR=000738 0600170C] [below ALWAYS_INLINE size] Span`1:get_Length():int:this
            [21 IL=0030 TR=000752 0600171E] [aggressive inline attribute] Span`1:Slice(int,int):Span`1:this
              [0 IL=0014 TR=000812 06001BEE] [FAILED: does not return] ThrowHelper:ThrowArgumentOutOfRangeException()
              [22 IL=0032 TR=000802 06006C1B] [aggressive inline attribute] Unsafe:Add(byref,long):byref
              [23 IL=0038 TR=000810 0600170A] [aggressive inline attribute] Span`1:.ctor(byref,int):this
            [0 IL=0035 TR=000754 06002A82] [FAILED: inline exceeds budget] FormattingHelpers:WriteDigits(long,Span`1)
      [24 IL=0020 TR=000103 06002A54] [below ALWAYS_INLINE size] StandardFormat:get_Symbol():ushort:this
      [0 IL=0097 TR=000195 06002A56] [FAILED: unprofitable inline] StandardFormat:get_HasPrecision():bool:this
      [0 IL=0104 TR=000217 06001B75] [FAILED: has ldstr VM restriction] SR:get_Argument_GWithPrecisionNotSupported():String
      [0 IL=0109 TR=000224 0600143B] [FAILED: unprofitable inline] NotSupportedException:.ctor(String):this
      [25 IL=0118 TR=000203 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [26 IL=0125 TR=000208 06002AA1] [aggressive inline attribute] Utf8Formatter:TryFormatInt64D(long,ubyte,Span`1,byref):bool
        [0 IL=0018 TR=000868 06002AA6] [FAILED: inline exceeds budget] Utf8Formatter:TryFormatUInt64D(long,ubyte,Span`1,bool,byref):bool
      [27 IL=0134 TR=000175 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [28 IL=0141 TR=000180 06002AA1] [aggressive inline attribute] Utf8Formatter:TryFormatInt64D(long,ubyte,Span`1,byref):bool
        [0 IL=0018 TR=000911 06002AA6] [FAILED: inline exceeds budget] Utf8Formatter:TryFormatUInt64D(long,ubyte,Span`1,bool,byref):bool
      [29 IL=0150 TR=000121 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [30 IL=0157 TR=000126 06002AA4] [aggressive inline attribute] Utf8Formatter:TryFormatInt64N(long,ubyte,Span`1,byref):bool
      [31 IL=0168 TR=000144 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [0 IL=0176 TR=000150 06002AAB] [FAILED: inline exceeds budget] Utf8Formatter:TryFormatUInt64X(long,ubyte,bool,Span`1,byref):bool
      [32 IL=0187 TR=000244 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [0 IL=0195 TR=000250 06002AAB] [FAILED: inline exceeds budget] Utf8Formatter:TryFormatUInt64X(long,ubyte,bool,Span`1,byref):bool
      [33 IL=0202 TR=000162 06002A87] [below ALWAYS_INLINE size] FormattingHelpers:TryFormatThrowFormatException(byref):bool
        [0 IL=0003 TR=000994 06001C2D] [FAILED: does not return] ThrowHelper:ThrowFormatException_BadFormatSpecifier()

Note in both cases we fail to do some of the aggressive inlines because of budget checks. We should revisit this pattern of aggressive inlines, and or figure out how to accommodate this as part of fixing #41692.

PGO/NoPGO have very similar inlines, save that with NoPGO we do one more inline which is likely not a factor here:

;; pgo
      [0 IL=0097 TR=000195 06002A56] [FAILED: unprofitable inline] StandardFormat:get_HasPrecision():bool:this

;; nopgo
      [25 IL=0097 TR=000195 06002A56] [profitable inline] StandardFormat:get_HasPrecision():bool:this

It is going to be difficult to pin down what exactly is leading to the perf loss here, given the size of the method, the substantial differences in generated code, and the lack of good tooling on unix, but I'll dig in and see if I can uncover anything.

AndyAyersMS · 2021-09-22T22:16:16Z

For the key inlinee Utf8Formatter:TryFormat we have virtually no profile data, with just 96 total calls and one path taken.

Have static profile data: 15 schema records (schema at 00007FA61001E558, data at 00007FA61001E520)
Profile summary: 4 runs, 0 block probes, 14 edge probes, 0 class profiles, 0 other records

Reconstructing block counts from sparse edge instrumentation
... adding known edge BB17 -> BB16: weight 0
... adding known edge BB25 -> BB40: weight 0
... adding known edge BB27 -> BB36: weight 0
... adding known edge BB28 -> BB33: weight 0
... adding known edge BB29 -> BB40: weight 0
... adding known edge BB30 -> BB37: weight 0
... adding known edge BB32 -> BB40: weight 0
... adding known edge BB34 -> BB16: weight 0
... adding known edge BB35 -> BB16: weight 0
... adding known edge BB36 -> BB16: weight 96
... adding known edge BB37 -> BB16: weight 0
... adding known edge BB38 -> BB16: weight 0
... adding known edge BB39 -> BB16: weight 0
... adding known edge BB40 -> BB16: weight 0

As a result we consider many of the blocks to be rarely executed.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB16 [0011]  1                           100    100    [000..009)-> BB18 ( cond )                     IBC 
BB17 [0012]  1                             0      0    [009..012)        (return)                     rare IBC 
BB18 [0013]  1                           100    100    [012..01F)-> BB26 ( cond )                     IBC 
BB19 [0014]  1                           100    100    [01F..024)-> BB23 ( cond )                     IBC 
BB20 [0015]  1                           100    100    [024..029)-> BB36 ( cond )                     IBC 
BB21 [0016]  1                             0      0    [029..02E)-> BB33 ( cond )                     rare IBC 
BB22 [0017]  1                             0      0    [02E..033)-> BB40 (always)                     rare IBC 
BB23 [0018]  1                             0      0    [033..038)-> BB37 ( cond )                     rare IBC 
BB24 [0019]  1                             0      0    [038..03D)-> BB39 ( cond )                     rare IBC 
BB25 [0020]  1                             0      0    [03D..042)-> BB40 (always)                     rare IBC 
BB26 [0021]  1                             0      0    [042..047)-> BB30 ( cond )                     rare IBC 
BB27 [0022]  1                             0      0    [047..04C)-> BB36 ( cond )                     rare IBC 
BB28 [0023]  1                             0      0    [04C..051)-> BB33 ( cond )                     rare IBC 
BB29 [0024]  1                             0      0    [051..053)-> BB40 (always)                     rare IBC 
BB30 [0025]  1                             0      0    [053..058)-> BB37 ( cond )                     rare IBC 
BB31 [0026]  1                             0      0    [058..05D)-> BB38 ( cond )                     rare IBC 
BB32 [0027]  1                             0      0    [05D..05F)-> BB40 (always)                     rare IBC 
BB33 [0028]  2                             0      0    [05F..068)-> BB35 ( cond )                     rare IBC 
BB34 [0029]  1                             0      0    [068..073)        (throw )                     rare IBC 
BB35 [0030]  1                             0      0    [073..083)        (return)                     rare IBC 
BB36 [0031]  2                           100    100    [083..093)        (return)                     IBC 
BB37 [0032]  2                             0      0    [093..0A3)        (return)                     rare IBC 
BB38 [0033]  1                             0      0    [0A3..0B6)        (return)                     rare IBC 
BB39 [0034]  1                             0      0    [0B6..0C9)        (return)                     rare IBC 
BB40 [0035]  4                             0      0    [0C9..0D0)        (return)                     rare IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

This likely explains the perf degradation.

This can be addressed in variety of ways (none of which are easily addressable in .NET 6).

Expand the training set we use for static PGO to include more coverage in this code.
Rely on synthetic profile data to fill in plausible details for lightly-covered methods like this one
(possibly) make the jit skeptical of PGO that shows low total hit counts in complex methods. Note it is hard for the jit to know when profile data is really representative.

AndyAyersMS · 2021-09-22T22:20:23Z

I don't see anything here we can address at this point in .NET 6, so am going to move this out of the 6.0 milestone.

AndyAyersMS · 2021-09-22T23:03:55Z

A few more notes -- as with the case @EgorBo noted above

Method	Job	EnvironmentVariables	Count	Options	Mean	Error	StdDev
Compare_Same_Upper	Job-OOQIED	DOTNET_JitDisablePgo=1	1024	(en-U(...)Case) [26]	917.7 ns	13.85 ns	15.95 ns
Compare_Same_Upper	Job-LRXJIG	DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1	1024	(en-U(...)Case) [26]	1,509.1 ns	10.26 ns	10.98 ns
Compare_Same_Upper	Job-SQKDSE	Empty	1024	(en-U(...)Case) [26]	1,373.2 ns	14.82 ns	15.85 ns

Looking at full pgo, we see it helps the 12345 case but not the other two. So more mysteries here to sort out; even if we see (presumably) high quality PGO data we don't seem to benefit.

Default

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
FormatterInt64	-9223372036854775808	40.49 ns	0.316 ns	0.280 ns	40.41 ns	40.23 ns	41.20 ns	-
FormatterInt64	12345	18.96 ns	0.163 ns	0.152 ns	18.88 ns	18.80 ns	19.26 ns	-
FormatterInt64	9223372036854775807	49.86 ns	0.152 ns	0.127 ns	49.82 ns	49.74 ns	50.14 ns	-

No PGO

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
FormatterInt64	-9223372036854775808	38.07 ns	0.104 ns	0.082 ns	38.04 ns	37.98 ns	38.22 ns	-
FormatterInt64	12345	16.38 ns	0.142 ns	0.126 ns	16.37 ns	16.15 ns	16.59 ns	-
FormatterInt64	9223372036854775807	48.11 ns	0.169 ns	0.141 ns	48.11 ns	47.91 ns	48.39 ns	-

Full PGO

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
FormatterInt64	-9223372036854775808	42.35 ns	0.523 ns	0.489 ns	42.48 ns	41.27 ns	42.98 ns	-
FormatterInt64	12345	13.53 ns	0.723 ns	0.832 ns	13.68 ns	10.58 ns	14.48 ns	-
FormatterInt64	9223372036854775807	52.26 ns	1.132 ns	1.059 ns	51.95 ns	50.32 ns	53.92 ns	-

I wonder how much of this might be that BDN doesn't run the Tier0 code sufficient times or something similar...? Playing around with --warmupCount I get different results for Full PGO:

Full PGO + --warmupCount 100

Method	value	Mean	Error	StdDev	Median	Min	Max	Allocated
FormatterInt64	-9223372036854775808	36.472 ns	0.2137 ns	0.1895 ns	36.347 ns	36.289 ns	36.814 ns	-
FormatterInt64	12345	9.882 ns	0.0709 ns	0.0629 ns	9.878 ns	9.766 ns	9.994 ns	-
FormatterInt64	9223372036854775807	43.894 ns	0.1425 ns	0.1263 ns	43.893 ns	43.696 ns	44.157 ns	-

however that trick doesn't help the Compare_Same_Upper performance:

Method	Job	EnvironmentVariables	WarmupCount	Count	Options	Mean	Error	StdDev
Compare_Same_Upper	Job-ESPZVK	DOTNET_JitDisablePgo=1	Default	1024	(en-U(...)Case) [26]	876.6 ns	9.96 ns	11.47 ns
Compare_Same_Upper	Job-MAFOCS	DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1	Default	1024	(en-U(...)Case) [26]	1,500.9 ns	5.40 ns	6.00 ns
Compare_Same_Upper	Job-TMUCWV	DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1	100	1024	(en-U(...)Case) [26]	1,510.0 ns	9.58 ns	11.03 ns
Compare_Same_Upper	Job-ZEOHZJ	Empty	Default	1024	(en-U(...)Case) [26]	1,321.5 ns	18.99 ns	21.11 ns

adamsitnik · 2021-10-14T15:54:34Z

I wonder how much of this might be that BDN doesn't run the Tier0 code sufficient times or something similar...?

Usually BDN invokes the code enough times for it to get promoted, but sometimes the background thread used by Tiered JIT does not get a chance to "kick in" and promote things to Tier 1.

#13069

AndyAyersMS · 2022-05-02T18:30:36Z

Performance of Compare_Same_Upper still showing regressions, but FormatterInt64 now back to where it was:

Runtime commit range for the perf drop was 489b034...250fda5 which doesn't show anything relevant.

Perf repo commit range was dotnet/performance@759f8b0...bff6fec which also doesn't show anything relevant.

So guessing FormatterInt64 was some microarchitectural issue.

AndyAyersMS · 2022-06-27T19:58:29Z

All the other benchmarks other than Compare_Same_Upper seem to be at or near their best levels. And this benchmark seems to be fairly volatile but rarely gets to the sub-1000 level that we had back before April 2021 (the one time it dipped down then up was a pair of PGO updates).

More recently there seems to have been a more or less sustained regression in early May 2022 with #68869 where we no longer dip below 1200. This was undone at the end of May via #70144 but we did not completely recover the old perf.

There does not seem to be any good explanation for the recent spike up to 1800 and then drop back down.

AndyAyersMS · 2022-07-23T15:38:54Z

Drilling into Compare_Same_Upper shows the comments above are still relevant (except that we no longer do early block reordering).

The late block reordering scrambles the loop body, interposing non-loop blocks. This sort of reordering is risky unless you have very high confidence in your profile.

This is something I hope we can improve on in .NET 8.

AndyAyersMS · 2023-04-22T15:09:10Z

Compare_Same_Upper performance now back to where it was long ago, looks like #85130 was the change responsible

Seems like this might not have been spotted by auto filing, going to double-check.

A it was dotnet/perf-autofiling-issues#12928 that we evidently never triaged.

adamsitnik added area-System.Globalization tenet-performance Performance related issue labels Sep 14, 2021

dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 14, 2021

tarekgh added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Globalization labels Sep 16, 2021

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 16, 2021

JulieLeeMSFT assigned AndyAyersMS Sep 16, 2021

JulieLeeMSFT added this to the 6.0.0 milestone Sep 16, 2021

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Sep 16, 2021

adamsitnik mentioned this issue Sep 17, 2021

.NET 6.0 Microbenchmarks Performance Study Report #59272

Closed

19 tasks

AndyAyersMS removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Sep 22, 2021

AndyAyersMS modified the milestones: 6.0.0, 7.0.0 Sep 22, 2021

AndyAyersMS mentioned this issue May 2, 2022

JIT: PGO-based block reordering interferes with loop recognition #67318

Closed

AndyAyersMS added the Regression label Jun 10, 2022

AndyAyersMS modified the milestones: 7.0.0, Future Jul 23, 2022

jeffhandley removed the Regression label Dec 28, 2022

AndyAyersMS closed this as completed Apr 22, 2023

ghost locked as resolved and limited conversation to collaborators May 22, 2023

JulieLeeMSFT added this to .NET Core CodeGen Jun 5, 2024

JulieLeeMSFT moved this to Done in .NET Core CodeGen Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential performance regression in OrdinalIgnoreCase string comparision #59087

Potential performance regression in OrdinalIgnoreCase string comparision #59087

adamsitnik commented Sep 14, 2021 •

edited by BruceForstall

Loading

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

ghost commented Sep 14, 2021

EgorBo commented Sep 14, 2021 •

edited

Loading

EgorBo commented Sep 14, 2021 •

edited

Loading

adamsitnik commented Sep 14, 2021

System.Globalization.Tests.StringEquality.Compare_DifferentFirstChar(Count: 1024, Options: (en-US, Ordinal))

adamsitnik commented Sep 14, 2021

System.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt64(value: 12345)

adamsitnik commented Sep 15, 2021

adamsitnik commented Sep 15, 2021

ghost commented Sep 16, 2021

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

AndyAyersMS commented Sep 21, 2021

AndyAyersMS commented Sep 21, 2021

AndyAyersMS commented Sep 21, 2021

AndyAyersMS commented Sep 22, 2021

AndyAyersMS commented Sep 22, 2021

AndyAyersMS commented Sep 22, 2021

AndyAyersMS commented Sep 22, 2021

adamsitnik commented Oct 14, 2021

AndyAyersMS commented May 2, 2022

AndyAyersMS commented Jun 27, 2022

AndyAyersMS commented Jul 23, 2022

AndyAyersMS commented Apr 22, 2023

Potential performance regression in OrdinalIgnoreCase string comparision #59087

Potential performance regression in OrdinalIgnoreCase string comparision #59087

Comments

adamsitnik commented Sep 14, 2021 • edited by BruceForstall Loading

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

ghost commented Sep 14, 2021

EgorBo commented Sep 14, 2021 • edited Loading

EgorBo commented Sep 14, 2021 • edited Loading

adamsitnik commented Sep 14, 2021

System.Globalization.Tests.StringEquality.Compare_DifferentFirstChar(Count: 1024, Options: (en-US, Ordinal))

adamsitnik commented Sep 14, 2021

System.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt64(value: 12345)

adamsitnik commented Sep 15, 2021

adamsitnik commented Sep 15, 2021

ghost commented Sep 16, 2021

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

AndyAyersMS commented Sep 21, 2021

AndyAyersMS commented Sep 21, 2021

AndyAyersMS commented Sep 21, 2021

AndyAyersMS commented Sep 22, 2021

AndyAyersMS commented Sep 22, 2021

AndyAyersMS commented Sep 22, 2021

AndyAyersMS commented Sep 22, 2021

adamsitnik commented Oct 14, 2021

AndyAyersMS commented May 2, 2022

AndyAyersMS commented Jun 27, 2022

AndyAyersMS commented Jul 23, 2022

AndyAyersMS commented Apr 22, 2023

adamsitnik commented Sep 14, 2021 •

edited by BruceForstall

Loading

EgorBo commented Sep 14, 2021 •

edited

Loading

EgorBo commented Sep 14, 2021 •

edited

Loading