Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance regression in OrdinalIgnoreCase string comparision #59087

Closed
adamsitnik opened this issue Sep 14, 2021 · 20 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@adamsitnik
Copy link
Member

adamsitnik commented Sep 14, 2021

Originally detected by the bot in DrewScoggins/performance-2#4549 but not reported in dotnet/runtime (cc @DrewScoggins)

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 978.46 1223.95 0.80 +0 bimodal Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 1405.76 1696.68 0.83 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1390.04 1662.75 0.84 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1228.43 1642.76 0.75 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 1689.86 2838.86 0.60 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Slower 1366.24 1803.72 0.76 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Slower 1014.33 1552.26 0.65 +0 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Slower 1355.82 1964.26 0.69 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Slower 858.70 1325.30 0.65 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Slower 984.06 1414.77 0.70 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 4173.22 6321.15 0.66 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Slower 1394.92 1628.06 0.86 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1371.22 1630.47 0.84 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1402.69 1653.62 0.85 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1401.21 1614.74 0.87 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1378.12 1623.93 0.85 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1020.15 1478.16 0.69 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 1744.39 2038.20 0.86 +0 bimodal alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Slower 3021.64 3715.93 0.81 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Slower 1809.60 2855.11 0.63 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 1987.02 2863.98 0.69 +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 989.54 1249.97 0.79 +0 several? Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 1446.38 2158.57 0.67 +0 Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 1816.65 2931.48 0.62 +0 Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 1306.85 1927.69 0.68 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 1161.13 1668.36 0.70 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 1198.18 1747.52 0.69 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper

A quick look at the historical data

image

and zooming:

image

It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):

f64246c...25f1800

category:performance
theme:benchmarks

@ghost
Copy link

ghost commented Sep 14, 2021

Tagging subscribers to this area: @tarekgh, @safern
See info in area-owners.md if you want to be subscribed.

Issue Details

Originally detected by the bot in DrewScoggins/performance-2#4549 but not reported in dotnet/runtime (cc @DrewScoggins)

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper

A quick look at the historical data

image

and zooming:

image

It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):

f64246c...25f1800

Author: adamsitnik
Assignees: -
Labels:

area-System.Globalization, tenet-performance

Milestone: -

@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Sep 14, 2021
@EgorBo
Copy link
Member

EgorBo commented Sep 14, 2021

Standalone repro:

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.
// See the LICENSE file in the project root for more information.

using System.Collections.Generic;
using System.IO;
using System.Linq;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Environments;
using BenchmarkDotNet.Jobs;

namespace System.Globalization.Tests
{
    [Config(typeof(ConfigWithCustomEnvVars))]
    [DisassemblyDiagnoser(maxDepth: 6, exportDiff: true)]
    public class StringEquality
    {
        private class ConfigWithCustomEnvVars : ManualConfig
        {
            private const string JitNoInline = "COMPlus_JitNoInline";

            public ConfigWithCustomEnvVars()
            {
                AddJob(Job.Default.WithRuntime(CoreRuntime.Core60).WithId("Default"));
                AddJob(Job.Default.WithRuntime(CoreRuntime.Core60)
                    .WithEnvironmentVariables(new EnvironmentVariable("DOTNET_JitDisablePgo", "1"))
                    .WithId("No PGO"));
            }
        }

        private string _value, _same, _sameUpper, _diffAtFirstChar;

        public static IEnumerable<(CultureInfo CultureInfo, CompareOptions CompareOptions)> GetOptions()
        {
            yield return (new CultureInfo("en-US"), CompareOptions.OrdinalIgnoreCase);
        }

        [ParamsSource(nameof(GetOptions))]
        public (CultureInfo CultureInfo, CompareOptions CompareOptions) Options;

        [Params(1024)] // single execution path = single test case
        public int Count;

        [GlobalSetup]
        public void Setup()
        {
            // we are using part of Alice's Adventures in Wonderland text as test data
            char[] characters = File.ReadAllText(@"path\to\alice29.txt").Take(Count).ToArray();
            _value = new string(characters);
            _same = new string(characters);
            _sameUpper = _same.ToUpper();
            char[] copy = characters.ToArray();
            copy[0] = (char)(copy[0] + 1);
            _diffAtFirstChar = new string(copy);
        }

        [Benchmark] // the most work to do for IgnoreCase: every char needs to be compared and uppercased
        public int Compare_Same_Upper() => Options.CultureInfo.CompareInfo.Compare(_value, _sameUpper, Options.CompareOptions);

        public static void Main() => BenchmarkRunner.Run<StringEquality>();
    }
}

Asmdiff (NoPGO is on the left): https://www.diffchecker.com/2guAUS5i

@EgorBo
Copy link
Member

EgorBo commented Sep 14, 2021

The most interesting part is that FullPGO is still slower than NoPGO mode cc @AndyAyersMS (I've not investigated why yet):


|             Method |              Job |         EnvironmentVariables | Count |              Options |     Mean |    Error |  StdDev |
|------------------- |----------------- |----------------------------- |------ |--------------------- |---------:|---------:|--------:|
| Compare_Same_Upper |          FullPGO | DOTNET_TC_QuickJitForLoops=1 |  1024 | (en-U(...)Case) [26] | 908.5 ns |  2.94 ns | 2.45 ns |
| Compare_Same_Upper |           No PGO |       DOTNET_JitDisablePgo=1 |  1024 | (en-U(...)Case) [26] | 869.2 ns |  9.12 ns | 8.53 ns |
| Compare_Same_Upper |          Default |                        Empty |  1024 | (en-U(...)Case) [26] | 948.0 ns | 10.51 ns | 9.83 ns |

@adamsitnik
Copy link
Member Author

Another benchmark that got most likely affected by this change:

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fSystem.Globalization.Tests.StringEquality.Compare_DifferentFirstChar(Count%3a%201024%2c%20Options%3a%20(en-US%2c%20Ordinal)).html

image

But in this particular case only Unix-like systems got affected:

System.Globalization.Tests.StringEquality.Compare_DifferentFirstChar(Count: 1024, Options: (en-US, Ordinal))

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 10.67 10.41 1.03 +0 Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Same 13.61 13.19 1.03 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 13.59 13.19 1.03 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 10.84 9.88 1.10 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Faster 17.97 15.89 1.13 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Same 11.84 11.12 1.07 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Same 10.48 9.42 1.11 +0 several? Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Same 14.24 15.04 0.95 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Same 9.66 9.74 0.99 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Same 10.44 9.05 1.15 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 19.70 29.32 0.67 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Slower 17.37 21.38 0.81 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 13.29 20.88 0.64 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 17.48 20.97 0.83 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 13.56 21.21 0.64 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 13.25 22.69 0.58 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 11.09 15.13 0.73 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 17.81 23.91 0.74 +0 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge) 5.0.921.35908 6.0.21.41701
Slower 11.35 15.96 0.71 +0 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Slower 43.14 54.98 0.78 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Faster 13.65 11.51 1.19 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Faster 14.64 9.85 1.49 +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Same 11.37 10.14 1.12 +0 Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Same 13.42 13.76 0.98 +0 Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 15.32 18.75 0.82 +0 Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 14.28 19.23 0.74 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 12.55 16.66 0.75 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 13.13 17.43 0.75 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

@adamsitnik
Copy link
Member Author

One more Unix-specific benchmark that regressed then:

System.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt64(value: 12345)

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 8.43 7.11 1.18 +0 Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Same 10.14 9.51 1.07 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 10.16 9.45 1.08 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 10.38 10.14 1.02 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 16.68 15.04 1.11 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Same 11.61 11.18 1.04 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Same 9.83 8.78 1.12 +0 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Same 12.85 13.06 0.98 +0 several? Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Same 9.21 8.65 1.06 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Same 10.19 9.32 1.09 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 21.23 36.56 0.58 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Slower 9.41 12.35 0.76 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 10.37 11.61 0.89 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 9.55 12.51 0.76 +0 several? rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 10.13 11.76 0.86 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 9.42 11.69 0.81 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Same 10.05 12.02 0.84 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 14.97 19.07 0.79 +0 ubuntu 18.04 X64 Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge) 5.0.921.35908 6.0.21.41701
Slower 9.25 11.46 0.81 +0 alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Same 28.50 26.57 1.07 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Same 11.26 11.56 0.97 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Same 13.15 11.53 1.14 +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Same 8.34 9.77 0.85 +0 Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Same 12.32 13.57 0.91 +0 several? Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 35.63 48.83 0.73 +0 Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 13.01 15.58 0.84 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 11.19 13.47 0.83 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 11.44 14.51 0.79 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

https://pvscmdupload.blob.core.windows.net/reports/allTestHistory%2frefs%2fheads%2fmain_x64_ubuntu%2018.04%2fSystem.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt64(value%3a%2012345).html

image

@tarekgh tarekgh added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Globalization labels Sep 16, 2021
@ghost
Copy link

ghost commented Sep 16, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Originally detected by the bot in DrewScoggins/performance-2#4549 but not reported in dotnet/runtime (cc @DrewScoggins)

System.Globalization.Tests.StringEquality.Compare_Same_Upper(Count: 1024, Options: (en-US, OrdinalIgnoreCase))

Result Base Diff Ratio Alloc Delta Modality Operating System Bit Processor Name Base V Diff V
Same 978.46 1223.95 0.80 +0 bimodal Windows 10.0.19043.1165 X64 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 1405.76 1696.68 0.83 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1390.04 1662.75 0.84 +0 Windows 10.0.20348 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1228.43 1642.76 0.75 +0 Windows 10.0.18363.1621 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 1689.86 2838.86 0.60 +0 Windows 8.1 X64 Intel Core i7-3610QM CPU 2.30GHz (Ivy Bridge) 5.0.921.35908 6.0.21.45401
Slower 1366.24 1803.72 0.76 +0 Windows 10.0.19042.685 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701
Slower 1014.33 1552.26 0.65 +0 Windows 10.0.19043.1165 X64 Intel Core i7-6700 CPU 3.40GHz (Skylake) 5.0.921.35908 6.0.21.41701
Slower 1355.82 1964.26 0.69 +0 Windows 10.0.22454 X64 Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R) 5.0.921.35908 6.0.21.41701
Slower 858.70 1325.30 0.65 +0 Windows 10.0.22451 X64 Intel Core i7-8700 CPU 3.20GHz (Coffee Lake) 5.0.921.35908 6.0.21.41701
Slower 984.06 1414.77 0.70 +0 Windows 10.0.19042.1165 X64 Intel Core i9-9900T CPU 2.10GHz 5.0.921.35908 6.0.21.41701
Slower 4173.22 6321.15 0.66 +0 Windows 7 SP1 X64 Intel Core2 Duo CPU T9600 2.80GHz 5.0.721.25508 6.0.21.41701
Slower 1394.92 1628.06 0.86 +0 centos 8 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1371.22 1630.47 0.84 +0 debian 10 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1402.69 1653.62 0.85 +0 rhel 7 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1401.21 1614.74 0.87 +0 sles 15 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1378.12 1623.93 0.85 +0 opensuse-leap 15.3 X64 AMD EPYC 7452 5.0.921.35908 6.0.21.41701
Slower 1020.15 1478.16 0.69 +0 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Same 1744.39 2038.20 0.86 +0 bimodal alpine 3.13 X64 Intel Core i7-7700 CPU 3.60GHz (Kaby Lake) 5.0.921.35908 6.0.21.41701
Slower 3021.64 3715.93 0.81 +0 ubuntu 16.04 Arm64 Unknown processor 5.0.421.11614 6.0.21.41701
Slower 1809.60 2855.11 0.63 +0 Windows 10.0.19043.1165 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 1987.02 2863.98 0.69 +0 Windows 10.0.22000 Arm64 Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 989.54 1249.97 0.79 +0 several? Windows 10.0.19043.1165 X86 AMD Ryzen Threadripper PRO 3945WX 12-Cores 5.0.921.35908 6.0.21.41701
Slower 1446.38 2158.57 0.67 +0 Windows 10.0.18363.1621 X86 Intel Xeon CPU E5-1650 v4 3.60GHz 5.0.921.35908 6.0.21.41701
Slower 1816.65 2931.48 0.62 +0 Windows 10.0.19043.1165 Arm Microsoft SQ1 3.0 GHz 5.0.921.35908 6.0.21.41701
Slower 1306.85 1927.69 0.68 +0 macOS Big Sur 11.5.2 X64 Intel Core i5-4278U CPU 2.60GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 1161.13 1668.36 0.70 +0 macOS Big Sur 11.5.2 X64 Intel Core i7-4870HQ CPU 2.50GHz (Haswell) 5.0.921.35908 6.0.21.41701
Slower 1198.18 1747.52 0.69 +0 macOS Big Sur 11.4 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) 5.0.921.35908 6.0.21.41701

Repro:

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f net5.0 net6.0 --filter System.Globalization.Tests.StringEquality.Compare_Same_Upper

A quick look at the historical data

image

and zooming:

image

It might have been caused by PGO (cc @AndyAyersMS @kunalspathak):

f64246c...25f1800

Author: adamsitnik
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Sep 16, 2021
@JulieLeeMSFT JulieLeeMSFT added this to the 6.0.0 milestone Sep 16, 2021
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Sep 16, 2021
@AndyAyersMS
Copy link
Member

Looking at the first of these... will update with details.

@AndyAyersMS
Copy link
Member

There seem to be several issues in Compare_Same_Upper.

First is that the PGO training data does not see many cases of case-insensitive comparison. So in the compound expression

((charA | 0x20) == (charB | 0x20) &&
(uint)((charA | 0x20) - 'a') <= (uint)('z' - 'a')))

The first clause is true with likelihood 0.97. This means that the CSE charA | 0x20 is not deemed profitable by the jit as the second use is considered quite infrequent.

In the benchmark however this code path is taken quite often and so PGO imposes an extra cost. One possible fix here is to broaden the set of inputs we see during PGO training, though it's also possible the likelihoods we see now are realistic.

We should arguably revisit the CSE costing heuristics as it does not seem like doing this extra CSE would actually cause any issues.

Second is that PGO data leads to poor layout decisions. The jit's block layout algorithm is locally greedy and this leads to globally sub-optimal layouts. One such example happens early in the method where there is a control flow diamond; with PGO
we see this diamond is biased with one block at 0.81 likelihood. Another happens later in the code where a PGO-rare path in a loop is moved and requires several jumps to rejoin the flow. In general the JIT is too aggressive in moving lower-frequency blocks out of line, doing so inhibits opportunities for jump elimination later on.

There is work anticipated here in .NET 7 but no easy fix in the meantime.

Third is that the early block reordering done by the JIT interferes with loop recognition, and the JIT mistakenly thinks there are two loops in the method (instead of one multi-exit loop). While concerning, it doesn't seem to cause problems here as the loops are not currently optimizable but the fact that the JIT does the reordering so early is indicative of two shortcomings: (1) loop recognition is too pattern sensitive; (2) optimizing block order should generally wait until later in the phase pipeline.

This is also something we've seen elsewhere and may try and address in .NET 7.

Compare_DifferentFirstChar hits these same code paths (though just one pass through, not many) and so likely has the same root cause.

@AndyAyersMS
Copy link
Member

I can repro the FormatterInt64 regression on x64 Linux, but for some reason the disassembly diagnoser is failing for the default config, so I haven't been able to compare codegen yet...

Unhandled exception. System.NotSupportedException: Unknown Acknowledgment: 
   at BenchmarkDotNet.Engines.ConsoleHost.SendSignal(HostSignal hostSignal)
   at BenchmarkDotNet.Engines.HostExtensions.AfterAll(IHost host)
   at BenchmarkDotNet.Autogenerated.UniqueProgramName.AfterAssemblyLoadingAttached(String[] args) in /home/andy/repos/performance/artifacts/bin/MicroBenchmarks/Release/net6.0/b1743b7d-d6d1-4487-b16d-dab5a6be6e52/b1743b7d-d6d1-4487-b16d-dab5a6be6e52.notcs:line 77
   at BenchmarkDotNet.Autogenerated.UniqueProgramName.Main(String[] args) in /home/andy/repos/performance/artifacts/bin/MicroBenchmarks/Release/net6.0/b1743b7d-d6d1-4487-b16d-dab5a6be6e52/b1743b7d-d6d1-4487-b16d-dab5a6be6e52.notcs:line 24

The other cases to (longer values) to format also show small regressions.

@AndyAyersMS
Copy link
Member

Can repro FormatterInt64 regression with corerun & checked jit dropped into release build.

Using COMPlus_JitDisablePgo=1 can verify the regression is related to incorporation of PGO data.

The 3 key methods are Utf8FormatterTests:FormatterInt64, WriteDigits and CountDigits. PGO modifies codegen for all three, but the changes in WriteDigits and CountDigits do not seem to be related to the regressions seen here (via a modified setup where I can selectively suppress PGO per method).

So regression seems to be coming from changes in Utf8FormatterTests:FormatterInt64. This is a fairly large method and the resulting codegen is quite different. The root method has no static PGO and a loop so it bypasses tiering. A number of inlinees have PGO data, and there are a whole lot of aggressive inlines happening:

Inlines into 06001894 [via ExtendedDefaultPolicy] Utf8FormatterTests:FormatterInt64(long):bool:this
  [1 IL=0007 TR=000003 06001711] [below ALWAYS_INLINE size] Span`1:op_Implicit(ref):Span`1
    [2 IL=0001 TR=000029 06001707] [aggressive inline attribute] Span`1:.ctor(ref):this
      [3 IL=0058 TR=000046 06004B30] [aggressive inline attribute] MemoryMarshal:GetArrayDataReference(ref):byref
  [4 IL=0023 TR=000012 06002A9F] [below ALWAYS_INLINE size] Utf8Formatter:TryFormat(long,Span`1,byref,StandardFormat):bool
    [5 IL=0006 TR=000080 06002AA0] [aggressive inline attribute] Utf8Formatter:TryFormatInt64(long,long,Span`1,byref,StandardFormat):bool
      [6 IL=0002 TR=000096 06002A57] [profitable inline] StandardFormat:get_IsDefault():bool:this
      [7 IL=0012 TR=000272 06002AA2] [aggressive inline attribute] Utf8Formatter:TryFormatInt64Default(long,Span`1,byref):bool
        [8 IL=0010 TR=000332 06002AA8] [aggressive inline attribute] Utf8Formatter:TryFormatUInt32SingleDigit(int,Span`1,byref):bool
          [9 IL=0002 TR=000346 0600170C] [below ALWAYS_INLINE size] Span`1:get_Length():int:this
        [10 IL=0016 TR=000311 060012AA] [below ALWAYS_INLINE size] IntPtr:get_Size():int
        [11 IL=0025 TR=000319 06002AA3] [aggressive inline attribute] Utf8Formatter:TryFormatInt64MultipleDigits(long,Span`1,byref):bool
          [12 IL=0010 TR=000423 06002A7C] [aggressive inline attribute] FormattingHelpers:CountDigits(long):int
          [13 IL=0019 TR=000430 0600170C] [below ALWAYS_INLINE size] Span`1:get_Length():int:this
          [14 IL=0052 TR=000462 0600171E] [aggressive inline attribute] Span`1:Slice(int,int):Span`1:this
            [0 IL=0014 TR=000617 06001BEE] [FAILED: does not return] ThrowHelper:ThrowArgumentOutOfRangeException()
            [15 IL=0032 TR=000607 06006C1B] [aggressive inline attribute] Unsafe:Add(byref,long):byref
            [16 IL=0038 TR=000615 0600170A] [aggressive inline attribute] Span`1:.ctor(byref,int):this
          [17 IL=0057 TR=000464 06002A82] [aggressive inline attribute] FormattingHelpers:WriteDigits(long,Span`1)
            [18 IL=0002 TR=000650 0600170C] [below ALWAYS_INLINE size] Span`1:get_Length():int:this
          [19 IL=0067 TR=000409 06002AA9] [aggressive inline attribute] Utf8Formatter:TryFormatUInt64MultipleDigits(long,Span`1,byref):bool
            [0 IL=0001 TR=000731 06002A7C] [FAILED: inline exceeds budget] FormattingHelpers:CountDigits(long):int
            [20 IL=0010 TR=000738 0600170C] [below ALWAYS_INLINE size] Span`1:get_Length():int:this
            [21 IL=0030 TR=000752 0600171E] [aggressive inline attribute] Span`1:Slice(int,int):Span`1:this
              [0 IL=0014 TR=000812 06001BEE] [FAILED: does not return] ThrowHelper:ThrowArgumentOutOfRangeException()
              [22 IL=0032 TR=000802 06006C1B] [aggressive inline attribute] Unsafe:Add(byref,long):byref
              [23 IL=0038 TR=000810 0600170A] [aggressive inline attribute] Span`1:.ctor(byref,int):this
            [0 IL=0035 TR=000754 06002A82] [FAILED: inline exceeds budget] FormattingHelpers:WriteDigits(long,Span`1)
      [24 IL=0020 TR=000103 06002A54] [below ALWAYS_INLINE size] StandardFormat:get_Symbol():ushort:this
      [0 IL=0097 TR=000195 06002A56] [FAILED: unprofitable inline] StandardFormat:get_HasPrecision():bool:this
      [0 IL=0104 TR=000217 06001B75] [FAILED: has ldstr VM restriction] SR:get_Argument_GWithPrecisionNotSupported():String
      [0 IL=0109 TR=000224 0600143B] [FAILED: unprofitable inline] NotSupportedException:.ctor(String):this
      [25 IL=0118 TR=000203 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [26 IL=0125 TR=000208 06002AA1] [aggressive inline attribute] Utf8Formatter:TryFormatInt64D(long,ubyte,Span`1,byref):bool
        [0 IL=0018 TR=000868 06002AA6] [FAILED: inline exceeds budget] Utf8Formatter:TryFormatUInt64D(long,ubyte,Span`1,bool,byref):bool
      [27 IL=0134 TR=000175 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [28 IL=0141 TR=000180 06002AA1] [aggressive inline attribute] Utf8Formatter:TryFormatInt64D(long,ubyte,Span`1,byref):bool
        [0 IL=0018 TR=000911 06002AA6] [FAILED: inline exceeds budget] Utf8Formatter:TryFormatUInt64D(long,ubyte,Span`1,bool,byref):bool
      [29 IL=0150 TR=000121 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [30 IL=0157 TR=000126 06002AA4] [aggressive inline attribute] Utf8Formatter:TryFormatInt64N(long,ubyte,Span`1,byref):bool
      [31 IL=0168 TR=000144 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [0 IL=0176 TR=000150 06002AAB] [FAILED: inline exceeds budget] Utf8Formatter:TryFormatUInt64X(long,ubyte,bool,Span`1,byref):bool
      [32 IL=0187 TR=000244 06002A55] [below ALWAYS_INLINE size] StandardFormat:get_Precision():ubyte:this
      [0 IL=0195 TR=000250 06002AAB] [FAILED: inline exceeds budget] Utf8Formatter:TryFormatUInt64X(long,ubyte,bool,Span`1,byref):bool
      [33 IL=0202 TR=000162 06002A87] [below ALWAYS_INLINE size] FormattingHelpers:TryFormatThrowFormatException(byref):bool
        [0 IL=0003 TR=000994 06001C2D] [FAILED: does not return] ThrowHelper:ThrowFormatException_BadFormatSpecifier()

Note in both cases we fail to do some of the aggressive inlines because of budget checks. We should revisit this pattern of aggressive inlines, and or figure out how to accommodate this as part of fixing #41692.

PGO/NoPGO have very similar inlines, save that with NoPGO we do one more inline which is likely not a factor here:

;; pgo
      [0 IL=0097 TR=000195 06002A56] [FAILED: unprofitable inline] StandardFormat:get_HasPrecision():bool:this

;; nopgo
      [25 IL=0097 TR=000195 06002A56] [profitable inline] StandardFormat:get_HasPrecision():bool:this

It is going to be difficult to pin down what exactly is leading to the perf loss here, given the size of the method, the substantial differences in generated code, and the lack of good tooling on unix, but I'll dig in and see if I can uncover anything.

@AndyAyersMS
Copy link
Member

For the key inlinee Utf8Formatter:TryFormat we have virtually no profile data, with just 96 total calls and one path taken.

Have static profile data: 15 schema records (schema at 00007FA61001E558, data at 00007FA61001E520)
Profile summary: 4 runs, 0 block probes, 14 edge probes, 0 class profiles, 0 other records

Reconstructing block counts from sparse edge instrumentation
... adding known edge BB17 -> BB16: weight 0
... adding known edge BB25 -> BB40: weight 0
... adding known edge BB27 -> BB36: weight 0
... adding known edge BB28 -> BB33: weight 0
... adding known edge BB29 -> BB40: weight 0
... adding known edge BB30 -> BB37: weight 0
... adding known edge BB32 -> BB40: weight 0
... adding known edge BB34 -> BB16: weight 0
... adding known edge BB35 -> BB16: weight 0
... adding known edge BB36 -> BB16: weight 96
... adding known edge BB37 -> BB16: weight 0
... adding known edge BB38 -> BB16: weight 0
... adding known edge BB39 -> BB16: weight 0
... adding known edge BB40 -> BB16: weight 0

As a result we consider many of the blocks to be rarely executed.

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd                 weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB16 [0011]  1                           100    100    [000..009)-> BB18 ( cond )                     IBC 
BB17 [0012]  1                             0      0    [009..012)        (return)                     rare IBC 
BB18 [0013]  1                           100    100    [012..01F)-> BB26 ( cond )                     IBC 
BB19 [0014]  1                           100    100    [01F..024)-> BB23 ( cond )                     IBC 
BB20 [0015]  1                           100    100    [024..029)-> BB36 ( cond )                     IBC 
BB21 [0016]  1                             0      0    [029..02E)-> BB33 ( cond )                     rare IBC 
BB22 [0017]  1                             0      0    [02E..033)-> BB40 (always)                     rare IBC 
BB23 [0018]  1                             0      0    [033..038)-> BB37 ( cond )                     rare IBC 
BB24 [0019]  1                             0      0    [038..03D)-> BB39 ( cond )                     rare IBC 
BB25 [0020]  1                             0      0    [03D..042)-> BB40 (always)                     rare IBC 
BB26 [0021]  1                             0      0    [042..047)-> BB30 ( cond )                     rare IBC 
BB27 [0022]  1                             0      0    [047..04C)-> BB36 ( cond )                     rare IBC 
BB28 [0023]  1                             0      0    [04C..051)-> BB33 ( cond )                     rare IBC 
BB29 [0024]  1                             0      0    [051..053)-> BB40 (always)                     rare IBC 
BB30 [0025]  1                             0      0    [053..058)-> BB37 ( cond )                     rare IBC 
BB31 [0026]  1                             0      0    [058..05D)-> BB38 ( cond )                     rare IBC 
BB32 [0027]  1                             0      0    [05D..05F)-> BB40 (always)                     rare IBC 
BB33 [0028]  2                             0      0    [05F..068)-> BB35 ( cond )                     rare IBC 
BB34 [0029]  1                             0      0    [068..073)        (throw )                     rare IBC 
BB35 [0030]  1                             0      0    [073..083)        (return)                     rare IBC 
BB36 [0031]  2                           100    100    [083..093)        (return)                     IBC 
BB37 [0032]  2                             0      0    [093..0A3)        (return)                     rare IBC 
BB38 [0033]  1                             0      0    [0A3..0B6)        (return)                     rare IBC 
BB39 [0034]  1                             0      0    [0B6..0C9)        (return)                     rare IBC 
BB40 [0035]  4                             0      0    [0C9..0D0)        (return)                     rare IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

This likely explains the perf degradation.

This can be addressed in variety of ways (none of which are easily addressable in .NET 6).

  • Expand the training set we use for static PGO to include more coverage in this code.
  • Rely on synthetic profile data to fill in plausible details for lightly-covered methods like this one
  • (possibly) make the jit skeptical of PGO that shows low total hit counts in complex methods. Note it is hard for the jit to know when profile data is really representative.

@AndyAyersMS
Copy link
Member

I don't see anything here we can address at this point in .NET 6, so am going to move this out of the 6.0 milestone.

@AndyAyersMS AndyAyersMS removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Sep 22, 2021
@AndyAyersMS AndyAyersMS modified the milestones: 6.0.0, 7.0.0 Sep 22, 2021
@AndyAyersMS
Copy link
Member

A few more notes -- as with the case @EgorBo noted above

Method Job EnvironmentVariables Count Options Mean Error StdDev
Compare_Same_Upper Job-OOQIED DOTNET_JitDisablePgo=1 1024 (en-U(...)Case) [26] 917.7 ns 13.85 ns 15.95 ns
Compare_Same_Upper Job-LRXJIG DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1 1024 (en-U(...)Case) [26] 1,509.1 ns 10.26 ns 10.98 ns
Compare_Same_Upper Job-SQKDSE Empty 1024 (en-U(...)Case) [26] 1,373.2 ns 14.82 ns 15.85 ns

Looking at full pgo, we see it helps the 12345 case but not the other two. So more mysteries here to sort out; even if we see (presumably) high quality PGO data we don't seem to benefit.

Default

Method value Mean Error StdDev Median Min Max Allocated
FormatterInt64 -9223372036854775808 40.49 ns 0.316 ns 0.280 ns 40.41 ns 40.23 ns 41.20 ns -
FormatterInt64 12345 18.96 ns 0.163 ns 0.152 ns 18.88 ns 18.80 ns 19.26 ns -
FormatterInt64 9223372036854775807 49.86 ns 0.152 ns 0.127 ns 49.82 ns 49.74 ns 50.14 ns -

No PGO

Method value Mean Error StdDev Median Min Max Allocated
FormatterInt64 -9223372036854775808 38.07 ns 0.104 ns 0.082 ns 38.04 ns 37.98 ns 38.22 ns -
FormatterInt64 12345 16.38 ns 0.142 ns 0.126 ns 16.37 ns 16.15 ns 16.59 ns -
FormatterInt64 9223372036854775807 48.11 ns 0.169 ns 0.141 ns 48.11 ns 47.91 ns 48.39 ns -

Full PGO

Method value Mean Error StdDev Median Min Max Allocated
FormatterInt64 -9223372036854775808 42.35 ns 0.523 ns 0.489 ns 42.48 ns 41.27 ns 42.98 ns -
FormatterInt64 12345 13.53 ns 0.723 ns 0.832 ns 13.68 ns 10.58 ns 14.48 ns -
FormatterInt64 9223372036854775807 52.26 ns 1.132 ns 1.059 ns 51.95 ns 50.32 ns 53.92 ns -

I wonder how much of this might be that BDN doesn't run the Tier0 code sufficient times or something similar...? Playing around with --warmupCount I get different results for Full PGO:

Full PGO + --warmupCount 100

Method value Mean Error StdDev Median Min Max Allocated
FormatterInt64 -9223372036854775808 36.472 ns 0.2137 ns 0.1895 ns 36.347 ns 36.289 ns 36.814 ns -
FormatterInt64 12345 9.882 ns 0.0709 ns 0.0629 ns 9.878 ns 9.766 ns 9.994 ns -
FormatterInt64 9223372036854775807 43.894 ns 0.1425 ns 0.1263 ns 43.893 ns 43.696 ns 44.157 ns -

however that trick doesn't help the Compare_Same_Upper performance:

Method Job EnvironmentVariables WarmupCount Count Options Mean Error StdDev
Compare_Same_Upper Job-ESPZVK DOTNET_JitDisablePgo=1 Default 1024 (en-U(...)Case) [26] 876.6 ns 9.96 ns 11.47 ns
Compare_Same_Upper Job-MAFOCS DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1 Default 1024 (en-U(...)Case) [26] 1,500.9 ns 5.40 ns 6.00 ns
Compare_Same_Upper Job-TMUCWV DOTNET_ReadyToRun=0,DOTNET_TC_QuickJitForLoops=1,DOTNET_TieredPGO=1 100 1024 (en-U(...)Case) [26] 1,510.0 ns 9.58 ns 11.03 ns
Compare_Same_Upper Job-ZEOHZJ Empty Default 1024 (en-U(...)Case) [26] 1,321.5 ns 18.99 ns 21.11 ns

@adamsitnik
Copy link
Member Author

I wonder how much of this might be that BDN doesn't run the Tier0 code sufficient times or something similar...?

Usually BDN invokes the code enough times for it to get promoted, but sometimes the background thread used by Tiered JIT does not get a chance to "kick in" and promote things to Tier 1.

#13069

@AndyAyersMS
Copy link
Member

Performance of Compare_Same_Upper still showing regressions, but FormatterInt64 now back to where it was:
newplot - 2022-05-02T112428 218

Runtime commit range for the perf drop was 489b034...250fda5 which doesn't show anything relevant.

Perf repo commit range was dotnet/performance@759f8b0...bff6fec which also doesn't show anything relevant.

So guessing FormatterInt64 was some microarchitectural issue.

@AndyAyersMS
Copy link
Member

All the other benchmarks other than Compare_Same_Upper seem to be at or near their best levels. And this benchmark seems to be fairly volatile but rarely gets to the sub-1000 level that we had back before April 2021 (the one time it dipped down then up was a pair of PGO updates).

newplot - 2022-06-27T124926 097

More recently there seems to have been a more or less sustained regression in early May 2022 with #68869 where we no longer dip below 1200. This was undone at the end of May via #70144 but we did not completely recover the old perf.

There does not seem to be any good explanation for the recent spike up to 1800 and then drop back down.

newplot - 2022-06-27T125713 028

@AndyAyersMS
Copy link
Member

Drilling into Compare_Same_Upper shows the comments above are still relevant (except that we no longer do early block reordering).

The late block reordering scrambles the loop body, interposing non-loop blocks. This sort of reordering is risky unless you have very high confidence in your profile.

This is something I hope we can improve on in .NET 8.

@AndyAyersMS AndyAyersMS modified the milestones: 7.0.0, Future Jul 23, 2022
@AndyAyersMS
Copy link
Member

Compare_Same_Upper performance now back to where it was long ago, looks like #85130 was the change responsible
newplot - 2023-04-22T080252 554

Seems like this might not have been spotted by auto filing, going to double-check.

A it was dotnet/perf-autofiling-issues#12928 that we evidently never triaged.

@ghost ghost locked as resolved and limited conversation to collaborators May 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
Archived in project
Development

No branches or pull requests

6 participants