RyuJIT: Tight loop performance regression #71628

ptr-null · 2022-07-05T09:44:19Z

Description

Consider the following method containing tight loop:

int[] array = new int[1000000];

public int Sum()
{
    int sum = 0;

    for (int i = 0; i < array.Length; i++)
        sum += array[i];

    return sum;
}

The method is modified then by adding a short check before the loop, as follows:

long state = 1;

public int Sum2()
{
    int sum = 0;

    if (Interlocked.Read(ref state) == 0)
        return sum;

    for (int i = 0; i < array.Length; i++)
        sum += array[i];

    return sum;
}

It is reasonable to expect (given that array is sufficiently long) that the check added should not affect method performance dramatically, right?

But it does:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1766 (21H2)
Intel Core i5-4690 CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.301
  [Host]     : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  DefaultJob : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT

Method	Mean	Error	StdDev	Ratio	RatioSD
Sum	590,207.1 ns	9,395.29 ns	8,328.68 ns	1.00	0.00
Sum2	1,943,186.9 ns	34,793.55 ns	32,545.90 ns	3.29	0.06

Modified method becomes ~3x slower, which suggests that the loop is performed slower.

Analysis

Indeed, it is. Inspecting IL does not expose any difference (loop bodies are identical in both cases). Code generated by RyuJIT for the loop is different however.

In the first case, the sum is accumulated into eax register:

...
add       eax,[r8+r9*4+10]        // sum += array[i]
...

Where as in the second one it is accumulated into r9d register, which is loaded from and saved to stack on each iteration of the loop:

...
mov       r9d,[rsp+24]            // sum <-- [stack]
add       r9d,[rcx+r8*4+10]       // sum += array[i]
...
mov       [rsp+24],r9d            // [stack] <-- sum
...

There is one additional jump in the second case also. See complete generated codes at the end.

Regression

As one may see from the benchmark results below, this is the regression actually:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1766 (21H2)
Intel Core i5-4690 CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.301
  [Host]             : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  .NET 6.0           : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  .NET 5.0           : .NET 5.0.17 (5.0.1722.21314), X64 RyuJIT
  .NET Core 3.1      : .NET Core 3.1.26 (CoreCLR 4.700.22.26002, CoreFX 4.700.22.26801), X64 RyuJIT
  .NET Core 2.2      : .NET Core 2.2.8 (CoreCLR 4.6.28207.03, CoreFX 4.6.28208.02), X64 RyuJIT
  .NET Framework 4.8 : .NET Framework 4.8 (4.8.4515.0), X64 RyuJIT

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD
Sum	.NET 6.0	587,219.9 ns	11,480.13 ns	10,738.52 ns	1.00	0.00
Sum2	.NET 6.0	1,943,794.5 ns	15,662.22 ns	13,078.66 ns	3.33	0.04

Sum	.NET 5.0	584,441.7 ns	9,659.41 ns	9,035.41 ns	1.00	0.00
Sum2	.NET 5.0	1,955,357.4 ns	18,500.14 ns	16,399.89 ns	3.35	0.06

Sum	.NET Core 3.1	597,269.8 ns	8,984.42 ns	8,404.03 ns	1.00	0.00
Sum2	.NET Core 3.1	1,987,871.5 ns	35,514.92 ns	33,220.67 ns	3.33	0.09

Sum	.NET Core 2.2	585,353.0 ns	7,527.64 ns	7,041.36 ns	1.00	0.00
Sum2	.NET Core 2.2	576,804.1 ns	2,975.21 ns	2,637.45 ns	0.99	0.01

Sum	.NET Framework 4.8	580,485.7 ns	5,634.76 ns	5,270.76 ns	1.00	0.00
Sum2	.NET Framework 4.8	591,283.8 ns	8,772.34 ns	8,205.65 ns	1.02	0.02

Both methods were on par in the .NET Framework 4.8 and .NET Core 2.2

Generated code

.NET 6.0.6 (6.0.622.26707), X64 RyuJIT

; Sum()
       sub       rsp,28
       xor       eax,eax
       xor       edx,edx
       mov       rcx,[rcx+8]
       cmp       dword ptr [rcx+8],0
       jle       short M00_L01
       nop       dword ptr [rax]
       nop       dword ptr [rax]
M00_L00:
       mov       r8,rcx
       cmp       edx,[r8+8]
       jae       short M00_L02
       movsxd    r9,edx
       add       eax,[r8+r9*4+10]
       inc       edx
       cmp       [rcx+8],edx
       jg        short M00_L00
M00_L01:
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 67

; Sum2()
       sub       rsp,28
       xor       eax,eax
       mov       [rsp+24],eax
       cmp       [rcx],ecx
       lea       rdx,[rcx+10]
       xor       r8d,r8d
       xor       eax,eax
       lock cmpxchg [rdx],r8
       test      rax,rax
       jne       short M00_L00
       xor       eax,eax
       add       rsp,28
       ret
M00_L00:
       xor       eax,eax
       mov       rdx,[rcx+8]
       cmp       dword ptr [rdx+8],0
       jle       short M00_L04
M00_L01:
       mov       rcx,rdx
       cmp       eax,[rcx+8]
       jae       short M00_L05
       movsxd    r8,eax
       mov       r9d,[rsp+24]
       add       r9d,[rcx+r8*4+10]
       inc       eax
       cmp       [rdx+8],eax
       jg        short M00_L03
M00_L02:
       mov       eax,r9d
       add       rsp,28
       ret
M00_L03:
       mov       [rsp+24],r9d
       jmp       short M00_L01
M00_L04:
       mov       r9d,[rsp+24]
       jmp       short M00_L02
M00_L05:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 106

category:cq
theme:register-allocator
skill-level:expert
cost:large
impact:large

The text was updated successfully, but these errors were encountered:

gfoidl · 2022-07-05T12:20:45Z

Beside the codegen you should write the method like

public int Sum()
{
    if (Interlocked.Read(ref _state) == 0)
    {
        return 0;
    }

    int sum = 0;
    int[] array = _array;

    for (int i = 0; i < array.Length; ++i)
        sum += array[i];

    return sum;
}

It's faster then, and codegen will be as expected.

So key points are:

don't return sum from the Interlocked-check, as this causes the spilling --> return 0 and initialize sum after the check
loop over a local array, to avoid the bound-checks

ghost · 2022-07-05T16:34:32Z

This issue has been marked needs-author-action and may be missing some important information.

JulieLeeMSFT · 2022-07-05T16:35:35Z

@ptr-null please try what @gfoidl suggested and let us know if it removes the regression.
cc @dotnet/jit-contrib.

kunalspathak · 2022-07-05T19:31:17Z

This is an epic problem of LSRA's including resolution moves in the middle of the loop where spill/reload happens. The generated code that gfoidl suggested is much better, but regardless the LSRA problem should be handled in runtime.

ptr-null · 2022-07-05T21:01:58Z

@gfoidl, the example I used here is quite reduced form of the real code piece (still reproducing the issue). Workaround is not something to worry about (I have several for my real problem). Thanks for the array-trick hint though. I've seen it before, but forgot happily. Those range checks are especially annoying to see in my case, because of I have array as readonly field of the instance (hello to #11797). The other annoying thing I noticed is this != null added (#44087 once again).
cc @JulieLeeMSFT

kunalspathak · 2023-06-05T18:57:47Z

This is likely won't get time during .NET 8, but I will mark it as Pri3 for .NET 8.

kunalspathak · 2024-07-16T23:26:22Z

This falls in the category of "resolution phase" of LSRA noted in #47194.

ptr-null added the tenet-performance Performance related issue label Jul 5, 2022

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 5, 2022

ghost added the untriaged New issue has not been triaged by the area owner label Jul 5, 2022

JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 5, 2022

JulieLeeMSFT assigned ptr-null Jul 5, 2022

JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 5, 2022

JulieLeeMSFT added this to the 7.0.0 milestone Jul 5, 2022

JulieLeeMSFT assigned BruceForstall Jul 5, 2022

kunalspathak assigned kunalspathak and unassigned BruceForstall Jul 5, 2022

kunalspathak modified the milestones: 7.0.0, 6.0.x, 8.0.0 Jul 5, 2022

ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 5, 2022

kunalspathak added Priority:3 Work that is nice to have and removed needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration labels Jun 5, 2023

kunalspathak modified the milestones: 8.0.0, 9.0.0 Jul 30, 2023

kunalspathak removed this from the 9.0.0 milestone May 3, 2024

JulieLeeMSFT removed the Priority:3 Work that is nice to have label May 16, 2024

JulieLeeMSFT added this to the 9.0.0 milestone May 16, 2024

kunalspathak modified the milestones: 9.0.0, Future Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RyuJIT: Tight loop performance regression #71628

RyuJIT: Tight loop performance regression #71628

ptr-null commented Jul 5, 2022 •

edited by BruceForstall

Loading

gfoidl commented Jul 5, 2022

ghost commented Jul 5, 2022

JulieLeeMSFT commented Jul 5, 2022

kunalspathak commented Jul 5, 2022

ptr-null commented Jul 5, 2022

kunalspathak commented Jun 5, 2023

kunalspathak commented Jul 16, 2024

RyuJIT: Tight loop performance regression #71628

RyuJIT: Tight loop performance regression #71628

Comments

ptr-null commented Jul 5, 2022 • edited by BruceForstall Loading

Description

Analysis

Regression

Generated code

.NET 6.0.6 (6.0.622.26707), X64 RyuJIT

gfoidl commented Jul 5, 2022

ghost commented Jul 5, 2022

JulieLeeMSFT commented Jul 5, 2022

kunalspathak commented Jul 5, 2022

ptr-null commented Jul 5, 2022

kunalspathak commented Jun 5, 2023

kunalspathak commented Jul 16, 2024

ptr-null commented Jul 5, 2022 •

edited by BruceForstall

Loading