Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RyuJIT: Tight loop performance regression #71628

Open
ptr-null opened this issue Jul 5, 2022 · 7 comments
Open

RyuJIT: Tight loop performance regression #71628

ptr-null opened this issue Jul 5, 2022 · 7 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@ptr-null
Copy link

ptr-null commented Jul 5, 2022

Description

Consider the following method containing tight loop:

int[] array = new int[1000000];

public int Sum()
{
    int sum = 0;

    for (int i = 0; i < array.Length; i++)
        sum += array[i];

    return sum;
}

The method is modified then by adding a short check before the loop, as follows:

long state = 1;

public int Sum2()
{
    int sum = 0;

    if (Interlocked.Read(ref state) == 0)
        return sum;

    for (int i = 0; i < array.Length; i++)
        sum += array[i];

    return sum;
}

It is reasonable to expect (given that array is sufficiently long) that the check added should not affect method performance dramatically, right?

But it does:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1766 (21H2)
Intel Core i5-4690 CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.301
  [Host]     : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  DefaultJob : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
Method Mean Error StdDev Ratio RatioSD
Sum 590,207.1 ns 9,395.29 ns 8,328.68 ns 1.00 0.00
Sum2 1,943,186.9 ns 34,793.55 ns 32,545.90 ns 3.29 0.06

Modified method becomes ~3x slower, which suggests that the loop is performed slower.

Analysis

Indeed, it is. Inspecting IL does not expose any difference (loop bodies are identical in both cases). Code generated by RyuJIT for the loop is different however.

In the first case, the sum is accumulated into eax register:

...
add       eax,[r8+r9*4+10]        // sum += array[i]
...

Where as in the second one it is accumulated into r9d register, which is loaded from and saved to stack on each iteration of the loop:

...
mov       r9d,[rsp+24]            // sum <-- [stack]
add       r9d,[rcx+r8*4+10]       // sum += array[i]
...
mov       [rsp+24],r9d            // [stack] <-- sum
...

There is one additional jump in the second case also. See complete generated codes at the end.

Regression

As one may see from the benchmark results below, this is the regression actually:

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19044.1766 (21H2)
Intel Core i5-4690 CPU 3.50GHz (Haswell), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.301
  [Host]             : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  .NET 6.0           : .NET 6.0.6 (6.0.622.26707), X64 RyuJIT
  .NET 5.0           : .NET 5.0.17 (5.0.1722.21314), X64 RyuJIT
  .NET Core 3.1      : .NET Core 3.1.26 (CoreCLR 4.700.22.26002, CoreFX 4.700.22.26801), X64 RyuJIT
  .NET Core 2.2      : .NET Core 2.2.8 (CoreCLR 4.6.28207.03, CoreFX 4.6.28208.02), X64 RyuJIT
  .NET Framework 4.8 : .NET Framework 4.8 (4.8.4515.0), X64 RyuJIT
Method Runtime Mean Error StdDev Ratio RatioSD
Sum .NET 6.0 587,219.9 ns 11,480.13 ns 10,738.52 ns 1.00 0.00
Sum2 .NET 6.0 1,943,794.5 ns 15,662.22 ns 13,078.66 ns 3.33 0.04
Sum .NET 5.0 584,441.7 ns 9,659.41 ns 9,035.41 ns 1.00 0.00
Sum2 .NET 5.0 1,955,357.4 ns 18,500.14 ns 16,399.89 ns 3.35 0.06
Sum .NET Core 3.1 597,269.8 ns 8,984.42 ns 8,404.03 ns 1.00 0.00
Sum2 .NET Core 3.1 1,987,871.5 ns 35,514.92 ns 33,220.67 ns 3.33 0.09
Sum .NET Core 2.2 585,353.0 ns 7,527.64 ns 7,041.36 ns 1.00 0.00
Sum2 .NET Core 2.2 576,804.1 ns 2,975.21 ns 2,637.45 ns 0.99 0.01
Sum .NET Framework 4.8 580,485.7 ns 5,634.76 ns 5,270.76 ns 1.00 0.00
Sum2 .NET Framework 4.8 591,283.8 ns 8,772.34 ns 8,205.65 ns 1.02 0.02

Both methods were on par in the .NET Framework 4.8 and .NET Core 2.2

Generated code

.NET 6.0.6 (6.0.622.26707), X64 RyuJIT

; Sum()
       sub       rsp,28
       xor       eax,eax
       xor       edx,edx
       mov       rcx,[rcx+8]
       cmp       dword ptr [rcx+8],0
       jle       short M00_L01
       nop       dword ptr [rax]
       nop       dword ptr [rax]
M00_L00:
       mov       r8,rcx
       cmp       edx,[r8+8]
       jae       short M00_L02
       movsxd    r9,edx
       add       eax,[r8+r9*4+10]
       inc       edx
       cmp       [rcx+8],edx
       jg        short M00_L00
M00_L01:
       add       rsp,28
       ret
M00_L02:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 67
; Sum2()
       sub       rsp,28
       xor       eax,eax
       mov       [rsp+24],eax
       cmp       [rcx],ecx
       lea       rdx,[rcx+10]
       xor       r8d,r8d
       xor       eax,eax
       lock cmpxchg [rdx],r8
       test      rax,rax
       jne       short M00_L00
       xor       eax,eax
       add       rsp,28
       ret
M00_L00:
       xor       eax,eax
       mov       rdx,[rcx+8]
       cmp       dword ptr [rdx+8],0
       jle       short M00_L04
M00_L01:
       mov       rcx,rdx
       cmp       eax,[rcx+8]
       jae       short M00_L05
       movsxd    r8,eax
       mov       r9d,[rsp+24]
       add       r9d,[rcx+r8*4+10]
       inc       eax
       cmp       [rdx+8],eax
       jg        short M00_L03
M00_L02:
       mov       eax,r9d
       add       rsp,28
       ret
M00_L03:
       mov       [rsp+24],r9d
       jmp       short M00_L01
M00_L04:
       mov       r9d,[rsp+24]
       jmp       short M00_L02
M00_L05:
       call      CORINFO_HELP_RNGCHKFAIL
       int       3
; Total bytes of code 106

category:cq
theme:register-allocator
skill-level:expert
cost:large
impact:large

@ptr-null ptr-null added the tenet-performance Performance related issue label Jul 5, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 5, 2022
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jul 5, 2022
@gfoidl
Copy link
Member

gfoidl commented Jul 5, 2022

Beside the codegen you should write the method like

public int Sum()
{
    if (Interlocked.Read(ref _state) == 0)
    {
        return 0;
    }

    int sum = 0;
    int[] array = _array;

    for (int i = 0; i < array.Length; ++i)
        sum += array[i];

    return sum;
}

It's faster then, and codegen will be as expected.

So key points are:

  • don't return sum from the Interlocked-check, as this causes the spilling --> return 0 and initialize sum after the check
  • loop over a local array, to avoid the bound-checks

@JulieLeeMSFT JulieLeeMSFT added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jul 5, 2022
@ghost
Copy link

ghost commented Jul 5, 2022

This issue has been marked needs-author-action and may be missing some important information.

@JulieLeeMSFT
Copy link
Member

@ptr-null please try what @gfoidl suggested and let us know if it removes the regression.
cc @dotnet/jit-contrib.

@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 5, 2022
@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Jul 5, 2022
@kunalspathak
Copy link
Member

This is an epic problem of LSRA's including resolution moves in the middle of the loop where spill/reload happens. The generated code that gfoidl suggested is much better, but regardless the LSRA problem should be handled in runtime.

@kunalspathak kunalspathak modified the milestones: 7.0.0, 6.0.x, 8.0.0 Jul 5, 2022
@ptr-null
Copy link
Author

ptr-null commented Jul 5, 2022

@gfoidl, the example I used here is quite reduced form of the real code piece (still reproducing the issue). Workaround is not something to worry about (I have several for my real problem). Thanks for the array-trick hint though. I've seen it before, but forgot happily. Those range checks are especially annoying to see in my case, because of I have array as readonly field of the instance (hello to #11797). The other annoying thing I noticed is this != null added (#44087 once again).
cc @JulieLeeMSFT

@ghost ghost added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jul 5, 2022
@kunalspathak
Copy link
Member

This is likely won't get time during .NET 8, but I will mark it as Pri3 for .NET 8.

@kunalspathak kunalspathak added Priority:3 Work that is nice to have and removed needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration labels Jun 5, 2023
@kunalspathak kunalspathak modified the milestones: 8.0.0, 9.0.0 Jul 30, 2023
@kunalspathak kunalspathak removed this from the 9.0.0 milestone May 3, 2024
@JulieLeeMSFT JulieLeeMSFT removed the Priority:3 Work that is nice to have label May 16, 2024
@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone May 16, 2024
@kunalspathak
Copy link
Member

This falls in the category of "resolution phase" of LSRA noted in #47194.

@kunalspathak kunalspathak modified the milestones: 9.0.0, Future Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

5 participants