Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with SIMD in .NET 6 #51915

Closed
aalmada opened this issue Apr 27, 2021 · 6 comments
Closed

Performance regression with SIMD in .NET 6 #51915

aalmada opened this issue Apr 27, 2021 · 6 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue

Comments

@aalmada
Copy link

aalmada commented Apr 27, 2021

Description

I've been periodically running benchmarks on multiple LINQ libraries. I recently upgraded these to .NET 6 and noticed a regression for some SIMD cases.

Configuration

The benchmarks use BenchmarkDotNet and the configuration can be found at https://github.com/NetFabric/LinqBenchmarks/blob/afdb508341242c94d525f6858addbba2d96bc132/LinqBenchmarks/Program.cs#L25

I'm using .NET 6.0.100-preview.3.21202.5

The regression can be reproduced both using LinqFaster and NetFabric.Hyperlinq.

Regression?

The benchmark repository contains the latest results of the benchmarks, comparing the results of .NET 5 against .NET 6.

Data

The benchmarks for the query Range().ToArray() shows no major difference between .NET 5 and .NET 6: https://github.com/NetFabric/LinqBenchmarks/blob/afdb508341242c94d525f6858addbba2d96bc132/Results/Range.RangeToArray.md

But, for the query Range().Select().ToArray(), the SIMD-enabled .NET 6 version is much slower, for both libraries: https://github.com/NetFabric/LinqBenchmarks/blob/afdb508341242c94d525f6858addbba2d96bc132/Results/Range.RangeSelectToArray.md

Analysis

I'm very sorry, I tried, but I cannot pinpoint the issue. Still, I hope this will help.

Both libraries use System.Numerics.

I'm the developer of NetFabric.Hyperlinq and I can point you to the core source code used for both cases:

In both cases, an array is allocated with the known size and passed to one of these methods as a Span<int>.

I run the benchmarks multiple times and always get the same results.

@aalmada aalmada added the tenet-performance Performance related issue label Apr 27, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Apr 27, 2021
@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@aalmada aalmada mentioned this issue Apr 27, 2021
54 tasks
@EgorBo
Copy link
Member

EgorBo commented Apr 27, 2021

After some investigation I think the reason is the same as #49071
(should be fixed in some next Preview version, see #49503)
e.g.:

[Benchmark]
public Vector<int> Bench2()
{
    return new Vector<int>(Vector<int>.Count) * 3;
}

BDN:

.NET 5.0.4 (5.0.421.11614), X64 RyuJIT

; Prog.Bench2()
       vzeroupper
       mov       eax,8
       vmovd     xmm0,eax
       vpbroadcastd ymm0,xmm0
       mov       eax,3
       vmovd     xmm1,eax
       vpbroadcastd ymm1,xmm1
       vpmulld   ymm0,ymm1,ymm0
       vmovupd   [rdx],ymm0
       mov       rax,rdx
       vzeroupper
       ret
; Total bytes of code 47

.NET 6.0.0 (6.0.21.20104), X64 RyuJIT

; Prog.Bench2()
       push      rdi
       push      rsi
       sub       rsp,68
       vzeroupper
       mov       rsi,rdx
       mov       ecx,8
       vmovd     xmm0,ecx
       vpbroadcastd ymm0,xmm0
       vmovupd   [rsp+20],ymm0
       vxorps    ymm0,ymm0,ymm0
       vmovupd   [rsp+40],ymm0
       xor       edi,edi
M00_L00:
       lea       rcx,[rsp+20]
       mov       ecx,[rcx+rdi*4]
       mov       edx,3
       call      System.Numerics.Vector`1[[System.Int32, System.Private.CoreLib]].ScalarMultiply(Int32, Int32)
       lea       rdx,[rsp+40]
       mov       [rdx+rdi*4],eax
       inc       rdi
       cmp       rdi,8
       jl        short M00_L00
       vmovupd   ymm0,[rsp+40]
       vmovupd   [rsi],ymm0
       mov       rax,rsi
       vzeroupper
       add       rsp,68
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 102

@tannergooding
Copy link
Member

For reference, the codegen as of the current nightly bits is now:

; Prog.Bench2()
    vzeroupper 
    mov      eax, 8
    vmovd    xmm0, eax
    vpbroadcastd ymm0, ymm0
    vpmulld  ymm0, ymm0, ymmword ptr[reloc @RWD00]
    vmovupd  ymmword ptr[rcx], ymm0
    mov      rax, rcx
    vzeroupper
    ret
; Total bytes of code 37

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Apr 27, 2021
@tannergooding
Copy link
Member

@aalmada, Would you be willing to retest with the latest nightly SDK: https://github.com/dotnet/installer#installers-and-binaries?

Doing so would allow us to confirm there are no other regressions in the area and that the fix does indeed cover the regression you detected.

@tannergooding tannergooding self-assigned this Apr 27, 2021
@jeffschwMSFT jeffschwMSFT added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 27, 2021
@aalmada
Copy link
Author

aalmada commented Apr 27, 2021

@tannergooding
I rerun the benchmark for the affected query, now with the nightly build, and I can confirm that the issue is gone. 👍
I'm now going to let all the benchmarks run through the night.
Thanks!

@aalmada
Copy link
Author

aalmada commented Apr 28, 2021

@tannergooding
I rerun all the benchmarks using .NET 6.0.100-preview.4.21227.6 and can confirm once again that the issue is gone.
In case you'd like to compare the performance of multiple LINQ implementations, both on .NET 5 and this version of .NET 6, you can find all the results at https://github.com/NetFabric/LinqBenchmarks/tree/444de2fe44b60fa86a2da02751551804dd834e61

@aalmada aalmada closed this as completed Apr 28, 2021
@ghost ghost locked as resolved and limited conversation to collaborators May 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants