Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance of Trigonometric math function have unbelievable loss at .NET8 #95954

Closed
kingsznhone opened this issue Dec 13, 2023 · 33 comments · Fixed by #98261
Closed

Performance of Trigonometric math function have unbelievable loss at .NET8 #95954

kingsznhone opened this issue Dec 13, 2023 · 33 comments · Fixed by #98261
Labels

Comments

@kingsznhone
Copy link

Description

I have a compute program that have intense Trigonometric calculation.

I try to compile with .NET 8 . Then run a benchmark.

Compute speed drop more than 10 times more.

Configuration

OS: Windows 11 x64
Runtime: .NET 6 7 8

Data

20231213171827

Analysis

The function I use is (su,cu)= Math.SinCos(u);

It's faster than

su = Math.Sin(u);
cu = Math.Cos(u);

on .NET 6 & .NET 7

Then I change my code to

su = Math.Sin(u);
cu = Math.Cos(u);

Here is benchmark result.

20231213172355

I think there are some serious problem with internal implementation of Math.SinCos() method.

@kingsznhone kingsznhone added the tenet-performance Performance related issue label Dec 13, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Dec 13, 2023
@ghost
Copy link

ghost commented Dec 13, 2023

Tagging subscribers to this area: @dotnet/area-system-numerics
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I have a compute program that have intense Trigonometric calculation.

I try to compile with .NET 8 . Then run a benchmark.

Compute speed drop more than 10 times more.

Configuration

OS: Windows 11 x64
Runtime: .NET 6 7 8

Data

20231213171827

Analysis

The function I use is (su,cu)= Math.SinCos(u);

It's faster than

su = Math.Sin(u);
cu = Math.Cos(u);

on .NET 6 & .NET 7

Then I change my code to

su = Math.Sin(u);
cu = Math.Cos(u);

Here is benchmark result.

20231213172355

I think there are some serious problem with internal implementation of Math.SinCos() method.

Author: kingsznhone
Assignees: -
Labels:

area-System.Numerics, tenet-performance

Milestone: -

@huoyaoyuan
Copy link
Member

I don's see a performance difference in the following benchmark. Can you share more information about your benchmark?

Method Job Runtime Angle Mean Error StdDev Ratio
Separate .NET 6.0 .NET 6.0 123.45 13.89 ns 0.132 ns 0.123 ns 1.00
SinCos .NET 6.0 .NET 6.0 123.45 14.24 ns 0.031 ns 0.029 ns 1.03
Separate .NET 7.0 .NET 7.0 123.45 13.56 ns 0.186 ns 0.174 ns 1.00
SinCos .NET 7.0 .NET 7.0 123.45 13.68 ns 0.043 ns 0.036 ns 1.01
Separate .NET 8.0 .NET 8.0 123.45 13.46 ns 0.028 ns 0.026 ns 1.00
SinCos .NET 8.0 .NET 8.0 123.45 13.74 ns 0.036 ns 0.034 ns 1.02
[SimpleJob(RuntimeMoniker.Net60)]
[SimpleJob(RuntimeMoniker.Net70)]
[SimpleJob(RuntimeMoniker.Net80)]
public class Program
{
    static void Main()
    {
        BenchmarkRunner.Run<Program>();
    }

    [Params(123.45)]
    public double Angle { get; set; }

    [Benchmark(Baseline = true)]
    public double Separate()
    {
        double sin = Math.Sin(Angle);
        double cos = Math.Cos(Angle);
        return sin + cos;
    }

    [Benchmark]
    public double SinCos()
    {
        (double sin, double cos) = Math.SinCos(Angle);
        return sin + cos;
    }
}

@kingsznhone
Copy link
Author

Here is a performance profiler result for 1000 loop of the hotspot function.

In debug mode, .NET7 and .NET8 are in the same behaviour.

.NET 7 DEBUG

7 0 debug

.NET 8 DEBUG

8 0debug

But in Release mode , .NET8 Cost many times than .NET7.

.NET 7 Release

7 0 release

.NET 8 Release

8 0 release

As I mentioned before, change code to seperate sin&cos gonna make .NET 7 and 8 in same execute time.

So my inference is that SinCos() in certain context after REALEASE optimize cause performance loss.

@kingsznhone
Copy link
Author

kingsznhone commented Dec 13, 2023

I'm going deep in Assembly level and found this blowing my mind.
The compiler does a lot of negative optimizations in this hotspot.
I have no idea why this happened.

.NET 7 REALEASE

               for (int n = 0; n < terms.Length; n++)
00007FFF232B3E63  xor         r9d,r9d  
00007FFF232B3E66  mov         r10d,dword ptr [rcx+8]  
00007FFF232B3E6A  mov         dword ptr [rbp+1Ch],r10d  
00007FFF232B3E6E  test        r10d,r10d  
00007FFF232B3E71  jle         Calculate(VariableTable, Double)+0169h (07FFF232B3EE9h)  
                {
                    u = terms[n].aa + terms[n].bb * tj;
00007FFF232B3E73  mov         dword ptr [rbp+3Ch],r9d  
00007FFF232B3E77  mov         r11d,r9d  
00007FFF232B3E7A  shl         r11,5  
00007FFF232B3E7E  mov         qword ptr [rbp+20h],r11  
00007FFF232B3E82  vmulsd      xmm0,xmm6,mmword ptr [rcx+r11+28h]  
00007FFF232B3E89  vaddsd      xmm0,xmm0,mmword ptr [rcx+r11+20h]  
00007FFF232B3E90  lea         rdx,[rbp+30h]  
00007FFF232B3E94  lea         r8,[rbp+28h]  
00007FFF232B3E98  call        00007FFF82F98DE0  
00007FFF232B3E9D  vmovsd      xmm1,qword ptr [rbp+30h]  
00007FFF232B3EA2  vmovsd      xmm0,qword ptr [rbp+28h]  
                    result += t[it] * (terms[n].ss * su + terms[n].cc * cu);
00007FFF232B3EA7  cmp         r12d,15h  
00007FFF232B3EAB  jae         Calculate(VariableTable, Double)+01E8h (07FFF232B3F68h)  
00007FFF232B3EB1  mov         rcx,qword ptr [rbp+8]  
00007FFF232B3EB5  mov         r11,qword ptr [rbp+20h]  
00007FFF232B3EB9  vmulsd      xmm1,xmm1,mmword ptr [rcx+r11+10h]  
00007FFF232B3EC0  vmulsd      xmm0,xmm0,mmword ptr [rcx+r11+18h]  
00007FFF232B3EC7  vaddsd      xmm1,xmm1,xmm0  
00007FFF232B3ECB  mov         rax,qword ptr [rbp+10h]  
00007FFF232B3ECF  vmulsd      xmm1,xmm1,mmword ptr [r15+rax*8]  
00007FFF232B3ED5  vaddsd      xmm7,xmm1,xmm7 
        for (int n = 0; n < terms.Length; n++)

.NET 8 REALEASE

                for (int n = 0; n < terms.Length; n++)
00007FFE52398E2A  xor         ecx,ecx  
00007FFE52398E2C  mov         dword ptr [rbp+74h],ecx  
00007FFE52398E2F  nop  
00007FFE52398E30  jmp         Calculate(VariableTable, Double)+032Ah (07FFE52398F4Ah)  
                {
                    u = terms[n].aa + terms[n].bb * tj;
00007FFE52398E35  mov         rcx,qword ptr [rbp+80h]  
00007FFE52398E3C  mov         eax,dword ptr [rbp+74h]  
00007FFE52398E3F  cmp         eax,dword ptr [rcx+8]  
00007FFE52398E42  jb          Calculate(VariableTable, Double)+0229h (07FFE52398E49h)  
00007FFE52398E44  call        00007FFEB1FD2DB0  
00007FFE52398E49  mov         edx,eax  
00007FFE52398E4B  imul        rdx,rdx,20h  
00007FFE52398E4F  lea         rcx,[rcx+rdx+10h]  
00007FFE52398E54  vmovsd      xmm1,qword ptr [rcx+10h]  
00007FFE52398E59  mov         rcx,qword ptr [rbp+80h]  
00007FFE52398E60  mov         eax,dword ptr [rbp+74h]  
00007FFE52398E63  cmp         eax,dword ptr [rcx+8]  
00007FFE52398E66  jb          Calculate(VariableTable, Double)+024Dh (07FFE52398E6Dh)  
00007FFE52398E68  call        00007FFEB1FD2DB0  
00007FFE52398E6D  mov         edx,eax  
00007FFE52398E6F  imul        rdx,rdx,20h  
00007FFE52398E73  lea         rcx,[rcx+rdx+10h]  
00007FFE52398E78  vmovsd      xmm0,qword ptr [rcx+18h]  
00007FFE52398E7D  vmulsd      xmm0,xmm0,mmword ptr [rbp+0B8h]  
00007FFE52398E85  vaddsd      xmm1,xmm1,xmm0  
00007FFE52398E89  lea         rcx,[rbp+20h]  
00007FFE52398E8D  call        qword ptr [CLRStub[MethodDescPrestub]@00007FFE526D4FA8 (07FFE526D4FA8h)]  
                    (su, cu) = Math.SinCos(u);
00007FFE52398E93  vmovsd      xmm0,qword ptr [rbp+20h]  
00007FFE52398E98  vmovsd      qword ptr [rbp+98h],xmm0  
00007FFE52398EA0  vmovsd      xmm0,qword ptr [rbp+28h]  
00007FFE52398EA5  vmovsd      qword ptr [rbp+90h],xmm0  
                    result += t[it] * (terms[n].ss * su + terms[n].cc * cu);
00007FFE52398EAD  vmovsd      xmm0,qword ptr [rbp+0A0h]  
00007FFE52398EB5  vmovsd      qword ptr [rbp+18h],xmm0  
00007FFE52398EBA  lea         rcx,[rbp+0A8h]  
00007FFE52398EC1  mov         edx,dword ptr [rbp+78h]  
00007FFE52398EC4  call        qword ptr [CLRStub[MethodDescPrestub]@00007FFE52A16760 (07FFE52A16760h)]  
00007FFE52398ECA  mov         qword ptr [rbp+10h],rax  
00007FFE52398ECE  mov         rax,qword ptr [rbp+10h]  
00007FFE52398ED2  vmovsd      xmm1,qword ptr [rax]  
00007FFE52398ED6  mov         rax,qword ptr [rbp+80h]  
00007FFE52398EDD  mov         ecx,dword ptr [rbp+74h]  
00007FFE52398EE0  cmp         ecx,dword ptr [rax+8]  
00007FFE52398EE3  jb          Calculate(VariableTable, Double)+02CAh (07FFE52398EEAh)  
00007FFE52398EE5  call        00007FFEB1FD2DB0  
00007FFE52398EEA  mov         edx,ecx  
00007FFE52398EEC  imul        rdx,rdx,20h  
00007FFE52398EF0  lea         rax,[rax+rdx+10h]  
00007FFE52398EF5  vmovsd      xmm0,qword ptr [rax]  
00007FFE52398EF9  vmulsd      xmm0,xmm0,mmword ptr [rbp+98h]  
00007FFE52398F01  mov         rax,qword ptr [rbp+80h]  
00007FFE52398F08  mov         ecx,dword ptr [rbp+74h]  
00007FFE52398F0B  cmp         ecx,dword ptr [rax+8]  
00007FFE52398F0E  jb          Calculate(VariableTable, Double)+02F5h (07FFE52398F15h)  
00007FFE52398F10  call        00007FFEB1FD2DB0  
00007FFE52398F15  mov         edx,ecx  
00007FFE52398F17  imul        rdx,rdx,20h  
00007FFE52398F1B  lea         rax,[rax+rdx+10h]  
00007FFE52398F20  vmovsd      xmm2,qword ptr [rax+8]  
00007FFE52398F25  vmulsd      xmm2,xmm2,mmword ptr [rbp+90h]  
00007FFE52398F2D  vaddsd      xmm0,xmm0,xmm2  
00007FFE52398F31  vmulsd      xmm1,xmm1,xmm0  
00007FFE52398F35  vaddsd      xmm1,xmm1,mmword ptr [rbp+18h]  
00007FFE52398F3A  vmovsd      qword ptr [rbp+0A0h],xmm1  
        for (int n = 0; n < terms.Length; n++)

compiler is far beyond my knowledge. Hope these infomations will be useful

@gfoidl
Copy link
Member

gfoidl commented Dec 13, 2023

This looks like Tier-0 (unoptimized code).
Did you have a debugger attached, etc?

For BenchmarkDotNet you can use the [Disassembly]-diagnoser to get the asm.

@kingsznhone
Copy link
Author

kingsznhone commented Dec 13, 2023

This looks like Tier-0 (unoptimized code). Did you have a debugger attached, etc?

For BenchmarkDotNet you can use the [Disassembly]-diagnoser to get the asm.

Even DEBUG is faster than RELEASE .
I think it's negative optimize for sure :(
20231213204828

@stephentoub
Copy link
Member

Please share the full benchmark you're running.

@kingsznhone
Copy link
Author

Please share the full benchmark you're running.

Check this Repo. VSOP2013
Add net-8.0 in target framework.
then Add [SimpleJob(RuntimeMoniker.Net80)] in /Demo/PerfTest.cs

run demo will lead to benchmark result

@andrewjsaid
Copy link
Contributor

I am unable to reproduce this when running the project as per your instructions.

image

@kingsznhone
Copy link
Author

kingsznhone commented Dec 13, 2023

I am unable to reproduce this when running the project as per your instructions.

image

I run benckmark on another computer with fresh SDK
Here is result.

13400

Another Result from Windows 10 in Hyper-V.
hyperV

Result from my friend's AMD PC.
Is it a bug of the compiler on Intel processor?
5900hx

@tannergooding
Copy link
Member

Can you add the [DisassemblyDiagnoser] attribute to the benchmark and share the result: https://benchmarkdotnet.org/articles/features/disassembler.html

It's possible this is something like the JCC Erratum that exists on Intel processors. It's also possible there is some subtle codegen difference that is impacting things here.

Notably this currently just defers down to the C Runtime and there were no explicit changes to the logic done in the .NET 7/8 timeframe.

@saucecontrol
Copy link
Member

saucecontrol commented Dec 13, 2023

I'm seeing a regression on Intel 11th gen (Tiger Lake) as well, though not of the same magnitude as shown on the hybrid models above

image

Here's the DisassemblyDiagnoser output

Edit: Oops, pasted without reading. It's useless...

Expand

.NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2

; Demo.PerfTest.Compute()
       mov       [rsp+8],rcx
       mov       rcx,[rcx+8]
       mov       r8,[rsp+8]
       mov       r8,[r8+10]
       mov       edx,4
       cmp       [rcx],ecx
       jmp       near ptr VSOP2013.Calculator.GetPlanetPosition(VSOP2013.VSOPBody, VSOP2013.VSOPTime)
; Total bytes of code 30
; VSOP2013.Calculator.GetPlanetPosition(VSOP2013.VSOPBody, VSOP2013.VSOPTime)
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,38
       mov       rsi,rcx
       mov       ebx,edx
       mov       rdi,r8
       mov       rcx,offset MT_VSOP2013.Calculator+<>c__DisplayClass4_0
       call      CORINFO_HELP_NEWSFAST
       mov       rbp,rax
       lea       rcx,[rbp+10]
       mov       rdx,rsi
       call      CORINFO_HELP_ASSIGN_REF
       mov       [rbp+20],ebx
       lea       rcx,[rbp+18]
       mov       rdx,rdi
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset MT_System.Double[]
       mov       edx,6
       call      CORINFO_HELP_NEWARR_1_VC
       lea       rcx,[rbp+8]
       mov       rdx,rax
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset MT_System.Action`1[[System.Int32, System.Private.CoreLib]]
       call      CORINFO_HELP_NEWSFAST
       mov       rsi,rax
       lea       rcx,[rsi+8]
       mov       rdx,rbp
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset VSOP2013.Calculator+<>c__DisplayClass4_0.<GetPlanetPosition>b__0(Int32)
       mov       [rsi+18],rcx
       lea       rcx,[rsp+20]
       mov       r9,rsi
       xor       edx,edx
       mov       r8d,6
       call      System.Threading.Tasks.Parallel.For(Int32, Int32, System.Action`1<Int32>)
       mov       rcx,offset MT_VSOP2013.VSOPResult_ELL
       call      CORINFO_HELP_NEWSFAST
       mov       rsi,rax
       mov       edx,[rbp+20]
       mov       rax,[rbp+18]
       mov       rdi,[rbp+8]
       mov       [rsi+18],edx
       lea       rcx,[rsi+8]
       mov       rdx,rax
       call      CORINFO_HELP_ASSIGN_REF
       lea       rcx,[rsi+10]
       mov       rdx,rdi
       call      CORINFO_HELP_ASSIGN_REF
       mov       rax,rsi
       add       rsp,38
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 226

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

; Demo.PerfTest.Compute()
       mov       r8,rcx
       mov       rcx,[r8+8]
       mov       r8,[r8+10]
       mov       edx,4
       cmp       [rcx],ecx
       jmp       qword ptr [7FFC1FB8F960]; VSOP2013.Calculator.GetPlanetPosition(VSOP2013.VSOPBody, VSOP2013.VSOPTime)
; Total bytes of code 24
; VSOP2013.Calculator.GetPlanetPosition(VSOP2013.VSOPBody, VSOP2013.VSOPTime)
       push      rbp
       push      r14
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,70
       lea       rbp,[rsp+90]
       mov       rbx,rcx
       mov       edi,edx
       mov       rsi,r8
       mov       rcx,offset MT_VSOP2013.Calculator+<>c__DisplayClass4_0
       call      CORINFO_HELP_NEWSFAST
       mov       r14,rax
       lea       rcx,[r14+10]
       mov       rdx,rbx
       call      CORINFO_HELP_ASSIGN_REF
       mov       [r14+20],edi
       lea       rcx,[r14+18]
       mov       rdx,rsi
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset MT_System.Double[]
       mov       edx,6
       call      CORINFO_HELP_NEWARR_1_VC
       lea       rcx,[r14+8]
       mov       rdx,rax
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset MT_System.Action`1[[System.Int32, System.Private.CoreLib]]
       call      CORINFO_HELP_NEWSFAST
       mov       rbx,rax
       lea       rcx,[rbx+8]
       mov       rdx,r14
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,7FFC200B8888
       mov       [rbx+18],rcx
       mov       rcx,1B07C404478
       mov       rcx,[rcx]
       mov       [rsp+20],rcx
       mov       [rsp+28],rbx
       xor       ecx,ecx
       mov       [rsp+30],rcx
       mov       [rsp+38],rcx
       mov       [rsp+40],rcx
       mov       [rsp+48],rcx
       lea       rcx,[rbp-38]
       mov       rdx,offset MD_System.Threading.Tasks.Parallel.ForWorker[[System.Object, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]](Int32, Int32, System.Threading.Tasks.ParallelOptions, System.Action`1<Int32>, System.Action`2<Int32,System.Threading.Tasks.ParallelLoopState>, System.Func`4<Int32,System.Threading.Tasks.ParallelLoopState,System.Object,System.Object>, System.Func`1<System.Object>, System.Action`1<System.Object>)
       xor       r8d,r8d
       mov       r9d,6
       call      qword ptr [7FFC1FBE55A8]; System.Threading.Tasks.Parallel.ForWorker[[System.__Canon, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]](Int32, Int32, System.Threading.Tasks.ParallelOptions, System.Action`1<Int32>, System.Action`2<Int32,System.Threading.Tasks.ParallelLoopState>, System.Func`4<Int32,System.Threading.Tasks.ParallelLoopState,System.__Canon,System.__Canon>, System.Func`1<System.__Canon>, System.Action`1<System.__Canon>)
       mov       rcx,offset MT_VSOP2013.VSOPResult_ELL
       call      CORINFO_HELP_NEWSFAST
       mov       rbx,rax
       mov       ecx,[r14+20]
       mov       rdx,[r14+18]
       mov       rsi,[r14+8]
       mov       [rbx+18],ecx
       lea       rcx,[rbx+8]
       call      CORINFO_HELP_ASSIGN_REF
       lea       rcx,[rbx+10]
       mov       rdx,rsi
       call      CORINFO_HELP_ASSIGN_REF
       mov       rax,rbx
       add       rsp,70
       pop       rbx
       pop       rsi
       pop       rdi
       pop       r14
       pop       rbp
       ret
; Total bytes of code 290

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Dec 13, 2023

Yeah I see similar results... the problem is down in native code so BDN isn't going to give much insight.

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22621.2861/22H2/2022Update/SunValley2)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.100
[Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
Job-CWERMR : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
Job-CNLSJN : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

IterationCount=32

Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
Compute .NET 6.0 1.782 ms 0.0165 ms 0.0252 ms 1.00 0.00 2.85 KB 1.00
Compute .NET 8.0 2.800 ms 0.0041 ms 0.0061 ms 1.57 0.02 2.9 KB 1.02

Profiling claims the impact is almost all in coreclr!_sse4_sin2:

6.0

00.05%   2.4E+05     ?        Unknown
72.53%   3.8E+08     native   coreclr.dll
25.38%   1.33E+08    FullOpt  [VSOP2013.NET]Calculator.Calculate(value class VSOP2013.VariableTable,float64)
01.28%   6.72E+06    native   ntoskrnl.exe
00.15%   8.1E+05     native   ntdll.dll
00.07%   3.8E+05     native   KernelBase.dll
image
8.0

00.02%   1.6E+05     ?        Unknown
82.04%   6.884E+08   native   coreclr.dll
17.09%   1.434E+08   FullOpt  [VSOP2013.NET]Calculator.Calculate(value class VSOP2013.VariableTable,float64)
00.46%   3.9E+06     native   ntoskrnl.exe
00.08%   6.5E+05     native   ntdll.dll
image

Interestingly the Calculate method uses both stackalloc and has a loop, and so it bypasses tiering entirely (since OSR can't handle stackalloc, yet).

This is via using the benchmark switcher and forcing the two runs to do the same amount of work (so counts are comparable). Perfview snippets above are not filtered to just benchmark intervals, so percentages differ a bit from my analysis tool, which currently can't resolve symbols in native code)

@kingsznhone
Copy link
Author

Can you add the [DisassemblyDiagnoser] attribute to the benchmark and share the result: https://benchmarkdotnet.org/articles/features/disassembler.html

It's possible this is something like the JCC Erratum that exists on Intel processors. It's also possible there is some subtle codegen difference that is impacting things here.

Notably this currently just defers down to the C Runtime and there were no explicit changes to the logic done in the .NET 7/8 timeframe.


BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.2861/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]   : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Code Size Allocated
Compute .NET 6.0 .NET 6.0 695.9 μs 13.69 μs 16.30 μs 256 B 3.28 KB
Compute .NET 8.0 .NET 8.0 10,180.5 μs 78.00 μs 72.96 μs 314 B 3.44 KB

.NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2

; Demo.PerfTest.Compute()
       mov       [rsp+8],rcx
       mov       rcx,[rcx+8]
       mov       r8,[rsp+8]
       mov       r8,[r8+10]
       mov       edx,4
       cmp       [rcx],ecx
       jmp       near ptr VSOP2013.Calculator.GetPlanetPosition(VSOP2013.VSOPBody, VSOP2013.VSOPTime)
; Total bytes of code 30
; VSOP2013.Calculator.GetPlanetPosition(VSOP2013.VSOPBody, VSOP2013.VSOPTime)
       push      rdi
       push      rsi
       push      rbp
       push      rbx
       sub       rsp,38
       mov       rsi,rcx
       mov       ebx,edx
       mov       rdi,r8
       mov       rcx,offset MT_VSOP2013.Calculator+<>c__DisplayClass4_0
       call      CORINFO_HELP_NEWSFAST
       mov       rbp,rax
       lea       rcx,[rbp+10]
       mov       rdx,rsi
       call      CORINFO_HELP_ASSIGN_REF
       mov       [rbp+20],ebx
       lea       rcx,[rbp+18]
       mov       rdx,rdi
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset MT_System.Double[]
       mov       edx,6
       call      CORINFO_HELP_NEWARR_1_VC
       lea       rcx,[rbp+8]
       mov       rdx,rax
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset MT_System.Action`1[[System.Int32, System.Private.CoreLib]]
       call      CORINFO_HELP_NEWSFAST
       mov       rsi,rax
       lea       rcx,[rsi+8]
       mov       rdx,rbp
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset VSOP2013.Calculator+<>c__DisplayClass4_0.<GetPlanetPosition>b__0(Int32)
       mov       [rsi+18],rcx
       lea       rcx,[rsp+20]
       mov       r9,rsi
       xor       edx,edx
       mov       r8d,6
       call      System.Threading.Tasks.Parallel.For(Int32, Int32, System.Action`1<Int32>)
       mov       rcx,offset MT_VSOP2013.VSOPResult_ELL
       call      CORINFO_HELP_NEWSFAST
       mov       rsi,rax
       mov       edx,[rbp+20]
       mov       rax,[rbp+18]
       mov       rdi,[rbp+8]
       mov       [rsi+18],edx
       lea       rcx,[rsi+8]
       mov       rdx,rax
       call      CORINFO_HELP_ASSIGN_REF
       lea       rcx,[rsi+10]
       mov       rdx,rdi
       call      CORINFO_HELP_ASSIGN_REF
       mov       rax,rsi
       add       rsp,38
       pop       rbx
       pop       rbp
       pop       rsi
       pop       rdi
       ret
; Total bytes of code 226

.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

; Demo.PerfTest.Compute()
       mov       r8,rcx
       mov       rcx,[r8+8]
       mov       r8,[r8+10]
       mov       edx,4
       cmp       [rcx],ecx
       jmp       qword ptr [7FFE58BD7870]; VSOP2013.Calculator.GetPlanetPosition(VSOP2013.VSOPBody, VSOP2013.VSOPTime)
; Total bytes of code 24
; VSOP2013.Calculator.GetPlanetPosition(VSOP2013.VSOPBody, VSOP2013.VSOPTime)
       push      rbp
       push      r14
       push      rdi
       push      rsi
       push      rbx
       sub       rsp,70
       lea       rbp,[rsp+90]
       mov       rbx,rcx
       mov       edi,edx
       mov       rsi,r8
       mov       rcx,offset MT_VSOP2013.Calculator+<>c__DisplayClass4_0
       call      CORINFO_HELP_NEWSFAST
       mov       r14,rax
       lea       rcx,[r14+10]
       mov       rdx,rbx
       call      CORINFO_HELP_ASSIGN_REF
       mov       [r14+20],edi
       lea       rcx,[r14+18]
       mov       rdx,rsi
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset MT_System.Double[]
       mov       edx,6
       call      CORINFO_HELP_NEWARR_1_VC
       lea       rcx,[r14+8]
       mov       rdx,rax
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset MT_System.Action`1[[System.Int32, System.Private.CoreLib]]
       call      CORINFO_HELP_NEWSFAST
       mov       rbx,rax
       lea       rcx,[rbx+8]
       mov       rdx,r14
       call      CORINFO_HELP_ASSIGN_REF
       mov       rcx,offset VSOP2013.Calculator+<>c__DisplayClass4_0.<GetPlanetPosition>b__0(Int32)
       mov       [rbx+18],rcx
       mov       rcx,15CCB006470
       mov       rcx,[rcx]
       mov       [rsp+20],rcx
       mov       [rsp+28],rbx
       xor       ecx,ecx
       mov       [rsp+30],rcx
       mov       [rsp+38],rcx
       mov       [rsp+40],rcx
       mov       [rsp+48],rcx
       lea       rcx,[rbp-38]
       mov       rdx,offset MD_System.Threading.Tasks.Parallel.ForWorker[[System.Object, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]](Int32, Int32, System.Threading.Tasks.ParallelOptions, System.Action`1<Int32>, System.Action`2<Int32,System.Threading.Tasks.ParallelLoopState>, System.Func`4<Int32,System.Threading.Tasks.ParallelLoopState,System.Object,System.Object>, System.Func`1<System.Object>, System.Action`1<System.Object>)
       xor       r8d,r8d
       mov       r9d,6
       call      qword ptr [7FFE58BDCED0]; System.Threading.Tasks.Parallel.ForWorker[[System.__Canon, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]](Int32, Int32, System.Threading.Tasks.ParallelOptions, System.Action`1<Int32>, System.Action`2<Int32,System.Threading.Tasks.ParallelLoopState>, System.Func`4<Int32,System.Threading.Tasks.ParallelLoopState,System.__Canon,System.__Canon>, System.Func`1<System.__Canon>, System.Action`1<System.__Canon>)
       mov       rcx,offset MT_VSOP2013.VSOPResult_ELL
       call      CORINFO_HELP_NEWSFAST
       mov       rbx,rax
       mov       ecx,[r14+20]
       mov       rdx,[r14+18]
       mov       rsi,[r14+8]
       mov       [rbx+18],ecx
       lea       rcx,[rbx+8]
       call      CORINFO_HELP_ASSIGN_REF
       lea       rcx,[rbx+10]
       mov       rdx,rsi
       call      CORINFO_HELP_ASSIGN_REF
       mov       rax,rbx
       add       rsp,70
       pop       rbx
       pop       rsi
       pop       rdi
       pop       r14
       pop       rbp
       ret
; Total bytes of code 290

@kingsznhone
Copy link
Author

kingsznhone commented Dec 13, 2023

Yeah I see similar results... the problem is down in native code so BDN isn't going to give much insight.

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22621.2861/22H2/2022Update/SunValley2) Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores .NET SDK 8.0.100 [Host] : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2 Job-CWERMR : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2 Job-CNLSJN : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

IterationCount=32

Method Runtime Mean Error StdDev Ratio RatioSD Allocated Alloc Ratio
Compute .NET 6.0 1.782 ms 0.0165 ms 0.0252 ms 1.00 0.00 2.85 KB 1.00
Compute .NET 8.0 2.800 ms 0.0041 ms 0.0061 ms 1.57 0.02 2.9 KB 1.02
Profiling claims the impact is almost all in coreclr!_sse4_sin2:

6.0

00.05%   2.4E+05     ?        Unknown
72.53%   3.8E+08     native   coreclr.dll
25.38%   1.33E+08    FullOpt  [VSOP2013.NET]Calculator.Calculate(value class VSOP2013.VariableTable,float64)
01.28%   6.72E+06    native   ntoskrnl.exe
00.15%   8.1E+05     native   ntdll.dll
00.07%   3.8E+05     native   KernelBase.dll
image ``` 8.0

00.02% 1.6E+05 ? Unknown
82.04% 6.884E+08 native coreclr.dll
17.09% 1.434E+08 FullOpt [VSOP2013.NET]Calculator.Calculate(value class VSOP2013.VariableTable,float64)
00.46% 3.9E+06 native ntoskrnl.exe
00.08% 6.5E+05 native ntdll.dll


<img alt="image" width="543" src="https://private-user-images.githubusercontent.com/10121823/290335832-335ced04-5200-41b7-bceb-8e9015b5b346.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTEiLCJleHAiOjE3MDI1MDU1NTIsIm5iZiI6MTcwMjUwNTI1MiwicGF0aCI6Ii8xMDEyMTgyMy8yOTAzMzU4MzItMzM1Y2VkMDQtNTIwMC00MWI3LWJjZWItOGU5MDE1YjViMzQ2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFJV05KWUFYNENTVkVINTNBJTJGMjAyMzEyMTMlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjMxMjEzVDIyMDczMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWI1YTZjNzkzNGVlMWE2NTBjOTdjZDRkOTgzYjU5N2ZiNmY1NjY5YjBkYWQ2MjIzNjkxN2Q1NjI0MmMzMzIyNzcmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.m7YcrHSuhApfTZS-hFklV4Z8vyMhKnUtwQ-7ULfWiJg">
Interestingly the `Calculate` method uses both `stackalloc` and has a loop, and so it bypasses tiering entirely (since OSR can't handle stackalloc, yet).

This is via using the benchmark switcher and forcing the two runs to do the same amount of work (so counts are comparable). Perfview snippets above are not filtered to just benchmark intervals, so percentages differ a bit from my analysis tool, which currently can't resolve symbols in native code)

You point out stackalloc
I change the code to Span<double> t =new double[21];
Benchmark result back to normal as usual.
The key of the enigma is stackalloc.


BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.2861/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]   : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Code Size Allocated
Compute .NET 6.0 .NET 6.0 711.7 μs 13.03 μs 13.38 μs 256 B 4.41 KB
Compute .NET 8.0 .NET 8.0 707.2 μs 11.09 μs 10.37 μs 314 B 4.44 KB

@kingsznhone
Copy link
Author

Inference based on current situation.
performance loss only happened when 3 requirement was met.

  1. use stackalloc Span
  2. use Math.SinCos()
  3. Intel CPU

@AndyAyersMS
Copy link
Member

If not using stackalloc fixes things for you, then great.

It would be good to understand what is going on more deeply as it is still a bit mysterious, especially the huge performance differences you see. The only things I know of that cause that magnitude of perf issues are some very rare cases like handling partially initialized vector data.

If you can run as admin on your box where you see very slow behavior, can you try and capture the ETW profile like I did above?
The steps required are:

  • modify the benchmark project, adding
    <PackageReference Include="BenchmarkDotNet.Diagnostics.Windows" Version="0.13.11" />
  • modify the main method of the benchmark to have
BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

instead of using BenchmarkRunner, and remove the [SimpleJob...] attributes.

  • invoke the benchmark in an admin cmd window, like this:
dotnet run -c Release -f net8.0 -- -p ETW -r net6.0 net8.0 --iterationCount 32 -f *

As part of the run it will print a line like the following:

Exported 2 trace file(s). Example:
C:\repos\VSOP2013.NET\Demo\BenchmarkDotNet.Artifacts\Demo.PerfTest.Compute-.NET 6.0-20231213-133329.etl

Share out the ETL files for .NET 6 and .NET 8 and I or somebody else will dig in.

Feel free to investigate on your own, if you know your way around perfview.

@kingsznhone
Copy link
Author


BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.2861/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  Job-SLXVWU : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
  Job-LMYLXB : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

IterationCount=32  

Method Runtime Mean Error StdDev Ratio RatioSD
Compute .NET 6.0 680.4 μs 4.26 μs 6.63 μs 1.00 0.00
Compute .NET 8.0 10,071.7 μs 31.93 μs 48.76 μs 14.80 0.18

etl files here

BenchmarkDotNet.Artifacts.zip

@kingsznhone
Copy link
Author

I found IL Code of .Net6 and .Net8 are identical.
Compiler should be fine.
The bug might locate in RyuJIT.
Here is a minimum reproduce function.
Eventhough Span have not been use in the loop, still cause performance regression.
stackalloc presence have serious interference on RyuJIT.

public double TestSinCos()
{
    Span<double> t = stackalloc double[21];
    t[0] = 1.0d;
    t[1] = 0.9;
    for (int i = 2; i < 21; i++)
    {
        t[i] = t[1] * t[i - 1];
    }
    double s, c;
    double result = 0;
    for (int i = 0; i < 21; i++)
    {
        (s, c) = Math.SinCos(i);
        result += s + c;
    }
    return result;
}

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.2861/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]   : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2 [AttachedDebugger]
  .NET 6.0 : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2


Method Job Runtime Mean Error StdDev Code Size Allocated
TestSinCos .NET 6.0 .NET 6.0 148.4 ns 1.79 ns 1.40 ns 239 B -
TestSinCos .NET 8.0 .NET 8.0 2,243.1 ns 16.27 ns 13.59 ns 263 B -

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Dec 14, 2023

etl files here

BenchmarkDotNet.Artifacts.zip

Thanks for the ETL files—is the table just above that text from the same run?

When I analyze those files I didn't immediately see the huge slowdown in .NET 8. The raw data shows

=== 6.0 ===

00.08%   7.5E+05     ?        Unknown
53.36%   4.736E+08   native   coreclr.dll
40.15%   3.564E+08   FullOpt  [VSOP2013.NET]Calculator.Calculate(value class VSOP2013.VariableTable,float64)
03.72%   3.299E+07   native   ntoskrnl.exe
00.94%   8.34E+06    native   nvlddmkm.sys

Benchmark: found 32 intervals; mean interval 696.737ms
000 7032.966 -- 7736.418 : 703.452
001 7739.778 -- 8437.664 : 697.886
002 8441.099 -- 9131.644 : 690.545

=== 8.0 ===

00.01%   1.1E+05     ?        Unknown
55.97%   4.166E+08   native   coreclr.dll
42.85%   3.189E+08   FullOpt  [VSOP2013.NET]Calculator.Calculate(value class VSOP2013.VariableTable,float64)
00.42%   3.16E+06    native   nvlddmkm.sys

Benchmark: found 32 intervals; mean interval 644.912ms
000 6430.704 -- 7073.339 : 642.635
001 7076.786 -- 7718.127 : 641.341
002 7721.748 -- 8369.540 : 647.792

and the time for 8.0 if anything is a bit faster.

This assumes that Benchmark Dot Net actually did the same amount of work (iterations) per interval. To verify this we need to look at the log—do you happen to still have Demo.PerfTest-20231214-153218.log? If so, can you attach it here too?

On my local runs BDN did 256 iterations per interval, but looking at your data above and I suspect it may have only done 1024.
If it did 1024 (assuming nominal overhead) then from your data, we should see a report with something like

6.0: 680us       (matches the BDN table)
8.0: 629us       (not 10,071.7)

@kingsznhone
Copy link
Author

kingsznhone commented Dec 14, 2023

here is the log. It was generated simutaniously with etl files.
Demo.PerfTest-20231214-153218.log

@kingsznhone
Copy link
Author

kingsznhone commented Dec 14, 2023

When I look at this native assymbly text compare result
I notice that loop control variable i is comparing on register esi in .NET6
But in .NET8 , RuyJIT write i into RAM then compare.

Further inference: stackalloc variable will cause RyuJIT think "not enough register"
So RyuJIT Write i into RAM.
After each iteration, i was loaded from RAM, compare with 21, then save back to memory.

Left .NET6/ Right .NET8

asm

same situation on (s,c)= Math.SinCos(a);
.NET8 load result from RAM to xmm, then save back to RAM.
at result += s + c; , load s and c from RAM to xmm.
It definitely will cause severe IO delay.

Left .NET6/ Right .NET8

xmm

This is the limit of my knowledge.
Need someone who really know RyuJIT and Intel x86_64 mechanism to help

@AndyAyersMS
Copy link
Member

This assumes that Benchmark Dot Net actually did the same amount of work (iterations) per interval. To verify this we need to look at the log—do you happen to still have Demo.PerfTest-20231214-153218.log? If so, can you attach it here too?

Per the log, BDN is NOT doing the same amount of work. It runs 1024 iterations/invocation with .NET 6, but only 64 for .NET 8.

=== 6 ===
// Runtime=.NET 6.0.25 (6.0.2523.51912), X64 RyuJIT AVX2
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT,AvxVnni VectorSize=256
// Job: Job-ZAIVVM(IterationCount=32)

OverheadJitting  1: 1 op, 278600.00 ns, 278.6000 us/op
WorkloadJitting  1: 1 op, 2465600.00 ns, 2.4656 ms/op

OverheadJitting  2: 16 op, 651200.00 ns, 40.7000 us/op
WorkloadJitting  2: 16 op, 11173100.00 ns, 698.3188 us/op

WorkloadPilot    1: 16 op, 10901000.00 ns, 681.3125 us/op
WorkloadPilot    2: 32 op, 21496100.00 ns, 671.7531 us/op
WorkloadPilot    3: 64 op, 43180100.00 ns, 674.6891 us/op
WorkloadPilot    4: 128 op, 86651800.00 ns, 676.9672 us/op
WorkloadPilot    5: 256 op, 179121000.00 ns, 699.6914 us/op
WorkloadPilot    6: 512 op, 347089000.00 ns, 677.9082 us/op
WorkloadPilot    7: 1024 op, 699272000.00 ns, 682.8828 us/op

=== 8 ===
// Runtime=.NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
// GC=Concurrent Workstation
// HardwareIntrinsics=AVX2,AES,BMI1,BMI2,FMA,LZCNT,PCLMUL,POPCNT,AvxVnni,SERIALIZE VectorSize=256
// Job: Job-OBLZXA(IterationCount=32)

OverheadJitting  1: 1 op, 236300.00 ns, 236.3000 us/op
WorkloadJitting  1: 1 op, 11928700.00 ns, 11.9287 ms/op

OverheadJitting  2: 16 op, 554700.00 ns, 34.6688 us/op
WorkloadJitting  2: 16 op, 161574900.00 ns, 10.0984 ms/op

WorkloadPilot    1: 16 op, 161143700.00 ns, 10.0715 ms/op
WorkloadPilot    2: 32 op, 324932100.00 ns, 10.1541 ms/op
WorkloadPilot    3: 64 op, 649269700.00 ns, 10.1448 ms/op

Given that, the reported results are correct:

6.0:: 696.737ms / 1024 = 680us       (matches the BDN table)
8.0:  644.912ms / 64   = 10,071us    (matches the BDN table)

So the questions are: (1) why does BDN's strategy diverge, and (2) why does this lead to quite different results overall? Not clear yet which is cause and effect, but it seems like the benchmark runs slower so BDN does less work to meet its iteration time goal.

In my local profiles I see the memory traffic in coreclr!_sse4_sin2 be slower in 8 than in 6. That suggests that somehow 8 is incurring cache conflicts or other memory unit stalls. How this is related to the jitted codegen is unclear. Let me see if I see the same thing in your profiles.

@MichalPetryka
Copy link
Contributor

MichalPetryka commented Dec 14, 2023

why does BDN's strategy diverge

BDN (by default) does a variable amount of iterations depending on how much time the code takes.

@saucecontrol
Copy link
Member

seems like the benchmark runs slower so BDN does less work to meet its iteration time goal

That's correct. It targets a run time of ~.5s and sets the UnrollFactor (ops in the output) based on the timing of sample runs. https://github.com/dotnet/BenchmarkDotNet/blob/master/docs/articles/guides/how-it-works.md

You can set UnrollFactor explicitly to match them up.

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Dec 14, 2023

Thanks all, I know all too well how BDN's strategy can shift about.

@kingsznhone can you try this experiment (with the unmodifed stackalloc version of the code) on your box?

dotnet run -c Release -f net8.0 -- -p ETW -r net6.0 net8.0 --iterationCount 32 -f * --envVars DOTNET_EnableAVX:0

We suspect maybe we are seeing AVX-SSE transition penalties triggered by the fact that the stackalloc version uses ymm registers to zero the allocated stack region in .NET 8.

This fits what we know so far pretty well:

  • .NET 6 did not use ymm to clear the stackalloc
  • Using new instead of stackalloc removes all uses of ymm in the .NET 8 version of Calculate
  • Math.Sin and Math.Cos resolve to entrypoints in ucrtbase which are likely well-optimized for the specific machine, but Math.SinCos resolves to an assembly routine embedded in coreclr.dll which is not AVX aware.
  • DOTNET_EnableAVX=0 (or for BDN, DOTNET_EnableAVX:0) will likewise block the jit from using ymm registers.
  • All reports so far of regressions are on Intel processors

@AndyAyersMS AndyAyersMS removed the untriaged New issue has not been triaged by the area owner label Dec 15, 2023
@kingsznhone
Copy link
Author

kingsznhone commented Dec 15, 2023

@AndyAyersMS Here is the result you need. It's closer to the answer. I attached etl files below

btw, I wonder why SinCos(double a) was designed to return a tuple. Is it a potential powder keg?

I would prefer SinCos(double a ,out double sa, out double ca) style.


BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.2861/23H2/2023Update/SunValley3)
12th Gen Intel Core i9-12950HX, 1 CPU, 24 logical and 16 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  Job-YCTKEM : .NET 6.0.25 (6.0.2523.51912), X64 RyuJIT SSE4.2
  Job-TCDZGP : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT SSE4.2

EnvironmentVariables=DOTNET_EnableAVX=0  IterationCount=32  

Method Runtime Mean Error StdDev Ratio RatioSD Code Size Allocated Alloc Ratio
Compute .NET 6.0 862.5 μs 14.71 μs 22.01 μs 1.00 0.00 256 B 2.85 KB 1.00
Compute .NET 8.0 836.4 μs 11.45 μs 17.14 μs 0.97 0.04 314 B 2.89 KB 1.01

Demo.PerfTest-20231215-194206.zip

@kunalspathak
Copy link
Member

Related: #82132 (comment)

@AndyAyersMS
Copy link
Member

@tannergooding any thoughts on this?

@tannergooding
Copy link
Member

The issue here isn't the tuple, that's desirable from a usability perspective and for performance/efficiency on most platforms.

The actual issue here ends being two parts:

  1. pessimization caused by MSVC in the codegen used to call the underlying __sincos
  2. The vzeroupper issue

For 1, there isn't much we can do. The code quality will ideally improve over time or we may be able to explicitly work around the issue by calling Sin and Cos from Managed for Windows in particular.

For 2, this can be closed as a duplicate of #82132

The simple fix is we should be emitting vzeroupper before transferring control to anything which may be "AVX unaware" (P/Invokes and some R2R methods) and therefore which could use the legacy encoded instructions.

On modern hardware, vzeroupper tends to be free (handled in register renaming); and as noted above the transition penalty is already not expensive for AMD; but it may still incur cost on older hardware which can be a net negative for method calls which don't use floating-point/SIMD at all.

We have a separate issue (#11496) tracking our existing overuse of vzeroupper in other areas, which itself really only needs to be before or after such transition boundaries, not as part of every managed function. This is because there is no penalty going from 128-bit legacy <-> 128-bit VEX only when going between 128-bit legacy <-> 256-bit or higher VEX/EVEX, the diagram for that (in the worst case) is:
image

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 12, 2024
@tannergooding
Copy link
Member

vzeroupper fix is #98261, no longer see the regression locally with the fix

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 13, 2024
@kingsznhone
Copy link
Author

vzeroupper fix is #98261, no longer see the regression locally with the fix

Thanks for your work. Will this fix patch be released with the next .NET8 minor Release?

@tannergooding
Copy link
Member

@kingsznhone I commented on that here: #98261 (comment)

Reposting for convenience:

my expectation is "no", but it would ultimately be up to @JulieLeeMSFT on whether or not we take it for a servicing bar check.

This is a general issue going back to .NET Framework, so it's not technically a regression. There were two new customer reported scenarios that it shows up in .NET 8, but they are just variations on the same general issue and are showing up primarily due to the context of broader code (user code + library code + user optimizations happen to trigger it for this scenario).

The fix here is relatively straightforward, but its also not isolated and impacts a lot of code across the BCL. Because of this it's possible that there are scenarios not covered or a particular microarchitecture this doesn't fix, so it's not easy to label it as "low risk". Given a couple months time, it might be easier to label this as "low risk", certainly after we get the first set of benchmark numbers in our weekly perf triage next Tuesday.

And then finally, there are some "workarounds" devs can do to "fix" this by utilizing knowledge of when the JIT emits vzeroupper. Most notably you can "force" the JIT to emit a vzeroupper before a P/Invoke by simply ensuring some V256 usage exists before the P/Invoke call. One example of this is the following, where you'd simply use _ = GetZero(); before the P/Invoke. This will force a call which emits vzeroupper and then never mutates the upper bits, ensuring you're in a "clean" state so that the penalty doesn't exist.

[MethodImpl(MethodImplOptions.NoInlining)]
public static Vector128<float> GetZero() => Vector128<float>.Zero;

@github-actions github-actions bot locked and limited conversation to collaborators Mar 23, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants