Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate regressions of binary-trees and fannkuch on benchmarksgame site #40810

Closed
danmoseley opened this issue Aug 14, 2020 · 24 comments
Closed
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue
Milestone

Comments

@danmoseley
Copy link
Member

Benchmarks game site has a comparison up of 3.1 vs . 5.0 preview 7 : https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharppreview.html

There are 4 regressions, 2 may be noise level. The other two are binary-trees (7%) and fannkuch-redux (4%)

I'll check those two using the source of the latest submissions.

@danmoseley
Copy link
Member Author

danmoseley commented Aug 14, 2020

Here's the results i get on Ubuntu using the latest sources for these two games (BinaryTrees_6 and FannkuchRedux_9 from dotnet/performance#1453 but with the official input parameters)

I included the various other flavors we already have, which are older.

There is no evidence of a regression here for the "official" sources, so it's unclear why they're seeing one unless its from slightly different hardware or builds. I guess we'll see whether it goes away in final official results.
cc @AndyAyersMS

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20366.6
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-HFLFME : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-XLBCVM : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-HEKPDG : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT
  Job-YYWTGM : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-DISMCE : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-HDJRLZ : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MinIterationCount=15  WarmupCount=1  
Type Method Job Runtime Toolchain MaxIterationCount Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
BinaryTrees_2 BinaryTrees_2 Job-HFLFME .NET Core 2.1 netcoreapp2.1 20 123.4 ms 1.74 ms 1.71 ms 122.9 ms 120.9 ms 127.8 ms 1.00 0.00 53000.0000 1500.0000 500.0000 232789.44 KB
BinaryTrees_2 BinaryTrees_2 Job-XLBCVM .NET Core 3.1 netcoreapp3.1 20 150.3 ms 2.72 ms 2.54 ms 150.6 ms 145.7 ms 154.1 ms 1.22 0.03 55000.0000 1000.0000 - 232789.44 KB
BinaryTrees_2 BinaryTrees_2 Job-HEKPDG .NET Core 5.0 netcoreapp5.0 20 177.2 ms 6.09 ms 7.01 ms 177.7 ms 162.5 ms 189.3 ms 1.43 0.06 50000.0000 1000.0000 - 232789.72 KB
BinaryTrees_5 BinaryTrees_5 Job-YYWTGM .NET Core 2.1 netcoreapp2.1 40 140.0 ms 2.77 ms 3.88 ms 140.2 ms 130.7 ms 145.6 ms 1.00 0.00 39000.0000 12500.0000 2500.0000 1.7 KB
BinaryTrees_5 BinaryTrees_5 Job-DISMCE .NET Core 3.1 netcoreapp3.1 40 145.3 ms 2.89 ms 4.41 ms 145.7 ms 133.5 ms 152.2 ms 1.04 0.04 38000.0000 11000.0000 4000.0000 232799.7 KB
BinaryTrees_5 BinaryTrees_5 Job-HDJRLZ .NET Core 5.0 netcoreapp5.0 40 141.7 ms 2.82 ms 5.01 ms 141.6 ms 132.0 ms 154.0 ms 1.01 0.04 38000.0000 9000.0000 3000.0000 232793.72 KB
BinaryTrees_6 BinaryTrees_6 Job-YYWTGM .NET Core 2.1 netcoreapp2.1 40 8,559.7 ms 164.30 ms 153.69 ms 8,558.8 ms 8,295.0 ms 8,832.5 ms 1.00 0.00 1942000.0000 359000.0000 9000.0000 65551.55 KB
BinaryTrees_6 BinaryTrees_6 Job-DISMCE .NET Core 3.1 netcoreapp3.1 40 8,494.8 ms 154.52 ms 158.68 ms 8,498.8 ms 8,143.3 ms 8,747.1 ms 0.99 0.02 1940000.0000 370000.0000 9000.0000 9546425.66 KB
BinaryTrees_6 BinaryTrees_6 Job-HDJRLZ .NET Core 5.0 netcoreapp5.0 40 7,929.7 ms 158.14 ms 147.92 ms 7,900.6 ms 7,603.8 ms 8,176.0 ms 0.93 0.03 1947000.0000 367000.0000 9000.0000 9546422.11 KB
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20366.6
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-YVYDTA : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-DOYOOY : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-ZKDSPK : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Type Method Job Runtime Toolchain n expectedSum Mean Error StdDev Median Min Max Ratio RatioSD Gen 0 Gen 1 Gen 2 Allocated
FannkuchRedux_5 FannkuchRedux_5 Job-YVYDTA .NET Core 2.1 netcoreapp2.1 10 38 37.02 ms 1.385 ms 1.595 ms 36.31 ms 35.61 ms 40.08 ms 1.00 0.00 - - - 3120 B
FannkuchRedux_5 FannkuchRedux_5 Job-DOYOOY .NET Core 3.1 netcoreapp3.1 10 38 37.87 ms 0.632 ms 0.592 ms 37.70 ms 36.97 ms 39.07 ms 1.03 0.04 - - - 4632 B
FannkuchRedux_5 FannkuchRedux_5 Job-ZKDSPK .NET Core 5.0 netcoreapp5.0 10 38 36.98 ms 0.698 ms 0.619 ms 36.92 ms 36.30 ms 38.26 ms 1.01 0.04 - - - 4673 B
FannkuchRedux_2 FannkuchRedux_2 Job-YVYDTA .NET Core 2.1 netcoreapp2.1 10 73196 150.94 ms 1.181 ms 0.986 ms 150.94 ms 149.68 ms 152.25 ms 1.00 0.00 - - - 224 B
FannkuchRedux_2 FannkuchRedux_2 Job-DOYOOY .NET Core 3.1 netcoreapp3.1 10 73196 143.72 ms 1.420 ms 1.108 ms 143.52 ms 142.36 ms 146.02 ms 0.95 0.01 - - - 224 B
FannkuchRedux_2 FannkuchRedux_2 Job-ZKDSPK .NET Core 5.0 netcoreapp5.0 10 73196 153.09 ms 1.336 ms 1.184 ms 152.98 ms 151.65 ms 155.17 ms 1.01 0.01 - - - 1168 B
FannkuchRedux_9 FannkuchRedux_5 Job-YVYDTA .NET Core 2.1 netcoreapp2.1 12 3968050 4,430.65 ms 7.464 ms 6.233 ms 4,431.80 ms 4,421.10 ms 4,439.71 ms 1.00 0.00 - - - 2368 B
FannkuchRedux_9 FannkuchRedux_5 Job-DOYOOY .NET Core 3.1 netcoreapp3.1 12 3968050 4,702.96 ms 62.185 ms 55.126 ms 4,679.59 ms 4,655.59 ms 4,827.31 ms 1.06 0.01 - - - 2536 B
FannkuchRedux_9 FannkuchRedux_5 Job-ZKDSPK .NET Core 5.0 netcoreapp5.0 12 3968050 4,411.65 ms 48.697 ms 45.551 ms 4,389.69 ms 4,373.25 ms 4,502.68 ms 1.00 0.01 - - - 2824 B

@danmoseley
Copy link
Member Author

cc @benaadams

@billwert
Copy link
Member

billwert commented Aug 14, 2020

If this is to track a potential regression in the product should we move it to dotnet/runtime @danmosemsft?

@danmoseley danmoseley transferred this issue from dotnet/performance Aug 14, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-Meta untriaged New issue has not been triaged by the area owner labels Aug 14, 2020
@danmoseley danmoseley added tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Aug 14, 2020
@adamsitnik adamsitnik added this to the 5.0.0 milestone Aug 18, 2020
@adamsitnik
Copy link
Member

FWIW the regression is Windows-specific

BenchmarksGame.BinaryTrees_2.RunBench

Conclusion Base Diff Base/Diff Modality Operating System Bit Processor Name Base Runtime Diff Runtime
Slower 104517050.00 115141650.00 0.91 Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 .NET Core 5.0.0
Same 121320538.00 126695979.00 0.96 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 .NET Core 5.0.0
Same 189125801.00 185862914.00 1.02 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 .NET Core 5.0.0

BenchmarksGame.FannkuchRedux_5.RunBench(n: 10, expectedSum: 38)

Conclusion Base Diff Base/Diff Modality Operating System Bit Processor Name Base Runtime Diff Runtime
Slower 24379150.00 39701257.14 0.61 bimodal Windows 10.0.18363.959 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 .NET Core 5.0.0
Same 27439371.19 27543953.25 1.00 ubuntu 18.04 X64 Intel Xeon CPU E5-1650 v4 3.60GHz .NET Core 3.1.6 .NET Core 5.0.0
Same 82078762.25 87040901.88 0.94 macOS Mojave 10.14.5 X64 Intel Core i7-5557U CPU 3.10GHz (Broadwell) .NET Core 3.1.6 .NET Core 5.0.0

@danmoseley
Copy link
Member Author

They measure on Ubuntu so if there's a windows regression that's coincidental 🙂

@AndyAyersMS
Copy link
Member

Benchmarks games now uses a slightly more up to date Ivy Bridge cpu.

Measured on a quad-core 3.0GHz Intel® i5-3330® with 15.8 GiB of RAM and 2TB SATA disk drive; using Ubuntu™ 20.04 x86_64 GNU/Linux 5.4.0-40-generic.

@AndyAyersMS
Copy link
Member

I'll investigate but may not get around to it until next week sometime. So if anyone wants to drill in the meantime, please go ahead.

cc @dotnet/jit-contrib

@ericstj ericstj added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-Meta labels Aug 26, 2020
@AndyAyersMS
Copy link
Member

AndyAyersMS commented Aug 28, 2020

For BinaryTrees_2, there is a small codegen diff in bottomUpTree at Tier1. The lea does not get CSE'd in 5.0 and there's an extra mov in the prolog.

;; 3.1

G_M54868_IG01:
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC20             sub      rsp, 32

G_M54868_IG02:
       85C9                 test     ecx, ecx
       7E4C                 jle      SHORT G_M54868_IG04
       8D71FF               lea      esi, [rcx-1]
       8BCE                 mov      ecx, esi
       E8E3F3FFFF           call     TreeNode:bottomUpTree(int):struct
       488BF8               mov      rdi, rax
       8BCE                 mov      ecx, esi
       E8D9F3FFFF           call     TreeNode:bottomUpTree(int):struct

;; 5.0

G_M46436_IG01:
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC20             sub      rsp, 32
       8BF1                 mov      esi, ecx

G_M46436_IG02:
       85F6                 test     esi, esi
       7E4B                 jle      SHORT G_M46436_IG05

G_M46436_IG03:
       8D4EFF               lea      ecx, [rsi-1]
       E8EBF3FFFF           call     TreeNode:bottomUpTree(int):TreeNode
       488BF8               mov      rdi, rax
       8D4EFF               lea      ecx, [rsi-1]
       E8E0F3FFFF           call     TreeNode:bottomUpTree(int):TreeNode

I have not yet confirmed this is the root cause of the perf difference, will try to do that soon.

Since this is a call-crossing lifetime I wonder if this is a consequence of #1463; the PR there shows a small perf score regression so presumably also a small code size regression.

Not sure if the new heuristic takes into account that CSEing something that uses a call-crossing live var might be cheaper than it seems if the CSEs are the only crossing appearances of the var; basically we'd be trading one call crossing lifetime for another.

cc @briansull

@AndyAyersMS
Copy link
Member

More or less the same diff in the linux version:

;; 3.1

G_M54868_IG01:
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       53                   push     rbx
       50                   push     rax
       488D6C2420           lea      rbp, [rsp+20H]

G_M54868_IG02:
       85FF                 test     edi, edi
       7E4F                 jle      SHORT G_M54868_IG04
       8D5FFF               lea      ebx, [rdi-1]
       8BFB                 mov      edi, ebx
       E800000000           call     TreeNode:bottomUpTree(int):struct
       4C8BF0               mov      r14, rax
       8BFB                 mov      edi, ebx
       E800000000           call     TreeNode:bottomUpTree(int):struct

;; 5.0

G_M46436_IG01:
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       53                   push     rbx
       50                   push     rax
       488D6C2420           lea      rbp, [rsp+20H]
       8BDF                 mov      ebx, edi

G_M46436_IG02:
       85DB                 test     ebx, ebx
       7E4E                 jle      SHORT G_M46436_IG05

G_M46436_IG03:
       8D7BFF               lea      edi, [rbx-1]
       E800000000           call     TreeNode:bottomUpTree(int):TreeNode
       4C8BF0               mov      r14, rax
       8D7BFF               lea      edi, [rbx-1]
       E800000000           call     TreeNode:bottomUpTree(int):TreeNode

@briansull
Copy link
Contributor

@AndyAyersMS I will take a look

@briansull
Copy link
Contributor

briansull commented Aug 28, 2020

I wouldn't expect that
lea edi, [rbx-1]
would be slower than
mov edi, ebx

The extra instruction in the prolog
mov ebx, edi
should be offset by the extra instruction to setup the CSE
lea ebx, [rdi-1]

@briansull
Copy link
Contributor

CSE doesn't see this as profitable.

@AndyAyersMS
Copy link
Member

But in 3.0 it was considered profitable?

@AndyAyersMS
Copy link
Member

Locally I see a potential regression in FannkuchRedux_9 on windows x64, though the 5.0 run has high variance. But the other two versions seem to be consistently faster.

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.450 (2004/May2020Update/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.1.20407.13
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT
  Job-NMJCAW : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  Job-JQYCRY : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT
Type RT n sum Mean Error StdDev Median Min Max Ratio
FannkuchRedux_5 3.1 10 38 45.39 ms 2.493 ms 2.771 ms 44.09 ms 42.60 ms 52.44 ms 1.00
FannkuchRedux_5 5.0 10 38 43.68 ms 1.006 ms 1.076 ms 43.57 ms 42.21 ms 45.91 ms 0.97
FannkuchRedux_2 3.1 10 73196 163.94 ms 10.318 ms 11.882 ms 164.25 ms 146.87 ms 191.22 ms 1.00
FannkuchRedux_2 5.0 10 73196 146.51 ms 4.870 ms 5.609 ms 146.05 ms 139.46 ms 160.05 ms 0.90
FannkuchRedux_9 3.1 11 556355 404.73 ms 8.088 ms 7.943 ms 404.47 ms 392.71 ms 421.88 ms 1.00
FannkuchRedux_9 5.0 11 556355 436.61 ms 19.457 ms 22.407 ms 438.80 ms 405.44 ms 469.88 ms 1.07

Will take a look on Ubuntu too.

@AndyAyersMS
Copy link
Member

Ubuntu looks similar for me (note older HW and RC8 vs Preview1, so not directly comparable to the above) -- FannkuchRedux_9 seems slower in 5.0.

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20417.9
  [Host]     : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  Job-JHIWSX : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  Job-CWFQLH : .NET Core 5.0.0 (CoreCLR 5.0.20.40711, CoreFX 5.0.20.40711), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Type RT n sum Mean Error StdDev Median Min Max Ratio
FannkuchRedux_5 3.1 10 38 75.09 ms 0.745 ms 0.660 ms 74.92 ms 74.34 ms 76.77 ms 1.00
FannkuchRedux_5 5.0 10 38 76.80 ms 1.218 ms 1.139 ms 76.38 ms 75.27 ms 79.02 ms 1.02
FannkuchRedux_2 3.1 10 73196 317.39 ms 0.212 ms 0.188 ms 317.37 ms 317.07 ms 317.80 ms 1.00
FannkuchRedux_2 5.0 10 73196 319.04 ms 0.403 ms 0.337 ms 319.04 ms 318.08 ms 319.56 ms 1.01
FannkuchRedux_9 3.1 11 556355 676.94 ms 6.637 ms 6.208 ms 677.03 ms 665.76 ms 685.21 ms 1.00
FannkuchRedux_9 5.0 11 556355 710.44 ms 2.657 ms 2.218 ms 710.34 ms 705.68 ms 714.14 ms 1.05

@AndyAyersMS
Copy link
Member

For FannkuchRedux_9, all the key perf is in Run.

For windows x64 codegen from 3.1 to 5.0 is very similar. 5.0 does a few more CSEs and avoids some spills. Most CQ looks locally better, eg here is the inner loop of the last for in FirstPermutation:

;;; 3.1 (x64 win)

G_M27408_IG15:
       428D043B             lea      eax, [rbx+r15]
       458D6601             lea      r12d, [r14+1]           // loop invariant
       99                   cdq      
       41F7FC               idiv     edx:eax, r12d
       4863C2               movsxd   rax, edx
       480FBF0447           movsx    rax, word  ptr [rdi+2*rax]
       4863D3               movsxd   rdx, ebx
       66890456             mov      word  ptr [rsi+2*rdx], ax
       FFC3                 inc      ebx
       413BDE               cmp      ebx, r14d
       7EDE                 jle      SHORT G_M27408_IG15

;;; 5.0 (one fewer mov)

G_M7882_IG16:
       438D043C             lea      eax, [r12+r15]
       99                   cdq      
       41F7FD               idiv     edx:eax, r13d
       4863C2               movsxd   rax, edx
       480FBF0447           movsx    rax, word  ptr [rdi+2*rax]
       4963D4               movsxd   rdx, r12d
       66890456             mov      word  ptr [rsi+2*rdx], ax
       41FFC4               inc      r12d
       453BE6               cmp      r12d, r14d
       7EE1                 jle      SHORT G_M7882_IG16

Because of this loops end up aligned differently; suspect this might be the root cause for perf issues.

Similar diffs in linux x64 codegen.

@AndyAyersMS AndyAyersMS self-assigned this Aug 28, 2020
@AndyAyersMS
Copy link
Member

For FannkuchRedux_9, Looking at linux perf, 5.0 has fewer instructions but higher clocks (IPC is 0.99 in 3.1; 0.94 in 5.0). Branch misses are up. So some kind of micro-architectural effect, quite possibly code alignment.

andy@andy-ubuntu:~/bugs/r40810$ perf stat -d dotnet exec /home/andy/bugs/r40810/f/bin/Release/net5.0/f.dll 
556355
Pfannkuchen(11) = 51

 Performance counter stats for 'dotnet exec /home/andy/bugs/r40810/f/bin/Release/net5.0/f.dll':

          5,776.17 msec task-clock                #    6.407 CPUs utilized          
               219      context-switches          #    0.038 K/sec                  
                 5      cpu-migrations            #    0.001 K/sec                  
             2,187      page-faults               #    0.379 K/sec                  
    11,369,904,249      cycles                    #    1.968 GHz                      (32.64%)
     7,166,244,393      stalled-cycles-frontend   #   63.03% frontend cycles idle     (42.47%)
     4,246,791,011      stalled-cycles-backend    #   37.35% backend cycles idle      (43.00%)
    10,734,941,194      instructions              #    0.94  insn per cycle         
                                                  #    0.67  stalled cycles per insn  (53.86%)
     1,554,131,325      branches                  #  269.059 M/sec                    (51.40%)
       202,785,963      branch-misses             #   13.05% of all branches          (52.87%)
     1,923,302,116      L1-dcache-loads           #  332.972 M/sec                    (38.33%)
         2,188,800      L1-dcache-load-misses     #    0.11% of all L1-dcache hits    (31.07%)
           699,606      LLC-loads                 #    0.121 M/sec                    (22.26%)
   <not supported>      LLC-load-misses                                             

       0.901590482 seconds time elapsed

       5.738926000 seconds user
       0.043961000 seconds sys


andy@andy-ubuntu:~/bugs/r40810$ perf stat -d dotnet exec /home/andy/bugs/r40810/f/bin/Release/netcoreapp3.1/f.dll
556355
Pfannkuchen(11) = 51

 Performance counter stats for 'dotnet exec /home/andy/bugs/r40810/f/bin/Release/netcoreapp3.1/f.dll':

          5,541.41 msec task-clock                #    6.571 CPUs utilized          
               200      context-switches          #    0.036 K/sec                  
                 8      cpu-migrations            #    0.001 K/sec                  
             1,996      page-faults               #    0.360 K/sec                  
    10,895,741,345      cycles                    #    1.966 GHz                      (33.64%)
     6,964,856,321      stalled-cycles-frontend   #   63.92% frontend cycles idle     (41.88%)
     3,964,367,855      stalled-cycles-backend    #   36.38% backend cycles idle      (43.15%)
    10,761,932,188      instructions              #    0.99  insn per cycle         
                                                  #    0.65  stalled cycles per insn  (54.04%)
     1,557,636,040      branches                  #  281.090 M/sec                    (54.03%)
       183,314,400      branch-misses             #   11.77% of all branches          (54.76%)
     1,934,439,969      L1-dcache-loads           #  349.088 M/sec                    (43.98%)
         3,206,904      L1-dcache-load-misses     #    0.17% of all L1-dcache hits    (33.38%)
           701,527      LLC-loads                 #    0.127 M/sec                    (22.60%)
   <not supported>      LLC-load-misses                                             

       0.843306468 seconds time elapsed

       5.519643000 seconds user
       0.027957000 seconds sys

@AndyAyersMS
Copy link
Member

Similar data for BinaryTrees_2. Here we see 5.0 has slightly better IPC but quite a few more instructions to execute;

andy@andy-ubuntu:~/bugs/r40810/b$ perf stat -d dotnet exec /home/andy/bugs/r40810/b/bin/Release/net5.0/b.dll

 Performance counter stats for 'dotnet exec /home/andy/bugs/r40810/b/bin/Release/net5.0/b.dll':

            499.71 msec task-clock                #    1.010 CPUs utilized          
                51      context-switches          #    0.102 K/sec                  
                 5      cpu-migrations            #    0.010 K/sec                  
             5,637      page-faults               #    0.011 M/sec                  
       938,804,171      cycles                    #    1.879 GHz                      (32.00%)
       334,250,542      stalled-cycles-frontend   #   35.60% frontend cycles idle     (43.19%)
       140,647,936      stalled-cycles-backend    #   14.98% backend cycles idle      (44.77%)
     1,576,795,897      instructions              #    1.68  insn per cycle         
                                                  #    0.21  stalled cycles per insn  (56.48%)
       346,894,682      branches                  #  694.191 M/sec                    (57.98%)
         1,670,050      branch-misses             #    0.48% of all branches          (57.25%)
       523,532,754      L1-dcache-loads           # 1047.672 M/sec                    (50.91%)
        10,172,775      L1-dcache-load-misses     #    1.94% of all L1-dcache hits    (20.91%)
         1,872,450      LLC-loads                 #    3.747 M/sec                    (20.66%)
   <not supported>      LLC-load-misses                                             

       0.494821787 seconds time elapsed

       0.452010000 seconds user
       0.052001000 seconds sys


andy@andy-ubuntu:~/bugs/r40810/b$ perf stat -d dotnet exec /home/andy/bugs/r40810/b/bin/Release/netcoreapp3.1/b.dll

 Performance counter stats for 'dotnet exec /home/andy/bugs/r40810/b/bin/Release/netcoreapp3.1/b.dll':

            466.87 msec task-clock                #    1.007 CPUs utilized          
                58      context-switches          #    0.124 K/sec                  
                 1      cpu-migrations            #    0.002 K/sec                  
             4,798      page-faults               #    0.010 M/sec                  
       885,595,535      cycles                    #    1.897 GHz                      (35.04%)
       345,941,828      stalled-cycles-frontend   #   39.06% frontend cycles idle     (47.13%)
       149,589,073      stalled-cycles-backend    #   16.89% backend cycles idle      (47.06%)
     1,464,498,744      instructions              #    1.65  insn per cycle         
                                                  #    0.24  stalled cycles per insn  (57.32%)
       328,724,263      branches                  #  704.102 M/sec                    (56.46%)
         1,478,562      branch-misses             #    0.45% of all branches          (53.65%)
       473,982,827      L1-dcache-loads           # 1015.235 M/sec                    (49.88%)
         8,932,883      L1-dcache-load-misses     #    1.88% of all L1-dcache hits    (22.16%)
         1,522,914      LLC-loads                 #    3.262 M/sec                    (22.02%)
   <not supported>      LLC-load-misses                                             

       0.463624216 seconds time elapsed

       0.416904000 seconds user
       0.055062000 seconds sys

Looking at profiles we spend quite a bit more time than I realized in TreeNode:itemCheck and looking at the code there, we no longer do the recursive inlines we used to do in 3.1.

So likely the main source of the perf difference in BinaryTrees_2 is from #35020.

Seems like we might want to reconsider the policy of not doing any recursive inlines, though suspect for the most part the beneficiary of such inlines is in microbenchmarks. Opened #41542.

@AndyAyersMS
Copy link
Member

I think this is understood and there's nothing we can or need to address in 5.0.

So unless there are any objections I propose we close this.

@AndyAyersMS
Copy link
Member

Some more data -- here's the perf history of BinaryTrees_2

image

Timing of the regression correlates with #35020 which merged on April 15.

Similar look for BinaryTrees_5, though somehow we recovered the perf later on, so perhaps we should look into that as well.

image

FannkuchRedux_9 is new, so no history to view.

@AndyAyersMS
Copy link
Member

(recovery of perf in BinaryTrees_5 may be from #38586 -- need to verify)

@AndyAyersMS
Copy link
Member

@danmosemsft per the above I am planning on closing this -- let me know if you agree.

@danmoseley
Copy link
Member Author

Sounds good to me. Thanks for investigating.

@danmoseley
Copy link
Member Author

5.0 will still be a clear improvement on Benchmarks game in general.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 7, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI os-windows tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

7 participants