Investigate regressions of binary-trees and fannkuch on benchmarksgame site #40810

danmoseley · 2020-08-14T00:35:11Z

Benchmarks game site has a comparison up of 3.1 vs . 5.0 preview 7 : https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharppreview.html

There are 4 regressions, 2 may be noise level. The other two are binary-trees (7%) and fannkuch-redux (4%)

I'll check those two using the source of the latest submissions.

danmoseley · 2020-08-14T00:48:23Z

Here's the results i get on Ubuntu using the latest sources for these two games (BinaryTrees_6 and FannkuchRedux_9 from dotnet/performance#1453 but with the official input parameters)

I included the various other flavors we already have, which are older.

There is no evidence of a regression here for the "official" sources, so it's unclear why they're seeing one unless its from slightly different hardware or builds. I guess we'll see whether it goes away in final official results.
cc @AndyAyersMS

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20366.6
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-HFLFME : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-XLBCVM : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-HEKPDG : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT
  Job-YYWTGM : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-DISMCE : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-HDJRLZ : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MinIterationCount=15  WarmupCount=1

Type	Method	Job	Runtime	Toolchain	MaxIterationCount	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
BinaryTrees_2	BinaryTrees_2	Job-HFLFME	.NET Core 2.1	netcoreapp2.1	20	123.4 ms	1.74 ms	1.71 ms	122.9 ms	120.9 ms	127.8 ms	1.00	0.00	53000.0000	1500.0000	500.0000	232789.44 KB
BinaryTrees_2	BinaryTrees_2	Job-XLBCVM	.NET Core 3.1	netcoreapp3.1	20	150.3 ms	2.72 ms	2.54 ms	150.6 ms	145.7 ms	154.1 ms	1.22	0.03	55000.0000	1000.0000	-	232789.44 KB
BinaryTrees_2	BinaryTrees_2	Job-HEKPDG	.NET Core 5.0	netcoreapp5.0	20	177.2 ms	6.09 ms	7.01 ms	177.7 ms	162.5 ms	189.3 ms	1.43	0.06	50000.0000	1000.0000	-	232789.72 KB

BinaryTrees_5	BinaryTrees_5	Job-YYWTGM	.NET Core 2.1	netcoreapp2.1	40	140.0 ms	2.77 ms	3.88 ms	140.2 ms	130.7 ms	145.6 ms	1.00	0.00	39000.0000	12500.0000	2500.0000	1.7 KB
BinaryTrees_5	BinaryTrees_5	Job-DISMCE	.NET Core 3.1	netcoreapp3.1	40	145.3 ms	2.89 ms	4.41 ms	145.7 ms	133.5 ms	152.2 ms	1.04	0.04	38000.0000	11000.0000	4000.0000	232799.7 KB
BinaryTrees_5	BinaryTrees_5	Job-HDJRLZ	.NET Core 5.0	netcoreapp5.0	40	141.7 ms	2.82 ms	5.01 ms	141.6 ms	132.0 ms	154.0 ms	1.01	0.04	38000.0000	9000.0000	3000.0000	232793.72 KB

BinaryTrees_6	BinaryTrees_6	Job-YYWTGM	.NET Core 2.1	netcoreapp2.1	40	8,559.7 ms	164.30 ms	153.69 ms	8,558.8 ms	8,295.0 ms	8,832.5 ms	1.00	0.00	1942000.0000	359000.0000	9000.0000	65551.55 KB
BinaryTrees_6	BinaryTrees_6	Job-DISMCE	.NET Core 3.1	netcoreapp3.1	40	8,494.8 ms	154.52 ms	158.68 ms	8,498.8 ms	8,143.3 ms	8,747.1 ms	0.99	0.02	1940000.0000	370000.0000	9000.0000	`9546425`.66 KB
BinaryTrees_6	BinaryTrees_6	Job-HDJRLZ	.NET Core 5.0	netcoreapp5.0	40	7,929.7 ms	158.14 ms	147.92 ms	7,900.6 ms	7,603.8 ms	8,176.0 ms	0.93	0.03	`1947000`.0000	367000.0000	9000.0000	9546422.11 KB

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20366.6
  [Host]     : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-YVYDTA : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
  Job-DOYOOY : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
  Job-ZKDSPK : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Type	Method	Job	Runtime	Toolchain	n	expectedSum	Mean	Error	StdDev	Median	Min	Max	Ratio	RatioSD	Gen 0	Gen 1	Gen 2	Allocated
FannkuchRedux_5	FannkuchRedux_5	Job-YVYDTA	.NET Core 2.1	netcoreapp2.1	10	38	37.02 ms	1.385 ms	1.595 ms	36.31 ms	35.61 ms	40.08 ms	1.00	0.00	-	-	-	3120 B
FannkuchRedux_5	FannkuchRedux_5	Job-DOYOOY	.NET Core 3.1	netcoreapp3.1	10	38	37.87 ms	0.632 ms	0.592 ms	37.70 ms	36.97 ms	39.07 ms	1.03	0.04	-	-	-	4632 B
FannkuchRedux_5	FannkuchRedux_5	Job-ZKDSPK	.NET Core 5.0	netcoreapp5.0	10	38	36.98 ms	0.698 ms	0.619 ms	36.92 ms	36.30 ms	38.26 ms	1.01	0.04	-	-	-	4673 B

FannkuchRedux_2	FannkuchRedux_2	Job-YVYDTA	.NET Core 2.1	netcoreapp2.1	10	73196	150.94 ms	1.181 ms	0.986 ms	150.94 ms	149.68 ms	152.25 ms	1.00	0.00	-	-	-	224 B
FannkuchRedux_2	FannkuchRedux_2	Job-DOYOOY	.NET Core 3.1	netcoreapp3.1	10	73196	143.72 ms	1.420 ms	1.108 ms	143.52 ms	142.36 ms	146.02 ms	0.95	0.01	-	-	-	224 B
FannkuchRedux_2	FannkuchRedux_2	Job-ZKDSPK	.NET Core 5.0	netcoreapp5.0	10	73196	153.09 ms	1.336 ms	1.184 ms	152.98 ms	151.65 ms	155.17 ms	1.01	0.01	-	-	-	1168 B

FannkuchRedux_9	FannkuchRedux_5	Job-YVYDTA	.NET Core 2.1	netcoreapp2.1	12	3968050	4,430.65 ms	7.464 ms	6.233 ms	4,431.80 ms	4,421.10 ms	4,439.71 ms	1.00	0.00	-	-	-	2368 B
FannkuchRedux_9	FannkuchRedux_5	Job-DOYOOY	.NET Core 3.1	netcoreapp3.1	12	3968050	4,702.96 ms	62.185 ms	55.126 ms	4,679.59 ms	4,655.59 ms	4,827.31 ms	1.06	0.01	-	-	-	2536 B
FannkuchRedux_9	FannkuchRedux_5	Job-ZKDSPK	.NET Core 5.0	netcoreapp5.0	12	3968050	4,411.65 ms	48.697 ms	45.551 ms	4,389.69 ms	4,373.25 ms	4,502.68 ms	1.00	0.01	-	-	-	2824 B

danmoseley · 2020-08-14T00:52:29Z

cc @benaadams

billwert · 2020-08-14T01:16:37Z

If this is to track a potential regression in the product should we move it to dotnet/runtime @danmosemsft?

adamsitnik · 2020-08-18T10:31:11Z

FWIW the regression is Windows-specific

BenchmarksGame.BinaryTrees_2.RunBench

Conclusion	Base	Diff	Base/Diff	Operating System	Bit	Processor Name	Base Runtime	Diff Runtime
Slower	104517050.00	115141650.00	0.91	Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	.NET Core 3.1.6	.NET Core 5.0.0
Same	121320538.00	126695979.00	0.96	ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	.NET Core 3.1.6	.NET Core 5.0.0
Same	189125801.00	185862914.00	1.02	macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	.NET Core 3.1.6	.NET Core 5.0.0

BenchmarksGame.FannkuchRedux_5.RunBench(n: 10, expectedSum: 38)

Conclusion	Base	Diff	Base/Diff	Modality	Operating System	Bit	Processor Name	Base Runtime	Diff Runtime
Slower	24379150.00	39701257.14	0.61	bimodal	Windows 10.0.18363.959	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	.NET Core 3.1.6	.NET Core 5.0.0
Same	27439371.19	27543953.25	1.00		ubuntu 18.04	X64	Intel Xeon CPU E5-1650 v4 3.60GHz	.NET Core 3.1.6	.NET Core 5.0.0
Same	82078762.25	87040901.88	0.94		macOS Mojave 10.14.5	X64	Intel Core i7-5557U CPU 3.10GHz (Broadwell)	.NET Core 3.1.6	.NET Core 5.0.0

danmoseley · 2020-08-18T14:41:40Z

They measure on Ubuntu so if there's a windows regression that's coincidental 🙂

AndyAyersMS · 2020-08-18T18:51:11Z

Benchmarks games now uses a slightly more up to date Ivy Bridge cpu.

Measured on a quad-core 3.0GHz Intel® i5-3330® with 15.8 GiB of RAM and 2TB SATA disk drive; using Ubuntu™ 20.04 x86_64 GNU/Linux 5.4.0-40-generic.

AndyAyersMS · 2020-08-21T23:35:00Z

I'll investigate but may not get around to it until next week sometime. So if anyone wants to drill in the meantime, please go ahead.

cc @dotnet/jit-contrib

AndyAyersMS · 2020-08-28T01:19:12Z

For BinaryTrees_2, there is a small codegen diff in bottomUpTree at Tier1. The lea does not get CSE'd in 5.0 and there's an extra mov in the prolog.

;; 3.1

G_M54868_IG01:
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC20             sub      rsp, 32

G_M54868_IG02:
       85C9                 test     ecx, ecx
       7E4C                 jle      SHORT G_M54868_IG04
       8D71FF               lea      esi, [rcx-1]
       8BCE                 mov      ecx, esi
       E8E3F3FFFF           call     TreeNode:bottomUpTree(int):struct
       488BF8               mov      rdi, rax
       8BCE                 mov      ecx, esi
       E8D9F3FFFF           call     TreeNode:bottomUpTree(int):struct

;; 5.0

G_M46436_IG01:
       57                   push     rdi
       56                   push     rsi
       53                   push     rbx
       4883EC20             sub      rsp, 32
       8BF1                 mov      esi, ecx

G_M46436_IG02:
       85F6                 test     esi, esi
       7E4B                 jle      SHORT G_M46436_IG05

G_M46436_IG03:
       8D4EFF               lea      ecx, [rsi-1]
       E8EBF3FFFF           call     TreeNode:bottomUpTree(int):TreeNode
       488BF8               mov      rdi, rax
       8D4EFF               lea      ecx, [rsi-1]
       E8E0F3FFFF           call     TreeNode:bottomUpTree(int):TreeNode

I have not yet confirmed this is the root cause of the perf difference, will try to do that soon.

Since this is a call-crossing lifetime I wonder if this is a consequence of #1463; the PR there shows a small perf score regression so presumably also a small code size regression.

Not sure if the new heuristic takes into account that CSEing something that uses a call-crossing live var might be cheaper than it seems if the CSEs are the only crossing appearances of the var; basically we'd be trading one call crossing lifetime for another.

cc @briansull

AndyAyersMS · 2020-08-28T02:45:55Z

More or less the same diff in the linux version:

;; 3.1

G_M54868_IG01:
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       53                   push     rbx
       50                   push     rax
       488D6C2420           lea      rbp, [rsp+20H]

G_M54868_IG02:
       85FF                 test     edi, edi
       7E4F                 jle      SHORT G_M54868_IG04
       8D5FFF               lea      ebx, [rdi-1]
       8BFB                 mov      edi, ebx
       E800000000           call     TreeNode:bottomUpTree(int):struct
       4C8BF0               mov      r14, rax
       8BFB                 mov      edi, ebx
       E800000000           call     TreeNode:bottomUpTree(int):struct

;; 5.0

G_M46436_IG01:
       55                   push     rbp
       4157                 push     r15
       4156                 push     r14
       53                   push     rbx
       50                   push     rax
       488D6C2420           lea      rbp, [rsp+20H]
       8BDF                 mov      ebx, edi

G_M46436_IG02:
       85DB                 test     ebx, ebx
       7E4E                 jle      SHORT G_M46436_IG05

G_M46436_IG03:
       8D7BFF               lea      edi, [rbx-1]
       E800000000           call     TreeNode:bottomUpTree(int):TreeNode
       4C8BF0               mov      r14, rax
       8D7BFF               lea      edi, [rbx-1]
       E800000000           call     TreeNode:bottomUpTree(int):TreeNode

briansull · 2020-08-28T17:22:51Z

@AndyAyersMS I will take a look

briansull · 2020-08-28T17:31:03Z

I wouldn't expect that
lea edi, [rbx-1]
would be slower than
mov edi, ebx

The extra instruction in the prolog
mov ebx, edi
should be offset by the extra instruction to setup the CSE
lea ebx, [rdi-1]

briansull · 2020-08-28T17:33:57Z

CSE doesn't see this as profitable.

AndyAyersMS · 2020-08-28T17:50:36Z

But in 3.0 it was considered profitable?

AndyAyersMS · 2020-08-28T18:30:50Z

Locally I see a potential regression in FannkuchRedux_9 on windows x64, though the 5.0 run has high variance. But the other two versions seem to be consistently faster.

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.450 (2004/May2020Update/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.1.20407.13
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT
  Job-NMJCAW : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
  Job-JQYCRY : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT

Type	RT	n	sum	Mean	Error	StdDev	Median	Min	Max	Ratio
FannkuchRedux_5	3.1	10	38	45.39 ms	2.493 ms	2.771 ms	44.09 ms	42.60 ms	52.44 ms	1.00
FannkuchRedux_5	5.0	10	38	43.68 ms	1.006 ms	1.076 ms	43.57 ms	42.21 ms	45.91 ms	0.97

FannkuchRedux_2	3.1	10	73196	163.94 ms	10.318 ms	11.882 ms	164.25 ms	146.87 ms	191.22 ms	1.00
FannkuchRedux_2	5.0	10	73196	146.51 ms	4.870 ms	5.609 ms	146.05 ms	139.46 ms	160.05 ms	0.90

FannkuchRedux_9	3.1	11	556355	404.73 ms	8.088 ms	7.943 ms	404.47 ms	392.71 ms	421.88 ms	1.00
FannkuchRedux_9	5.0	11	556355	436.61 ms	19.457 ms	22.407 ms	438.80 ms	405.44 ms	469.88 ms	1.07

Will take a look on Ubuntu too.

AndyAyersMS · 2020-08-28T18:53:27Z

Ubuntu looks similar for me (note older HW and RC8 vs Preview1, so not directly comparable to the above) -- FannkuchRedux_9 seems slower in 5.0.

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20417.9
  [Host]     : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  Job-JHIWSX : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
  Job-CWFQLH : .NET Core 5.0.0 (CoreCLR 5.0.20.40711, CoreFX 5.0.20.40711), X64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Arguments=/p:DebugType=portable  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1

Type	RT	n	sum	Mean	Error	StdDev	Median	Min	Max	Ratio
FannkuchRedux_5	3.1	10	38	75.09 ms	0.745 ms	0.660 ms	74.92 ms	74.34 ms	76.77 ms	1.00
FannkuchRedux_5	5.0	10	38	76.80 ms	1.218 ms	1.139 ms	76.38 ms	75.27 ms	79.02 ms	1.02

FannkuchRedux_2	3.1	10	73196	317.39 ms	0.212 ms	0.188 ms	317.37 ms	317.07 ms	317.80 ms	1.00
FannkuchRedux_2	5.0	10	73196	319.04 ms	0.403 ms	0.337 ms	319.04 ms	318.08 ms	319.56 ms	1.01

FannkuchRedux_9	3.1	11	556355	676.94 ms	6.637 ms	6.208 ms	677.03 ms	665.76 ms	685.21 ms	1.00
FannkuchRedux_9	5.0	11	556355	710.44 ms	2.657 ms	2.218 ms	710.34 ms	705.68 ms	714.14 ms	1.05

AndyAyersMS · 2020-08-28T22:34:33Z

For FannkuchRedux_9, all the key perf is in Run.

For windows x64 codegen from 3.1 to 5.0 is very similar. 5.0 does a few more CSEs and avoids some spills. Most CQ looks locally better, eg here is the inner loop of the last for in FirstPermutation:

;;; 3.1 (x64 win)

G_M27408_IG15:
       428D043B             lea      eax, [rbx+r15]
       458D6601             lea      r12d, [r14+1]           // loop invariant
       99                   cdq      
       41F7FC               idiv     edx:eax, r12d
       4863C2               movsxd   rax, edx
       480FBF0447           movsx    rax, word  ptr [rdi+2*rax]
       4863D3               movsxd   rdx, ebx
       66890456             mov      word  ptr [rsi+2*rdx], ax
       FFC3                 inc      ebx
       413BDE               cmp      ebx, r14d
       7EDE                 jle      SHORT G_M27408_IG15

;;; 5.0 (one fewer mov)

G_M7882_IG16:
       438D043C             lea      eax, [r12+r15]
       99                   cdq      
       41F7FD               idiv     edx:eax, r13d
       4863C2               movsxd   rax, edx
       480FBF0447           movsx    rax, word  ptr [rdi+2*rax]
       4963D4               movsxd   rdx, r12d
       66890456             mov      word  ptr [rsi+2*rdx], ax
       41FFC4               inc      r12d
       453BE6               cmp      r12d, r14d
       7EE1                 jle      SHORT G_M7882_IG16

Because of this loops end up aligned differently; suspect this might be the root cause for perf issues.

Similar diffs in linux x64 codegen.

AndyAyersMS · 2020-08-29T00:07:34Z

For FannkuchRedux_9, Looking at linux perf, 5.0 has fewer instructions but higher clocks (IPC is 0.99 in 3.1; 0.94 in 5.0). Branch misses are up. So some kind of micro-architectural effect, quite possibly code alignment.

andy@andy-ubuntu:~/bugs/r40810$ perf stat -d dotnet exec /home/andy/bugs/r40810/f/bin/Release/net5.0/f.dll 
556355
Pfannkuchen(11) = 51

 Performance counter stats for 'dotnet exec /home/andy/bugs/r40810/f/bin/Release/net5.0/f.dll':

          5,776.17 msec task-clock                #    6.407 CPUs utilized          
               219      context-switches          #    0.038 K/sec                  
                 5      cpu-migrations            #    0.001 K/sec                  
             2,187      page-faults               #    0.379 K/sec                  
    11,369,904,249      cycles                    #    1.968 GHz                      (32.64%)
     7,166,244,393      stalled-cycles-frontend   #   63.03% frontend cycles idle     (42.47%)
     4,246,791,011      stalled-cycles-backend    #   37.35% backend cycles idle      (43.00%)
    10,734,941,194      instructions              #    0.94  insn per cycle         
                                                  #    0.67  stalled cycles per insn  (53.86%)
     1,554,131,325      branches                  #  269.059 M/sec                    (51.40%)
       202,785,963      branch-misses             #   13.05% of all branches          (52.87%)
     1,923,302,116      L1-dcache-loads           #  332.972 M/sec                    (38.33%)
         2,188,800      L1-dcache-load-misses     #    0.11% of all L1-dcache hits    (31.07%)
           699,606      LLC-loads                 #    0.121 M/sec                    (22.26%)
   <not supported>      LLC-load-misses                                             

       0.901590482 seconds time elapsed

       5.738926000 seconds user
       0.043961000 seconds sys


andy@andy-ubuntu:~/bugs/r40810$ perf stat -d dotnet exec /home/andy/bugs/r40810/f/bin/Release/netcoreapp3.1/f.dll
556355
Pfannkuchen(11) = 51

 Performance counter stats for 'dotnet exec /home/andy/bugs/r40810/f/bin/Release/netcoreapp3.1/f.dll':

          5,541.41 msec task-clock                #    6.571 CPUs utilized          
               200      context-switches          #    0.036 K/sec                  
                 8      cpu-migrations            #    0.001 K/sec                  
             1,996      page-faults               #    0.360 K/sec                  
    10,895,741,345      cycles                    #    1.966 GHz                      (33.64%)
     6,964,856,321      stalled-cycles-frontend   #   63.92% frontend cycles idle     (41.88%)
     3,964,367,855      stalled-cycles-backend    #   36.38% backend cycles idle      (43.15%)
    10,761,932,188      instructions              #    0.99  insn per cycle         
                                                  #    0.65  stalled cycles per insn  (54.04%)
     1,557,636,040      branches                  #  281.090 M/sec                    (54.03%)
       183,314,400      branch-misses             #   11.77% of all branches          (54.76%)
     1,934,439,969      L1-dcache-loads           #  349.088 M/sec                    (43.98%)
         3,206,904      L1-dcache-load-misses     #    0.17% of all L1-dcache hits    (33.38%)
           701,527      LLC-loads                 #    0.127 M/sec                    (22.60%)
   <not supported>      LLC-load-misses                                             

       0.843306468 seconds time elapsed

       5.519643000 seconds user
       0.027957000 seconds sys

AndyAyersMS · 2020-08-29T00:26:34Z

Similar data for BinaryTrees_2. Here we see 5.0 has slightly better IPC but quite a few more instructions to execute;

andy@andy-ubuntu:~/bugs/r40810/b$ perf stat -d dotnet exec /home/andy/bugs/r40810/b/bin/Release/net5.0/b.dll

 Performance counter stats for 'dotnet exec /home/andy/bugs/r40810/b/bin/Release/net5.0/b.dll':

            499.71 msec task-clock                #    1.010 CPUs utilized          
                51      context-switches          #    0.102 K/sec                  
                 5      cpu-migrations            #    0.010 K/sec                  
             5,637      page-faults               #    0.011 M/sec                  
       938,804,171      cycles                    #    1.879 GHz                      (32.00%)
       334,250,542      stalled-cycles-frontend   #   35.60% frontend cycles idle     (43.19%)
       140,647,936      stalled-cycles-backend    #   14.98% backend cycles idle      (44.77%)
     1,576,795,897      instructions              #    1.68  insn per cycle         
                                                  #    0.21  stalled cycles per insn  (56.48%)
       346,894,682      branches                  #  694.191 M/sec                    (57.98%)
         1,670,050      branch-misses             #    0.48% of all branches          (57.25%)
       523,532,754      L1-dcache-loads           # 1047.672 M/sec                    (50.91%)
        10,172,775      L1-dcache-load-misses     #    1.94% of all L1-dcache hits    (20.91%)
         1,872,450      LLC-loads                 #    3.747 M/sec                    (20.66%)
   <not supported>      LLC-load-misses                                             

       0.494821787 seconds time elapsed

       0.452010000 seconds user
       0.052001000 seconds sys


andy@andy-ubuntu:~/bugs/r40810/b$ perf stat -d dotnet exec /home/andy/bugs/r40810/b/bin/Release/netcoreapp3.1/b.dll

 Performance counter stats for 'dotnet exec /home/andy/bugs/r40810/b/bin/Release/netcoreapp3.1/b.dll':

            466.87 msec task-clock                #    1.007 CPUs utilized          
                58      context-switches          #    0.124 K/sec                  
                 1      cpu-migrations            #    0.002 K/sec                  
             4,798      page-faults               #    0.010 M/sec                  
       885,595,535      cycles                    #    1.897 GHz                      (35.04%)
       345,941,828      stalled-cycles-frontend   #   39.06% frontend cycles idle     (47.13%)
       149,589,073      stalled-cycles-backend    #   16.89% backend cycles idle      (47.06%)
     1,464,498,744      instructions              #    1.65  insn per cycle         
                                                  #    0.24  stalled cycles per insn  (57.32%)
       328,724,263      branches                  #  704.102 M/sec                    (56.46%)
         1,478,562      branch-misses             #    0.45% of all branches          (53.65%)
       473,982,827      L1-dcache-loads           # 1015.235 M/sec                    (49.88%)
         8,932,883      L1-dcache-load-misses     #    1.88% of all L1-dcache hits    (22.16%)
         1,522,914      LLC-loads                 #    3.262 M/sec                    (22.02%)
   <not supported>      LLC-load-misses                                             

       0.463624216 seconds time elapsed

       0.416904000 seconds user
       0.055062000 seconds sys

Looking at profiles we spend quite a bit more time than I realized in TreeNode:itemCheck and looking at the code there, we no longer do the recursive inlines we used to do in 3.1.

So likely the main source of the perf difference in BinaryTrees_2 is from #35020.

Seems like we might want to reconsider the policy of not doing any recursive inlines, though suspect for the most part the beneficiary of such inlines is in microbenchmarks. Opened #41542.

AndyAyersMS · 2020-08-29T00:28:34Z

I think this is understood and there's nothing we can or need to address in 5.0.

So unless there are any objections I propose we close this.

AndyAyersMS · 2020-08-29T01:26:53Z

Some more data -- here's the perf history of BinaryTrees_2

Timing of the regression correlates with #35020 which merged on April 15.

Similar look for BinaryTrees_5, though somehow we recovered the perf later on, so perhaps we should look into that as well.

FannkuchRedux_9 is new, so no history to view.

AndyAyersMS · 2020-08-29T01:31:21Z

(recovery of perf in BinaryTrees_5 may be from #38586 -- need to verify)

AndyAyersMS · 2020-09-01T01:50:49Z

@danmosemsft per the above I am planning on closing this -- let me know if you agree.

danmoseley · 2020-09-01T18:59:44Z

Sounds good to me. Thanks for investigating.

danmoseley · 2020-09-01T19:10:26Z

5.0 will still be a clear improvement on Benchmarks game in general.

danmoseley transferred this issue from dotnet/performance Aug 14, 2020

Dotnet-GitSync-Bot added area-Meta untriaged New issue has not been triaged by the area owner labels Aug 14, 2020

danmoseley added tenet-performance Performance related issue and removed untriaged New issue has not been triaged by the area owner labels Aug 14, 2020

adamsitnik added the os-windows label Aug 18, 2020

adamsitnik added this to the 5.0.0 milestone Aug 18, 2020

ericstj added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-Meta labels Aug 26, 2020

AndyAyersMS self-assigned this Aug 28, 2020

AndyAyersMS mentioned this issue Aug 29, 2020

Jit: reconsider policy for recursive inlines #41542

Closed

danmoseley closed this as completed Sep 1, 2020

ghost locked as resolved and limited conversation to collaborators Dec 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate regressions of binary-trees and fannkuch on benchmarksgame site #40810

Investigate regressions of binary-trees and fannkuch on benchmarksgame site #40810

danmoseley commented Aug 14, 2020

danmoseley commented Aug 14, 2020 •

edited

Loading

danmoseley commented Aug 14, 2020

billwert commented Aug 14, 2020 •

edited

Loading

adamsitnik commented Aug 18, 2020

danmoseley commented Aug 18, 2020

AndyAyersMS commented Aug 18, 2020

AndyAyersMS commented Aug 21, 2020

AndyAyersMS commented Aug 28, 2020 •

edited

Loading

AndyAyersMS commented Aug 28, 2020

briansull commented Aug 28, 2020

briansull commented Aug 28, 2020 •

edited

Loading

briansull commented Aug 28, 2020

AndyAyersMS commented Aug 28, 2020

AndyAyersMS commented Aug 28, 2020

AndyAyersMS commented Aug 28, 2020

AndyAyersMS commented Aug 28, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Sep 1, 2020

danmoseley commented Sep 1, 2020

danmoseley commented Sep 1, 2020

Investigate regressions of binary-trees and fannkuch on benchmarksgame site #40810

Investigate regressions of binary-trees and fannkuch on benchmarksgame site #40810

Comments

danmoseley commented Aug 14, 2020

danmoseley commented Aug 14, 2020 • edited Loading

danmoseley commented Aug 14, 2020

billwert commented Aug 14, 2020 • edited Loading

adamsitnik commented Aug 18, 2020

BenchmarksGame.BinaryTrees_2.RunBench

BenchmarksGame.FannkuchRedux_5.RunBench(n: 10, expectedSum: 38)

danmoseley commented Aug 18, 2020

AndyAyersMS commented Aug 18, 2020

AndyAyersMS commented Aug 21, 2020

AndyAyersMS commented Aug 28, 2020 • edited Loading

AndyAyersMS commented Aug 28, 2020

briansull commented Aug 28, 2020

briansull commented Aug 28, 2020 • edited Loading

briansull commented Aug 28, 2020

AndyAyersMS commented Aug 28, 2020

AndyAyersMS commented Aug 28, 2020

AndyAyersMS commented Aug 28, 2020

AndyAyersMS commented Aug 28, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Aug 29, 2020

AndyAyersMS commented Sep 1, 2020

danmoseley commented Sep 1, 2020

danmoseley commented Sep 1, 2020

danmoseley commented Aug 14, 2020 •

edited

Loading

billwert commented Aug 14, 2020 •

edited

Loading

AndyAyersMS commented Aug 28, 2020 •

edited

Loading

briansull commented Aug 28, 2020 •

edited

Loading