-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate regressions of binary-trees and fannkuch on benchmarksgame site #40810
Comments
Here's the results i get on Ubuntu using the latest sources for these two games (BinaryTrees_6 and FannkuchRedux_9 from dotnet/performance#1453 but with the official input parameters) I included the various other flavors we already have, which are older. There is no evidence of a regression here for the "official" sources, so it's unclear why they're seeing one unless its from slightly different hardware or builds. I guess we'll see whether it goes away in final official results. BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20366.6
[Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
Job-HFLFME : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
Job-XLBCVM : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
Job-HEKPDG : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT
Job-YYWTGM : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
Job-DISMCE : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
Job-HDJRLZ : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable IterationTime=250.0000 ms
MinIterationCount=15 WarmupCount=1
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-7700 CPU 3.60GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.7.20366.6
[Host] : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
Job-YVYDTA : .NET Core 2.1.21 (CoreCLR 4.6.29130.01, CoreFX 4.6.29130.02), X64 RyuJIT
Job-DOYOOY : .NET Core 3.1.1 (CoreCLR 4.700.19.60701, CoreFX 4.700.19.60801), X64 RyuJIT
Job-ZKDSPK : .NET Core 5.0.0 (CoreCLR 5.0.20.36411, CoreFX 5.0.20.36411), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
cc @benaadams |
If this is to track a potential regression in the product should we move it to dotnet/runtime @danmosemsft? |
FWIW the regression is Windows-specific BenchmarksGame.BinaryTrees_2.RunBench
BenchmarksGame.FannkuchRedux_5.RunBench(n: 10, expectedSum: 38)
|
They measure on Ubuntu so if there's a windows regression that's coincidental 🙂 |
Benchmarks games now uses a slightly more up to date Ivy Bridge cpu.
|
I'll investigate but may not get around to it until next week sometime. So if anyone wants to drill in the meantime, please go ahead. cc @dotnet/jit-contrib |
For BinaryTrees_2, there is a small codegen diff in ;; 3.1
G_M54868_IG01:
57 push rdi
56 push rsi
53 push rbx
4883EC20 sub rsp, 32
G_M54868_IG02:
85C9 test ecx, ecx
7E4C jle SHORT G_M54868_IG04
8D71FF lea esi, [rcx-1]
8BCE mov ecx, esi
E8E3F3FFFF call TreeNode:bottomUpTree(int):struct
488BF8 mov rdi, rax
8BCE mov ecx, esi
E8D9F3FFFF call TreeNode:bottomUpTree(int):struct
;; 5.0
G_M46436_IG01:
57 push rdi
56 push rsi
53 push rbx
4883EC20 sub rsp, 32
8BF1 mov esi, ecx
G_M46436_IG02:
85F6 test esi, esi
7E4B jle SHORT G_M46436_IG05
G_M46436_IG03:
8D4EFF lea ecx, [rsi-1]
E8EBF3FFFF call TreeNode:bottomUpTree(int):TreeNode
488BF8 mov rdi, rax
8D4EFF lea ecx, [rsi-1]
E8E0F3FFFF call TreeNode:bottomUpTree(int):TreeNode I have not yet confirmed this is the root cause of the perf difference, will try to do that soon. Since this is a call-crossing lifetime I wonder if this is a consequence of #1463; the PR there shows a small perf score regression so presumably also a small code size regression. Not sure if the new heuristic takes into account that CSEing something that uses a call-crossing live var might be cheaper than it seems if the CSEs are the only crossing appearances of the var; basically we'd be trading one call crossing lifetime for another. cc @briansull |
More or less the same diff in the linux version: ;; 3.1
G_M54868_IG01:
55 push rbp
4157 push r15
4156 push r14
53 push rbx
50 push rax
488D6C2420 lea rbp, [rsp+20H]
G_M54868_IG02:
85FF test edi, edi
7E4F jle SHORT G_M54868_IG04
8D5FFF lea ebx, [rdi-1]
8BFB mov edi, ebx
E800000000 call TreeNode:bottomUpTree(int):struct
4C8BF0 mov r14, rax
8BFB mov edi, ebx
E800000000 call TreeNode:bottomUpTree(int):struct
;; 5.0
G_M46436_IG01:
55 push rbp
4157 push r15
4156 push r14
53 push rbx
50 push rax
488D6C2420 lea rbp, [rsp+20H]
8BDF mov ebx, edi
G_M46436_IG02:
85DB test ebx, ebx
7E4E jle SHORT G_M46436_IG05
G_M46436_IG03:
8D7BFF lea edi, [rbx-1]
E800000000 call TreeNode:bottomUpTree(int):TreeNode
4C8BF0 mov r14, rax
8D7BFF lea edi, [rbx-1]
E800000000 call TreeNode:bottomUpTree(int):TreeNode |
@AndyAyersMS I will take a look |
I wouldn't expect that The extra instruction in the prolog |
CSE doesn't see this as profitable. |
But in 3.0 it was considered profitable? |
Locally I see a potential regression in FannkuchRedux_9 on windows x64, though the 5.0 run has high variance. But the other two versions seem to be consistently faster. BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.450 (2004/May2020Update/20H1)
Intel Core i7-8650U CPU 1.90GHz (Kaby Lake R), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-rc.1.20407.13
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT
Job-NMJCAW : .NET Core 3.1.4 (CoreCLR 4.700.20.20201, CoreFX 4.700.20.21406), X64 RyuJIT
Job-JQYCRY : .NET Core 5.0.0 (CoreCLR 5.0.20.40416, CoreFX 5.0.20.40416), X64 RyuJIT
Will take a look on Ubuntu too. |
Ubuntu looks similar for me (note older HW and RC8 vs Preview1, so not directly comparable to the above) -- FannkuchRedux_9 seems slower in 5.0. BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
Intel Core i7-2720QM CPU 2.20GHz (Sandy Bridge), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.100-preview.8.20417.9
[Host] : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
Job-JHIWSX : .NET Core 3.1.7 (CoreCLR 4.700.20.36602, CoreFX 4.700.20.37001), X64 RyuJIT
Job-CWFQLH : .NET Core 5.0.0 (CoreCLR 5.0.20.40711, CoreFX 5.0.20.40711), X64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
|
For FannkuchRedux_9, all the key perf is in For windows x64 codegen from 3.1 to 5.0 is very similar. 5.0 does a few more CSEs and avoids some spills. Most CQ looks locally better, eg here is the inner loop of the last ;;; 3.1 (x64 win)
G_M27408_IG15:
428D043B lea eax, [rbx+r15]
458D6601 lea r12d, [r14+1] // loop invariant
99 cdq
41F7FC idiv edx:eax, r12d
4863C2 movsxd rax, edx
480FBF0447 movsx rax, word ptr [rdi+2*rax]
4863D3 movsxd rdx, ebx
66890456 mov word ptr [rsi+2*rdx], ax
FFC3 inc ebx
413BDE cmp ebx, r14d
7EDE jle SHORT G_M27408_IG15
;;; 5.0 (one fewer mov)
G_M7882_IG16:
438D043C lea eax, [r12+r15]
99 cdq
41F7FD idiv edx:eax, r13d
4863C2 movsxd rax, edx
480FBF0447 movsx rax, word ptr [rdi+2*rax]
4963D4 movsxd rdx, r12d
66890456 mov word ptr [rsi+2*rdx], ax
41FFC4 inc r12d
453BE6 cmp r12d, r14d
7EE1 jle SHORT G_M7882_IG16 Because of this loops end up aligned differently; suspect this might be the root cause for perf issues. Similar diffs in linux x64 codegen. |
For FannkuchRedux_9, Looking at linux perf, 5.0 has fewer instructions but higher clocks (IPC is 0.99 in 3.1; 0.94 in 5.0). Branch misses are up. So some kind of micro-architectural effect, quite possibly code alignment.
|
Similar data for BinaryTrees_2. Here we see 5.0 has slightly better IPC but quite a few more instructions to execute;
Looking at profiles we spend quite a bit more time than I realized in So likely the main source of the perf difference in BinaryTrees_2 is from #35020. Seems like we might want to reconsider the policy of not doing any recursive inlines, though suspect for the most part the beneficiary of such inlines is in microbenchmarks. Opened #41542. |
I think this is understood and there's nothing we can or need to address in 5.0. So unless there are any objections I propose we close this. |
Some more data -- here's the perf history of BinaryTrees_2 Timing of the regression correlates with #35020 which merged on April 15. Similar look for BinaryTrees_5, though somehow we recovered the perf later on, so perhaps we should look into that as well. FannkuchRedux_9 is new, so no history to view. |
(recovery of perf in BinaryTrees_5 may be from #38586 -- need to verify) |
@danmosemsft per the above I am planning on closing this -- let me know if you agree. |
Sounds good to me. Thanks for investigating. |
5.0 will still be a clear improvement on Benchmarks game in general. |
Benchmarks game site has a comparison up of 3.1 vs . 5.0 preview 7 : https://benchmarksgame-team.pages.debian.net/benchmarksgame/fastest/csharpcore-csharppreview.html
There are 4 regressions, 2 may be noise level. The other two are binary-trees (7%) and fannkuch-redux (4%)
I'll check those two using the source of the latest submissions.
The text was updated successfully, but these errors were encountered: