Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Perf] Regression in System.Collections.CtorFromCollection<String>.ConcurrentQueue #56017

Closed
DrewScoggins opened this issue Jul 20, 2021 · 15 comments
Assignees
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Milestone

Comments

@DrewScoggins
Copy link
Member

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 07336810acf3b4e7bdd0fb7da87b54920ea9c382
Compare 405da674b633a99cc41be9e18c2e4d84b1b01639
Diff Diff

Regressions in System.Collections.CtorFromCollection<String>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ConcurrentQueue - Duration of single invocation 7.75 μs 8.35 μs 1.08 0.00 True

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorFromCollection&lt;String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)


System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

@DrewScoggins DrewScoggins added os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Jul 20, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Jul 20, 2021
@ghost
Copy link

ghost commented Jul 20, 2021

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 07336810acf3b4e7bdd0fb7da87b54920ea9c382
Compare 405da674b633a99cc41be9e18c2e4d84b1b01639
Diff Diff

Regressions in System.Collections.CtorFromCollection<String>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ConcurrentQueue - Duration of single invocation 7.75 μs 8.35 μs 1.08 0.00 True

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorFromCollection&lt;String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)


System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: DrewScoggins
Assignees: -
Labels:

arch-x64, area-System.Collections, os-windows, tenet-performance, tenet-performance-benchmarks, untriaged

Milestone: -

@eiriktsarpalis eiriktsarpalis added this to the 6.0.0 milestone Jul 21, 2021
@eiriktsarpalis eiriktsarpalis added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jul 21, 2021
@danmoseley danmoseley changed the title [Perf] Regression in System.Collections.CtorFromCollection<String> [Perf] Regression in System.Collections.CtorFromCollection<String>.ConcurrentQueue Jul 27, 2021
@danmoseley
Copy link
Member

danmoseley commented Jul 27, 2021

There were no changes to S.C.C in this interval and nothing jumps out of the other libraries history (I would assume this test uses not much else)

This is labeled x64 meaning it didn't regress on other arch. Falling back to codegen, I see these:

C:\git\runtime\src\coreclr>git lg ede3733b1cf5902899e56caf748492ed993c98c8...efd0bb58bdc2d7591d77e604b810c096a7a5d58e -- **
* f57b6e7308e - Add crash report to createdump for Linux Watson  (#55438) (2 weeks ago) <Mike McLaughlin>
* a4590197c10 - Use init rather than ms extension syntax (#55475) (2 weeks ago) <Adeel Mujahid>
* 6a47ecf4a8e - W^X support (#54954) (2 weeks ago) <Jan Vorlicek>
* fce412be777 - Support filtering ObjectAllocated callback for pinned object heap allocation only (#55448) (2 weeks ago) <Yauk>
* 02ccdacb069 - Improve loop cloning, with debugging improvements (#55299) (2 weeks ago) <Bruce Forstall>
* 65b472a0f31 - Handle a missing case in zero-extend peephole (#55129) (2 weeks ago) <Jakob Botsch Nielsen>
* 27b564ab965 - Spill single-def variable at definition to avoid further spilling (#54345) (2 weeks ago) <Kunal Pathak>
* 0f5a39df193 - [Mono] Add Mono Profiler events into EventPipe. (#55264) (2 weeks ago) <Johan Lorensson>
* 39d1161c889 - Fix helloworld on x86 Linux (#55095) (2 weeks ago) <t-mustafin>
* c8486b69ced - Remove some left-over debug code (#55408) (2 weeks ago) <Jakob Botsch Nielsen>
* 43cdfa10ac6 - Fix GCMemoryInfo.TotalAvailableMemoryBytes under Wow (#55387) (2 weeks ago) <Jan Kotas>

The test loops over 512 strings, adding each to a queue segment. there is no contention so I'm not sure there is much nested looping though.

@BruceForstall is it possible something like #55299 could have affected this?

@BruceForstall
Copy link
Member

#55299 could be the cause if (1) there's a tight, commonly-executed loop, (2) the JIT's loop cloning kicked in (this happens for loops that look like "for (int i = 0; i < n; i++)") and there's an array access using the loop variable, e.g., "a[i]", where it knows it can get rid of the array bounds check, and (3) there's some other conditional branch in the loop. Then, it's possible that a change is code alignment causes the Intel jcc erratum to kick in.

#54345 is another possibility. cc @kunalspathak

@kunalspathak
Copy link
Member

I reverted each of below commits and generated disassembly of entire benchmark run but didn't see anything that comes out:

* 02ccdacb069 - Improve loop cloning, with debugging improvements (#55299) (2 weeks ago) <Bruce Forstall>
* 65b472a0f31 - Handle a missing case in zero-extend peephole (#55129) (2 weeks ago) <Jakob Botsch Nielsen>
* 27b564ab965 - Spill single-def variable at definition to avoid further spilling (#54345) (2 weeks ago) <Kunal Pathak>

VTune points at TryEnqueue method being hottest.

image

Further digging shows that cmp is the hottest instruction.

image

I tried looking at disassembly difference before and after LSRA changes in #54345 and the only thing that comes out is before LSRA change, we would spill the value into [rsp+2CH] few instructions before using that value in cmp eax, [rsp+2CH]. After LSRA change, the mov has moved at the top of the block and I am guessing there might be some data caching involved which is not applicable after my change?

image

@eiriktsarpalis
Copy link
Member

I'm not able to discern a regression running the particular benchmark on my machine. Here are the results of a few runs:

0733681 (baseline)

Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
ConcurrentQueue 512 6.021 us 0.0779 us 0.0691 us 5.998 us 5.941 us 6.180 us 0.8383 0.0493 - 8 KB
ConcurrentQueue 512 6.527 us 0.4597 us 0.5294 us 6.416 us 6.040 us 7.538 us 0.8380 0.0270 - 8 KB
ConcurrentQueue 512 6.287 us 0.1688 us 0.1876 us 6.213 us 6.058 us 6.695 us 0.8272 0.0487 - 8 KB
ConcurrentQueue 512 6.004 us 0.0181 us 0.0160 us 5.997 us 5.984 us 6.036 us 0.8410 0.0481 - 8 KB
ConcurrentQueue 512 6.400 us 0.3274 us 0.3770 us 6.404 us 5.967 us 7.328 us 0.8333 0.0476 - 8 KB
ConcurrentQueue 512 5.990 us 0.0215 us 0.0168 us 5.984 us 5.972 us 6.032 us 0.8307 0.0489 - 8 KB
ConcurrentQueue 512 6.360 us 0.3821 us 0.4400 us 6.037 us 5.969 us 7.261 us 0.8340 0.0477 - 8 KB
ConcurrentQueue 512 5.967 us 0.0190 us 0.0158 us 5.959 us 5.945 us 5.995 us 0.8301 0.0268 - 8 KB
ConcurrentQueue 512 6.329 us 0.4166 us 0.4797 us 6.014 us 5.926 us 7.704 us 0.8323 0.0268 - 8 KB

405da67 (compare)

Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
ConcurrentQueue 512 6.039 us 0.0432 us 0.0337 us 6.038 us 5.992 us 6.104 us 0.8443 0.0482 - 8 KB
ConcurrentQueue 512 5.986 us 0.0231 us 0.0205 us 5.987 us 5.950 us 6.024 us 0.8482 0.0485 - 8 KB
ConcurrentQueue 512 5.994 us 0.0361 us 0.0282 us 5.987 us 5.964 us 6.073 us 0.8426 0.0482 - 8 KB
ConcurrentQueue 512 6.066 us 0.0991 us 0.0927 us 6.031 us 5.984 us 6.311 us 0.8337 0.0253 - 8 KB
ConcurrentQueue 512 6.348 us 0.3685 us 0.4244 us 6.210 us 5.944 us 7.202 us 0.8352 0.0477 - 8 KB
ConcurrentQueue 512 6.325 us 0.3304 us 0.3805 us 6.252 us 5.913 us 6.843 us 0.8340 0.0477 - 8 KB
ConcurrentQueue 512 6.305 us 0.3562 us 0.3959 us 6.250 us 5.929 us 6.951 us 0.8346 0.0477 - 8 KB
ConcurrentQueue 512 5.951 us 0.0275 us 0.0230 us 5.951 us 5.922 us 5.990 us 0.8324 0.0476 - 8 KB
ConcurrentQueue 512 5.930 us 0.0225 us 0.0188 us 5.933 us 5.904 us 5.974 us 0.8311 0.0475 - 8 KB
ConcurrentQueue 512 5.973 us 0.0109 us 0.0091 us 5.970 us 5.961 us 5.990 us 0.8433 0.0482 - 8 KB
ConcurrentQueue 512 6.332 us 0.3394 us 0.3909 us 6.078 us 5.974 us 7.293 us 0.8410 0.0481 - 8 KB

Running statistical tests comparing the two commits with 3% threshold also show no perf regression.

@DrewScoggins
Copy link
Member Author

I will try and repro this on the lab machines. Maybe this is an architectural thing that we only hit on older Intel hardware.

@eiriktsarpalis
Copy link
Member

Should we close or move to 7.0.0?

@eiriktsarpalis eiriktsarpalis self-assigned this Aug 12, 2021
@eiriktsarpalis eiriktsarpalis added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Collections labels Aug 14, 2021
@ghost
Copy link

ghost commented Aug 14, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture x64
OS Windows 10.0.18362
Baseline 07336810acf3b4e7bdd0fb7da87b54920ea9c382
Compare 405da674b633a99cc41be9e18c2e4d84b1b01639
Diff Diff

Regressions in System.Collections.CtorFromCollection<String>

Benchmark Baseline Test Test/Base Test Quality Edge Detector Baseline IR Compare IR IR Ratio Baseline ETL Compare ETL
ConcurrentQueue - Duration of single invocation 7.75 μs 8.35 μs 1.08 0.00 True

graph
Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorFromCollection&lt;String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)


System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)


Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author: DrewScoggins
Assignees: eiriktsarpalis
Labels:

os-windows, tenet-performance, tenet-performance-benchmarks, arch-x64, area-CodeGen-coreclr, needs further triage

Milestone: 6.0.0

@JulieLeeMSFT
Copy link
Member

@DrewScoggins have you reproduced this regression on a newer hardware?

@DrewScoggins
Copy link
Member Author

I will look at this right now on my local machine.

@DrewScoggins
Copy link
Member Author

This was collected on my machine Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores. As can be seen, we still seem to have a measurable regression. The command line I used for this run was py -3 .\scripts\benchmarks_ci.py -f net6.0 --filter System.Collections.CtorFromCollection<String>.ConcurrentQueue --corerun D:\git\runtime\artifacts\bin\testhost\net6.0-windows-Release-x64\shared\Microsoft.NETCore.App\6.0.0\corerun.exe --dotnet-versions 6.0.100-preview.3.21125.4

Baseline ede3733 Test Host

Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
ConcurrentQueue 512 5.549 us 0.0543 us 0.0508 us 5.532 us 5.474 us 5.644 us 0.8073 0.0449 - 8 KB

Compare efd0bb5 Test Host

Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
ConcurrentQueue 512 6.102 us 0.0370 us 0.0346 us 6.096 us 6.030 us 6.151 us 0.8022 0.0243 - 8 KB

@eiriktsarpalis
Copy link
Member

Ok, it looks like the originally reported baseline commit was in fact after the regression was first recorded. This might explain why I wasn't seeing any regressions on my machine. I'd need to run a bisect on the correct commit range, but at this point I find it unlikely we'd be able to squeeze in a fix this late.

Looking at older history for that particular benchmark it would seem like performance is a net improvement compared to .NET 5, so I would suggest punting this to 7.0.0:
image

@eiriktsarpalis
Copy link
Member

Using a wider commit range 890fde7...405da67 still does not show regression on my machine. The benchmark results tend to fluctuate in the order 500ns across runs and this is typically the only source of variation when comparing builds. More importantly statistical testing with 1% threshold does not indicate any performance regressions either:

cd src/benchmarks/micro

dotnet run -c release -f net6.0 --corerun \
    <path to baseline commit build> \
    <path to compare commit build> \
    --filter 'System.Collections.CtorFromCollection<String>.ConcurrentQueue' \
    --statisticalTest 1%

Getting the following results:

BenchmarkDotNet=v0.13.1.1600-nightly, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=6.0.100-rc.1.21415.3
  [Host]     : .NET 6.0.0 (6.0.21.41404), X64 RyuJIT
  Job-WJLAGI : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DWJFTN : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

Sample results:

Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio MannWhitney(1%) RatioSD Gen 0 Gen 1 Allocated
ConcurrentQueue Job-DWJFTN baseline 512 6.561 us 0.4747 us 0.5467 us 6.485 us 5.783 us 7.561 us 1.00 Base 0.00 0.8387 0.0271 8 KB
ConcurrentQueue Job-WJLAGI compare 512 6.662 us 0.4548 us 0.5238 us 6.641 us 5.763 us 7.390 us 1.02 Same 0.10 0.8356 0.0279 8 KB
Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio MannWhitney(1%) RatioSD Gen 0 Gen 1 Allocated
ConcurrentQueue Job-UHAZOW baseline 512 6.605 us 0.4502 us 0.5185 us 6.542 us 5.833 us 7.537 us 1.00 Base 0.00 0.8415 0.0255 8 KB
ConcurrentQueue Job-KIMTAF compare 512 6.576 us 0.3786 us 0.4360 us 6.576 us 5.876 us 7.320 us 1.00 Same 0.09 0.8454 0.0282 8 KB
Method Job Toolchain Size Mean Error StdDev Median Min Max Ratio MannWhitney(1%) RatioSD Gen 0 Gen 1 Allocated
ConcurrentQueue Job-RWKAKP baseline 512 6.723 us 0.3103 us 0.3573 us 6.727 us 6.088 us 7.351 us 1.00 Base 0.00 0.8408 0.0467 8 KB
ConcurrentQueue Job-KQJEVY compare 512 6.327 us 0.2968 us 0.3418 us 6.388 us 5.818 us 7.171 us 0.94 Faster 0.07 0.8330 0.0490 8 KB

@DrewScoggins given that you're measuring a consistent regression on your hardware, would it be possible to run a git-bisect?

@JulieLeeMSFT
Copy link
Member

Let's invetigate the perf regression in .NET 7.

@JulieLeeMSFT JulieLeeMSFT modified the milestones: 6.0.0, 7.0.0 Aug 19, 2021
@jakobbotsch
Copy link
Member

jakobbotsch commented Jul 21, 2022

I tried looking at disassembly difference before and after LSRA changes in #54345 and the only thing that comes out is before LSRA change, we would spill the value into [rsp+2CH] few instructions before using that value in cmp eax, [rsp+2CH]. After LSRA change, the mov has moved at the top of the block and I am guessing there might be some data caching involved which is not applicable after my change?

Maybe moving this instruction broke store-to-load forwarding? Could also be a combination of that and the data dependency of the spill being closer now, so the CPU might stall on that. In any case I think that changing the spilling logic is just as likely to come with a bunch more regressions and not something we should be doing at this point in the release cycle. Looking at the benchmark data it also seems like performance is improved in .NET 7 compared to .NET 6, so I will just close this.

@jakobbotsch jakobbotsch closed this as not planned Won't fix, can't repro, duplicate, stale Jul 21, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Aug 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark
Projects
None yet
Development

No branches or pull requests

7 participants