[Perf] Regression in System.Collections.CtorFromCollection<String>.ConcurrentQueue #56017

DrewScoggins · 2021-07-20T17:17:01Z

Run Information

Architecture	x64
OS	Windows 10.0.18362
Baseline	07336810acf3b4e7bdd0fb7da87b54920ea9c382
Compare	405da674b633a99cc41be9e18c2e4d84b1b01639
Diff	Diff

Regressions in System.Collections.CtorFromCollection<String>

Benchmark	Baseline	Test	Test/Base	Test Quality	Edge Detector	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ConcurrentQueue - Duration of single invocation	7.75 μs	8.35 μs	1.08	0.00	True

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorFromCollection&lt;String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)

System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

The text was updated successfully, but these errors were encountered:

ghost · 2021-07-20T17:17:07Z

Tagging subscribers to this area: @eiriktsarpalis
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture	x64
OS	Windows 10.0.18362
Baseline	07336810acf3b4e7bdd0fb7da87b54920ea9c382
Compare	405da674b633a99cc41be9e18c2e4d84b1b01639
Diff	Diff

Regressions in System.Collections.CtorFromCollection<String>

Benchmark	Baseline	Test	Test/Base	Test Quality	Edge Detector	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ConcurrentQueue - Duration of single invocation	7.75 μs	8.35 μs	1.08	0.00	True

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorFromCollection&lt;String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)

System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author:	DrewScoggins
Assignees:	-
Labels:	`arch-x64`, `area-System.Collections`, `os-windows`, `tenet-performance`, `tenet-performance-benchmarks`, `untriaged`
Milestone:	-

danmoseley · 2021-07-27T00:23:46Z

There were no changes to S.C.C in this interval and nothing jumps out of the other libraries history (I would assume this test uses not much else)

This is labeled x64 meaning it didn't regress on other arch. Falling back to codegen, I see these:

C:\git\runtime\src\coreclr>git lg ede3733b1cf5902899e56caf748492ed993c98c8...efd0bb58bdc2d7591d77e604b810c096a7a5d58e -- **
* f57b6e7308e - Add crash report to createdump for Linux Watson  (#55438) (2 weeks ago) <Mike McLaughlin>
* a4590197c10 - Use init rather than ms extension syntax (#55475) (2 weeks ago) <Adeel Mujahid>
* 6a47ecf4a8e - W^X support (#54954) (2 weeks ago) <Jan Vorlicek>
* fce412be777 - Support filtering ObjectAllocated callback for pinned object heap allocation only (#55448) (2 weeks ago) <Yauk>
* 02ccdacb069 - Improve loop cloning, with debugging improvements (#55299) (2 weeks ago) <Bruce Forstall>
* 65b472a0f31 - Handle a missing case in zero-extend peephole (#55129) (2 weeks ago) <Jakob Botsch Nielsen>
* 27b564ab965 - Spill single-def variable at definition to avoid further spilling (#54345) (2 weeks ago) <Kunal Pathak>
* 0f5a39df193 - [Mono] Add Mono Profiler events into EventPipe. (#55264) (2 weeks ago) <Johan Lorensson>
* 39d1161c889 - Fix helloworld on x86 Linux (#55095) (2 weeks ago) <t-mustafin>
* c8486b69ced - Remove some left-over debug code (#55408) (2 weeks ago) <Jakob Botsch Nielsen>
* 43cdfa10ac6 - Fix GCMemoryInfo.TotalAvailableMemoryBytes under Wow (#55387) (2 weeks ago) <Jan Kotas>

The test loops over 512 strings, adding each to a queue segment. there is no contention so I'm not sure there is much nested looping though.

@BruceForstall is it possible something like #55299 could have affected this?

BruceForstall · 2021-07-27T00:53:04Z

#55299 could be the cause if (1) there's a tight, commonly-executed loop, (2) the JIT's loop cloning kicked in (this happens for loops that look like "for (int i = 0; i < n; i++)") and there's an array access using the loop variable, e.g., "a[i]", where it knows it can get rid of the array bounds check, and (3) there's some other conditional branch in the loop. Then, it's possible that a change is code alignment causes the Intel jcc erratum to kick in.

#54345 is another possibility. cc @kunalspathak

kunalspathak · 2021-07-27T18:30:07Z

I reverted each of below commits and generated disassembly of entire benchmark run but didn't see anything that comes out:

* 02ccdacb069 - Improve loop cloning, with debugging improvements (#55299) (2 weeks ago) <Bruce Forstall>
* 65b472a0f31 - Handle a missing case in zero-extend peephole (#55129) (2 weeks ago) <Jakob Botsch Nielsen>
* 27b564ab965 - Spill single-def variable at definition to avoid further spilling (#54345) (2 weeks ago) <Kunal Pathak>

VTune points at TryEnqueue method being hottest.

Further digging shows that cmp is the hottest instruction.

I tried looking at disassembly difference before and after LSRA changes in #54345 and the only thing that comes out is before LSRA change, we would spill the value into [rsp+2CH] few instructions before using that value in cmp eax, [rsp+2CH]. After LSRA change, the mov has moved at the top of the block and I am guessing there might be some data caching involved which is not applicable after my change?

eiriktsarpalis · 2021-08-05T15:19:14Z

I'm not able to discern a regression running the particular benchmark on my machine. Here are the results of a few runs:

`0733681` (baseline)

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ConcurrentQueue	512	6.021 us	0.0779 us	0.0691 us	5.998 us	5.941 us	6.180 us	0.8383	0.0493	-	8 KB
ConcurrentQueue	512	6.527 us	0.4597 us	0.5294 us	6.416 us	6.040 us	7.538 us	0.8380	0.0270	-	8 KB
ConcurrentQueue	512	6.287 us	0.1688 us	0.1876 us	6.213 us	6.058 us	6.695 us	0.8272	0.0487	-	8 KB
ConcurrentQueue	512	6.004 us	0.0181 us	0.0160 us	5.997 us	5.984 us	6.036 us	0.8410	0.0481	-	8 KB
ConcurrentQueue	512	6.400 us	0.3274 us	0.3770 us	6.404 us	5.967 us	7.328 us	0.8333	0.0476	-	8 KB
ConcurrentQueue	512	5.990 us	0.0215 us	0.0168 us	5.984 us	5.972 us	6.032 us	0.8307	0.0489	-	8 KB
ConcurrentQueue	512	6.360 us	0.3821 us	0.4400 us	6.037 us	5.969 us	7.261 us	0.8340	0.0477	-	8 KB
ConcurrentQueue	512	5.967 us	0.0190 us	0.0158 us	5.959 us	5.945 us	5.995 us	0.8301	0.0268	-	8 KB
ConcurrentQueue	512	6.329 us	0.4166 us	0.4797 us	6.014 us	5.926 us	7.704 us	0.8323	0.0268	-	8 KB

`405da67` (compare)

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ConcurrentQueue	512	6.039 us	0.0432 us	0.0337 us	6.038 us	5.992 us	6.104 us	0.8443	0.0482	-	8 KB
ConcurrentQueue	512	5.986 us	0.0231 us	0.0205 us	5.987 us	5.950 us	6.024 us	0.8482	0.0485	-	8 KB
ConcurrentQueue	512	5.994 us	0.0361 us	0.0282 us	5.987 us	5.964 us	6.073 us	0.8426	0.0482	-	8 KB
ConcurrentQueue	512	6.066 us	0.0991 us	0.0927 us	6.031 us	5.984 us	6.311 us	0.8337	0.0253	-	8 KB
ConcurrentQueue	512	6.348 us	0.3685 us	0.4244 us	6.210 us	5.944 us	7.202 us	0.8352	0.0477	-	8 KB
ConcurrentQueue	512	6.325 us	0.3304 us	0.3805 us	6.252 us	5.913 us	6.843 us	0.8340	0.0477	-	8 KB
ConcurrentQueue	512	6.305 us	0.3562 us	0.3959 us	6.250 us	5.929 us	6.951 us	0.8346	0.0477	-	8 KB
ConcurrentQueue	512	5.951 us	0.0275 us	0.0230 us	5.951 us	5.922 us	5.990 us	0.8324	0.0476	-	8 KB
ConcurrentQueue	512	5.930 us	0.0225 us	0.0188 us	5.933 us	5.904 us	5.974 us	0.8311	0.0475	-	8 KB
ConcurrentQueue	512	5.973 us	0.0109 us	0.0091 us	5.970 us	5.961 us	5.990 us	0.8433	0.0482	-	8 KB
ConcurrentQueue	512	6.332 us	0.3394 us	0.3909 us	6.078 us	5.974 us	7.293 us	0.8410	0.0481	-	8 KB

Running statistical tests comparing the two commits with 3% threshold also show no perf regression.

DrewScoggins · 2021-08-06T21:04:36Z

I will try and repro this on the lab machines. Maybe this is an architectural thing that we only hit on older Intel hardware.

eiriktsarpalis · 2021-08-12T09:59:35Z

Should we close or move to 7.0.0?

ghost · 2021-08-14T16:34:27Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Run Information

Architecture	x64
OS	Windows 10.0.18362
Baseline	07336810acf3b4e7bdd0fb7da87b54920ea9c382
Compare	405da674b633a99cc41be9e18c2e4d84b1b01639
Diff	Diff

Regressions in System.Collections.CtorFromCollection<String>

Benchmark	Baseline	Test	Test/Base	Test Quality	Edge Detector	Baseline IR	Compare IR	IR Ratio	Baseline ETL	Compare ETL
ConcurrentQueue - Duration of single invocation	7.75 μs	8.35 μs	1.08	0.00	True

Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.CtorFromCollection&lt;String&gt;*'

Payloads

Baseline
Compare

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)

System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)

Docs

Profiling workflow for dotnet/runtime repository
Benchmarking workflow for dotnet/runtime repository

Author:	DrewScoggins
Assignees:	eiriktsarpalis
Labels:	`os-windows`, `tenet-performance`, `tenet-performance-benchmarks`, `arch-x64`, `area-CodeGen-coreclr`, `needs further triage`
Milestone:	6.0.0

JulieLeeMSFT · 2021-08-15T15:53:28Z

@DrewScoggins have you reproduced this regression on a newer hardware?

DrewScoggins · 2021-08-16T17:07:59Z

I will look at this right now on my local machine.

DrewScoggins · 2021-08-16T20:13:00Z

This was collected on my machine Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores. As can be seen, we still seem to have a measurable regression. The command line I used for this run was py -3 .\scripts\benchmarks_ci.py -f net6.0 --filter System.Collections.CtorFromCollection<String>.ConcurrentQueue --corerun D:\git\runtime\artifacts\bin\testhost\net6.0-windows-Release-x64\shared\Microsoft.NETCore.App\6.0.0\corerun.exe --dotnet-versions 6.0.100-preview.3.21125.4

Baseline ede3733 Test Host

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ConcurrentQueue	512	5.549 us	0.0543 us	0.0508 us	5.532 us	5.474 us	5.644 us	0.8073	0.0449	-	8 KB

Compare efd0bb5 Test Host

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ConcurrentQueue	512	6.102 us	0.0370 us	0.0346 us	6.096 us	6.030 us	6.151 us	0.8022	0.0243	-	8 KB

eiriktsarpalis · 2021-08-17T09:29:13Z

Ok, it looks like the originally reported baseline commit was in fact after the regression was first recorded. This might explain why I wasn't seeing any regressions on my machine. I'd need to run a bisect on the correct commit range, but at this point I find it unlikely we'd be able to squeeze in a fix this late.

Looking at older history for that particular benchmark it would seem like performance is a net improvement compared to .NET 5, so I would suggest punting this to 7.0.0:

eiriktsarpalis · 2021-08-17T14:17:32Z

Using a wider commit range 890fde7...405da67 still does not show regression on my machine. The benchmark results tend to fluctuate in the order 500ns across runs and this is typically the only source of variation when comparing builds. More importantly statistical testing with 1% threshold does not indicate any performance regressions either:

cd src/benchmarks/micro

dotnet run -c release -f net6.0 --corerun \
    <path to baseline commit build> \
    <path to compare commit build> \
    --filter 'System.Collections.CtorFromCollection<String>.ConcurrentQueue' \
    --statisticalTest 1%

Getting the following results:

BenchmarkDotNet=v0.13.1.1600-nightly, OS=Windows 10.0.19043.1165 (21H1/May2021Update)
Intel Core i9-10900X CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK=6.0.100-rc.1.21415.3
  [Host]     : .NET 6.0.0 (6.0.21.41404), X64 RyuJIT
  Job-WJLAGI : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
  Job-DWJFTN : .NET 6.0.0 (42.42.42.42424), X64 RyuJIT

Sample results:

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	MannWhitney(1%)	RatioSD	Gen 0	Gen 1	Allocated
ConcurrentQueue	Job-DWJFTN	baseline	512	6.561 us	0.4747 us	0.5467 us	6.485 us	5.783 us	7.561 us	1.00	Base	0.00	0.8387	0.0271	8 KB
ConcurrentQueue	Job-WJLAGI	compare	512	6.662 us	0.4548 us	0.5238 us	6.641 us	5.763 us	7.390 us	1.02	Same	0.10	0.8356	0.0279	8 KB

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	MannWhitney(1%)	RatioSD	Gen 0	Gen 1	Allocated
ConcurrentQueue	Job-UHAZOW	baseline	512	6.605 us	0.4502 us	0.5185 us	6.542 us	5.833 us	7.537 us	1.00	Base	0.00	0.8415	0.0255	8 KB
ConcurrentQueue	Job-KIMTAF	compare	512	6.576 us	0.3786 us	0.4360 us	6.576 us	5.876 us	7.320 us	1.00	Same	0.09	0.8454	0.0282	8 KB

Method	Job	Toolchain	Size	Mean	Error	StdDev	Median	Min	Max	Ratio	MannWhitney(1%)	RatioSD	Gen 0	Gen 1	Allocated
ConcurrentQueue	Job-RWKAKP	baseline	512	6.723 us	0.3103 us	0.3573 us	6.727 us	6.088 us	7.351 us	1.00	Base	0.00	0.8408	0.0467	8 KB
ConcurrentQueue	Job-KQJEVY	compare	512	6.327 us	0.2968 us	0.3418 us	6.388 us	5.818 us	7.171 us	0.94	Faster	0.07	0.8330	0.0490	8 KB

@DrewScoggins given that you're measuring a consistent regression on your hardware, would it be possible to run a git-bisect?

JulieLeeMSFT · 2021-08-19T18:46:27Z

Let's invetigate the perf regression in .NET 7.

jakobbotsch · 2022-07-21T11:06:41Z

I tried looking at disassembly difference before and after LSRA changes in #54345 and the only thing that comes out is before LSRA change, we would spill the value into [rsp+2CH] few instructions before using that value in cmp eax, [rsp+2CH]. After LSRA change, the mov has moved at the top of the block and I am guessing there might be some data caching involved which is not applicable after my change?

Maybe moving this instruction broke store-to-load forwarding? Could also be a combination of that and the data dependency of the spill being closer now, so the CPU might stall on that. In any case I think that changing the spilling logic is just as likely to come with a bunch more regressions and not something we should be doing at this point in the release cycle. Looking at the benchmark data it also seems like performance is improved in .NET 7 compared to .NET 6, so I will just close this.

DrewScoggins added os-windows tenet-performance Performance related issue tenet-performance-benchmarks Issue from performance benchmark arch-x64 labels Jul 20, 2021

dotnet-issue-labeler bot added area-System.Collections untriaged New issue has not been triaged by the area owner labels Jul 20, 2021

eiriktsarpalis added this to the 6.0.0 milestone Jul 21, 2021

eiriktsarpalis added needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration and removed untriaged New issue has not been triaged by the area owner labels Jul 21, 2021

danmoseley changed the title ~~[Perf] Regression in System.Collections.CtorFromCollection<String>~~ [Perf] Regression in System.Collections.CtorFromCollection<String>.ConcurrentQueue Jul 27, 2021

eiriktsarpalis self-assigned this Aug 12, 2021

eiriktsarpalis added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed area-System.Collections labels Aug 14, 2021

JulieLeeMSFT assigned DrewScoggins Aug 16, 2021

JulieLeeMSFT modified the milestones: 6.0.0, 7.0.0 Aug 19, 2021

JulieLeeMSFT assigned jakobbotsch and unassigned eiriktsarpalis and DrewScoggins Aug 19, 2021

EgorBo mentioned this issue Nov 16, 2021

Performance Regression in Concurrent (Stack/Queue)? #61673

Open

jakobbotsch closed this as not planned Won't fix, can't repro, duplicate, stale Jul 21, 2022

ghost locked as resolved and limited conversation to collaborators Aug 20, 2022

[Perf] Regression in System.Collections.CtorFromCollection<String>.ConcurrentQueue #56017

[Perf] Regression in System.Collections.CtorFromCollection<String>.ConcurrentQueue #56017

Comments

DrewScoggins commented Jul 20, 2021

Run Information

Regressions in System.Collections.CtorFromCollection<String>

Repro

Payloads

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)

System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)

Docs

ghost commented Jul 20, 2021

Run Information

Regressions in System.Collections.CtorFromCollection<String>

Repro

Payloads

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)

System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)

Docs

danmoseley commented Jul 27, 2021 • edited Loading

BruceForstall commented Jul 27, 2021

kunalspathak commented Jul 27, 2021

eiriktsarpalis commented Aug 5, 2021

0733681 (baseline)

405da67 (compare)

DrewScoggins commented Aug 6, 2021

eiriktsarpalis commented Aug 12, 2021

ghost commented Aug 14, 2021

Run Information

Regressions in System.Collections.CtorFromCollection<String>

Repro

Payloads

Histogram

System.Collections.CtorFromCollection<String>.LinkedList(Size: 512)

System.Collections.CtorFromCollection<String>.ConcurrentQueue(Size: 512)

Docs

JulieLeeMSFT commented Aug 15, 2021

DrewScoggins commented Aug 16, 2021

DrewScoggins commented Aug 16, 2021

eiriktsarpalis commented Aug 17, 2021

eiriktsarpalis commented Aug 17, 2021

JulieLeeMSFT commented Aug 19, 2021

jakobbotsch commented Jul 21, 2022 • edited Loading

danmoseley commented Jul 27, 2021 •

edited

Loading

`0733681` (baseline)

`405da67` (compare)

jakobbotsch commented Jul 21, 2022 •

edited

Loading