Update where and when vzeroupper is emitted #98261

tannergooding · 2024-02-10T05:11:26Z

This resolves #82132 and resolves #11496 and resolves #96211 and resolves #95954

The transition diagrams are as seen below. The The Intel optimization manual guidance in 3.11.5.3 Fixing Instruction Slowdowns states:

Insert a VZEROUPPER to tell the hardware that the state of the higher registers is clean
between the VEX and the legacy SSE instructions. Often the best way to do this is to insert a
VZEROUPPER before returning from any function that uses VEX (that does not produce a VEX
register) and before any call to an unknown function.

Given the diagrams and this statement, we can come to two conclusions:

We were emitting vzeroupper in cases it wasn't needed, such as prologues of methods
We weren't emitting vzeroupper in cases it was needed, such as before p/invoke transitions

Essentially, for any method compiled by the JIT during the lifetime of the program, we know it is VEX-aware and thus regardless of the UpperState=Dirty or UpperState=Clean, managed to managed calls for such methods are safe and do not need vzeroupper and incur no transition penalty.

Likewise, if we are going from unmanaged to managed we are also safe because we are going from UpperState=Clean or UpperState=Dirty to UpperState=Dirty (assuming we aren't on a pre-Skylake microarchitecture where native itself placed us in UpperState=PreservedNonInit) and thus no transition penalty exists.

The only case we really care about is managed to unmanaged (such as for a P/Invoke), as for such a scenario we cannot assume to know whether or not the unmanaged code is VEX aware. Thus, we need to emit vzeroupper before such calls (as the optimization manual guidance states) to ensure we aren't executing legacy encoded instructions where UpperState=Dirty or UpperState=PreservedNonInit.

This consideration largely only applies to P/Invokes to user functions and does not apply to most JIT helpers. It additionally applies to calls from a managed method that was jitted during the execution of the program to a managed method that was compiled for R2R, which may target the legacy encoding.

Older micro-architectures:

Skylake and newer micro-architectures:

ghost · 2024-02-10T05:11:40Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This resolves #82132 and resolves #11496

The transition diagrams are as seen below. The The Intel optimization manual guidance in 3.11.5.3 Fixing Instruction Slowdowns states:

Insert a VZEROUPPER to tell the hardware that the state of the higher registers is clean
between the VEX and the legacy SSE instructions. Often the best way to do this is to insert a
VZEROUPPER before returning from any function that uses VEX (that does not produce a VEX
register) and before any call to an unknown function.

Given the diagrams and this statement, we can come to two conclusions:

We were emitting vzeroupper in cases it wasn't needed, such as prologues of methods
We weren't emitting vzeroupper in cases it was needed, such as before p/invoke transitions

Essentially, for any method compiled by the JIT during the lifetime of the program, we know it is VEX-aware and thus regardless of the UpperState=Dirty or UpperState=Clean, managed to managed calls for such methods are safe and do not need vzeroupper and incur no transition penalty.

Likewise, if we are going from unmanaged to managed we are also safe because we are going from UpperState=Clean or UpperState=Dirty to UpperState=Dirty (assuming we aren't on a pre-Skylake microarchitecture where native itself placed us in UpperState=PreservedNonInit) and thus no transition penalty exists.

The only case we really care about is managed to unmanaged (such as for a P/Invoke), as for such a scenario we cannot assume to know whether or not the unmanaged code is VEX aware. Thus, we need to emit vzeroupper before such calls (as the optimization manual guidance states) to ensure we aren't executing legacy encoded instructions where UpperState=Dirty or UpperState=PreservedNonInit.

This consideration largely only applies to P/Invokes to user functions and does not apply to most JIT helpers. It additionally applies to calls from a managed method that was jitted during the execution of the program to a managed method that was compiled for R2R, which may target the legacy encoding.

Older micro-architectures:

Skylake and newer micro-architectures:

Author:	tannergooding
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/coreclr/jit/codegenxarch.cpp

ryujit-bot · 2024-02-10T07:18:39Z

Diff results for #98261

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 1,620,764 contexts (360,162 MinOpts, 1,260,602 FullOpts).

MISSED contexts: 3,086 (0.19%)

Overall (-790,628 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.linux.x64.checked.mch	11,931,107	-19,269
benchmarks.run_pgo.linux.x64.checked.mch	57,210,208	-50,798
benchmarks.run_tiered.linux.x64.checked.mch	18,554,064	-38,326
coreclr_tests.run.linux.x64.checked.mch	247,128,973	-392,798
libraries.pmi.linux.x64.checked.mch	60,382,766	-116,383
libraries_tests.run.linux.x64.Release.mch	31,730,047	-30,736
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	130,006,281	-130,040
realworld.run.linux.x64.checked.mch	13,217,922	-11,925
smoke_tests.nativeaot.linux.x64.checked.mch	4,173,941	-353

MinOpts (-306,427 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.linux.x64.checked.mch	169,702	-975
benchmarks.run_pgo.linux.x64.checked.mch	17,746,512	-34,802
benchmarks.run_tiered.linux.x64.checked.mch	15,055,746	-34,283
coreclr_tests.run.linux.x64.checked.mch	140,366,881	-204,420
libraries.pmi.linux.x64.checked.mch	112,857	-42
libraries_tests.run.linux.x64.Release.mch	15,927,817	-20,049
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	10,583,855	-11,853
realworld.run.linux.x64.checked.mch	388,536	-3

FullOpts (-484,201 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.linux.x64.checked.mch	11,761,405	-18,294
benchmarks.run_pgo.linux.x64.checked.mch	39,463,696	-15,996
benchmarks.run_tiered.linux.x64.checked.mch	3,498,318	-4,043
coreclr_tests.run.linux.x64.checked.mch	106,762,092	-188,378
libraries.pmi.linux.x64.checked.mch	60,269,909	-116,341
libraries_tests.run.linux.x64.Release.mch	15,802,230	-10,687
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	119,422,426	-118,187
realworld.run.linux.x64.checked.mch	12,829,386	-11,922
smoke_tests.nativeaot.linux.x64.checked.mch	4,172,992	-353

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 1,999,231 contexts (587,594 MinOpts, 1,411,637 FullOpts).

MISSED contexts: 3,657 (0.18%)

Overall (-1,094,187 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	46,755,443	-66,423
benchmarks.run.windows.x64.checked.mch	11,726,687	-12,375
benchmarks.run_pgo.windows.x64.checked.mch	34,354,002	-66,859
benchmarks.run_tiered.windows.x64.checked.mch	19,448,991	-38,617
coreclr_tests.run.windows.x64.checked.mch	296,147,801	-497,185
libraries.pmi.windows.x64.checked.mch	67,659,390	-161,764
libraries_tests.run.windows.x64.Release.mch	42,430,197	-62,987
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	142,635,439	-167,708
realworld.run.windows.x64.checked.mch	14,768,267	-19,303
smoke_tests.nativeaot.windows.x64.checked.mch	5,049,682	-966

MinOpts (-483,299 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	18,488,740	-27,292
benchmarks.run.windows.x64.checked.mch	595	-3
benchmarks.run_pgo.windows.x64.checked.mch	18,836,696	-39,778
benchmarks.run_tiered.windows.x64.checked.mch	15,367,889	-34,929
coreclr_tests.run.windows.x64.checked.mch	185,774,390	-300,707
libraries.pmi.windows.x64.checked.mch	113,521	-42
libraries_tests.run.windows.x64.Release.mch	31,641,880	-54,124
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	10,782,870	-26,421
realworld.run.windows.x64.checked.mch	386,609	-3

FullOpts (-610,888 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	28,266,703	-39,131
benchmarks.run.windows.x64.checked.mch	11,726,092	-12,372
benchmarks.run_pgo.windows.x64.checked.mch	15,517,306	-27,081
benchmarks.run_tiered.windows.x64.checked.mch	4,081,102	-3,688
coreclr_tests.run.windows.x64.checked.mch	110,373,411	-196,478
libraries.pmi.windows.x64.checked.mch	67,545,869	-161,722
libraries_tests.run.windows.x64.Release.mch	10,788,317	-8,863
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	131,852,569	-141,287
realworld.run.windows.x64.checked.mch	14,381,658	-19,300
smoke_tests.nativeaot.windows.x64.checked.mch	5,048,735	-966

Details here

Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 1,618,717 contexts (327,626 MinOpts, 1,291,091 FullOpts).

MISSED contexts: 11,022 (0.68%)

Overall (-504,237 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.windows.x86.checked.mch	11,115,500	-3,943
benchmarks.run_pgo.windows.x86.checked.mch	31,815,296	+27,317
benchmarks.run_tiered.windows.x86.checked.mch	13,989,178	-6,113
coreclr_tests.run.windows.x86.checked.mch	215,108,646	-367,256
libraries.pmi.windows.x86.checked.mch	50,246,165	-92,043
libraries_tests.run.windows.x86.Release.mch	14,793,337	-5,742
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	107,842,128	-47,463
realworld.run.windows.x86.checked.mch	11,479,674	-8,994

MinOpts (-200,802 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run_pgo.windows.x86.checked.mch	6,121,948	-3,295
benchmarks.run_tiered.windows.x86.checked.mch	6,854,637	-4,273
coreclr_tests.run.windows.x86.checked.mch	122,261,024	-189,664
libraries.pmi.windows.x86.checked.mch	95,233	-3
libraries_tests.run.windows.x86.Release.mch	5,490,195	-3,351
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	8,952,773	-213
realworld.run.windows.x86.checked.mch	295,714	-3

FullOpts (-303,435 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.windows.x86.checked.mch	11,115,022	-3,943
benchmarks.run_pgo.windows.x86.checked.mch	25,693,348	+30,612
benchmarks.run_tiered.windows.x86.checked.mch	7,134,541	-1,840
coreclr_tests.run.windows.x86.checked.mch	92,847,622	-177,592
libraries.pmi.windows.x86.checked.mch	50,150,932	-92,040
libraries_tests.run.windows.x86.Release.mch	9,303,142	-2,391
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	98,889,355	-47,250
realworld.run.windows.x86.checked.mch	11,183,960	-8,991

Details here

Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.00% to +0.03%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.01%
benchmarks.run_pgo.linux.x64.checked.mch	+0.02%
benchmarks.run_tiered.linux.x64.checked.mch	+0.03%
coreclr_tests.run.linux.x64.checked.mch	+0.02%
libraries.pmi.linux.x64.checked.mch	+0.01%
libraries_tests.run.linux.x64.Release.mch	+0.03%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.02%
realworld.run.linux.x64.checked.mch	+0.01%

MinOpts (-0.00% to +0.08%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.01%
benchmarks.run_pgo.linux.x64.checked.mch	+0.05%
benchmarks.run_tiered.linux.x64.checked.mch	+0.05%
coreclr_tests.run.linux.x64.checked.mch	+0.03%
libraries.pmi.linux.x64.checked.mch	+0.05%
libraries_tests.run.linux.x64.Release.mch	+0.06%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.04%
realworld.run.linux.x64.checked.mch	+0.08%

FullOpts (+0.00% to +0.02%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.01%
benchmarks.run_pgo.linux.x64.checked.mch	+0.01%
benchmarks.run_tiered.linux.x64.checked.mch	+0.01%
coreclr_tests.run.linux.x64.checked.mch	+0.02%
libraries.pmi.linux.x64.checked.mch	+0.01%
libraries_tests.run.linux.x64.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.02%
realworld.run.linux.x64.checked.mch	+0.01%

Throughput diffs for windows/x64 ran on windows/x64

Overall (-0.00% to +0.03%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.01%
benchmarks.run.windows.x64.checked.mch	+0.01%
benchmarks.run_pgo.windows.x64.checked.mch	+0.01%
benchmarks.run_tiered.windows.x64.checked.mch	+0.02%
coreclr_tests.run.windows.x64.checked.mch	+0.02%
libraries.pmi.windows.x64.checked.mch	+0.01%
libraries_tests.run.windows.x64.Release.mch	+0.03%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.01%
realworld.run.windows.x64.checked.mch	+0.01%

MinOpts (-0.01% to +0.07%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.05%
benchmarks.run.windows.x64.checked.mch	+0.01%
benchmarks.run_pgo.windows.x64.checked.mch	+0.04%
benchmarks.run_tiered.windows.x64.checked.mch	+0.03%
coreclr_tests.run.windows.x64.checked.mch	+0.03%
libraries.crossgen2.windows.x64.checked.mch	-0.01%
libraries.pmi.windows.x64.checked.mch	+0.05%
libraries_tests.run.windows.x64.Release.mch	+0.05%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.03%
realworld.run.windows.x64.checked.mch	+0.07%
smoke_tests.nativeaot.windows.x64.checked.mch	-0.01%

FullOpts (-0.00% to +0.02%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.01%
benchmarks.run.windows.x64.checked.mch	+0.01%
benchmarks.run_pgo.windows.x64.checked.mch	+0.01%
benchmarks.run_tiered.windows.x64.checked.mch	+0.01%
coreclr_tests.run.windows.x64.checked.mch	+0.02%
libraries.pmi.windows.x64.checked.mch	+0.01%
libraries_tests.run.windows.x64.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.01%
realworld.run.windows.x64.checked.mch	+0.01%

Details here

Throughput diffs for windows/x86 ran on windows/x86

Overall (+0.00% to +0.04%)

Collection	PDIFF
benchmarks.run.windows.x86.checked.mch	+0.02%
benchmarks.run_pgo.windows.x86.checked.mch	+0.02%
benchmarks.run_tiered.windows.x86.checked.mch	+0.03%
coreclr_tests.run.windows.x86.checked.mch	+0.03%
libraries.pmi.windows.x86.checked.mch	+0.01%
libraries_tests.run.windows.x86.Release.mch	+0.04%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	+0.03%
realworld.run.windows.x86.checked.mch	+0.02%

MinOpts (+0.00% to +0.16%)

Collection	PDIFF
benchmarks.run.windows.x86.checked.mch	+0.09%
benchmarks.run_pgo.windows.x86.checked.mch	+0.07%
benchmarks.run_tiered.windows.x86.checked.mch	+0.07%
coreclr_tests.run.windows.x86.checked.mch	+0.06%
libraries.pmi.windows.x86.checked.mch	+0.11%
libraries_tests.run.windows.x86.Release.mch	+0.11%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	+0.10%
realworld.run.windows.x86.checked.mch	+0.16%

FullOpts (+0.00% to +0.03%)

Collection	PDIFF
benchmarks.run.windows.x86.checked.mch	+0.02%
benchmarks.run_pgo.windows.x86.checked.mch	+0.02%
benchmarks.run_tiered.windows.x86.checked.mch	+0.02%
coreclr_tests.run.windows.x86.checked.mch	+0.02%
libraries.pmi.windows.x86.checked.mch	+0.01%
libraries_tests.run.windows.x86.Release.mch	+0.03%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	+0.03%
realworld.run.windows.x86.checked.mch	+0.02%

Details here

Throughput diffs for linux/x64 ran on linux/x64

Overall (-0.01% to +0.00%)

Collection	PDIFF
benchmarks.run_tiered.linux.x64.checked.mch	-0.01%

MinOpts (-0.05% to +0.02%)

Collection	PDIFF
benchmarks.run_tiered.linux.x64.checked.mch	-0.01%
coreclr_tests.run.linux.x64.checked.mch	-0.01%
libraries.crossgen2.linux.x64.checked.mch	-0.02%
libraries.pmi.linux.x64.checked.mch	+0.01%
benchmarks.run.linux.x64.checked.mch	-0.05%
realworld.run.linux.x64.checked.mch	+0.02%
smoke_tests.nativeaot.linux.x64.checked.mch	-0.01%
benchmarks.run_pgo.linux.x64.checked.mch	-0.01%

Details here

src/coreclr/jit/codegenxarch.cpp

tannergooding · 2024-02-12T00:03:11Z

Diffs look better now. Still overwhelmingly an improvement, but now with less examples of regressions.

The regressions are places where we called a P/Invoke but didn't use any 256-bit or higher AVX in the method itself, this is what fixes the perf issues called out in the original post. In such scenarios, we "hoist" the vzeroupper to be emitted in the prologue to avoid needing to do it before every P/Invoke call.

The improvements are primarily places where we used floating-point/simd in the method. Previously we would always emit a vzeroupper in the prologue for these methods. However, this was unnecessary since the JIT always emits VEX aware instructions and thus there is no penalty regardless of whether the UpperState=Clean or UpperState=Dirty.

We continue emitting vzeroupper in the epilogue of methods that use any 256-bit or higher AVX in the method itself, as is best practice according to the architecture manual. While this isn't strictly necessary for the JIT, since any managed caller will likely be VEX aware itself, it does ensure that if we return to a R2R method or a native method, that they won't incur any penalty. It likewise ensures that if we return to a method that needs to call a P/Invoke where the vzeroupper was hoisted, that the "right stuff" happens.

We likewise continue emitting vzeroupper before any P/Invokes for a method that uses 256-bit or higher AVX in the method itself. We could do more flow analysis to hoist some of these as well, but it's likely not worth the complexity.

tannergooding · 2024-02-12T03:24:33Z

Reduced the TP impact a bit and limited it to only x64. This should be ready for review, @dotnet/jit-contrib

ryujit-bot · 2024-02-12T15:09:41Z

Diff results for #98261

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Diffs are based on 1,730,987 contexts (430,855 MinOpts, 1,300,132 FullOpts).

Overall (-823,764 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.linux.x64.checked.mch	12,567,522	-20,337
benchmarks.run_pgo.linux.x64.checked.mch	69,885,519	-66,168
benchmarks.run_tiered.linux.x64.checked.mch	23,156,301	-41,331
coreclr_tests.run.linux.x64.checked.mch	246,265,337	-408,042
libraries.pmi.linux.x64.checked.mch	60,776,347	-110,913
libraries_tests.run.linux.x64.Release.mch	32,207,624	-28,116
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	140,919,691	-136,986
realworld.run.linux.x64.checked.mch	13,946,490	-11,076
smoke_tests.nativeaot.linux.x64.checked.mch	4,232,799	-795

MinOpts (-338,370 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.linux.x64.checked.mch	199,298	-903
benchmarks.run_pgo.linux.x64.checked.mch	27,322,199	-48,459
benchmarks.run_tiered.linux.x64.checked.mch	18,767,019	-37,281
coreclr_tests.run.linux.x64.checked.mch	139,079,884	-217,170
libraries.pmi.linux.x64.checked.mch	112,857	-27
libraries_tests.run.linux.x64.Release.mch	20,750,846	-22,671
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	10,584,167	-11,856
realworld.run.linux.x64.checked.mch	388,157	-3

FullOpts (-485,394 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.linux.x64.checked.mch	12,368,224	-19,434
benchmarks.run_pgo.linux.x64.checked.mch	42,563,320	-17,709
benchmarks.run_tiered.linux.x64.checked.mch	4,389,282	-4,050
coreclr_tests.run.linux.x64.checked.mch	107,185,453	-190,872
libraries.pmi.linux.x64.checked.mch	60,663,490	-110,886
libraries_tests.run.linux.x64.Release.mch	11,456,778	-5,445
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	130,335,524	-125,130
realworld.run.linux.x64.checked.mch	13,558,333	-11,073
smoke_tests.nativeaot.linux.x64.checked.mch	4,231,850	-795

Assembly diffs for windows/x64 ran on windows/x64

Diffs are based on 1,837,795 contexts (509,217 MinOpts, 1,328,578 FullOpts).

MISSED contexts: 133 (0.01%)

Overall (-964,068 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	46,760,847	-64,902
benchmarks.run.windows.x64.checked.mch	8,752,020	-9,990
benchmarks.run_pgo.windows.x64.checked.mch	26,046,814	-43,281
benchmarks.run_tiered.windows.x64.checked.mch	12,793,606	-24,594
coreclr_tests.run.windows.x64.checked.mch	286,363,008	-479,850
libraries.pmi.windows.x64.checked.mch	62,025,027	-121,785
libraries_tests.run.windows.x64.Release.mch	35,353,949	-42,561
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	136,924,584	-157,551
realworld.run.windows.x64.checked.mch	14,214,707	-18,276
smoke_tests.nativeaot.windows.x64.checked.mch	5,089,751	-1,278

MinOpts (-422,088 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	18,490,815	-26,304
benchmarks.run.windows.x64.checked.mch	363	-3
benchmarks.run_pgo.windows.x64.checked.mch	11,756,366	-24,291
benchmarks.run_tiered.windows.x64.checked.mch	9,132,019	-20,877
coreclr_tests.run.windows.x64.checked.mch	179,104,349	-289,155
libraries.pmi.windows.x64.checked.mch	113,521	-27
libraries_tests.run.windows.x64.Release.mch	26,016,097	-35,004
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	10,511,309	-26,424
realworld.run.windows.x64.checked.mch	386,612	-3

FullOpts (-541,980 bytes)

Collection	Base size (bytes)	Diff size (bytes)
aspnet.run.windows.x64.checked.mch	28,270,032	-38,598
benchmarks.run.windows.x64.checked.mch	8,751,657	-9,987
benchmarks.run_pgo.windows.x64.checked.mch	14,290,448	-18,990
benchmarks.run_tiered.windows.x64.checked.mch	3,661,587	-3,717
coreclr_tests.run.windows.x64.checked.mch	107,258,659	-190,695
libraries.pmi.windows.x64.checked.mch	61,911,506	-121,758
libraries_tests.run.windows.x64.Release.mch	9,337,852	-7,557
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	126,413,275	-131,127
realworld.run.windows.x64.checked.mch	13,828,095	-18,273
smoke_tests.nativeaot.windows.x64.checked.mch	5,088,804	-1,278

Details here

Assembly diffs for windows/x86 ran on windows/x86

Diffs are based on 1,485,481 contexts (265,979 MinOpts, 1,219,502 FullOpts).

Overall (-493,551 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.windows.x86.checked.mch	7,144,882	-4,548
benchmarks.run_pgo.windows.x86.checked.mch	31,085,893	+28,212
benchmarks.run_tiered.windows.x86.checked.mch	9,486,951	-5,766
coreclr_tests.run.windows.x86.checked.mch	207,102,883	-364,980
libraries.pmi.windows.x86.checked.mch	49,622,010	-82,632
libraries_tests.run.windows.x86.Release.mch	8,693,631	-1,512
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	104,061,821	-52,320
realworld.run.windows.x86.checked.mch	11,356,453	-10,005

MinOpts (-193,488 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run_pgo.windows.x86.checked.mch	3,953,221	-3,201
benchmarks.run_tiered.windows.x86.checked.mch	4,279,358	-3,624
coreclr_tests.run.windows.x86.checked.mch	117,693,910	-186,045
libraries.pmi.windows.x86.checked.mch	95,233	-3
libraries_tests.run.windows.x86.Release.mch	1,591,385	-450
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	8,675,049	-162
realworld.run.windows.x86.checked.mch	295,717	-3

FullOpts (-300,063 bytes)

Collection	Base size (bytes)	Diff size (bytes)
benchmarks.run.windows.x86.checked.mch	7,144,601	-4,548
benchmarks.run_pgo.windows.x86.checked.mch	27,132,672	+31,413
benchmarks.run_tiered.windows.x86.checked.mch	5,207,593	-2,142
coreclr_tests.run.windows.x86.checked.mch	89,408,973	-178,935
libraries.pmi.windows.x86.checked.mch	49,526,777	-82,629
libraries_tests.run.windows.x86.Release.mch	7,102,246	-1,062
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	95,386,772	-52,158
realworld.run.windows.x86.checked.mch	11,060,736	-10,002

Details here

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Overall (-0.00% to +0.02%)

Collection	PDIFF
smoke_tests.nativeaot.linux.x64.checked.mch	+0.01%
benchmarks.run.linux.x64.checked.mch	+0.01%
libraries.crossgen2.linux.x64.checked.mch	+0.02%
benchmarks.run_pgo.linux.x64.checked.mch	+0.01%
libraries.pmi.linux.x64.checked.mch	+0.01%
libraries_tests.run.linux.x64.Release.mch	+0.02%
benchmarks.run_tiered.linux.x64.checked.mch	+0.01%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.01%

MinOpts (-0.01% to +0.07%)

Collection	PDIFF
smoke_tests.nativeaot.linux.x64.checked.mch	+0.07%
libraries.crossgen2.linux.x64.checked.mch	+0.03%
realworld.run.linux.x64.checked.mch	+0.05%
benchmarks.run_pgo.linux.x64.checked.mch	+0.02%
libraries.pmi.linux.x64.checked.mch	+0.07%
libraries_tests.run.linux.x64.Release.mch	+0.04%
coreclr_tests.run.linux.x64.checked.mch	-0.01%
benchmarks.run_tiered.linux.x64.checked.mch	+0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.02%

FullOpts (+0.00% to +0.02%)

Collection	PDIFF
smoke_tests.nativeaot.linux.x64.checked.mch	+0.01%
benchmarks.run.linux.x64.checked.mch	+0.01%
libraries.crossgen2.linux.x64.checked.mch	+0.02%
benchmarks.run_pgo.linux.x64.checked.mch	+0.01%
libraries.pmi.linux.x64.checked.mch	+0.01%
libraries_tests.run.linux.x64.Release.mch	+0.01%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.01%

Details here

Throughput diffs for linux/x64 ran on windows/x64

Overall (+0.01% to +0.04%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.02%
benchmarks.run_pgo.linux.x64.checked.mch	+0.02%
benchmarks.run_tiered.linux.x64.checked.mch	+0.03%
coreclr_tests.run.linux.x64.checked.mch	+0.01%
libraries.crossgen2.linux.x64.checked.mch	+0.04%
libraries.pmi.linux.x64.checked.mch	+0.03%
libraries_tests.run.linux.x64.Release.mch	+0.03%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.03%
realworld.run.linux.x64.checked.mch	+0.02%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.03%

MinOpts (-0.00% to +0.10%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.03%
benchmarks.run_pgo.linux.x64.checked.mch	+0.04%
benchmarks.run_tiered.linux.x64.checked.mch	+0.03%
libraries.crossgen2.linux.x64.checked.mch	+0.06%
libraries.pmi.linux.x64.checked.mch	+0.08%
libraries_tests.run.linux.x64.Release.mch	+0.05%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.03%
realworld.run.linux.x64.checked.mch	+0.07%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.10%

FullOpts (+0.02% to +0.04%)

Collection	PDIFF
benchmarks.run.linux.x64.checked.mch	+0.02%
benchmarks.run_pgo.linux.x64.checked.mch	+0.02%
benchmarks.run_tiered.linux.x64.checked.mch	+0.02%
coreclr_tests.run.linux.x64.checked.mch	+0.02%
libraries.crossgen2.linux.x64.checked.mch	+0.04%
libraries.pmi.linux.x64.checked.mch	+0.03%
libraries_tests.run.linux.x64.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch	+0.03%
realworld.run.linux.x64.checked.mch	+0.02%
smoke_tests.nativeaot.linux.x64.checked.mch	+0.03%

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)

Collection	PDIFF
libraries.pmi.windows.arm64.checked.mch	+0.01%

Throughput diffs for windows/x64 ran on windows/x64

Overall (+0.01% to +0.03%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.02%
benchmarks.run.windows.x64.checked.mch	+0.02%
benchmarks.run_pgo.windows.x64.checked.mch	+0.01%
benchmarks.run_tiered.windows.x64.checked.mch	+0.02%
coreclr_tests.run.windows.x64.checked.mch	+0.01%
libraries.crossgen2.windows.x64.checked.mch	+0.03%
libraries.pmi.windows.x64.checked.mch	+0.02%
libraries_tests.run.windows.x64.Release.mch	+0.03%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.03%
realworld.run.windows.x64.checked.mch	+0.02%
smoke_tests.nativeaot.windows.x64.checked.mch	+0.03%

MinOpts (-0.03% to +0.09%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.04%
benchmarks.run.windows.x64.checked.mch	-0.03%
benchmarks.run_pgo.windows.x64.checked.mch	+0.02%
benchmarks.run_tiered.windows.x64.checked.mch	+0.02%
libraries.crossgen2.windows.x64.checked.mch	+0.05%
libraries.pmi.windows.x64.checked.mch	+0.08%
libraries_tests.run.windows.x64.Release.mch	+0.04%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.01%
realworld.run.windows.x64.checked.mch	+0.06%
smoke_tests.nativeaot.windows.x64.checked.mch	+0.09%

FullOpts (+0.01% to +0.03%)

Collection	PDIFF
aspnet.run.windows.x64.checked.mch	+0.02%
benchmarks.run.windows.x64.checked.mch	+0.02%
benchmarks.run_pgo.windows.x64.checked.mch	+0.01%
benchmarks.run_tiered.windows.x64.checked.mch	+0.02%
coreclr_tests.run.windows.x64.checked.mch	+0.02%
libraries.crossgen2.windows.x64.checked.mch	+0.03%
libraries.pmi.windows.x64.checked.mch	+0.02%
libraries_tests.run.windows.x64.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch	+0.03%
realworld.run.windows.x64.checked.mch	+0.02%
smoke_tests.nativeaot.windows.x64.checked.mch	+0.03%

Details here

Throughput diffs for windows/x86 ran on windows/x86

Overall (-0.02% to +0.02%)

Collection	PDIFF
benchmarks.run_pgo.windows.x86.checked.mch	+0.01%
coreclr_tests.run.windows.x86.checked.mch	-0.02%
libraries.crossgen2.windows.x86.checked.mch	+0.02%
libraries.pmi.windows.x86.checked.mch	-0.01%
libraries_tests.run.windows.x86.Release.mch	+0.02%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	+0.01%
realworld.run.windows.x86.checked.mch	-0.01%

MinOpts (-0.03% to +0.07%)

Collection	PDIFF
benchmarks.run.windows.x86.checked.mch	+0.05%
coreclr_tests.run.windows.x86.checked.mch	-0.03%
libraries.crossgen2.windows.x86.checked.mch	+0.04%
libraries.pmi.windows.x86.checked.mch	+0.05%
libraries_tests.run.windows.x86.Release.mch	+0.07%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	+0.05%
realworld.run.windows.x86.checked.mch	+0.07%

FullOpts (-0.02% to +0.02%)

Collection	PDIFF
benchmarks.run_pgo.windows.x86.checked.mch	+0.01%
coreclr_tests.run.windows.x86.checked.mch	-0.02%
libraries.crossgen2.windows.x86.checked.mch	+0.02%
libraries.pmi.windows.x86.checked.mch	-0.01%
libraries_tests.run.windows.x86.Release.mch	+0.01%
libraries_tests_no_tiered_compilation.run.windows.x86.Release.mch	+0.01%
realworld.run.windows.x86.checked.mch	-0.01%

Details here

src/coreclr/jit/lsraxarch.cpp

kunalspathak · 2024-02-13T15:54:24Z

Not sure why windows-arm64 TP is affected:

kunalspathak

LGTM

tannergooding · 2024-02-13T16:09:34Z

Not sure why windows-arm64 TP is affected:

The tool is only measuring how many instructions were executed. This naturally fluctuates based on several factors and so it's always possible (although rare) that the tool reports additional TP changes for an architecture that wasn't touched.

In this case, the changes have all been made in xarch specific files or definitions (either *xarch.h, *xarch.cpp, or under #ifdef TARGET_XARCH), so there's nothing that could have actually changed for Arm64.

jakobbotsch · 2024-02-13T19:44:51Z

I've also noticed that lately the arm64 variance in the TP jobs has been higher than previously. I should take a look at where that variance is coming from. The variance used to be significantly less than 0.01%.

jnyrup · 2024-02-14T08:56:27Z

#82132 (comment)

This is a longstanding perf issue, but not a regression nor a correctness issue. Moving to .NET 9

With the two reported regressions for .NET 8 fixed by this PR is there a hope of meeting the bar for having this PR backported to .NET 8?

tannergooding · 2024-02-14T15:44:27Z

@jnyrup, my expectation is "no", but it would ultimately be up to @JulieLeeMSFT on whether or not we take it for a servicing bar check.

This is a general issue going back to .NET Framework, so it's not technically a regression. There were two new customer reported scenarios that it shows up in .NET 8, but they are just variations on the same general issue and are showing up primarily due to the context of broader code (user code + library code + user optimizations happen to trigger it for this scenario).

The fix here is relatively straightforward, but its also not isolated and impacts a lot of code across the BCL. Because of this it's possible that there are scenarios not covered or a particular microarchitecture this doesn't fix, so it's not easy to label it as "low risk". Given a couple months time, it might be easier to label this as "low risk", certainly after we get the first set of benchmark numbers in our weekly perf triage next Tuesday.

And then finally, there are some "workarounds" devs can do to "fix" this by utilizing knowledge of when the JIT emits vzeroupper. Most notably you can "force" the JIT to emit a vzeroupper before a P/Invoke by simply ensuring some V256 usage exists before the P/Invoke call. One example of this is the following, where you'd simply use _ = GetZero(); before the P/Invoke. This will force a call which emits vzeroupper and then never mutates the upper bits, ensuring you're in a "clean" state so that the penalty doesn't exist.

[MethodImpl(MethodImplOptions.NoInlining)]
public static Vector128<float> GetZero() => Vector128<float>.Zero;

Update where and when vzeroupper is emitted

f764a47

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 10, 2024

ghost assigned tannergooding Feb 10, 2024

tannergooding commented Feb 10, 2024

View reviewed changes

src/coreclr/jit/codegenxarch.cpp Outdated Show resolved Hide resolved

tannergooding commented Feb 10, 2024

View reviewed changes

src/coreclr/jit/codegenxarch.cpp Outdated Show resolved Hide resolved

jkotas reviewed Feb 10, 2024

View reviewed changes

src/coreclr/jit/codegenxarch.cpp Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request Feb 10, 2024

Tests crashing in CI with no dump: exit code 137 means SIGKILL Killed #97049

Closed

tannergooding commented Feb 11, 2024

View reviewed changes

src/coreclr/jit/codegenxarch.cpp Outdated Show resolved Hide resolved

build-analysis bot mentioned this pull request Feb 11, 2024

[8.0] Failing to build native components #94823

Closed

Ensure we emit vzeroupper for JIT helpers that need it

e1b0354

tannergooding force-pushed the vzeroupper branch from bcf9262 to e1b0354 Compare February 11, 2024 15:58

tannergooding added 2 commits February 11, 2024 08:00

Make sure vzeroupper is in genRestoreCalleeSavedFltRegs

d442cd3

Scope when vzeroupper is emitted to fewer places

e6847f2

build-analysis bot mentioned this pull request Feb 11, 2024

Test failure _ParallelCrashTester::ParallelCrashTester.ParallelCrashMainThreadAndWorkerThreads() #94393

Closed

Revert the simplification done to SetContainsAVXFlags

d50c09b

tannergooding mentioned this pull request Feb 11, 2024

Ensure that SetContainsAVX passes in the regSize #98283

Merged

tannergooding added 3 commits February 11, 2024 16:03

Merge remote-tracking branch 'dotnet/main' into vzeroupper

7422d8b

Try to minify the TP impact of the improved vzeroupper handling

76ae261

Merge remote-tracking branch 'dotnet/main' into vzeroupper

db9df66

build-analysis bot mentioned this pull request Feb 12, 2024

System.Net.Security.Tests.SslStreamCertificateContextOcspLinuxTests.RefreshOcspResponse_BeforeExpiration test failure #97779

Closed

runfoapp bot mentioned this pull request Feb 12, 2024

Methodical_others test JIT/Methodical/Coverage/copy_prop_byref_to_native_int crashing #69832

Open

This was referenced Feb 12, 2024

SqlServer.Types is 20% slower on .NET 8 #96211

Closed

Performance of Trigonometric math function have unbelievable loss at .NET8 #95954

Closed

kunalspathak reviewed Feb 12, 2024

View reviewed changes

src/coreclr/jit/lsraxarch.cpp Show resolved Hide resolved

kunalspathak approved these changes Feb 13, 2024

View reviewed changes

tannergooding merged commit 6d877c5 into dotnet:main Feb 13, 2024
137 of 139 checks passed

tannergooding deleted the vzeroupper branch February 13, 2024 16:09

DrewScoggins mentioned this pull request Feb 20, 2024

Performance changes related to vzeroupper emit changes #98705

Closed

github-actions bot locked and limited conversation to collaborators Mar 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update where and when vzeroupper is emitted #98261

Update where and when vzeroupper is emitted #98261

tannergooding commented Feb 10, 2024 •

edited

Loading

ghost commented Feb 10, 2024

ryujit-bot commented Feb 10, 2024

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Assembly diffs for windows/x64 ran on windows/x64

Assembly diffs for windows/x86 ran on windows/x86

Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

Throughput diffs for windows/x64 ran on windows/x64

Throughput diffs for windows/x86 ran on windows/x86

Throughput diffs for linux/x64 ran on linux/x64

tannergooding commented Feb 12, 2024 •

edited

Loading

tannergooding commented Feb 12, 2024

ryujit-bot commented Feb 12, 2024

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Assembly diffs for windows/x64 ran on windows/x64

Assembly diffs for windows/x86 ran on windows/x86

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Throughput diffs for linux/x64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

Throughput diffs for windows/x64 ran on windows/x64

Throughput diffs for windows/x86 ran on windows/x86

kunalspathak commented Feb 13, 2024

kunalspathak left a comment

tannergooding commented Feb 13, 2024

jakobbotsch commented Feb 13, 2024

jnyrup commented Feb 14, 2024

tannergooding commented Feb 14, 2024

Update where and when vzeroupper is emitted #98261

Update where and when vzeroupper is emitted #98261

Conversation

tannergooding commented Feb 10, 2024 • edited Loading

ghost commented Feb 10, 2024

ryujit-bot commented Feb 10, 2024

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Assembly diffs for windows/x64 ran on windows/x64

Assembly diffs for windows/x86 ran on windows/x86

Throughput diffs

Throughput diffs for linux/x64 ran on windows/x64

Throughput diffs for windows/x64 ran on windows/x64

Throughput diffs for windows/x86 ran on windows/x86

Throughput diffs for linux/x64 ran on linux/x64

tannergooding commented Feb 12, 2024 • edited Loading

tannergooding commented Feb 12, 2024

ryujit-bot commented Feb 12, 2024

Assembly diffs

Assembly diffs for linux/x64 ran on windows/x64

Assembly diffs for windows/x64 ran on windows/x64

Assembly diffs for windows/x86 ran on windows/x86

Throughput diffs

Throughput diffs for linux/x64 ran on linux/x64

Throughput diffs for linux/x64 ran on windows/x64

Throughput diffs for windows/arm64 ran on windows/x64

Throughput diffs for windows/x64 ran on windows/x64

Throughput diffs for windows/x86 ran on windows/x86

kunalspathak commented Feb 13, 2024

kunalspathak left a comment

Choose a reason for hiding this comment

tannergooding commented Feb 13, 2024

jakobbotsch commented Feb 13, 2024

jnyrup commented Feb 14, 2024

tannergooding commented Feb 14, 2024

tannergooding commented Feb 10, 2024 •

edited

Loading

tannergooding commented Feb 12, 2024 •

edited

Loading