-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update where and when vzeroupper is emitted #98261
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThis resolves #82132 and resolves #11496 The transition diagrams are as seen below. The The Intel optimization manual guidance in
Given the diagrams and this statement, we can come to two conclusions:
Essentially, for any method compiled by the JIT during the lifetime of the program, we know it is Likewise, if we are going from The only case we really care about is This consideration largely only applies to P/Invokes to user functions and does not apply to most JIT helpers. It additionally applies to calls from a managed method that was jitted during the execution of the program to a managed method that was compiled for R2R, which may target the legacy encoding. Skylake and newer micro-architectures:
|
Diff results for #98261Assembly diffsAssembly diffs for linux/x64 ran on windows/x64Diffs are based on 1,620,764 contexts (360,162 MinOpts, 1,260,602 FullOpts). MISSED contexts: 3,086 (0.19%) Overall (-790,628 bytes)
MinOpts (-306,427 bytes)
FullOpts (-484,201 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 1,999,231 contexts (587,594 MinOpts, 1,411,637 FullOpts). MISSED contexts: 3,657 (0.18%) Overall (-1,094,187 bytes)
MinOpts (-483,299 bytes)
FullOpts (-610,888 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 1,618,717 contexts (327,626 MinOpts, 1,291,091 FullOpts). MISSED contexts: 11,022 (0.68%) Overall (-504,237 bytes)
MinOpts (-200,802 bytes)
FullOpts (-303,435 bytes)
Details here Throughput diffsThroughput diffs for linux/x64 ran on windows/x64Overall (+0.00% to +0.03%)
MinOpts (-0.00% to +0.08%)
FullOpts (+0.00% to +0.02%)
Throughput diffs for windows/x64 ran on windows/x64Overall (-0.00% to +0.03%)
MinOpts (-0.01% to +0.07%)
FullOpts (-0.00% to +0.02%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (+0.00% to +0.04%)
MinOpts (+0.00% to +0.16%)
FullOpts (+0.00% to +0.03%)
Details here Throughput diffs for linux/x64 ran on linux/x64Overall (-0.01% to +0.00%)
MinOpts (-0.05% to +0.02%)
Details here |
bcf9262
to
e1b0354
Compare
Diffs look better now. Still overwhelmingly an improvement, but now with less examples of regressions. The regressions are places where we called a P/Invoke but didn't use any The improvements are primarily places where we used floating-point/simd in the method. Previously we would always emit a We continue emitting We likewise continue emitting |
Reduced the TP impact a bit and limited it to only x64. This should be ready for review, @dotnet/jit-contrib |
Diff results for #98261Assembly diffsAssembly diffs for linux/x64 ran on windows/x64Diffs are based on 1,730,987 contexts (430,855 MinOpts, 1,300,132 FullOpts). Overall (-823,764 bytes)
MinOpts (-338,370 bytes)
FullOpts (-485,394 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 1,837,795 contexts (509,217 MinOpts, 1,328,578 FullOpts). MISSED contexts: 133 (0.01%) Overall (-964,068 bytes)
MinOpts (-422,088 bytes)
FullOpts (-541,980 bytes)
Details here Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 1,485,481 contexts (265,979 MinOpts, 1,219,502 FullOpts). Overall (-493,551 bytes)
MinOpts (-193,488 bytes)
FullOpts (-300,063 bytes)
Details here Throughput diffsThroughput diffs for linux/x64 ran on linux/x64Overall (-0.00% to +0.02%)
MinOpts (-0.01% to +0.07%)
FullOpts (+0.00% to +0.02%)
Details here Throughput diffs for linux/x64 ran on windows/x64Overall (+0.01% to +0.04%)
MinOpts (-0.00% to +0.10%)
FullOpts (+0.02% to +0.04%)
Throughput diffs for windows/arm64 ran on windows/x64MinOpts (-0.00% to +0.01%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.01% to +0.03%)
MinOpts (-0.03% to +0.09%)
FullOpts (+0.01% to +0.03%)
Details here Throughput diffs for windows/x86 ran on windows/x86Overall (-0.02% to +0.02%)
MinOpts (-0.03% to +0.07%)
FullOpts (-0.02% to +0.02%)
Details here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The tool is only measuring how many instructions were executed. This naturally fluctuates based on several factors and so it's always possible (although rare) that the tool reports additional TP changes for an architecture that wasn't touched. In this case, the changes have all been made in xarch specific files or definitions (either |
I've also noticed that lately the arm64 variance in the TP jobs has been higher than previously. I should take a look at where that variance is coming from. The variance used to be significantly less than 0.01%. |
With the two reported regressions for .NET 8 fixed by this PR is there a hope of meeting the bar for having this PR backported to .NET 8? |
@jnyrup, my expectation is "no", but it would ultimately be up to @JulieLeeMSFT on whether or not we take it for a servicing bar check. This is a general issue going back to .NET Framework, so it's not technically a regression. There were two new customer reported scenarios that it shows up in .NET 8, but they are just variations on the same general issue and are showing up primarily due to the context of broader code (user code + library code + user optimizations happen to trigger it for this scenario). The fix here is relatively straightforward, but its also not isolated and impacts a lot of code across the BCL. Because of this it's possible that there are scenarios not covered or a particular microarchitecture this doesn't fix, so it's not easy to label it as "low risk". Given a couple months time, it might be easier to label this as "low risk", certainly after we get the first set of benchmark numbers in our weekly perf triage next Tuesday. And then finally, there are some "workarounds" devs can do to "fix" this by utilizing knowledge of when the JIT emits [MethodImpl(MethodImplOptions.NoInlining)]
public static Vector128<float> GetZero() => Vector128<float>.Zero; |
This resolves #82132 and resolves #11496 and resolves #96211 and resolves #95954
The transition diagrams are as seen below. The The Intel optimization manual guidance in
3.11.5.3 Fixing Instruction Slowdowns
states:Given the diagrams and this statement, we can come to two conclusions:
Essentially, for any method compiled by the JIT during the lifetime of the program, we know it is
VEX-aware
and thus regardless of theUpperState=Dirty
orUpperState=Clean
,managed to managed
calls for such methods are safe and do not needvzeroupper
and incur no transition penalty.Likewise, if we are going from
unmanaged to managed
we are also safe because we are going fromUpperState=Clean
orUpperState=Dirty
toUpperState=Dirty
(assuming we aren't on a pre-Skylake microarchitecture where native itself placed us inUpperState=PreservedNonInit
) and thus no transition penalty exists.The only case we really care about is
managed to unmanaged
(such as for a P/Invoke), as for such a scenario we cannot assume to know whether or not the unmanaged code isVEX
aware. Thus, we need to emitvzeroupper
before such calls (as the optimization manual guidance states) to ensure we aren't executing legacy encoded instructions whereUpperState=Dirty
orUpperState=PreservedNonInit
.This consideration largely only applies to P/Invokes to user functions and does not apply to most JIT helpers. It additionally applies to calls from a managed method that was jitted during the execution of the program to a managed method that was compiled for R2R, which may target the legacy encoding.
Older micro-architectures:
Skylake and newer micro-architectures: