-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic PGO #43618
Comments
Bolded "in progress" work item. |
Quick idea: don't expand BA67666666 mov edx, 0x66666667
8BC2 mov eax, edx
F7E9 imul edx:eax, ecx
8BC2 mov eax, edx
C1E81F shr eax, 31
C1FA02 sar edx, 2
03C2 add eax, edx
8D0480 lea eax, [rax+4*rax]
03C0 add eax, eax
2BC8 sub ecx, eax
8BC1 mov eax, ecx while could be just 8BC1 mov eax, ecx
99 cdq
41F7F8 idiv edx:eax, 10
8BC2 mov eax, edx |
Anecdotal evidence of PGO making SIMD worse: However, @aalmada hasn't been able to isolate it for a repo/issue https://twitter.com/AntaoAlmada/status/1382033309052588036 |
Happy to investigate if there's a repro. Couple random thoughts:
|
@AndyAyersMS I submitted an issue with more information #51915 |
Here's a current comparison of roughly 3400 microbenchmarks running with a variety of PGO configurations on Windows x64. In .NET 6.0, the default behavior is to use the Static PGO data available in the framework assemblies. This configuration is the All measurements are done via Benchmark.NET, which (in principle) should be measuring the performance of Tier1 jitted code. The configurations measured are:
The data below shows crude histograms of the ratios of baseline to configuration performance for the microbenchmarks. Values less than 1.0 mean that the baseline is running faster than the configuration; values larger than 1.0 mean that the configuration is running faster than the baseline. The last entry shows the geometric mean of the ratios; this gives a rough figure of merit for the entire configuration. From the No PGO data, we can see that Static PGO (the default configuration) provide an overall improvement of around 1.5% on microbenchmark performance. Dynamic PGO offers roughly 1% improvement over default (so a 2.5% impact over no PGO). Full PGO offers roughly 6% improvement over default (so 7.5% impact over no PGO).
|
Dynamic PGO: Extreme ResultsHere Static PGO is the baseline, and Dynamic PGO the diff. So higher is better for Dynamic PGO, and the "bottom" results are tests that fare poorly with Dynamic PGO. We suspect that some of these very poor results are cases where BenchmarkDotNet doesn't run enough iterations to get all the key methods running Tier1 code, but this needs more investigation. Bottom 20 results
Top 20 results
Not clear yet what's going on with some of these tests with outsized gains. Will fill in with analysis when I have it. Suspect the running time of these tests is so short that the measurement is below BDN's noise floor. |
Full PGO: Extreme Results(more details as I have time to fill them in) Bottom 20 results
Top 20 results
|
@AndyAyersMS the latest round of TE benchmark for the inliner: #52708 (comment) I'm also watching other metrics like time-to-first response, latency, memory, etc. I wonder how fast we can go so I'm testing a more aggressive version at the moment. |
Closing per updated top comment:
|
Epic for improving how the jit produces and consumes profile data, with an emphasis on the "dynamic" scenario where everything happens in-process.
Much of the work is also applicable to AOT PGO scenarios.
All non-stretch items are completed for .NET 6. We'll open a follow-on issue to capture the stretch items below and new work envisioned for .NET 7.
Link to related github project
Overview document: Dynamic PGO
(intro from that doc)
Profile based optimization relies heavily on the principle that past behavior is a good predictor of future behavior. Thus observations about past program behavior can steer optimization decisions in profitable directions, so that future program execution is more efficient.
These observations may come from the recent past, perhaps even from the current execution of a program, or from the distant past. Observations can be from the same version of the program or from different versions.
Observations are most often block counts, but can cover many different aspects of behavior; some of these are sketched below.
A number of important optimizations are really only practical when profile feedback is available. Key among these is aggressive inlining, but many other speculative, time-consuming, or size-expanding optimizations fall in this category.
Profile feedback is especially crucial in JIT-based environments, where compile time is at a premium. Indeed, one can argue that the performance of modern Java and Javascript implementations hinges crucially on effective leverage of profile feedback.
Profile guided optimization benefits both JIT and AOT compilation. While this document focuses largely on the benefits to JIT compilation, but much of what follows is also applicable to AOT. The big distinction is ease of use -- in a jitted environment profile based optimization can be done automatically, and so can be offered as a platform feature without requiring any changes to applications.
.NET currently has a somewhat arm's-length approach to profile guided optimization, and does not obtain much benefit from it. Significant opportunity awaits us if we can tap into this technology
.NET 6 Scenarios
Work items
(stretch) indicates things that are not going make it into .NET 6.0.
Representation of Profile Data
Incorporation of profile data
Heuristics and Optimization
Instrumentation
Sample Based PGO
Runtime
Maintenance
Debugging and Diagnostics
Testing and CI
Performance
Related issues:
Also of note:
category:planning
theme:planning
skill-level:expert
cost:large
The text was updated successfully, but these errors were encountered: