-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: make profile data available to inlinees #42277
Conversation
Update the jit to try and read profile data for inlinees, and if successful, scale it appropriately for the inline call site. This kicks in for crossgen BBOPT and TieredPGO Tier1. Update VM and Crossgen hosts to handle requests for inlinee profile counts. Crossgen2 does not seem to support profile data retrieval yet. Note crossgen experience may not be as good as one might expect, because crossgen BBINSTR loses counts for inlinees. But enabling this for crossgen even with this limitation is probably a win overall. Fix small issue in the jit where we were overly aggressive about merging the callee block's flags into the callsite block's flags.
cc @dotnet/jit-contrib @davidwrighton Crossgen diffs (ignore the assemblies with minor diffs, I think my baseline build is a bit off). Impact is on the Roslyn assemblies that have IBC data. Most of the code size diffs seem to be from changes to block layout; suspect our PGO based layout algorithm is reacting a bit too strongly here (and by way of contrast, the inliner is not reacting strongly enough).
|
@AndyAyersMS is there a doc how upd: wow, TIL |
@EgorBo yes, that's right -- just set Still lots to do in this space. |
Still quite impressive! Question: do we emit PGO counters to R2R images for BCL libs? so when I promote let's say String.IsNullOrEmpty to tier1 (from R2R'd version) I can get weights for cases when input is empty and not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved, but see my comment about possible race conditions with profiler rejit.
{ | ||
hr = E_NOTIMPL; | ||
COR_ILMETHOD_DECODER decoder(pMD->GetILHeader()); | ||
codeSize = decoder.GetCodeSize(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like there is some opportunity for a race condition here while interacting with Profiler Rejit, but otherwise this looks good. I'm going to mark this as approved, but we may need a conversation with @davmason to make sure this is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree rejitting could cause troubles. I wonder this is a broader issue -- we tend to query properties of inlinees over time, and not all in one go.
I actually have the IL size on the jit side so could pass the information down, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect the broader issue is a thing, but given the rarity of rejit we would never find the issue in our testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inlining has always ignored profiler modified IL, whether it's from SetILFunctionBody or via rejit. When grabbing the IL for inlining the jit grabs it directly from the MethodDesc and doesn't query the ReJITManager or call Module::GetDynamicIL.
It's always been an unexpected gotcha for profilers that if you rewrite method B and method A inlines B, even if the inlining occurs after the IL rewrite it will inline the original IL. I never knew the reason for this, perhaps it's intentional because of the way the jit requests inlinee data. Starting in 3.0 inlining is explicitly blocked for any rejit methods. I added logic to CEEInfo::canInline that will return INLINE_FAIL if the method's active IL has been modified with rejit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks.
It might be worth tracing out the fuller picture and rationale somewhere, if it's not written down.
@EgorBo no we don't put counters in R2R, normally. If you crossgen with the right options it will enable BBINSTR mode in the jit and allow collection of counts when the code is run; these counts can be read back by the jit during a subsequent crossgen that enables BBOPT (this is the "IBC" mode you may have seen referenced in places -- Instrumented Block Counts). The Roslyn assemblies noted above are set up this way, as are (or should be) some/many of our officially built assemblies. R2R collected counts are not (yet) available to the jit; we've had a case or two where because of this Tier1 code is slower than R2R code. Likewise methods that bypass Tier0 won't get instrumented, by default this is:
To get TieredPGO profile data for as many Tier1 methods as possible, you thus need:
Also note the exact time at which a Tier1 rejit happens is unpredictable. The Tier0 method may well be running concurrently, so there is some risk of reading inconsistent counts, and some run to run variation in the counts Tier1 sees. |
I tried to test a few random micro benchmarks from dotnet/performance with TieredPGO: [Config(typeof(MyConfig))]
public class PgoBench
{
private class MyConfig : ManualConfig
{
public MyConfig()
{
AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
.WithId("Default params"));
AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
.WithEnvironmentVariables(
new EnvironmentVariable("COMPlus_TieredCompilation", "0"))
.WithId("No TC"));
AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
.WithEnvironmentVariables(
new EnvironmentVariable("COMPlus_ReadyToRun", "0"),
new EnvironmentVariable("COMPlus_TC_QuickJitForLoops", "1"),
new EnvironmentVariable("COMPlus_TieredPGO", "0"))
.WithId("NoR2R, QJ4L"));
AddJob(Job.Default.WithRuntime(CoreRuntime.Core50)
.WithEnvironmentVariables(
new EnvironmentVariable("COMPlus_ReadyToRun", "0"),
new EnvironmentVariable("COMPlus_TC_QuickJitForLoops", "1"),
new EnvironmentVariable("COMPlus_TieredPGO", "1"))
.WithId("NoR2R, QJ4L, TieredPGO"));
}
}
[Benchmark]
[Arguments(true)]
public string BoolToString(bool value) => value.ToString();
[Benchmark]
[Arguments("4242.5555")]
public float StringToFloat(string value) => float.Parse(value, CultureInfo.InvariantCulture);
static readonly char[] s_colonAndSemicolon = { ':', ';' };
[Benchmark]
public int StringIndexOfAny() =>
"All the world's a stage, and all the men and women merely players: they have their exits and their entrances; and one man in his time plays many parts, his acts being seven ages."
.IndexOfAny(s_colonAndSemicolon);
} Results:
👍 as expected. (.NET 5 Preview 8) PS: Is OSR already there in the jit so |
Nice to see. I've tried running some of the bigger benchmarks and don't see consistent improvement (yet)
No, OSR is not enabled by default (and there are a few bugs to fix). OSR methods jit at Tier1 so should pick up pgo data too. Enable via
Also note if you try running bigger things that the "profile data slab" the runtime creates is fixed size and may run out of space, meaning some Tier0 code doesn't get profiled. Current size is 512K bytes, which is 64K profile entries. A Tier0 method uses 2 entry slots for header data, and N slots for counters (where N is number of basic blocks). So roughly speaking it will fill up at around 8K methods or so. There is no notification/detection if it fills up. The slab design was intended to make import and export easy. It's not ideal for in-process pgo. I have various plans to make the in-process PGO more space efficient, like
Initial focus will be ensuring the jit uses the profile data to its best advantage. There is a lot of work to do there too. I am working on a more detailed plan and will try and share it out soon. |
@AndyAyersMS thanks for the detailed explanation! A small issue bother me - it's enough to call a method 30 times to promote it to tier1 and bake BB weights there forever. so when we inline it in a different context (callsite) those weights can be irrelevant.
does jit need an ability to deoptimize methods to solve this problem? Or maybe weights should be reset for new callsites? |
Good question. Profile based optimizations rely on the past being a good predictor of the future. So any sort of profiling scheme is vulnerable to the profile not accurately representing future behavior. We are using Tier0 to collect counts because it was fairly easy to set up, and provided an easy way to get profile data into the jit, so we could start improving the ways the jit uses profile data. But as you note, that means we only get to observe the first few calls for some methods. It remains to be seen how well (or poorly) this predicts future behavior in general. I think we will get around to deopt eventually, as we start making stronger bets based on profile data and need the ability to reconsider and reoptimize if we have bet wrong. I don't know if we'd deopt because of poor block layout, though -- I was imagining it would be more for things like guarded (or unguarded) speculative devirtualization. |
Could also sample methods? e.g. periodically switch to an optimised version but with the addition of counters; then switch back. If the counts differ interestingly, could then reoptimize the method with the new weights? |
@AndyAyersMS note 👆 that's 5.0 RC1 PGO; not this PR |
@benaadams interesting -- I don't expect big improvements just yet. Was this just with It's possible ASP scenarios overflow the 512K slab, depending on how much ends up getting jitted at Tier0 (meaning some methods don't get pgo data). |
Used the triple
|
I'd love to see TieredPGO + Guarded Devirtualization combo 🙂 void Foo(IDisposable d)
{
d.Dispose();
} TieredPGO should somehow detect that in most cases d is let's say Foo with empty Dispose impl: void Foo(IDisposable d)
{
if (d is Foo) // guarded devirt.
return;
else
d.Dispose();
} |
Yes, we are headed in that direction... either we'll get "currently monomorphic" info from the runtime or else we'll info ourselves via profiling. For the latter we might be able to peek at the VSD cell state (for interface calls) or else profile the types that reach the call with custom instrumentation. |
Looks like in TE some of the C++ implementations are running a PGO loopback warmup as part of compile step and then recompiling with the PGO data https://github.com/TechEmpower/FrameworkBenchmarks/blob/11bcc746b3444ad393eaaf350060367611ddc8fa/frameworks/C%2B%2B/lithium/lithium.cc#L44-L56 For us TE do run a warmup test before starting the measurements which works for tiered compilation and would also work for runtime PGO? |
We'll eventually be building a robust solutions for feeding profile data from one run to the next, so some sort of two-phase process might be doable. Though it would be nicer to just get what we need from Tier0. |
Update the jit to try and read profile data for inlinees, and if successful,
scale it appropriately for the inline call site. This kicks in for crossgen
BBOPT and TieredPGO Tier1.
Update VM and Crossgen hosts to handle requests for inlinee profile counts.
Crossgen2 does not seem to support profile data retrieval yet.
Note crossgen experience may not be as good as one might expect, because
crossgen BBINSTR loses counts for inlinees. But enabling this for crossgen
even with this limitation is probably a win overall.
Fix small issue in the jit where we were overly aggressive about merging the
callee block's flags into the callsite block's flags.