-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Guarded devirtualization foundations #21270
Guarded devirtualization foundations #21270
Conversation
guesses the wrong class, but the break-even point (not shown) is at a relatively | ||
small probability of a correct guess. | ||
|
||
![two classes interface devirt](TwoClassesInterface.JPG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small jpg TwoClassesInterface.JPG
=> TwoClassesInterface.jpg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or change the file name to have a CAPs extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will stick with caps for now, though it looks wrongish.
Beautiful! ❤️ Resolves: https://github.com/dotnet/coreclr/issues/8819 and then some 😄 |
Great analysis in the design doc. One area that may get significant benefits from guarded devirtualization of Interface Calls (over something like Linq) is dependency injection (and ASP.NET Core due to this). Generally types are commonly defined via interface; so they can be easily adapted for the app and for ease of testing; however per app the concrete type to resolve is generally defined at start up and baked to be only one type for the application's lifetime so monomorphic dispatch becomes very common. The impact would likely improve as you move further up the ASP.NET Core stack (into MVC and user code) as layers of interfaces compound on each other through specific call sites. |
May also be worth considering a Through perhaps instrumenting Tier0/Tier1 would determine this and could "deoptimze" the callsite automatically; or it the fallback dispatch stub grew beyond a certain number of permutations this could trigger the "deoptimzation" automatically. |
If you peer back just a bit in the commit chain there is a version where the runtime keeps track of which classes implement a given interface. And if there's just one implementing class for an interface call at the time a method is jitted then the jit will guess that class. That may in fact be the thing we try to get fully working for 3.0, since it tends to impact a relatively small number of call sites (so less code bloat and jit time cost) and should have a pretty good CQ payoff. |
IMHO "add instrumenting Tier1 to collect data and Tier2 to optimize?" sounds good, as Tier 1 is more "stationary" than Tier 0 (i.e. I mean reflects the actual workload better than the startup-phase). |
@benaadams Wow that is a lot of interfaces. @gfoidl As far as adding an "instrumenting" Tier1 -- it is certain something to consider. But it may be a while. The jit isn't set up to take advantage of rich profile feedback (in particular the inliner doesn't use profile information in any meaningful way, and what profile data there is is only read in for top-level methods and is ignored for inlinees). So we have a fairly large backlog of things to address. |
Started looking at the failures. Something is going wrong in the void return case (eg method returning struct implicitly by-ref... as in GitHub_20499). We can't map a GT_RET_EXPR to a GT_NOP -- if we fail to inline we then lose the attached call. |
|
Another common aspnet monomorphic interface callsite (which may be covered) is via generic interface dispatch; with type categorized logging via public class MyPage : PageModel
{
private readonly ILogger<MyPage> _logger;
public MyPage(ILogger<MyPage> logger) => _logger = logger;
public void OnGet() => _logger.LogInformation("Get Request"); // Call site
} Further complicating; it may be the generic interface held in a non-generic field public class MyPage : PageModel
{
private readonly ILogger _logger;
public MyPage(ILogger<MyPage> logger) => _logger = logger;
public void OnGet() => _logger.LogInformation("Get Request"); // Call site
} Further complicating; the fast-path non allocating when logger is off route goes via a static action created by the public class MyPage : PageModel
{
private readonly static Action<ILogger, Exception> _log =
LoggerMessage.Define(
eventId: 1,
logLevel: LogLevel.Information,
formatString: "Get Request");
private readonly ILogger _logger;
public MyPage(ILogger<MyPage> logger) => _logger = logger;
public void OnGet() => _log(_logger, null); // Call site
} Some common more complicated use-cases for the mix? |
For something that gets executed millions of times per second during the TE benchmarks, this fairly innocuous line in Kestrel is probably a wild ride through interfaces, generics, delegates and virtuals: Log.ConnectionKeepAlive(ConnectionId); |
@fiigii will get perf data once things are bit further along... want to fix remaining bugs in the transformation and iron out some annoyances first. @benaadams thanks much for the examples. |
Hoping this x64 failure is just a glitch:
|
Yes, it is also failed in #21264 |
@dotnet-bot test Windows_NT x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0) |
Lay the groundwork for guarded devirtualization of virtual and interface calls in the jit. Introduce the notion of a guarded devirtualization candidate and identify these if regular devirtualization fails. Use simple heuristics to produce a class to guess for. Require that the method that would be invoked if the class guess is correct be a plausible inline candidate. Generalize the calli transformer to become an indirect call transformer. This runs after importation because it needs to introduce control flow and runs before inlining so that the new direct calls it introduces can be inlined. Implement the transformation to duplicate the call site, devirtualize on the side where the class is now known exactly, and turn the resulting direct call into an inline candidate.
d35d4dd
to
7cbdc06
Compare
Needed to rebase to pick up recent fixes. The large commit chain in the initial PR had several conflicts and marginal value so decided to squash it all down... |
Here's an example of the poor interaction of guarded devirt and late devirt: both arms of the guarded call devirtualize (one early, one late) and so both end up calling // example of bad interaction between guarded and late devirt.
class B
{
public virtual int F() => 33;
}
class D : B
{
public override int F() => 44;
}
class X
{
static B G() => new B();
public static int Main()
{
// GuardedDevirt will insert runtime type test here.
//
// Then residiual virtual call to F() will devirt "late"
// after inlining G()
return G().F() + 100 - 33;
}
} ; Assembly listing for method X:Main():int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 OutArgs [V00 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V01 tmp1 [V01,T00] ( 3, 4.20) ref -> rax class-hnd exact "spilling ret_expr"
; V02 tmp2 [V02,T02] ( 3, 1.50) int -> rsi "guarded devirt return temp"
;* V03 tmp3 [V03 ] ( 0, 0 ) ref -> zero-ref class-hnd exact "guarded devirt this exact temp"
; V04 tmp4 [V04,T01] ( 2, 4 ) ref -> rax class-hnd exact "NewObj constructor temp"
;
; Lcl frame size = 32
G_M10001_IG01:
56 push rsi
4883EC20 sub rsp, 32
G_M10001_IG02:
48B92813845EFB7F0000 mov rcx, 0x7FFB5E841328
E8BCE7885F call CORINFO_HELP_NEWSFAST
48B92813845EFB7F0000 mov rcx, 0x7FFB5E841328
483908 cmp qword ptr [rax], rcx
7507 jne SHORT G_M10001_IG03
BE21000000 mov esi, 33
EB0A jmp SHORT G_M10001_IG04
G_M10001_IG03:
488BC8 mov rcx, rax
E866F4FFFF call B:F():int:this
8BF0 mov esi, eax
G_M10001_IG04:
8D4643 lea eax, [rsi+67]
G_M10001_IG05:
4883C420 add rsp, 32
5E pop rsi
C3 ret
; Total bytes of code 61, prolog size 5 for method X:Main():int We could perhaps fix this up later if the jit knew that the |
OSX build failed with what look like in-use files:
will retry... @dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test |
Ubuntu arm64 leg has an msbuild failure:
(similar in the other leg). |
Ended up reordering the (debug only) search for early inline failures to simplify the logic, and added comments. Marked the transformer classes as `final` to save some jit cycles. Also removing the option to allow GDv for non-inline candidates as it wasn't working. Will likely sort this out later as it may prove useful in tracking down bugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good
Verified latest is no-diff if I disable Guarded Devirt. With GuardedDevirt enabled we still see the ~5% size increase. Am going to push a final commit -- once testing above goes green -- that flips the default to disabled, and that's what I will check in. |
Please remove the "WIP" tag. |
arm cross checked network hangup (same for arm64 cross checked)
|
Perf test runs seem perhaps to be looking for python 3.7 and failing to find it:
|
@adiaaida does this python 3.7 thing seem familiar? |
@AndyAyersMS yes. We moved the python3.7 artifact to the regular helix queues, so I moved the smoketests jobs to use the regular queue, but we broke dumpling because python3 was the first on the path, so they went back to a previous deployment so the queue lost python3.7. We are working on a new deployment. See https://github.com/dotnet/core-eng/issues/4753 |
InlineCandidateInfo* gtInlineCandidateInfo; | ||
void* gtStubCallStubAddr; // GTF_CALL_VIRT_STUB - these are never inlined | ||
InlineCandidateInfo* gtInlineCandidateInfo; | ||
GuardedDevirtualizationCandidateInfo* gtGuardedDevirtualizationCandidateInfo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to handle the new field in Compiler::gtCloneExpr
. If not, we would see JITStress failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing this out. Unfortunately we can't safely clone inline candidates in isolation like this, as they and the associated GT_RET_EXPR
s must refer to one another. Also we can't safely share inline candidate info across multiple calls.
To put it another there are some special and unusual rules in the IR early on and any cloning of an inline candidate requires cloning of the GT_RET_EXPR
and special post processing to fix things up (which is a good part of what the guarded devirt transformation and up-front GT_RET_EXPR
spilling does).
So what goes on here when cloning calls is wrongish but can't easily be made right. We should I suppose null out the candidate information as that would be somewhat less wrong, but it is still wrong. And we need to clone calls so banishing cloning for inline candidates or guarded devirtualization candidates outright (like we do for GT_RET_EXPR
s) doesn't help either.
I doubt stress mode cloning runs before we've inlined as it would already be busted (and after we get past fgInline
these special complications vanish). So I further doubt these changes will pose any new problems beyond the ones we already have and are avoiding.
But I could be wrong. I'll keep an eye on cloning as I continue working on this feature, but for now don't expect to need to change this aspect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more, probably the right change is to ban cloning of inline candidates via gtCloneExpr
and introduce a separate method that can be used to clone them.
Because inline candidate calls can't ever be subtrees or live on the eval stack it's unlikely anyone would clone them inadvertently (say as part of cloning some ancestor tree), so having a deliberate API to call in this case seems like it might be a good compromise.
Will follow up on this in a subsequent change.
This reverts commit 045f470.
Follow-up from a review comment in dotnet#21270.
Follow-up from a review comment in #21270.
Follow-up from dotnet#21270 and dotnet#21414. Block general cloning from inadvertently cloning a candidate call. Add a separate path for cloning calls that are inline and guarded devirtualization candidates for use by guarded devirtualization. Callers need to take extra steps after cloning one of these calls to properly fix everything up. Also fix up some issues in the large comment for the fat calli transformation.
Follow-up from #21270 and #21414. Block general cloning from inadvertently cloning a candidate call. Add a separate path for cloning calls that are inline and guarded devirtualization candidates for use by guarded devirtualization. Callers need to take extra steps after cloning one of these calls to properly fix everything up. Also fix up some issues in the large comment for the fat calli transformation.
middle and left hand side differ and are more costly. | ||
|
||
For the un-optimized case (orange) the difference is directly attributable to | ||
the performance of the indirect ranch predictor. Even when `p` is small there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: ranch → branch
Perf tests like those in the design doc: dotnet/performance#174 |
Follow-up from #21270 and #21414. Block general cloning from inadvertently cloning a candidate call. Add a separate path for cloning calls that are inline and guarded devirtualization candidates for use by guarded devirtualization. Callers need to take extra steps after cloning one of these calls to properly fix everything up. Also fix up some issues in the large comment for the fat calli transformation.
Lay the groundwork for guarded devirtualization of virtual and interface calls in the jit. Introduce the notion of a guarded devirtualization candidate and identify these if regular devirtualization fails. Use simple heuristics to produce a class to guess for. Require that the method that would be invoked if the class guess is correct be a plausible inline candidate. Generalize the calli transformer to become an indirect call transformer. This runs after importation because it needs to introduce control flow and runs before inlining so that the new direct calls it introduces can be inlined. Implement the transformation to duplicate the call site, devirtualize on the side where the class is now known exactly, and turn the resulting direct call into an inline candidate. Add a motivation and design document. Commit migrated from dotnet/coreclr@045f470
…coreclr#21414) Follow-up from a review comment in dotnet/coreclr#21270. Commit migrated from dotnet/coreclr@6e9c14e
Follow-up from dotnet/coreclr#21270 and dotnet/coreclr#21414. Block general cloning from inadvertently cloning a candidate call. Add a separate path for cloning calls that are inline and guarded devirtualization candidates for use by guarded devirtualization. Callers need to take extra steps after cloning one of these calls to properly fix everything up. Also fix up some issues in the large comment for the fat calli transformation. Commit migrated from dotnet/coreclr@08be98c
Exploratory changes to enable guarded devirtualization. See the design doc for motivation and discussion.
This series of changes mainly introduces the ability to do the proper transformation. The heuristics for when to transform and what to guess for are quite crude, but serve, for now, as a good stress test. The middle part of the commit stream includes some experimental runtime changes (thanks to @davidwrighton) that would support better decision making, but those have been undone by later commits. They will return at some point.
I will likely split this up into a series of smaller PRs once I've thought though some of the open issues and figured out how I really want to refactor things. But I want to expose this now for some broader testing and early feedback.
[Edit: am now going to just submit this one PR, but the transformation will be disabled by default for now].
cc @dotnet/jit-contrib (feel free to comment, but don't feel obligated)
Current crude heuristic has a pretty sizeable impact on code size: