Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Guarded devirtualization foundations #21270

Merged
merged 8 commits into from
Dec 6, 2018

Conversation

AndyAyersMS
Copy link
Member

@AndyAyersMS AndyAyersMS commented Nov 29, 2018

Exploratory changes to enable guarded devirtualization. See the design doc for motivation and discussion.

This series of changes mainly introduces the ability to do the proper transformation. The heuristics for when to transform and what to guess for are quite crude, but serve, for now, as a good stress test. The middle part of the commit stream includes some experimental runtime changes (thanks to @davidwrighton) that would support better decision making, but those have been undone by later commits. They will return at some point.

I will likely split this up into a series of smaller PRs once I've thought though some of the open issues and figured out how I really want to refactor things. But I want to expose this now for some broader testing and early feedback.

[Edit: am now going to just submit this one PR, but the transformation will be disabled by default for now].

cc @dotnet/jit-contrib (feel free to comment, but don't feel obligated)

Current crude heuristic has a pretty sizeable impact on code size:

PMI Diffs for System.Private.CoreLib.dll, framework assemblies for x64 default jit
Summary:
(Lower is better)
Total bytes of diff: 1923401 (5.15% of base)
    diff is a regression.
Top file regressions by size (bytes):
      436212 : Microsoft.CodeAnalysis.VisualBasic.dasm (7.80% of base)
      339130 : System.Private.Xml.dasm (9.37% of base)
      330916 : Microsoft.CodeAnalysis.CSharp.dasm (7.60% of base)
      105677 : System.Data.Common.dasm (8.07% of base)
       97797 : System.Private.DataContractSerialization.dasm (12.75% of base)
119 total files with size differences (0 improved, 119 regressed), 10 unchanged.
Top method regressions by size (bytes):
       19555 (150.86% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundTreeVisitor`2:VisitInternal(ref,struct):long:this
        8076 (96.52% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - BoundTreeVisitor`2:VisitInternal(ref,double):long:this
        7747 (111.55% of base) : System.Private.Xml.dasm - XmlSerializationWriterCodeGen:WriteElements(ref,ref,ref,ref,ref,ref,bool,bool):this
        7483 (51.50% of base) : System.Data.Common.dasm - XmlTreeGen:HandleTable(ref,ref,ref,bool):ref:this
        7277 (111.64% of base) : Microsoft.CodeAnalysis.CSharp.dasm - BoundTreeVisitor`2:VisitInternal(ref,struct):long:this
Top method improvements by size (bytes):
        -790 (-4.49% of base) : Microsoft.CodeAnalysis.VisualBasic.dasm - Binder:ReportOverloadResolutionFailureForASingleCandidate(ref,ref,int,byref,struct,struct,bool,bool,bool,bool,ref,ref,bool,ref,ref):this
        -155 (-16.47% of base) : Microsoft.CodeAnalysis.CSharp.dasm - IteratorStateMachine:.ctor(ref,ref,ref,int,bool,ref):this
        -132 (-10.56% of base) : System.Net.Requests.dasm - FtpWebRequest:FinishRequestStage(int):int:this
         -82 (-8.35% of base) : System.Net.Http.dasm - WinHttpRequestCallback:OnRequestSendingRequest(ref)
         -67 (-32.84% of base) : System.Collections.NonGeneric.dasm - ReadOnlyCollectionBase:get_Count():int:this
Top method regressions by size (percentage):
         220 (1,047.62% of base) : System.Private.CoreLib.dasm - Progress`1:System.IProgress<T>.Report(struct):this
         158 (831.58% of base) : Newtonsoft.Json.dasm - JToken:System.Dynamic.IDynamicMetaObjectProvider.GetMetaObject(ref):ref:this
         189 (756.00% of base) : System.Private.DataContractSerialization.dasm - XmlWriterDelegator:WriteDecimal(struct):this
         144 (600.00% of base) : System.Net.Requests.dasm - HttpWebResponse:Close():this
         125 (568.18% of base) : System.Private.Xml.dasm - XmlBaseConverter:ToString(ref):ref:this
Top method improvements by size (percentage):
         -10 (-50.00% of base) : System.Private.CoreLib.dasm - CultureInfo:ToString():ref:this (2 base, 1 diff methods)
         -18 (-50.00% of base) : System.Private.CoreLib.dasm - RuntimeAssembly:get_ManifestModule():ref:this (2 base, 1 diff methods)
         -20 (-50.00% of base) : System.Private.CoreLib.dasm - FileStream:get_IsClosed():bool:this (2 base, 1 diff methods)
         -24 (-50.00% of base) : System.Private.CoreLib.dasm - Hashtable:Add(ref,ref):this (2 base, 1 diff methods)
          -9 (-50.00% of base) : System.Runtime.Extensions.dasm - ArrayList:get_Count():int:this (2 base, 1 diff methods)
27393 total methods with size differences (129 improved, 27264 regressed), 163883 unchanged.

guesses the wrong class, but the break-even point (not shown) is at a relatively
small probability of a correct guess.

![two classes interface devirt](TwoClassesInterface.JPG)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small jpg TwoClassesInterface.JPG => TwoClassesInterface.jpg

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or change the file name to have a CAPs extension

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will stick with caps for now, though it looks wrongish.

@benaadams
Copy link
Member

Beautiful! ❤️

Resolves: https://github.com/dotnet/coreclr/issues/8819 and then some 😄

@benaadams
Copy link
Member

Great analysis in the design doc.

One area that may get significant benefits from guarded devirtualization of Interface Calls (over something like Linq) is dependency injection (and ASP.NET Core due to this).

Generally types are commonly defined via interface; so they can be easily adapted for the app and for ease of testing; however per app the concrete type to resolve is generally defined at start up and baked to be only one type for the application's lifetime so monomorphic dispatch becomes very common.

The impact would likely improve as you move further up the ASP.NET Core stack (into MVC and user code) as layers of interfaces compound on each other through specific call sites.

@benaadams
Copy link
Member

benaadams commented Nov 29, 2018

May also be worth considering a MethodImplOptions value for known multimorphic/megamorphic call sites?

Through perhaps instrumenting Tier0/Tier1 would determine this and could "deoptimze" the callsite automatically; or it the fallback dispatch stub grew beyond a certain number of permutations this could trigger the "deoptimzation" automatically.

src/jit/morph.cpp Outdated Show resolved Hide resolved
@AndyAyersMS
Copy link
Member Author

If you peer back just a bit in the commit chain there is a version where the runtime keeps track of which classes implement a given interface. And if there's just one implementing class for an interface call at the time a method is jitted then the jit will guess that class.

That may in fact be the thing we try to get fully working for 3.0, since it tends to impact a relatively small number of call sites (so less code bloat and jit time cost) and should have a pretty good CQ payoff.

@gfoidl
Copy link
Member

gfoidl commented Nov 29, 2018

what is the best mechanism for guessing which class to test for?

IMHO "add instrumenting Tier1 to collect data and Tier2 to optimize?" sounds good, as Tier 1 is more "stationary" than Tier 0 (i.e. I mean reflects the actual workload better than the startup-phase).

@benaadams
Copy link
Member

To quantify, interfaces used in base ASP.NET Core that this could benefit:

aspnetcoreinterfaces

@AndyAyersMS
Copy link
Member Author

@benaadams Wow that is a lot of interfaces.

@gfoidl As far as adding an "instrumenting" Tier1 -- it is certain something to consider. But it may be a while. The jit isn't set up to take advantage of rich profile feedback (in particular the inliner doesn't use profile information in any meaningful way, and what profile data there is is only read in for top-level methods and is ignored for inlinees). So we have a fairly large backlog of things to address.

@AndyAyersMS
Copy link
Member Author

Started looking at the failures.

Something is going wrong in the void return case (eg method returning struct implicitly by-ref... as in GitHub_20499). We can't map a GT_RET_EXPR to a GT_NOP -- if we fail to inline we then lose the attached call.

@fiigii
Copy link

fiigii commented Nov 29, 2018

Total bytes of diff: 1923401 (5.15% of base)
diff is a regression.

  • It would be great if we have perf data as well (like TechEmpower).
  • Adding deoptimization in tiered JIT may help to optimize code size and the fallback handling (e.g., guard propagation, actually it also helps assertion propagation).

@benaadams
Copy link
Member

Another common aspnet monomorphic interface callsite (which may be covered) is via generic interface dispatch; with type categorized logging via interface ILogger<out TCategoryName> : ILogger being the most obvious demonstrator:

public class MyPage : PageModel
{
    private readonly ILogger<MyPage> _logger;
    public MyPage(ILogger<MyPage> logger) => _logger = logger;

    public void OnGet() => _logger.LogInformation("Get Request"); // Call site
}

Further complicating; it may be the generic interface held in a non-generic field

public class MyPage : PageModel
{
    private readonly ILogger _logger;
    public MyPage(ILogger<MyPage> logger) => _logger = logger;

    public void OnGet() => _logger.LogInformation("Get Request"); // Call site
}

Further complicating; the fast-path non allocating when logger is off route goes via a static action created by the LoggerMessage helper Defines; which probably also introduces delegate speculation?:

public class MyPage : PageModel
{
    private readonly static Action<ILogger, Exception> _log = 
        LoggerMessage.Define(
            eventId: 1,
            logLevel: LogLevel.Information,
            formatString: "Get Request");

    private readonly ILogger _logger;
    public MyPage(ILogger<MyPage> logger) => _logger = logger;

    public void OnGet() => _log(_logger, null); // Call site
}

Some common more complicated use-cases for the mix?

@benaadams
Copy link
Member

benaadams commented Nov 29, 2018

For something that gets executed millions of times per second during the TE benchmarks, this fairly innocuous line in Kestrel is probably a wild ride through interfaces, generics, delegates and virtuals:

Log.ConnectionKeepAlive(ConnectionId);

@AndyAyersMS
Copy link
Member Author

@fiigii will get perf data once things are bit further along... want to fix remaining bugs in the transformation and iron out some annoyances first.

@benaadams thanks much for the examples.

@AndyAyersMS
Copy link
Member Author

Hoping this x64 failure is just a glitch:

CSC : error CS2011: Error opening response file 
'D:\j\workspace\x64_checked_w---fa31220a\bin\int\Managed\baseservices\typeequivalence\impl\TypeImpl\TypeImpl.rsp' 
[D:\j\workspace\x64_checked_w---fa31220a\tests\src\baseservices\typeequivalence\impl\TypeImpl.csproj]

@fiigii
Copy link

fiigii commented Nov 29, 2018

Hoping this x64 failure is just a glitch:

Yes, it is also failed in #21264

@AndyAyersMS
Copy link
Member Author

@dotnet-bot test Windows_NT x64 Checked Innerloop Build and Test (Jit - TieredCompilation=0)

@benaadams
Copy link
Member

Hoping this x64 failure is just a glitch:

Yes, it is also failed in #21264

Workaround merged #21294 (Switching off test)

Lay the groundwork for guarded devirtualization of virtual and interface
calls in the jit.

Introduce the notion of a guarded devirtualization candidate and identify
these if regular devirtualization fails. Use simple heuristics to produce
a class to guess for. Require that the method that would be invoked if the class
guess is correct be a plausible inline candidate.

Generalize the calli transformer to become an indirect call transformer.
This runs after importation because it needs to introduce control flow and
runs before inlining so that the new direct calls it introduces can be inlined.

Implement the transformation to duplicate the call site, devirtualize on the side
where the class is now known exactly, and turn the resulting direct call into an
inline candidate.
@AndyAyersMS
Copy link
Member Author

Needed to rebase to pick up recent fixes. The large commit chain in the initial PR had several conflicts and marginal value so decided to squash it all down...

@AndyAyersMS
Copy link
Member Author

Here's an example of the poor interaction of guarded devirt and late devirt: both arms of the guarded call devirtualize (one early, one late) and so both end up calling B.F (early devirt'd call gets inlined, late one doesn't):

// example of bad interaction between guarded and late devirt.

class B
{
    public virtual int F() => 33;
}

class D : B
{
    public override int F() => 44;
}

class X
{
    static B G() => new B();

    public static int Main()
    {
        // GuardedDevirt will insert runtime type test here.
        //
        // Then residiual virtual call to F() will devirt "late"
        // after inlining G()
        return G().F() + 100 - 33;
    }
}
; Assembly listing for method X:Main():int
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 OutArgs      [V00    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V01 tmp1         [V01,T00] (  3,  4.20)     ref  ->  rax         class-hnd exact "spilling ret_expr"
;  V02 tmp2         [V02,T02] (  3,  1.50)     int  ->  rsi         "guarded devirt return temp"
;* V03 tmp3         [V03    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "guarded devirt this exact temp"
;  V04 tmp4         [V04,T01] (  2,  4   )     ref  ->  rax         class-hnd exact "NewObj constructor temp"
;
; Lcl frame size = 32

G_M10001_IG01:
       56                   push     rsi
       4883EC20             sub      rsp, 32

G_M10001_IG02:
       48B92813845EFB7F0000 mov      rcx, 0x7FFB5E841328
       E8BCE7885F           call     CORINFO_HELP_NEWSFAST
       48B92813845EFB7F0000 mov      rcx, 0x7FFB5E841328
       483908               cmp      qword ptr [rax], rcx
       7507                 jne      SHORT G_M10001_IG03
       BE21000000           mov      esi, 33
       EB0A                 jmp      SHORT G_M10001_IG04

G_M10001_IG03:
       488BC8               mov      rcx, rax
       E866F4FFFF           call     B:F():int:this
       8BF0                 mov      esi, eax

G_M10001_IG04:
       8D4643               lea      eax, [rsi+67]

G_M10001_IG05:
       4883C420             add      rsp, 32
       5E                   pop      rsi
       C3                   ret

; Total bytes of code 61, prolog size 5 for method X:Main():int

We could perhaps fix this up later if the jit knew that the newobj helper installed its argument in the vtable slot. But probably better to catch this during late devirt and undo the guarded transform on the spot.

@AndyAyersMS
Copy link
Member Author

OSX build failed with what look like in-use files:

16:56:54 [ 64%] Built target paltest_wcschr_test1
16:56:54 ld: can't write to output file: paltest_wcscmp_test1, errno=28 for architecture x86_64
16:56:54 clang: error: linker command failed with exit code 1 (use -v to see invocation)
16:56:54 make[2]: *** [src/pal/tests/palsuite/c_runtime/wcscmp/test1/paltest_wcscmp_test1] 

will retry...

@dotnet-bot test OSX10.12 x64 Checked Innerloop Build and Test

@AndyAyersMS
Copy link
Member Author

Ubuntu arm64 leg has an msbuild failure:

12:52:49 
MSBUILD : error MSB4166: Child node "4" exited prematurely. Shutting down. Diagnostic information may be found in files in "/tmp/" and will be named MSBuild_*.failure.txt. This location can be changed by setting the MSBUILDDEBUGPATH environment variable to a different directory.
12:52:49 
12:52:49 Build FAILED.

(similar in the other leg).

Ended up reordering the (debug only) search for early inline failures
to simplify the logic, and added comments.

Marked the transformer classes as `final` to save some jit cycles.

Also removing the option to allow GDv for non-inline candidates as it
wasn't working. Will likely sort this out later as it may prove useful
in tracking down bugs.
Copy link

@briansull briansull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks Good

@AndyAyersMS
Copy link
Member Author

Verified latest is no-diff if I disable Guarded Devirt. With GuardedDevirt enabled we still see the ~5% size increase.

Am going to push a final commit -- once testing above goes green -- that flips the default to disabled, and that's what I will check in.

@BruceForstall
Copy link
Member

Please remove the "WIP" tag.

@AndyAyersMS AndyAyersMS changed the title Guarded devirtualization foundations [WIP] Guarded devirtualization foundations Dec 5, 2018
@AndyAyersMS
Copy link
Member Author

arm cross checked network hangup (same for arm64 cross checked)

Will be doing one more push so will let that trigger a retest.
16:15:28     Finished:    JIT.Directed.XUnitWrapper
16:15:36     Finished:    JIT.Performance.XUnitWrapper
16:16:59 FATAL: command execution failed
16:16:59 java.nio.channels.ClosedChannelException

@AndyAyersMS
Copy link
Member Author

Perf test runs seem perhaps to be looking for python 3.7 and failing to find it:

17:59:52 D:\j\w\perf_perflab_---6af0d14a>C:\python3.7.0\python.exe "D:\j\w\perf_perflab_---6af0d14a\Microsoft.BenchView.JSONFormat\tools\submission-metadata.py" --name "coreclr private Guarded devirtualization foundations" --user-email "dotnet-bot@microsoft.com" 
17:59:52 The system cannot find the path specified

@AndyAyersMS
Copy link
Member Author

@adiaaida does this python 3.7 thing seem familiar?

@michellemcdaniel
Copy link

@AndyAyersMS yes. We moved the python3.7 artifact to the regular helix queues, so I moved the smoketests jobs to use the regular queue, but we broke dumpling because python3 was the first on the path, so they went back to a previous deployment so the queue lost python3.7. We are working on a new deployment. See https://github.com/dotnet/core-eng/issues/4753

InlineCandidateInfo* gtInlineCandidateInfo;
void* gtStubCallStubAddr; // GTF_CALL_VIRT_STUB - these are never inlined
InlineCandidateInfo* gtInlineCandidateInfo;
GuardedDevirtualizationCandidateInfo* gtGuardedDevirtualizationCandidateInfo;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to handle the new field in Compiler::gtCloneExpr. If not, we would see JITStress failures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. Unfortunately we can't safely clone inline candidates in isolation like this, as they and the associated GT_RET_EXPRs must refer to one another. Also we can't safely share inline candidate info across multiple calls.

To put it another there are some special and unusual rules in the IR early on and any cloning of an inline candidate requires cloning of the GT_RET_EXPR and special post processing to fix things up (which is a good part of what the guarded devirt transformation and up-front GT_RET_EXPR spilling does).

So what goes on here when cloning calls is wrongish but can't easily be made right. We should I suppose null out the candidate information as that would be somewhat less wrong, but it is still wrong. And we need to clone calls so banishing cloning for inline candidates or guarded devirtualization candidates outright (like we do for GT_RET_EXPRs) doesn't help either.

I doubt stress mode cloning runs before we've inlined as it would already be busted (and after we get past fgInline these special complications vanish). So I further doubt these changes will pose any new problems beyond the ones we already have and are avoiding.

But I could be wrong. I'll keep an eye on cloning as I continue working on this feature, but for now don't expect to need to change this aspect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more, probably the right change is to ban cloning of inline candidates via gtCloneExpr and introduce a separate method that can be used to clone them.

Because inline candidate calls can't ever be subtrees or live on the eval stack it's unlikely anyone would clone them inadvertently (say as part of cloning some ancestor tree), so having a deliberate API to call in this case seems like it might be a good compromise.

Will follow up on this in a subsequent change.

@AndyAyersMS AndyAyersMS merged commit 045f470 into dotnet:master Dec 6, 2018
@AndyAyersMS AndyAyersMS deleted the GuardedDevirtFoundations branch December 6, 2018 17:03
michellemcdaniel added a commit that referenced this pull request Dec 6, 2018
AndyAyersMS added a commit to AndyAyersMS/coreclr that referenced this pull request Dec 6, 2018
AndyAyersMS added a commit that referenced this pull request Dec 6, 2018
AndyAyersMS added a commit to AndyAyersMS/coreclr that referenced this pull request Dec 6, 2018
Follow-up from dotnet#21270 and dotnet#21414.

Block general cloning from inadvertently cloning a candidate call.

Add a separate path for cloning calls that are inline and guarded
devirtualization candidates for use by guarded devirtualization.
Callers need to take extra steps after cloning one of these calls
to properly fix everything up.

Also fix up some issues in the large comment for the fat calli
transformation.
AndyAyersMS added a commit that referenced this pull request Dec 7, 2018
Follow-up from #21270 and #21414.

Block general cloning from inadvertently cloning a candidate call.

Add a separate path for cloning calls that are inline and guarded
devirtualization candidates for use by guarded devirtualization.
Callers need to take extra steps after cloning one of these calls
to properly fix everything up.

Also fix up some issues in the large comment for the fat calli
transformation.
middle and left hand side differ and are more costly.

For the un-optimized case (orange) the difference is directly attributable to
the performance of the indirect ranch predictor. Even when `p` is small there
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: ranch → branch

@AndyAyersMS
Copy link
Member Author

Perf tests like those in the design doc: dotnet/performance#174

morganbr pushed a commit that referenced this pull request Dec 14, 2018
Follow-up from #21270 and #21414.

Block general cloning from inadvertently cloning a candidate call.

Add a separate path for cloning calls that are inline and guarded
devirtualization candidates for use by guarded devirtualization.
Callers need to take extra steps after cloning one of these calls
to properly fix everything up.

Also fix up some issues in the large comment for the fat calli
transformation.
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Lay the groundwork for guarded devirtualization of virtual and interface
calls in the jit.

Introduce the notion of a guarded devirtualization candidate and identify
these if regular devirtualization fails. Use simple heuristics to produce
a class to guess for. Require that the method that would be invoked if the class
guess is correct be a plausible inline candidate.

Generalize the calli transformer to become an indirect call transformer.
This runs after importation because it needs to introduce control flow and
runs before inlining so that the new direct calls it introduces can be inlined.

Implement the transformation to duplicate the call site, devirtualize on the side
where the class is now known exactly, and turn the resulting direct call into an
inline candidate.

Add a motivation and design document.

Commit migrated from dotnet/coreclr@045f470
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Follow-up from dotnet/coreclr#21270 and dotnet/coreclr#21414.

Block general cloning from inadvertently cloning a candidate call.

Add a separate path for cloning calls that are inline and guarded
devirtualization candidates for use by guarded devirtualization.
Callers need to take extra steps after cloning one of these calls
to properly fix everything up.

Also fix up some issues in the large comment for the fat calli
transformation.


Commit migrated from dotnet/coreclr@08be98c
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.