Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack Allocation Enhancements #104936

Open
20 tasks
AndyAyersMS opened this issue Jul 16, 2024 · 5 comments
Open
20 tasks

Stack Allocation Enhancements #104936

AndyAyersMS opened this issue Jul 16, 2024 · 5 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Jul 16, 2024

Stack allocation of non-escaping ref classes and boxed value classes was enabled in #103361, but only works in limited cases. This issue tracks further enhancements (see also #11192).

Abilities:

image

  • zeroing strategy (zero in prolog vs zero at first use)
  • stack allocation of some delegates (and perhaps their closures), eg
    public static int Test()
    {
        int a = 100;
        Func<int> f = () => { return a; };
        return f();
    }

a small tweak to escape analysis gets the delegate on the stack, but the invoke expansion currently happens in lower so we don't get any physical promotion. We would need to move this earlier.

See note below.

Analysis:

Implementation:

  • stop relying on top-level ALLOCOBJ
  • stop relying on ALLOCOBJ assigned to single-def temp in importer
  • use custom GC layout for boxes; stop fetching placeholder type from the runtime
  • make sure object stack allocation doesn't block fast tail call optimization unnecessarily (currently fast tail call optimization is disabled if there are any exposed local)

NAOT:

  • interprocedural escape analysis. May also be viable in jitted contexts, either as a hint for the inliner or as some sort of property we can guarantee on profiler-driven re-jit.

Advanced:

  • partial escape analysis. Allocate objects that are unlikely to escape on the stack. Either compute the "escape frontier" of an object and copy it to the stack when that frontier is crossed, or else add capabilities to write barriers to note when a stack allocated object reference is going to be stored on the heap, and "promote" the object at that point (using GC info to rewrite the stack references to heap references). Likely requires PGO, to ensure we're not wrong too often.
  • for objects that don't escape but that we don't want to stack allocate, treat them as thread private: we can use more aggressive value numbering for instance, since we don't have to assume the field values can change asynchronously.

Diagnostics:

  • if VM/GC/WriteBarriers catch an escaped object-on-stack they should provide a helpful assert instead of a generic GC hole like assert. We can either reserve a debug bit in the sync block or rely on object's address being within the code heap.

FYI @dotnet/jit-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 16, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 16, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@AndyAyersMS
Copy link
Member Author

AndyAyersMS commented Jul 17, 2024

For the delegate case, if we add

index 89e28c5978c..a069d98503e 100644
--- a/src/coreclr/jit/objectalloc.cpp
+++ b/src/coreclr/jit/objectalloc.cpp
@@ -719,12 +719,23 @@ bool ObjectAllocator::CanLclVarEscapeViaParentStack(ArrayStack<GenTree*>* parent

             case GT_CALL:
             {
-                GenTreeCall* asCall = parent->AsCall();
+                GenTreeCall* const call = parent->AsCall();

-                if (asCall->IsHelperCall())
+                if (call->IsHelperCall())
                 {
                     canLclVarEscapeViaParentStack =
-                        !Compiler::s_helperCallProperties.IsNoEscape(comp->eeGetHelperNum(asCall->gtCallMethHnd));
+                        !Compiler::s_helperCallProperties.IsNoEscape(comp->eeGetHelperNum(call->gtCallMethHnd));
+                }
+                else if (call->gtCallType == CT_USER_FUNC)
+                {
+                    // Delegate invoke won't escape the delegate which is passed as "this"
+                    // And gets expanded inline later.
+                    //
+                    if ((call->gtCallMoreFlags & GTF_CALL_M_DELEGATE_INV) != 0)
+                    {
+                        GenTree* const thisArg = call->gtArgs.GetThisArg()->GetNode();
+                        canLclVarEscapeViaParentStack = thisArg != tree;
+                    }

Then the example above becomes

; Method Y:Test():int (FullOpts)
G_M53607_IG01:  ;; offset=0x0000
       sub      rsp, 88
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  ymmword ptr [rsp+0x20], ymm4
       vmovdqa  xmmword ptr [rsp+0x40], xmm4
       xor      eax, eax
       mov      qword ptr [rsp+0x50], rax
						;; size=27 bbWeight=1 PerfScore 5.83

G_M53607_IG02:  ;; offset=0x001B
       mov      rcx, 0x7FFD4BB04580      ; Y+<>c__DisplayClass0_0
       call     CORINFO_HELP_NEWSFAST
       mov      dword ptr [rax+0x08], 100
       mov      rcx, 0x7FFD4BB04B30      ; System.Func`1[int]
       mov      qword ptr [rsp+0x20], rcx
       mov      gword ptr [rsp+0x28], rax
       mov      rcx, 0x7FFD4B8783F0      ; code for Y+<>c__DisplayClass0_0:<Test>b__0():int:this
       mov      qword ptr [rsp+0x38], rcx
       lea      rax, [rsp+0x20]
       mov      rcx, gword ptr [rax+0x08]
       call     [rax+0x18]System.Func`1[int]:Invoke():int:this
       nop      
						;; size=70 bbWeight=1 PerfScore 11.50

G_M53607_IG03:  ;; offset=0x0061
       add      rsp, 88
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
; Total bytes of code: 102

where the closure is still on the heap and we're invoking the delegate func "directly" but via a convoluted path where we store the func's (indirection cell) address to the stack allocated delegate and then fetch it back and indirect through it.

Ideally we'd like to be able to inline and perhaps realize the closure doesn't escape either, but that seems far off. Perhaps we can just summarily claim the closure can't escape. I am not sure.

Moving delegate invoke expansion earlier does not look to be simple -- currently there is some prep work in morph and then the actual expansion in lower, and tail calls are a complication.

@JulieLeeMSFT JulieLeeMSFT added this to the 10.0.0 milestone Jul 17, 2024
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Jul 17, 2024
@hez2010
Copy link
Contributor

hez2010 commented Jul 19, 2024

@AndyAyersMS With the array (non-gc elems) support + my field analysis prototype + the above delegate handling (branch at https://github.com/hez2010/runtime/tree/field-stackalloc), the codegen becomes:

G_M30166_IG01:  ;; offset=0x0000
       sub      rsp, 104
       vxorps   xmm4, xmm4, xmm4
       vmovdqu  ymmword ptr [rsp+0x20], ymm4
       vmovdqa  xmmword ptr [rsp+0x40], xmm4
       xor      eax, eax
       mov      qword ptr [rsp+0x50], rax
                                                ;; size=27 bbWeight=1 PerfScore 5.83
G_M30166_IG02:  ;; offset=0x001B
       xor      ecx, ecx
       mov      qword ptr [rsp+0x58], rcx
       mov      dword ptr [rsp+0x60], ecx
       mov      rcx, 0x7FFEBB4211B8      ; Y+<>c__DisplayClass7_0
       mov      qword ptr [rsp+0x58], rcx
       mov      dword ptr [rsp+0x60], 100
       mov      rcx, 0x7FFEBB420EE8      ; System.Func`1[int]
       mov      qword ptr [rsp+0x20], rcx
       lea      rcx, [rsp+0x58]
       mov      qword ptr [rsp+0x28], rcx
       mov      rcx, 0x7FFEBB3F0678      ; code for Y+<>c__DisplayClass7_0:<Test>b__0():int:this
       mov      qword ptr [rsp+0x38], rcx
       lea      rax, [rsp+0x20]
       mov      rcx, gword ptr [rax+0x08]
       call     [rax+0x18]System.Func`1[int]:Invoke():int:this
       nop
                                                ;; size=87 bbWeight=1 PerfScore 14.25
G_M30166_IG03:  ;; offset=0x0072
       add      rsp, 104
       ret

@EgorBo
Copy link
Member

EgorBo commented Jul 22, 2024

Added Diagnostics section

@hez2010
Copy link
Contributor

hez2010 commented Aug 16, 2024

Put more useful links:

An overview of the impl of esacpe analysis including interprocedural analysis in JVM: https://cr.openjdk.org/~cslucas/escape-analysis/EscapeAnalysis.html

Some insights on allocating objects in a loop and etc.: https://devblogs.microsoft.com/java/improving-openjdk-scalar-replacement-part-2-3/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Status: Team User Stories
Development

No branches or pull requests

4 participants