JIT: Extend escape analysis to account for arrays with non-gcref elements #104906

hez2010 · 2024-07-15T17:09:51Z

Positive case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[1] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;* V03 tmp2         [V03    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V04 tmp3         [V04    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V05 tmp4         [V05    ] (  0,  0   )   short  ->  zero-ref    "V02.[018..020)"
;
; Lcl frame size = 40

G_M25548_IG01:  ;; offset=0x0000
       sub      rsp, 40
                                                ;; size=4 bbWeight=1 PerfScore 0.25
G_M25548_IG02:  ;; offset=0x0004
       mov      ecx, 84
       call     [System.Console:WriteLine(int)]
       nop
                                                ;; size=12 bbWeight=1 PerfScore 3.50
G_M25548_IG03:  ;; offset=0x0010
       add      rsp, 40
       ret
                                                ;; size=5 bbWeight=1 PerfScore 1.25

; Total bytes of code 21, prolog size 4, PerfScore 5.00, instruction count 6, allocated bytes for code 21 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)

Negative case:

var chs = new char[42];
chs[1] = 'a';
Console.WriteLine((int)chs[42] + chs.Length);

Codegen:

; Assembly listing for method ArrayAllocator.Program:Main() (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;* V00 loc0         [V00    ] (  0,  0   )    long  ->  zero-ref    class-hnd exact <short[]>
;  V01 OutArgs      [V01    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V02 tmp1         [V02    ] (  0,  0   )  struct (104) zero-ref    do-not-enreg[SF] "stack allocated array temp"
;  V03 tmp2         [V03,T00] (  1,  0   )   byref  ->  rbx         must-init "dummy temp of must thrown exception"
;* V04 tmp3         [V04    ] (  0,  0   )    long  ->  zero-ref    single-def "V02.[000..008)"
;* V05 tmp4         [V05    ] (  0,  0   )     int  ->  zero-ref    single-def "V02.[008..012)"
;* V06 tmp5         [V06    ] (  0,  0   )   short  ->  zero-ref    single-def "V02.[018..020)"
;
; Lcl frame size = 32

G_M25548_IG01:  ;; offset=0x0000
       push     rbx
       sub      rsp, 32
       xor      ebx, ebx
                                                ;; size=7 bbWeight=0 PerfScore 0.00
G_M25548_IG02:  ;; offset=0x0007
       call     CORINFO_HELP_RNGCHKFAIL
       movsx    rcx, word  ptr [rbx]
       call     [System.Console:WriteLine(int)]
       int3
                                                ;; size=16 bbWeight=0 PerfScore 0.00

; Total bytes of code 23, prolog size 5, PerfScore 0.00, instruction count 7, allocated bytes for code 23 (MethodHash=5b0b9c33) for method ArrayAllocator.Program:Main() (FullOpts)
; ============================================================

Benchmark on Mandelbrot:

Method	Job	Mean	Error	StdDev	Code Size	Allocated
MandelBrot	NoStackAllocationArray	199.7 us	1.30 us	1.22 us	1,996 B	2.49 KB
MandelBrot	StackAllocationArray	195.8 us	1.16 us	1.08 us	2,414 B	1.14 KB

Diff: https://www.diffchecker.com/bNP4qHdF/

src/coreclr/jit/objectalloc.h

src/coreclr/jit/objectalloc.cpp

AndyAyersMS

For arrays (and also perhaps boxes and ref classes) we ought to have some kind of size limit... possibly similar to the one we use for stackallocs.

We need to be careful we don't allocate a lot of stack for an object that might not be heavily used, as we'll pay per-call prolog zeroing costs.

src/coreclr/jit/lclmorph.cpp

hez2010 · 2024-12-07T11:07:38Z

@AndyAyersMS Now all tests are green, and this is ready for merge, please take another look.
@MihuBot

AndyAyersMS · 2024-12-08T19:43:11Z

I have some other changes to escape analysis which are going to conflict, so my plan is to merge those first and then pick this (or something like it) up later. Not sure how long that will take, hopefully not too long.

In the meantime, could you check if your changes to gtFoldExpr and morph resolve #107542, and if so, split those off separately?

Also if you want to peel off the change to always use a temp for newarr we could take that in advance too; it would be nice to see it go in as a zero diff prerequisite.

AndyAyersMS · 2024-12-14T01:04:04Z

@hez2010 can you resolve conflicts? The work I was doing was held up so maybe we can work on this and get it in first.

This reverts commit 1914e80.

hez2010 · 2024-12-14T09:01:52Z

@MihuBot

AndyAyersMS · 2024-12-18T21:07:55Z

Hopefully #110787 unblocks this.

AndyAyersMS · 2024-12-18T22:26:01Z

@hez2010 given the small number of diffs from MihuBot, it would be good to understand what changes might be needed elsewhere to make this more effective.

I'm guessing the main blocker is lack of inlining, but a quantitative analysis might reveal other things.

AndyAyersMS · 2024-12-19T00:12:05Z

Some interesting diffs from SPMI

hez2010 · 2024-12-19T00:24:43Z

One of the regression coming from Array.ForEach(new int[1], null) here:

-       sub      rsp, 40
+       sub      rsp, 56
 						;; size=4 bbWeight=0 PerfScore 0.00
 G_M52314_IG02:        ; bbWeight=0, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
+       vxorps   xmm0, xmm0, xmm0
+       vmovdqu  xmmword ptr [rsp+0x20], xmm0
+       vmovdqu  xmmword ptr [rsp+0x24], xmm0
+       mov      rcx, 0xD1FFAB1E      ; int[]
+       mov      qword ptr [rsp+0x20], rcx
+       mov      dword ptr [rsp+0x28], 1
        mov      ecx, 28
        call     [System.ThrowHelper:ThrowArgumentNullException(int)]
        ; gcr arg pop 0
        int3

The loop was originally unrolled but now it's no longer doing that. Seems that we need to propagate the assertion into loops so that the bound can be replaced by a constant, which is #110501

AndyAyersMS · 2024-12-19T00:31:49Z

I will dig into some of these tomorrow. Need to look closely at the dumps.

src/coreclr/jit/lclmorph.cpp

hez2010 · 2024-12-23T19:24:52Z

@AndyAyersMS BTW we can mark Array.Copy and SpanHelper.Memmove as non-escaping to see if it can give us more opportunities.

AndyAyersMS · 2024-12-24T01:54:05Z

@AndyAyersMS BTW we can mark Array.Copy and SpanHelper.Memmove as non-escaping to see if it can give us more opportunities.

If we start passing stack allocated ref classes to callees we also have to fix the GC reporting for those callee arguments to be managed (not object) pointers (and transitively, fix reporting for any place those arguments can propagate, including possibly in the native parts of the runtime). So there is (perhaps considerable) extra work.

hez2010 · 2024-12-24T09:09:17Z

@EgorBo and me discussed on discord that we can probe the size argument using value probing, so that for unknown sized arrays we can do "guarded stack allocation" in the future.

It would effectively replace

Span<int> arr = new int[size];

with

Span<int> arr;
// dummy code below
if (size < 16)
{
    arr = stackalloc int[16];
    arr.Length = size;
}
else
{
    arr = new int[size];
}

hez2010 added 7 commits July 15, 2024 20:23

initial prototype

1b0e3d3

Morph ARR_LENGTH and INDEX_ADDR

57b7e42

Fix incorrect array length storage

1b5b25e

Use offset and correct type

395b735

handle reassignment

17de70b

range check

5443c42

throw range check failure

b2d07da

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 15, 2024

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jul 15, 2024

hez2010 added 2 commits July 16, 2024 02:13

update comments

b5ae9e7

add metrics

87b29de

jakobbotsch reviewed Jul 15, 2024

View reviewed changes

src/coreclr/jit/objectalloc.h Outdated Show resolved Hide resolved

jakobbotsch reviewed Jul 15, 2024

View reviewed changes

src/coreclr/jit/objectalloc.cpp Outdated Show resolved Hide resolved

jakobbotsch reviewed Jul 15, 2024

View reviewed changes

src/coreclr/jit/objectalloc.cpp Outdated Show resolved Hide resolved

minor cleanup

eeb681d

AndyAyersMS reviewed Jul 15, 2024

View reviewed changes

AndyAyersMS mentioned this pull request Jul 16, 2024

Stack Allocation Enhancements #104936

Open

20 tasks

hez2010 added 5 commits July 16, 2024 14:13

Introduce new temp and implement local address morphing

dee9f38

handle index out-of-range

94c103b

Refactor to remove duplicates

12b297b

Remove invalid asserts

e0fa91e

make compiler happy

9e0a04f

jakobbotsch reviewed Jul 16, 2024

View reviewed changes

src/coreclr/jit/lclmorph.cpp Outdated Show resolved Hide resolved