Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Handle primitive-sized remainders overlapping padding/promotions in physical promotion #88109

Merged

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jun 27, 2023

The remainder may be separated by a bit of padding or other promoted fields but still fit into a primitive; in this case it is still beneficial to copy it all as a primitive, instead of falling back to a full block copy.

Example.

private S _s;

void Foo()
{
    S s = new();
    s.A = 10;
    s.D = 20;
    s.F = 30; // A, D, F gets promoted
    _s = s;
}

private struct S
{
    public byte A;
    public byte B;
    public byte C;
    public byte D;
    public byte E;
    public byte F;
}
 Processing block operation [000018] that involves replacements
   dst+003 <- V04 (V01.[003..004)) (last use)
   dst+005 <- V05 (V01.[005..006)) (last use)
   Block op remainder: [001..003) [004..005)
-  => Remainder strategy: retain a full block op
+  => Remainder strategy: int at +001
 ;  V00 this         [V00,T01] (  3,  3   )     ref  ->  rcx         this class-hnd single-def
 ;* V01 loc0         [V01    ] (  0,  0   )  struct ( 8) zero-ref    do-not-enreg[SF] ld-addr-op
 ;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )   ubyte  ->  zero-ref    "V01.[000..001)"
 ;* V04 tmp2         [V04    ] (  0,  0   )   ubyte  ->  zero-ref    "V01.[003..004)"
 ;* V05 tmp3         [V05    ] (  0,  0   )   ubyte  ->  zero-ref    "V01.[005..006)"
 ;  V06 tmp4         [V06,T00] (  5, 10   )   byref  ->  rcx         single-def "Spilling address for field-by-field copy"
 ;
 ; Lcl frame size = 0
 
 G_M52879_IG01:  ;; offset=0000H
 						;; size=0 bbWeight=1 PerfScore 0.00
 G_M52879_IG02:  ;; offset=0000H
        add      rcx, 8
        xor      eax, eax
-       mov      dword ptr [rcx], eax
-       mov      dword ptr [rcx+02H], eax
+       mov      dword ptr [rcx+01H], eax
        mov      byte  ptr [rcx], 10
        mov      byte  ptr [rcx+03H], 20
        mov      byte  ptr [rcx+05H], 30
-						;; size=22 bbWeight=1 PerfScore 5.50
-G_M52879_IG03:  ;; offset=0016H
+						;; size=20 bbWeight=1 PerfScore 4.50
+G_M52879_IG03:  ;; offset=0014H
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
 
-; Total bytes of code 23, prolog size 0, PerfScore 8.80, instruction count 8, allocated bytes for code 23 (MethodHash=1da13170) for method Program:Foo():this (FullOpts)
+; Total bytes of code 21, prolog size 0, PerfScore 7.60, instruction count 7, allocated bytes for code 21 (MethodHash=1da13170) for method Program:Foo():this (FullOpts)
 ; ============================================================

We have to be careful, however, since the covering segment can now contain promoted fields. If this happens we need to make sure we write the promoted field after the remainder.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 27, 2023
@ghost ghost assigned jakobbotsch Jun 27, 2023
@ghost
Copy link

ghost commented Jun 27, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

The remainder may be separated by a bit of padding but still fit into a primitive; in this case it is still beneficial to copy it all as a primitive, instead of falling back to a full block copy.

Example.
Before:

Processing block operation [000737] that involves replacements
  dst+000 <- V101 (V47.[000..008)) (last use)
  dst+008 <- V102 (V47.[008..016)) (last use)
  dst+016 <- V103 (V47.[016..024)) (last use)
  dst+024 <- V104 (V47.[024..028)) (last use)
  dst+028 <- V105 (V47.[028..029)) (last use)
  Block op remainder: [032..033) [036..040)
  => Remainder strategy: retain a full block op

After:

Processing block operation [000737] that involves replacements
  dst+000 <- V101 (V47.[000..008)) (last use)
  dst+008 <- V102 (V47.[008..016)) (last use)
  dst+016 <- V103 (V47.[016..024)) (last use)
  dst+024 <- V104 (V47.[024..028)) (last use)
  dst+028 <- V105 (V47.[028..029)) (last use)
  Block op remainder: [032..033) [036..040)
  => Remainder strategy: long at +032

This leads to ASM diffs like the following:

        xor      edx, edx
        mov      qword ptr [rsp+148H], rdx
-						;; size=20 bbWeight=0.33 PerfScore 1.57
-G_M6338_IG24:        ; bbWeight=0.33, nogc, extend
-       vmovdqu  ymm0, ymmword ptr [rsp+128H]
-       vmovdqu  ymmword ptr [rsp+D8H], ymm0
-       mov      rdx, qword ptr [rsp+148H]
-       mov      qword ptr [rsp+F8H], rdx
-						;; size=34 bbWeight=0.33 PerfScore 2.32
-G_M6338_IG25:        ; bbWeight=0.33, extend
        mov      gword ptr [rsp+D8H], r12
-       xor      rdx, rdx
-       ; gcrRegs +[rdx]
+						;; size=28 bbWeight=0.33 PerfScore 1.90
+G_M6338_IG24:        ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
        mov      gword ptr [rsp+E0H], rdx
-						;; size=18 bbWeight=0.33 PerfScore 0.75
-G_M6338_IG26:        ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
-       ; gcrRegs -[rdx]
+						;; size=8 bbWeight=0.33 PerfScore 0.33
+G_M6338_IG25:        ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
        mov      gword ptr [rsp+E8H], rdx
        mov      dword ptr [rsp+F0H], r9d
        mov      byte  ptr [rsp+F4H], 0
+						;; size=24 bbWeight=0.33 PerfScore 0.99
+G_M6338_IG26:        ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
+       mov      qword ptr [rsp+F8H], rdx

We have to be careful, however, since the covering segment can now contain promoted fields. If this happens we need to make sure we write the promoted field after the remainder.

Author: jakobbotsch
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

…l promotion

The remainder may be separated by a bit of padding but still fit into a
primitive; in this case it is still beneficial to copy it all as a
primitive, instead of falling back to a full block copy.

Example.
Before:
```
Processing block operation [000737] that involves replacements
  dst+000 <- V101 (V47.[000..008)) (last use)
  dst+008 <- V102 (V47.[008..016)) (last use)
  dst+016 <- V103 (V47.[016..024)) (last use)
  dst+024 <- V104 (V47.[024..028)) (last use)
  dst+028 <- V105 (V47.[028..029)) (last use)
  Block op remainder: [032..033) [036..040)
  => Remainder strategy: retain a full block op

```

After:
```
Processing block operation [000737] that involves replacements
  dst+000 <- V101 (V47.[000..008)) (last use)
  dst+008 <- V102 (V47.[008..016)) (last use)
  dst+016 <- V103 (V47.[016..024)) (last use)
  dst+024 <- V104 (V47.[024..028)) (last use)
  dst+028 <- V105 (V47.[028..029)) (last use)
  Block op remainder: [032..033) [036..040)
  => Remainder strategy: long at +032
```

This leads to ASM diffs like the following:
```diff
        xor      edx, edx
        mov      qword ptr [rsp+148H], rdx
-						;; size=20 bbWeight=0.33 PerfScore 1.57
-G_M6338_IG24:        ; bbWeight=0.33, nogc, extend
-       vmovdqu  ymm0, ymmword ptr [rsp+128H]
-       vmovdqu  ymmword ptr [rsp+D8H], ymm0
-       mov      rdx, qword ptr [rsp+148H]
-       mov      qword ptr [rsp+F8H], rdx
-						;; size=34 bbWeight=0.33 PerfScore 2.32
-G_M6338_IG25:        ; bbWeight=0.33, extend
        mov      gword ptr [rsp+D8H], r12
-       xor      rdx, rdx
-       ; gcrRegs +[rdx]
+						;; size=28 bbWeight=0.33 PerfScore 1.90
+G_M6338_IG24:        ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
        mov      gword ptr [rsp+E0H], rdx
-						;; size=18 bbWeight=0.33 PerfScore 0.75
-G_M6338_IG26:        ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
-       ; gcrRegs -[rdx]
+						;; size=8 bbWeight=0.33 PerfScore 0.33
+G_M6338_IG25:        ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
        mov      gword ptr [rsp+E8H], rdx
        mov      dword ptr [rsp+F0H], r9d
        mov      byte  ptr [rsp+F4H], 0
+						;; size=24 bbWeight=0.33 PerfScore 0.99
+G_M6338_IG26:        ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
+       mov      qword ptr [rsp+F8H], rdx
```

We have to be careful, however, since the covering segment can now
contain promoted fields. If this happens we need to make sure we write
the promoted field _after_ the remainder.

Unfortunately doing this requires quite a bit of refactoring. I have
extracted all common code for handling creation of derived accesses of
the destination/source into a common class called LocationAccess.
@jakobbotsch jakobbotsch force-pushed the physical-promotion-covering-segment branch from 4ec5010 to 0c9b1c5 Compare June 30, 2023 22:07
@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr superpmi-diffs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch
Copy link
Member Author

/azp run runtime-coreclr superpmi-diffs

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@jakobbotsch jakobbotsch marked this pull request as ready for review July 3, 2023 08:16
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @EgorBo

Diffs. This is not just about padding, but also when the remainder is overlapping with promoted fields. A C# example of the difference:

private S _s;

void Foo()
{
    S s = new();
    s.A = 10;
    s.D = 20;
    s.F = 30; // A, D, F gets promoted
    _s = s;
}

private struct S
{
    public byte A;
    public byte B;
    public byte C;
    public byte D;
    public byte E;
    public byte F;
}
 Processing block operation [000018] that involves replacements
   dst+003 <- V04 (V01.[003..004)) (last use)
   dst+005 <- V05 (V01.[005..006)) (last use)
   Block op remainder: [001..003) [004..005)
-  => Remainder strategy: retain a full block op
+  => Remainder strategy: int at +001
 ;  V00 this         [V00,T01] (  3,  3   )     ref  ->  rcx         this class-hnd single-def
 ;* V01 loc0         [V01    ] (  0,  0   )  struct ( 8) zero-ref    do-not-enreg[SF] ld-addr-op
 ;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
 ;* V03 tmp1         [V03    ] (  0,  0   )   ubyte  ->  zero-ref    "V01.[000..001)"
 ;* V04 tmp2         [V04    ] (  0,  0   )   ubyte  ->  zero-ref    "V01.[003..004)"
 ;* V05 tmp3         [V05    ] (  0,  0   )   ubyte  ->  zero-ref    "V01.[005..006)"
 ;  V06 tmp4         [V06,T00] (  5, 10   )   byref  ->  rcx         single-def "Spilling address for field-by-field copy"
 ;
 ; Lcl frame size = 0
 
 G_M52879_IG01:  ;; offset=0000H
 						;; size=0 bbWeight=1 PerfScore 0.00
 G_M52879_IG02:  ;; offset=0000H
        add      rcx, 8
        xor      eax, eax
-       mov      dword ptr [rcx], eax
-       mov      dword ptr [rcx+02H], eax
+       mov      dword ptr [rcx+01H], eax
        mov      byte  ptr [rcx], 10
        mov      byte  ptr [rcx+03H], 20
        mov      byte  ptr [rcx+05H], 30
-						;; size=22 bbWeight=1 PerfScore 5.50
-G_M52879_IG03:  ;; offset=0016H
+						;; size=20 bbWeight=1 PerfScore 4.50
+G_M52879_IG03:  ;; offset=0014H
        ret      
 						;; size=1 bbWeight=1 PerfScore 1.00
 
-; Total bytes of code 23, prolog size 0, PerfScore 8.80, instruction count 8, allocated bytes for code 23 (MethodHash=1da13170) for method Program:Foo():this (FullOpts)
+; Total bytes of code 21, prolog size 0, PerfScore 7.60, instruction count 7, allocated bytes for code 21 (MethodHash=1da13170) for method Program:Foo():this (FullOpts)
 ; ============================================================

@jakobbotsch jakobbotsch requested a review from EgorBo July 3, 2023 08:36
@jakobbotsch jakobbotsch changed the title JIT: Handle primitive-sized remainders overlapping padding in physical promotion JIT: Handle primitive-sized remainders overlapping padding/promotions in physical promotion Jul 3, 2023
Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are regressions like this

@@ -13,46 +13,48 @@
 ;  V02 arg1         [V02,T02] (  3,  3   )     ref  ->   r8         class-hnd single-def
 ;  V03 arg2         [V03,T03] (  3,  3   )  struct ( 8)  r9         single-def
 ;# V04 OutArgs      [V04    ] (  1,  1   )  struct ( 0) [rsp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-;  V05 tmp1         [V05,T01] (  3,  6   )  struct (16) [rsp+00H]   do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
+;  V05 tmp1         [V05,T04] (  2,  4   )  struct (16) [rsp+08H]   do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
 ;* V06 tmp2         [V06    ] (  0,  0   )    long  ->  zero-ref    "spilling helperCall"
 ;* V07 tmp3         [V07    ] (  0,  0   )     ref  ->  zero-ref    single-def "V05.[000..008)"
+;  V08 tmp4         [V08,T01] (  3,  6   )   byref  ->  rsi         single-def "Spilling address for field-by-field copy"
 ;
-; Lcl frame size = 16
+; Lcl frame size = 24
 
 G_M36974_IG01:        ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
-       push     rdi
        push     rsi
        push     rbx
-       sub      rsp, 16
+       sub      rsp, 24
        xor      eax, eax
-       mov      qword ptr [rsp], rax
+       mov      qword ptr [rsp+08H], rax
        mov      rbx, rcx
        ; byrRegs +[rbx]
-						;; size=16 bbWeight=1 PerfScore 4.75
+						;; size=16 bbWeight=1 PerfScore 3.75
 G_M36974_IG02:        ; bbWeight=1, gcrefRegs=0100 {r8}, byrefRegs=0008 {rbx}, byref
        ; gcrRegs +[r8]
-       mov      qword ptr [rsp+08H], r9
-       mov      gword ptr [rsp], r8
-       lea      rdi, bword ptr [rbx+08H]
-       ; byrRegs +[rdi]
-       lea      rsi, bword ptr [rsp]
+       mov      qword ptr [rsp+10H], r9
+       lea      rsi, bword ptr [rbx+08H]
        ; byrRegs +[rsi]
-       call     CORINFO_HELP_ASSIGN_BYREF
-       ; gcrRegs -[r8]
-       movsq    
+       mov      rcx, rsi
+       ; byrRegs +[rcx]
+       mov      rdx, r8
+       ; gcrRegs +[rdx]
+       call     CORINFO_HELP_CHECKED_ASSIGN_REF
+       ; gcrRegs -[rdx r8]
+       ; byrRegs -[rcx]
+       mov      rax, qword ptr [rsp+10H]
+       mov      qword ptr [rsi+08H], rax
        xor      rax, rax
        ; gcrRegs +[rax]
        mov      gword ptr [rbx], rax
-						;; size=29 bbWeight=1 PerfScore 6.25
+						;; size=34 bbWeight=1 PerfScore 6.25
 G_M36974_IG03:        ; bbWeight=1, epilog, nogc, extend
-       add      rsp, 16
+       add      rsp, 24
        pop      rbx
        pop      rsi
-       pop      rdi
        ret      
-						;; size=8 bbWeight=1 PerfScore 2.75
+						;; size=7 bbWeight=1 PerfScore 2.25
 
-; Total bytes of code 53, prolog size 13, PerfScore 19.05, instruction count 20, allocated bytes for code 53 (MethodHash=ebd16f91) for method System.Linq.Parallel.HashLookupValueList`2[System.__Canon,System.Nullable`1[int]]:.ctor(System.__Canon,System.Nullable`1[int]):this (FullOpts)
+; Total bytes of code 57, prolog size 13, PerfScore 17.95, instruction count 19, allocated bytes for code 57 (MethodHash=ebd16f91) for method System.Linq.Parallel.HashLookupValueList`2[System.__Canon,System.Nullable`1[int]]:.ctor(System.__Canon,System.Nullable`1[int]):this (FullOpts)
 ; ============================================================

expected?

@jakobbotsch
Copy link
Member Author

Generally we do expect to replace CORINFO_HELP_ASSIGN_BYREF with CORINFO_HELP_CHECKED_ASSIGN_REF in the cases where we switch the remainder from a block copy (where the block has GC pointers) to a TYP_REF primitive copy.
That has the side effect that we no longer get the destination/source increments "for free" as part of that helper, so I suppose that can sometimes result in bigger code. But notice that in this case we get rid of a movsq which is a very expensive instruction, so perf wise I would expect this to be beneficial.

Now, this case is a bit peculiar because the fact that we no longer uses the full block copy for the remainder turns off an unrelated optimization:

@@ -1,20 +1,28 @@
 STMT00001 ( 0x000[E-] ... ??? )
                [000027] nA-XG------                         ▌  STORE_BLK struct<System.Linq.Parallel.Pair`2, 16> (copy)
                [000025] ---X-------                         ├──▌  FIELD_ADDR byref  <unknown class>:<unknown field>
                [000026] -----------                         │  └──▌  LCL_VAR   byref  V00 this         
                [000024] -----------                         └──▌  LCL_VAR   struct<System.Linq.Parallel.Pair`2, 16> V05 tmp1          (last use)
 Processing block operation [000027] that involves replacements
   dst+000 <- V07 (V05.[000..008)) (last use)
   Block op remainder: [008..009) [012..016)
-  => Remainder strategy: retain a full block op
-  Will write back V07 (V05.[000..008)) to avoid an additional write barrier
-  Skipping dst+000 <- V07 (V05.[000..008)); it is up-to-date in its struct local and will be handled as part of the remainder
+  => Remainder strategy: long at +008
+
+lvaGrabTemp returning 8 (V08 tmp4) called for Spilling address for field-by-field copy.
+
+Local V05 should not be enregistered because: was accessed as a local field
 New statement:
 STMT00001 ( 0x000[E-] ... ??? )
-               [000040] -A-XG------                         ▌  COMMA     void  
-               [000039] UA---------                         ├──▌  STORE_LCL_FLD ref    V05 tmp1         [+0]
-               [000038] -----------                         │  └──▌  LCL_VAR   ref    V07 tmp3         
-               [000027] nA-XG------                         └──▌  STORE_BLK struct<System.Linq.Parallel.Pair`2, 16> (copy)
-               [000025] ---X-------                            ├──▌  FIELD_ADDR byref  <unknown class>:<unknown field>
-               [000026] -----------                            │  └──▌  LCL_VAR   byref  V00 this         
-               [000024] -----------                            └──▌  LCL_VAR   struct<System.Linq.Parallel.Pair`2, 16> V05 tmp1         
+               [000048] -A-XG------                         ▌  COMMA     void  
+               [000038] DA-X-------                         ├──▌  STORE_LCL_VAR byref  V08 tmp4         
+               [000025] ---X-------                         │  └──▌  FIELD_ADDR byref  <unknown class>:<unknown field>
+               [000026] -----------                         │     └──▌  LCL_VAR   byref  V00 this         
+               [000047] -A--G------                         └──▌  COMMA     void  
+               [000042] nA--G------                            ├──▌  STOREIND  ref   
+               [000041] -----------                            │  ├──▌  LCL_VAR   byref  V08 tmp4         
+               [000040] -----------                            │  └──▌  LCL_VAR   ref    V07 tmp3          (last use)
+               [000046] nA--G------                            └──▌  STOREIND  long  
+               [000045] -----------                               ├──▌  ADD       byref 
+               [000039] -----------                               │  ├──▌  LCL_VAR   byref  V08 tmp4         
+               [000044] -----------                               │  └──▌  CNS_INT   long   8
+               [000043] -----------                               └──▌  LCL_FLD   long   V05 tmp1         [+8]

Specifically, in the base we notice that we have a promoted TYP_REF field that is being written to heap, which will involve a write barrier, in addition to the write barrier imposed for the block copy. So we write it back to stack first, to avoid this extra write barrier.
This has the side effect that we don't need to write it back after the write barrier call. It also means we don't need to save the target address to a local because we end up only using the address once.

In the diff we don't do this optimization anymore and instead just copy the promoted TYP_REF directly to the heap, incurring the write barrier. It doesn't really matter because overall we still only end up needing one write barrier.

We should be able to avoid the spilled local address in this case by teaching physical promotion to peel FIELD_ADDR nodes, though it comes with a bit of complexity to get the right null checking behavior. It would probably have quite decent diffs, though. Will open an issue for that.

@jakobbotsch jakobbotsch merged commit 210a7a5 into dotnet:main Jul 4, 2023
@jakobbotsch jakobbotsch deleted the physical-promotion-covering-segment branch July 4, 2023 16:02
@ghost ghost locked as resolved and limited conversation to collaborators Aug 3, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants