-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Handle primitive-sized remainders overlapping padding/promotions in physical promotion #88109
JIT: Handle primitive-sized remainders overlapping padding/promotions in physical promotion #88109
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsThe remainder may be separated by a bit of padding but still fit into a primitive; in this case it is still beneficial to copy it all as a primitive, instead of falling back to a full block copy. Example. Processing block operation [000737] that involves replacements
dst+000 <- V101 (V47.[000..008)) (last use)
dst+008 <- V102 (V47.[008..016)) (last use)
dst+016 <- V103 (V47.[016..024)) (last use)
dst+024 <- V104 (V47.[024..028)) (last use)
dst+028 <- V105 (V47.[028..029)) (last use)
Block op remainder: [032..033) [036..040)
=> Remainder strategy: retain a full block op
After: Processing block operation [000737] that involves replacements
dst+000 <- V101 (V47.[000..008)) (last use)
dst+008 <- V102 (V47.[008..016)) (last use)
dst+016 <- V103 (V47.[016..024)) (last use)
dst+024 <- V104 (V47.[024..028)) (last use)
dst+028 <- V105 (V47.[028..029)) (last use)
Block op remainder: [032..033) [036..040)
=> Remainder strategy: long at +032 This leads to ASM diffs like the following: xor edx, edx
mov qword ptr [rsp+148H], rdx
- ;; size=20 bbWeight=0.33 PerfScore 1.57
-G_M6338_IG24: ; bbWeight=0.33, nogc, extend
- vmovdqu ymm0, ymmword ptr [rsp+128H]
- vmovdqu ymmword ptr [rsp+D8H], ymm0
- mov rdx, qword ptr [rsp+148H]
- mov qword ptr [rsp+F8H], rdx
- ;; size=34 bbWeight=0.33 PerfScore 2.32
-G_M6338_IG25: ; bbWeight=0.33, extend
mov gword ptr [rsp+D8H], r12
- xor rdx, rdx
- ; gcrRegs +[rdx]
+ ;; size=28 bbWeight=0.33 PerfScore 1.90
+G_M6338_IG24: ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
mov gword ptr [rsp+E0H], rdx
- ;; size=18 bbWeight=0.33 PerfScore 0.75
-G_M6338_IG26: ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
- ; gcrRegs -[rdx]
+ ;; size=8 bbWeight=0.33 PerfScore 0.33
+G_M6338_IG25: ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
mov gword ptr [rsp+E8H], rdx
mov dword ptr [rsp+F0H], r9d
mov byte ptr [rsp+F4H], 0
+ ;; size=24 bbWeight=0.33 PerfScore 0.99
+G_M6338_IG26: ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref
+ mov qword ptr [rsp+F8H], rdx We have to be careful, however, since the covering segment can now contain promoted fields. If this happens we need to make sure we write the promoted field after the remainder.
|
581adc9
to
4ec5010
Compare
…l promotion The remainder may be separated by a bit of padding but still fit into a primitive; in this case it is still beneficial to copy it all as a primitive, instead of falling back to a full block copy. Example. Before: ``` Processing block operation [000737] that involves replacements dst+000 <- V101 (V47.[000..008)) (last use) dst+008 <- V102 (V47.[008..016)) (last use) dst+016 <- V103 (V47.[016..024)) (last use) dst+024 <- V104 (V47.[024..028)) (last use) dst+028 <- V105 (V47.[028..029)) (last use) Block op remainder: [032..033) [036..040) => Remainder strategy: retain a full block op ``` After: ``` Processing block operation [000737] that involves replacements dst+000 <- V101 (V47.[000..008)) (last use) dst+008 <- V102 (V47.[008..016)) (last use) dst+016 <- V103 (V47.[016..024)) (last use) dst+024 <- V104 (V47.[024..028)) (last use) dst+028 <- V105 (V47.[028..029)) (last use) Block op remainder: [032..033) [036..040) => Remainder strategy: long at +032 ``` This leads to ASM diffs like the following: ```diff xor edx, edx mov qword ptr [rsp+148H], rdx - ;; size=20 bbWeight=0.33 PerfScore 1.57 -G_M6338_IG24: ; bbWeight=0.33, nogc, extend - vmovdqu ymm0, ymmword ptr [rsp+128H] - vmovdqu ymmword ptr [rsp+D8H], ymm0 - mov rdx, qword ptr [rsp+148H] - mov qword ptr [rsp+F8H], rdx - ;; size=34 bbWeight=0.33 PerfScore 2.32 -G_M6338_IG25: ; bbWeight=0.33, extend mov gword ptr [rsp+D8H], r12 - xor rdx, rdx - ; gcrRegs +[rdx] + ;; size=28 bbWeight=0.33 PerfScore 1.90 +G_M6338_IG24: ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref mov gword ptr [rsp+E0H], rdx - ;; size=18 bbWeight=0.33 PerfScore 0.75 -G_M6338_IG26: ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref - ; gcrRegs -[rdx] + ;; size=8 bbWeight=0.33 PerfScore 0.33 +G_M6338_IG25: ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref mov gword ptr [rsp+E8H], rdx mov dword ptr [rsp+F0H], r9d mov byte ptr [rsp+F4H], 0 + ;; size=24 bbWeight=0.33 PerfScore 0.99 +G_M6338_IG26: ; bbWeight=0.33, gcrefRegs=1028 {rbx rbp r12}, byrefRegs=4000 {r14}, byref + mov qword ptr [rsp+F8H], rdx ``` We have to be careful, however, since the covering segment can now contain promoted fields. If this happens we need to make sure we write the promoted field _after_ the remainder. Unfortunately doing this requires quite a bit of refactoring. I have extracted all common code for handling creation of derived accesses of the destination/source into a common class called LocationAccess.
4ec5010
to
0c9b1c5
Compare
/azp run runtime-coreclr superpmi-diffs |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run runtime-coreclr superpmi-diffs |
Azure Pipelines successfully started running 1 pipeline(s). |
cc @dotnet/jit-contrib PTAL @EgorBo Diffs. This is not just about padding, but also when the remainder is overlapping with promoted fields. A C# example of the difference: private S _s;
void Foo()
{
S s = new();
s.A = 10;
s.D = 20;
s.F = 30; // A, D, F gets promoted
_s = s;
}
private struct S
{
public byte A;
public byte B;
public byte C;
public byte D;
public byte E;
public byte F;
} Processing block operation [000018] that involves replacements
dst+003 <- V04 (V01.[003..004)) (last use)
dst+005 <- V05 (V01.[005..006)) (last use)
Block op remainder: [001..003) [004..005)
- => Remainder strategy: retain a full block op
+ => Remainder strategy: int at +001 ; V00 this [V00,T01] ( 3, 3 ) ref -> rcx this class-hnd single-def
;* V01 loc0 [V01 ] ( 0, 0 ) struct ( 8) zero-ref do-not-enreg[SF] ld-addr-op
;# V02 OutArgs [V02 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;* V03 tmp1 [V03 ] ( 0, 0 ) ubyte -> zero-ref "V01.[000..001)"
;* V04 tmp2 [V04 ] ( 0, 0 ) ubyte -> zero-ref "V01.[003..004)"
;* V05 tmp3 [V05 ] ( 0, 0 ) ubyte -> zero-ref "V01.[005..006)"
; V06 tmp4 [V06,T00] ( 5, 10 ) byref -> rcx single-def "Spilling address for field-by-field copy"
;
; Lcl frame size = 0
G_M52879_IG01: ;; offset=0000H
;; size=0 bbWeight=1 PerfScore 0.00
G_M52879_IG02: ;; offset=0000H
add rcx, 8
xor eax, eax
- mov dword ptr [rcx], eax
- mov dword ptr [rcx+02H], eax
+ mov dword ptr [rcx+01H], eax
mov byte ptr [rcx], 10
mov byte ptr [rcx+03H], 20
mov byte ptr [rcx+05H], 30
- ;; size=22 bbWeight=1 PerfScore 5.50
-G_M52879_IG03: ;; offset=0016H
+ ;; size=20 bbWeight=1 PerfScore 4.50
+G_M52879_IG03: ;; offset=0014H
ret
;; size=1 bbWeight=1 PerfScore 1.00
-; Total bytes of code 23, prolog size 0, PerfScore 8.80, instruction count 8, allocated bytes for code 23 (MethodHash=1da13170) for method Program:Foo():this (FullOpts)
+; Total bytes of code 21, prolog size 0, PerfScore 7.60, instruction count 7, allocated bytes for code 21 (MethodHash=1da13170) for method Program:Foo():this (FullOpts)
; ============================================================
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are regressions like this
@@ -13,46 +13,48 @@
; V02 arg1 [V02,T02] ( 3, 3 ) ref -> r8 class-hnd single-def
; V03 arg2 [V03,T03] ( 3, 3 ) struct ( 8) r9 single-def
;# V04 OutArgs [V04 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-; V05 tmp1 [V05,T01] ( 3, 6 ) struct (16) [rsp+00H] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
+; V05 tmp1 [V05,T04] ( 2, 4 ) struct (16) [rsp+08H] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
;* V06 tmp2 [V06 ] ( 0, 0 ) long -> zero-ref "spilling helperCall"
;* V07 tmp3 [V07 ] ( 0, 0 ) ref -> zero-ref single-def "V05.[000..008)"
+; V08 tmp4 [V08,T01] ( 3, 6 ) byref -> rsi single-def "Spilling address for field-by-field copy"
;
-; Lcl frame size = 16
+; Lcl frame size = 24
G_M36974_IG01: ; bbWeight=1, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG
- push rdi
push rsi
push rbx
- sub rsp, 16
+ sub rsp, 24
xor eax, eax
- mov qword ptr [rsp], rax
+ mov qword ptr [rsp+08H], rax
mov rbx, rcx
; byrRegs +[rbx]
- ;; size=16 bbWeight=1 PerfScore 4.75
+ ;; size=16 bbWeight=1 PerfScore 3.75
G_M36974_IG02: ; bbWeight=1, gcrefRegs=0100 {r8}, byrefRegs=0008 {rbx}, byref
; gcrRegs +[r8]
- mov qword ptr [rsp+08H], r9
- mov gword ptr [rsp], r8
- lea rdi, bword ptr [rbx+08H]
- ; byrRegs +[rdi]
- lea rsi, bword ptr [rsp]
+ mov qword ptr [rsp+10H], r9
+ lea rsi, bword ptr [rbx+08H]
; byrRegs +[rsi]
- call CORINFO_HELP_ASSIGN_BYREF
- ; gcrRegs -[r8]
- movsq
+ mov rcx, rsi
+ ; byrRegs +[rcx]
+ mov rdx, r8
+ ; gcrRegs +[rdx]
+ call CORINFO_HELP_CHECKED_ASSIGN_REF
+ ; gcrRegs -[rdx r8]
+ ; byrRegs -[rcx]
+ mov rax, qword ptr [rsp+10H]
+ mov qword ptr [rsi+08H], rax
xor rax, rax
; gcrRegs +[rax]
mov gword ptr [rbx], rax
- ;; size=29 bbWeight=1 PerfScore 6.25
+ ;; size=34 bbWeight=1 PerfScore 6.25
G_M36974_IG03: ; bbWeight=1, epilog, nogc, extend
- add rsp, 16
+ add rsp, 24
pop rbx
pop rsi
- pop rdi
ret
- ;; size=8 bbWeight=1 PerfScore 2.75
+ ;; size=7 bbWeight=1 PerfScore 2.25
-; Total bytes of code 53, prolog size 13, PerfScore 19.05, instruction count 20, allocated bytes for code 53 (MethodHash=ebd16f91) for method System.Linq.Parallel.HashLookupValueList`2[System.__Canon,System.Nullable`1[int]]:.ctor(System.__Canon,System.Nullable`1[int]):this (FullOpts)
+; Total bytes of code 57, prolog size 13, PerfScore 17.95, instruction count 19, allocated bytes for code 57 (MethodHash=ebd16f91) for method System.Linq.Parallel.HashLookupValueList`2[System.__Canon,System.Nullable`1[int]]:.ctor(System.__Canon,System.Nullable`1[int]):this (FullOpts)
; ============================================================
expected?
Generally we do expect to replace Now, this case is a bit peculiar because the fact that we no longer uses the full block copy for the remainder turns off an unrelated optimization: @@ -1,20 +1,28 @@
STMT00001 ( 0x000[E-] ... ??? )
[000027] nA-XG------ ▌ STORE_BLK struct<System.Linq.Parallel.Pair`2, 16> (copy)
[000025] ---X------- ├──▌ FIELD_ADDR byref <unknown class>:<unknown field>
[000026] ----------- │ └──▌ LCL_VAR byref V00 this
[000024] ----------- └──▌ LCL_VAR struct<System.Linq.Parallel.Pair`2, 16> V05 tmp1 (last use)
Processing block operation [000027] that involves replacements
dst+000 <- V07 (V05.[000..008)) (last use)
Block op remainder: [008..009) [012..016)
- => Remainder strategy: retain a full block op
- Will write back V07 (V05.[000..008)) to avoid an additional write barrier
- Skipping dst+000 <- V07 (V05.[000..008)); it is up-to-date in its struct local and will be handled as part of the remainder
+ => Remainder strategy: long at +008
+
+lvaGrabTemp returning 8 (V08 tmp4) called for Spilling address for field-by-field copy.
+
+Local V05 should not be enregistered because: was accessed as a local field
New statement:
STMT00001 ( 0x000[E-] ... ??? )
- [000040] -A-XG------ ▌ COMMA void
- [000039] UA--------- ├──▌ STORE_LCL_FLD ref V05 tmp1 [+0]
- [000038] ----------- │ └──▌ LCL_VAR ref V07 tmp3
- [000027] nA-XG------ └──▌ STORE_BLK struct<System.Linq.Parallel.Pair`2, 16> (copy)
- [000025] ---X------- ├──▌ FIELD_ADDR byref <unknown class>:<unknown field>
- [000026] ----------- │ └──▌ LCL_VAR byref V00 this
- [000024] ----------- └──▌ LCL_VAR struct<System.Linq.Parallel.Pair`2, 16> V05 tmp1
+ [000048] -A-XG------ ▌ COMMA void
+ [000038] DA-X------- ├──▌ STORE_LCL_VAR byref V08 tmp4
+ [000025] ---X------- │ └──▌ FIELD_ADDR byref <unknown class>:<unknown field>
+ [000026] ----------- │ └──▌ LCL_VAR byref V00 this
+ [000047] -A--G------ └──▌ COMMA void
+ [000042] nA--G------ ├──▌ STOREIND ref
+ [000041] ----------- │ ├──▌ LCL_VAR byref V08 tmp4
+ [000040] ----------- │ └──▌ LCL_VAR ref V07 tmp3 (last use)
+ [000046] nA--G------ └──▌ STOREIND long
+ [000045] ----------- ├──▌ ADD byref
+ [000039] ----------- │ ├──▌ LCL_VAR byref V08 tmp4
+ [000044] ----------- │ └──▌ CNS_INT long 8
+ [000043] ----------- └──▌ LCL_FLD long V05 tmp1 [+8] Specifically, in the base we notice that we have a promoted In the diff we don't do this optimization anymore and instead just copy the promoted We should be able to avoid the spilled local address in this case by teaching physical promotion to peel |
The remainder may be separated by a bit of padding or other promoted fields but still fit into a primitive; in this case it is still beneficial to copy it all as a primitive, instead of falling back to a full block copy.
Example.
We have to be careful, however, since the covering segment can now contain promoted fields. If this happens we need to make sure we write the promoted field after the remainder.