-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Add a (disabled) prototype for a generalized promotion pass #83388
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak Issue Detailsnull
|
Currently I'm using some size heuristics to figure out when to do the promotion. Here is an example diff in ; Final local variable assignments
;
; V00 RetBuf [V00,T01] ( 4, 4 ) byref -> rcx single-def
-; V01 arg0 [V01,T00] ( 29, 34 ) byref -> rdx single-def
-; V02 loc0 [V02,T08] ( 3, 2.50) float -> mm2
+; V01 arg0 [V01,T00] ( 11, 22 ) byref -> rdx single-def
+; V02 loc0 [V02,T14] ( 3, 2.50) float -> mm9
; V03 loc1 [V03,T02] ( 13, 7 ) simd16 -> registers ld-addr-op
; V04 loc2 [V04,T07] ( 7, 3.50) float -> mm0
-; V05 loc3 [V05,T12] ( 3, 1.50) float -> mm3
-; V06 loc4 [V06,T09] ( 4, 2 ) float -> mm0
-; V07 loc5 [V07,T13] ( 3, 1.50) float -> mm3
-; V08 loc6 [V08,T10] ( 4, 2 ) float -> mm0
-; V09 loc7 [V09,T14] ( 3, 1.50) float -> mm0
-; V10 loc8 [V10,T11] ( 4, 2 ) float -> mm1
+; V05 loc3 [V05,T18] ( 3, 1.50) float -> mm0
+; V06 loc4 [V06,T15] ( 4, 2 ) float -> mm4
+; V07 loc5 [V07,T19] ( 3, 1.50) float -> mm0
+; V08 loc6 [V08,T16] ( 4, 2 ) float -> mm4
+; V09 loc7 [V09,T20] ( 3, 1.50) float -> mm0
+; V10 loc8 [V10,T17] ( 4, 2 ) float -> mm4
;# V11 OutArgs [V11 ] ( 1, 1 ) struct ( 0) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-; V12 cse0 [V12,T04] ( 7, 4.50) float -> mm1 "CSE - aggressive"
-; V13 cse1 [V13,T05] ( 7, 4.50) float -> mm3 "CSE - aggressive"
-; V14 cse2 [V14,T06] ( 7, 4.50) float -> mm0 "CSE - aggressive"
-; V15 cse3 [V15,T03] ( 12, 6 ) float -> mm2 "CSE - aggressive"
+; V12 tmp1 [V12,T04] ( 7, 4.50) float -> mm0 "V01.[000..004)"
+; V13 tmp2 [V13,T08] ( 5, 3 ) float -> mm1 "V01.[004..008)"
+; V14 tmp3 [V14,T09] ( 5, 3 ) float -> mm2 "V01.[008..012)"
+; V15 tmp4 [V15,T10] ( 5, 3 ) float -> mm3 "V01.[016..020)"
+; V16 tmp5 [V16,T05] ( 7, 4.50) float -> mm4 "V01.[020..024)"
+; V17 tmp6 [V17,T11] ( 5, 3 ) float -> mm5 "V01.[024..028)"
+; V18 tmp7 [V18,T12] ( 5, 3 ) float -> mm6 "V01.[032..036)"
+; V19 tmp8 [V19,T13] ( 5, 3 ) float -> mm7 "V01.[036..040)"
+; V20 tmp9 [V20,T06] ( 7, 4.50) float -> mm8 "V01.[040..044)"
+; V21 cse0 [V21,T03] ( 12, 6 ) float -> mm9 "CSE - aggressive"
;
-; Lcl frame size = 0
+; Lcl frame size = 88
G_M15800_IG01:
+ sub rsp, 88
vzeroupper
- ;; size=3 bbWeight=1 PerfScore 1.00
+ vmovaps xmmword ptr [rsp+40H], xmm6
+ vmovaps xmmword ptr [rsp+30H], xmm7
+ vmovaps xmmword ptr [rsp+20H], xmm8
+ vmovaps xmmword ptr [rsp+10H], xmm9
+ vmovaps xmmword ptr [rsp], xmm10
+ ;; size=36 bbWeight=1 PerfScore 11.25
G_M15800_IG02:
vmovss xmm0, dword ptr [rdx]
- vmovss xmm1, dword ptr [rdx+14H]
- vaddss xmm2, xmm0, xmm1
- vmovss xmm3, dword ptr [rdx+28H]
- vaddss xmm2, xmm2, xmm3
- vxorps xmm4, xmm4
- vucomiss xmm2, xmm4
+ vmovss xmm1, dword ptr [rdx+04H]
+ vmovss xmm2, dword ptr [rdx+08H]
+ vmovss xmm3, dword ptr [rdx+10H]
+ vmovss xmm4, dword ptr [rdx+14H]
+ vmovss xmm5, dword ptr [rdx+18H]
+ vmovss xmm6, dword ptr [rdx+20H]
+ vmovss xmm7, dword ptr [rdx+24H]
+ vmovss xmm8, dword ptr [rdx+28H]
+ vaddss xmm9, xmm0, xmm4
+ vaddss xmm9, xmm9, xmm8
+ vxorps xmm10, xmm10
+ vucomiss xmm9, xmm10
jbe SHORT G_M15800_IG04
- ;; size=32 bbWeight=1 PerfScore 21.33
+ ;; size=65 bbWeight=1 PerfScore 45.33
G_M15800_IG03:
- vaddss xmm0, xmm2, dword ptr [reloc @RWD00]
+ vaddss xmm0, xmm9, dword ptr [reloc @RWD00]
vsqrtss xmm0, xmm0
- vmovss xmm2, dword ptr [reloc @RWD04]
- vmulss xmm1, xmm0, xmm2
- vinsertps xmm1, xmm1, xmm1, 55
- vdivss xmm0, xmm2, xmm0
- vmovss xmm3, dword ptr [rdx+18H]
- vsubss xmm2, xmm3, dword ptr [rdx+24H]
+ vmovss xmm9, dword ptr [reloc @RWD04]
+ vmulss xmm4, xmm0, xmm9
+ vinsertps xmm4, xmm4, xmm4, 55
+ vdivss xmm0, xmm9, xmm0
+ vsubss xmm5, xmm5, xmm7
+ vmulss xmm5, xmm5, xmm0
+ vinsertps xmm4, xmm4, xmm5, 0
+ vsubss xmm2, xmm6, xmm2
vmulss xmm2, xmm2, xmm0
- vinsertps xmm1, xmm1, xmm2, 0
- vmovss xmm2, dword ptr [rdx+20H]
- vsubss xmm2, xmm2, dword ptr [rdx+08H]
- vmulss xmm2, xmm2, xmm0
- vinsertps xmm1, xmm1, xmm2, 16
- vmovss xmm2, dword ptr [rdx+04H]
- vsubss xmm2, xmm2, dword ptr [rdx+10H]
- vmulss xmm0, xmm2, xmm0
- vinsertps xmm1, xmm1, xmm0, 32
+ vinsertps xmm2, xmm4, xmm2, 16
+ vsubss xmm1, xmm1, xmm3
+ vmulss xmm3, xmm1, xmm0
+ vinsertps xmm4, xmm2, xmm3, 32
jmp G_M15800_IG07
- ;; size=99 bbWeight=0.50 PerfScore 38.50
+ ;; size=82 bbWeight=0.50 PerfScore 28.00
G_M15800_IG04:
- vucomiss xmm0, xmm1
+ vucomiss xmm0, xmm4
jb SHORT G_M15800_IG05
- vucomiss xmm0, xmm3
+ vucomiss xmm0, xmm8
jb SHORT G_M15800_IG05
vaddss xmm0, xmm0, dword ptr [reloc @RWD00]
- vsubss xmm1, xmm0, xmm1
- vsubss xmm3, xmm1, xmm3
- vsqrtss xmm3, xmm3
- vmovss xmm2, dword ptr [reloc @RWD04]
- vdivss xmm0, xmm2, xmm3
- vmulss xmm2, xmm3, xmm2
- vinsertps xmm1, xmm1, xmm2, 14
- vmovss xmm2, dword ptr [rdx+04H]
- vaddss xmm2, xmm2, dword ptr [rdx+10H]
- vmulss xmm2, xmm2, xmm0
- vinsertps xmm1, xmm1, xmm2, 16
- vmovss xmm2, dword ptr [rdx+08H]
- vaddss xmm2, xmm2, dword ptr [rdx+20H]
- vmulss xmm2, xmm2, xmm0
- vinsertps xmm1, xmm1, xmm2, 32
- vmovss xmm2, dword ptr [rdx+18H]
- vsubss xmm2, xmm2, dword ptr [rdx+24H]
- vmulss xmm0, xmm2, xmm0
- vinsertps xmm1, xmm1, xmm0, 48
- jmp G_M15800_IG07
- ;; size=119 bbWeight=0.50 PerfScore 44.50
-G_M15800_IG05:
- vucomiss xmm1, xmm3
- jbe SHORT G_M15800_IG06
- vaddss xmm1, xmm1, dword ptr [reloc @RWD00]
- vsubss xmm0, xmm1, xmm0
- vsubss xmm3, xmm0, xmm3
- vsqrtss xmm3, xmm3
- vmovss xmm2, dword ptr [reloc @RWD04]
- vdivss xmm0, xmm2, xmm3
- vmovss xmm1, dword ptr [rdx+10H]
- vaddss xmm1, xmm1, dword ptr [rdx+04H]
- vmulss xmm1, xmm1, xmm0
- vinsertps xmm1, xmm1, xmm1, 14
- vmulss xmm2, xmm3, xmm2
- vinsertps xmm1, xmm1, xmm2, 16
- vmovss xmm2, dword ptr [rdx+24H]
- vaddss xmm2, xmm2, dword ptr [rdx+18H]
- vmulss xmm2, xmm2, xmm0
- vinsertps xmm1, xmm1, xmm2, 32
- vmovss xmm2, dword ptr [rdx+20H]
- vsubss xmm2, xmm2, dword ptr [rdx+08H]
- vmulss xmm0, xmm2, xmm0
- vinsertps xmm1, xmm1, xmm0, 48
- jmp SHORT G_M15800_IG07
- ;; size=110 bbWeight=0.50 PerfScore 43.00
-G_M15800_IG06:
- vaddss xmm2, xmm3, dword ptr [reloc @RWD00]
- vsubss xmm0, xmm2, xmm0
- vsubss xmm0, xmm0, xmm1
+ vsubss xmm4, xmm0, xmm4
+ vsubss xmm0, xmm4, xmm8
vsqrtss xmm0, xmm0
- vmovss xmm2, dword ptr [reloc @RWD04]
- vdivss xmm1, xmm2, xmm0
- vmovss xmm3, dword ptr [rdx+20H]
- vaddss xmm3, xmm3, dword ptr [rdx+08H]
- vmulss xmm3, xmm3, xmm1
- vinsertps xmm3, xmm3, xmm3, 14
- vmovss xmm4, dword ptr [rdx+24H]
- vaddss xmm4, xmm4, dword ptr [rdx+18H]
- vmulss xmm4, xmm4, xmm1
- vinsertps xmm3, xmm3, xmm4, 16
- vmulss xmm0, xmm0, xmm2
- vinsertps xmm0, xmm3, xmm0, 32
- vmovss xmm2, dword ptr [rdx+04H]
- vsubss xmm2, xmm2, dword ptr [rdx+10H]
- vmulss xmm1, xmm2, xmm1
+ vmovss xmm9, dword ptr [reloc @RWD04]
+ vdivss xmm4, xmm9, xmm0
+ vmulss xmm0, xmm0, xmm9
+ vinsertps xmm0, xmm0, xmm0, 14
+ vaddss xmm1, xmm1, xmm3
+ vmulss xmm3, xmm1, xmm4
+ vinsertps xmm0, xmm0, xmm3, 16
+ vaddss xmm2, xmm2, xmm6
+ vmulss xmm1, xmm2, xmm4
+ vinsertps xmm0, xmm0, xmm1, 32
+ vsubss xmm5, xmm5, xmm7
+ vmulss xmm1, xmm5, xmm4
vinsertps xmm0, xmm0, xmm1, 48
- vmovaps xmm1, xmm0
- ;; size=106 bbWeight=0.50 PerfScore 40.62
+ vmovaps xmm4, xmm0
+ jmp G_M15800_IG07
+ ;; size=108 bbWeight=0.50 PerfScore 34.12
+G_M15800_IG05:
+ vucomiss xmm4, xmm8
+ jbe SHORT G_M15800_IG06
+ vaddss xmm4, xmm4, dword ptr [reloc @RWD00]
+ vsubss xmm0, xmm4, xmm0
+ vsubss xmm0, xmm0, xmm8
+ vsqrtss xmm0, xmm0
+ vmovss xmm9, dword ptr [reloc @RWD04]
+ vdivss xmm4, xmm9, xmm0
+ vaddss xmm1, xmm3, xmm1
+ vmulss xmm3, xmm1, xmm4
+ vinsertps xmm1, xmm1, xmm3, 14
+ vmulss xmm0, xmm0, xmm9
+ vinsertps xmm0, xmm1, xmm0, 16
+ vaddss xmm5, xmm7, xmm5
+ vmulss xmm1, xmm5, xmm4
+ vinsertps xmm0, xmm0, xmm1, 32
+ vsubss xmm2, xmm6, xmm2
+ vmulss xmm1, xmm2, xmm4
+ vinsertps xmm0, xmm0, xmm1, 48
+ vmovaps xmm4, xmm0
+ jmp SHORT G_M15800_IG07
+ ;; size=99 bbWeight=0.50 PerfScore 32.62
+G_M15800_IG06:
+ vaddss xmm9, xmm8, dword ptr [reloc @RWD00]
+ vsubss xmm0, xmm9, xmm0
+ vsubss xmm0, xmm0, xmm4
+ vsqrtss xmm0, xmm0
+ vmovss xmm9, dword ptr [reloc @RWD04]
+ vdivss xmm4, xmm9, xmm0
+ vaddss xmm2, xmm6, xmm2
+ vmulss xmm2, xmm2, xmm4
+ vinsertps xmm2, xmm2, xmm2, 14
+ vaddss xmm5, xmm7, xmm5
+ vmulss xmm5, xmm5, xmm4
+ vinsertps xmm2, xmm2, xmm5, 16
+ vmulss xmm0, xmm0, xmm9
+ vinsertps xmm0, xmm2, xmm0, 32
+ vsubss xmm1, xmm1, xmm3
+ vmulss xmm1, xmm1, xmm4
+ vinsertps xmm0, xmm0, xmm1, 48
+ vmovaps xmm4, xmm0
+ ;; size=89 bbWeight=0.50 PerfScore 30.12
G_M15800_IG07:
- vmovups xmmword ptr [rcx], xmm1
+ vmovups xmmword ptr [rcx], xmm4
mov rax, rcx
;; size=7 bbWeight=1 PerfScore 2.25
G_M15800_IG08:
+ vmovaps xmm6, xmmword ptr [rsp+40H]
+ vmovaps xmm7, xmmword ptr [rsp+30H]
+ vmovaps xmm8, xmmword ptr [rsp+20H]
+ vmovaps xmm9, xmmword ptr [rsp+10H]
+ vmovaps xmm10, xmmword ptr [rsp]
+ add rsp, 88
ret
- ;; size=1 bbWeight=1 PerfScore 1.00
+ ;; size=34 bbWeight=1 PerfScore 21.25
RWD00 dd 3F800000h ; 1
RWD04 dd 3F000000h ; 0.5
-; Total bytes of code 477, prolog size 3, PerfScore 239.91, instruction count 100, allocated bytes for code 477 (MethodHash=096ec247) for method System.Numerics.Quaternion:CreateFromRotationMatrix(System.Numerics.Matrix4x4):System.Numerics.Quaternion
+; Total bytes of code 520, prolog size 36, PerfScore 256.96, instruction count 108, allocated bytes for code 520 (MethodHash=096ec247) for method System.Numerics.Quaternion:CreateFromRotationMatrix(System.Numerics.Matrix4x4):System.Numerics.Quaternion We decide to promote all the fields of the matrix because we see each of them as having multiple accesses, and we only need to load them once (since they come from a parameter): Accesses for V01
float @ 000
#: 6
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
float @ 004
#: 4
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
float @ 008
#: 4
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
float @ 016
#: 4
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
float @ 020
#: 6
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
float @ 024
#: 4
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
float @ 032
#: 4
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
float @ 036
#: 4
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
float @ 040
#: 6
# assigned from: 0
# assigned to: 0
# as call arg: 0
# as implicit by-ref call arg: 0
# as on-stack call arg: 0
# as retbuf: 0
# as returned value: 0
Evaluating access float @ 000
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 200
Cost without: 450
Promoting replacement
lvaGrabTemp returning 12 (V12 tmp1) (a long lifetime temp) called for V01.[000..004).
Evaluating access float @ 004
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 150
Cost without: 300
Promoting replacement
lvaGrabTemp returning 13 (V13 tmp2) (a long lifetime temp) called for V01.[004..008).
Evaluating access float @ 008
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 150
Cost without: 300
Promoting replacement
lvaGrabTemp returning 14 (V14 tmp3) (a long lifetime temp) called for V01.[008..012).
Evaluating access float @ 016
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 150
Cost without: 300
Promoting replacement
lvaGrabTemp returning 15 (V15 tmp4) (a long lifetime temp) called for V01.[016..020).
Evaluating access float @ 020
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 200
Cost without: 450
Promoting replacement
lvaGrabTemp returning 16 (V16 tmp5) (a long lifetime temp) called for V01.[020..024).
Evaluating access float @ 024
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 150
Cost without: 300
Promoting replacement
lvaGrabTemp returning 17 (V17 tmp6) (a long lifetime temp) called for V01.[024..028).
Evaluating access float @ 032
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 150
Cost without: 300
Promoting replacement
lvaGrabTemp returning 18 (V18 tmp7) (a long lifetime temp) called for V01.[032..036).
Evaluating access float @ 036
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 150
Cost without: 300
Promoting replacement
lvaGrabTemp returning 19 (V19 tmp8) (a long lifetime temp) called for V01.[036..040).
Evaluating access float @ 040
Write-back cost: 50
# write backs: 0
# read backs: 1
Cost with: 200
Cost without: 450
Promoting replacement
lvaGrabTemp returning 20 (V20 tmp9) (a long lifetime temp) called for V01.[040..044).
V01 promoted with 9 replacements
[000..004) promoted as float V12
[004..008) promoted as float V13
[008..012) promoted as float V14
[016..020) promoted as float V15
[020..024) promoted as float V16
[024..028) promoted as float V17
[032..036) promoted as float V18
[036..040) promoted as float V19
[040..044) promoted as float V20 It still ends up being a size regression because we need to save a bunch of callee saves. In addition there is a large up front cost in loading all the registers, but after that we do not need any more memory accesses. runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/Quaternion.cs Lines 340 to 390 in c9173f2
So in reality, in each execution each field is only accesses once anyway and it's best to leave it in memory to allow it to be loaded/contained right at its use:
My current thinking is to try to base things more on PGO, and piggy back on @AndyAyersMS's synthetic PGO in my experiment. That is, make the logic much more conservative about introducing replacement, so that we only actually end up introducing them when we're quite sure that an execution of the function ends up accessing the fields many times. |
Reworked things to use BB weights and am now testing with [15:35:30] 110 contexts with diffs (29 improvements, 70 regressions, 11 same size)
[15:35:30] -689/+6,564 bytes PerfScore is a bit better: [15:35:48] 104 total methods with Perf Score differences (66 improved, 38 regressed). For aspnet we have: [15:41:55] 524 contexts with diffs (195 improvements, 298 regressions, 31 same size)
[15:41:55] -4,605/+9,335 bytes
[15:42:29] 500 total methods with Perf Score differences (377 improved, 123 regressed). Accurately predicting things like new spills and register pressure when creating the new locals this early in the JIT is bound to be tough, but we should be able to do better than this. For the example in #71510 we get the following diff in ; 0 inlinees with PGO data; 3 single block inlinees; 1 inlinees without PGO data
; Final local variable assignments
;
-; V00 arg0 [V00,T08] ( 3, 6 ) byref -> rcx single-def
-; V01 arg1 [V01,T10] ( 3, 3 ) ref -> rdx class-hnd single-def
-; V02 loc0 [V02 ] ( 7,437.00) struct (32) [rsp+48H] do-not-enreg[XSF] addr-exposed ld-addr-op
+; V00 arg0 [V00,T11] ( 3, 6 ) byref -> rcx single-def
+; V01 arg1 [V01,T12] ( 3, 3 ) ref -> rdx class-hnd single-def
+;* V02 loc0 [V02 ] ( 0, 0 ) struct (32) zero-ref do-not-enreg[SF] ld-addr-op
;* V03 loc1 [V03 ] ( 0, 0 ) struct (32) zero-ref do-not-enreg[S] ld-addr-op
; V04 OutArgs [V04 ] ( 1, 1 ) struct (32) [rsp+00H] do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
-; V05 tmp1 [V05,T09] ( 3, 6 ) struct (32) [rsp+28H] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
-; V06 tmp2 [V06,T06] ( 3,216.00) ref -> rax class-hnd exact "Single-def Box Helper"
-;* V07 tmp3 [V07,T07] ( 0, 0 ) bool -> zero-ref "Inline return value spill temp"
-; V08 tmp4 [V08,T03] ( 4,320.00) int -> rcx "Inline stloc first use temp"
-; V09 tmp5 [V09,T00] ( 3,480.00) ref -> rax class-hnd "impAppendStmt"
-; V10 tmp6 [V10,T02] ( 3,480.00) int -> rcx "Span.get_Item index"
-; V11 tmp7 [V11,T01] ( 3,480.00) byref -> rax "Span.get_Item ptrToSpan"
-;* V12 tmp8 [V12 ] ( 0, 0 ) byref -> zero-ref V14._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
-;* V13 tmp9 [V13 ] ( 0, 0 ) int -> zero-ref V14._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
-;* V14 tmp10 [V14 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref"
-; V15 cse0 [V15,T05] ( 3,240.00) int -> rcx "CSE - aggressive"
-; V16 cse1 [V16,T04] ( 3,240.00) ref -> rax "CSE - aggressive"
+; V05 tmp1 [V05,T10] ( 5, 10 ) struct (32) [rsp+28H] do-not-enreg[SF] must-init ld-addr-op "NewObj constructor temp"
+; V06 tmp2 [V06,T03] ( 3,216.00) ref -> rax class-hnd exact "Single-def Box Helper"
+;* V07 tmp3 [V07,T04] ( 0, 0 ) bool -> zero-ref "Inline return value spill temp"
+; V08 tmp4 [V08,T02] ( 3,240.00) int -> rdi "Inline stloc first use temp"
+;* V09 tmp5 [V09 ] ( 0, 0 ) ref -> zero-ref class-hnd "impAppendStmt"
+; V10 tmp6 [V10,T00] ( 3,480.00) int -> rcx "Span.get_Item index"
+;* V11 tmp7 [V11 ] ( 0, 0 ) byref -> zero-ref V18._reference(offs=0x00) P-INDEP "field V00._reference (fldOffset=0x0)"
+;* V12 tmp8 [V12 ] ( 0, 0 ) int -> zero-ref V18._length(offs=0x08) P-INDEP "field V00._length (fldOffset=0x8)"
+; V13 tmp9 [V13,T06] ( 3, 82.00) ref -> rsi single-def "V02.[000..008)"
+; V14 tmp10 [V14,T05] ( 2,116.00) int -> rdi "V02.[008..012)"
+; V15 tmp11 [V15,T01] ( 4,241.00) int -> registers "V02.[012..016)"
+; V16 tmp12 [V16,T07] ( 2, 81.00) byref -> rbx single-def "V02.[016..024)"
+; V17 tmp13 [V17,T08] ( 2, 81.00) int -> rbp single-def "V02.[024..028)"
+;* V18 tmp14 [V18 ] ( 0, 0 ) struct (16) zero-ref "Promoted implicit byref"
+; V19 cse0 [V19,T09] ( 2, 81.00) int -> r14 "CSE - aggressive"
+; V20 cse1 [V20,T13] ( 2, 2 ) int -> rdi "CSE - aggressive"
;
-; Lcl frame size = 104
+; Lcl frame size = 72
G_M27694_IG01:
- sub rsp, 104
+ push r15
+ push r14
+ push rdi
+ push rsi
+ push rbp
+ push rbx
+ sub rsp, 72
vzeroupper
xor eax, eax
mov qword ptr [rsp+28H], rax
vxorps xmm4, xmm4
vmovdqa xmmword ptr [rsp+30H], xmm4
mov qword ptr [rsp+40H], rax
- ;; size=29 bbWeight=1 PerfScore 5.83
+ ;; size=37 bbWeight=1 PerfScore 11.83
G_M27694_IG02:
vmovdqu xmm0, xmmword ptr [rcx]
vmovdqu xmmword ptr [rsp+38H], xmm0
;; size=10 bbWeight=1 PerfScore 5.00
G_M27694_IG03:
mov gword ptr [rsp+28H], rdx
- ;; size=5 bbWeight=1 PerfScore 1.00
+ mov rsi, gword ptr [rsp+28H]
+ xor edi, edi
+ mov rbx, bword ptr [rsp+38H]
+ mov ebp, dword ptr [rsp+40H]
+ mov r14d, dword ptr [rsi+08H]
+ jmp SHORT G_M27694_IG05
+ ;; size=27 bbWeight=1 PerfScore 8.25
G_M27694_IG04:
- vmovdqu ymm0, ymmword ptr[rsp+28H]
- vmovdqu ymmword ptr[rsp+48H], ymm0
- ;; size=12 bbWeight=1 PerfScore 5.00
-G_M27694_IG05:
- jmp SHORT G_M27694_IG07
- ;; size=2 bbWeight=1 PerfScore 2.00
-G_M27694_IG06:
mov rcx, 0xD1FFAB1E ; System.Int32
call CORINFO_HELP_NEWSFAST
- mov ecx, dword ptr [rsp+50H]
- mov dword ptr [rax+08H], ecx
+ mov dword ptr [rax+08H], edi
mov rcx, rax
call [System.Console:WriteLine(System.Object)]
- ;; size=31 bbWeight=36.00 PerfScore 234.00
+ mov edi, r15d
+ ;; size=30 bbWeight=36.00 PerfScore 207.00
+G_M27694_IG05:
+ cmp r14d, edi
+ jle SHORT G_M27694_IG07
+ ;; size=5 bbWeight=80.00 PerfScore 100.00
+G_M27694_IG06:
+ lea ecx, [rdi+01H]
+ mov r15d, ecx
+ mov ecx, edi
+ mov ecx, dword ptr [rsi+4*rcx+10H]
+ cmp ecx, ebp
+ jae SHORT G_M27694_IG08
+ mov edi, dword ptr [rbx+4*rcx]
+ jmp SHORT G_M27694_IG04
+ ;; size=21 bbWeight=80.00 PerfScore 660.00
G_M27694_IG07:
- mov ecx, dword ptr [rsp+54H]
- mov rax, gword ptr [rsp+48H]
- cmp ecx, dword ptr [rax+08H]
- jge SHORT G_M27694_IG09
- ;; size=14 bbWeight=80.00 PerfScore 480.00
-G_M27694_IG08:
- lea edx, [rcx+01H]
- mov dword ptr [rsp+54H], edx
- cmp ecx, dword ptr [rax+08H]
- jae SHORT G_M27694_IG10
- mov ecx, ecx
- mov ecx, dword ptr [rax+4*rcx+10H]
- lea rax, bword ptr [rsp+58H]
- cmp ecx, dword ptr [rax+08H]
- jae SHORT G_M27694_IG10
- mov rax, bword ptr [rax]
- mov ecx, dword ptr [rax+4*rcx]
- mov dword ptr [rsp+50H], ecx
- jmp SHORT G_M27694_IG06
- ;; size=40 bbWeight=80.00 PerfScore 1540.00
-G_M27694_IG09:
- vzeroupper
- add rsp, 104
+ add rsp, 72
+ pop rbx
+ pop rbp
+ pop rsi
+ pop rdi
+ pop r14
+ pop r15
ret
- ;; size=8 bbWeight=10.00 PerfScore 22.50
-G_M27694_IG10:
+ ;; size=13 bbWeight=10.00 PerfScore 42.50
+G_M27694_IG08:
call CORINFO_HELP_RNGCHKFAIL
int3
;; size=6 bbWeight=0 PerfScore 0.00
-; Total bytes of code 157, prolog size 29, PerfScore 2311.03, instruction count 41, allocated bytes for code 157 (MethodHash=b4d393d1) for method FilteredSpanEnumerator`1[int]:Test(System.ReadOnlySpan`1[int],int[])
+; Total bytes of code 149, prolog size 37, PerfScore 1049.48, instruction count 48, allocated bytes for code 149 (MethodHash=b4d393d1) for method FilteredSpanEnumerator`1[int]:Test(System.ReadOnlySpan`1[int],int[]) So this looks encouraging at least. |
I wouldn't worry too much about regressions in APIs like That's an API that I plan on (and need to) rewrite to use explicit SIMD code. Namely it needs to use the It is somewhat representative of what "might" happen for some user code. But its still a bit different since |
Eliminate commas early in block morphing, before the rest of the phase needs to look at it. Do this by extracting their side effects and adding them on top of the returned result instead. This allows us to handle arbitrary COMMAs on the source and destination operands in a general manner.
Otherwise we hit asserts when trying to allocate 0 bytes of memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a first pass ... overall this looks very nice.
src/coreclr/jit/promotion.h
Outdated
#ifndef _PROMOTION_H | ||
#define _PROMOTION_H | ||
|
||
class Compiler; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like we have policy on this but it is often nice for headers to compile on their own instead of relying on any includers to provide the right prerequisites.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added includes instead of forward declarations.
// It may also be contained. Overall we are going to cost each use of | ||
// an unpromoted local at 6.5 bytes. | ||
// TODO-CQ: We can make much better guesses on what will and won't be contained. | ||
costWithout += access.CountWtd * 6.5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be mixing units? Weighted costs reflect (relative) frequency so should be used to scale times, but bytes are size related.
You may want to form both a time score and a size score and then explicitly try and model the tradeoffs between the two.
Eg for inlining I would generally propose: if size is negative then always transform; else if size increases transform only if ratio of time improvement to size increase is above some threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially this was purely a size estimate but then I switched it to make some use of PGO weights. But I didn't really spend any time on the costing after that, it's pretty hard to make good choices when a lot of the basic stuff is still missing.
I will add a todo here -- your suggestion to eventually check both size and perf makes sense. But I found it very hard to predict the size cost (in particular due to prolog/epilog cost and potential spilling). And accurately modelling execution cost is also going to be pretty hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I can see there is already a TODO above for it:
// TODO-CQ: Tune the following heuristics. Currently they are based on
// x64 code size although using BB weights when available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added clarification to the todo that this mixing of units does not make sense.
src/coreclr/jit/promotion.cpp
Outdated
argSize = m_compiler->typGetObjLayout(arg.GetSignatureClassHandle())->GetSize(); | ||
} | ||
|
||
#ifdef WINDOWS_AMD64_ABI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leverage getArgTypeForStruct
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove these for now, they are currently unused. The plan was to move ABI information determination back so that it happened before this pass and could be used for it (I guess the implicit byref case might be simple enough here, while the "passed on stack" hint would be harder to determine).
if ((ssize_t)index < 0) | ||
{ | ||
index = ~index; | ||
if ((index > 0) && replacements[index - 1].Overlaps(offs, size)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know that there aren't any lower indexed entries that overlap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replacements
has only disjoint ranges in it (ensured by EvaluateReplacement
)
// lcl - The struct local | ||
// offs - The starting offset of the range in the struct local that needs to be read back from. | ||
// size - The size of the range | ||
// conservative - Whether this is a potentially conservative read back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this for the case where you may be able to prove that a reassembled struct won't be modified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to make it clear in the jitdump that this is an assignment. Eventually we want to decompose these directly (like block morphing) and avoid this read back in many cases, but the logic is not implemented yet.
* Avoid setting GTF_GLOB_REF on GT_FIELD_ADDR nodes * Avoid setting GTF_GLOB_REF on GT_FIELD nodes off of implicit byrefs. This is ok now since implicit byref morphing indiscriminately sets GTF_GLOB_REF for these. These changes are necessary to avoid address exposure in the two user benchmarks in dotnet#83388. Fix dotnet#74563
Fix issue seen in dotnet#83388 where tail merging ended up adding new predecessors to the scratch block, making adding more "initialization IR" impossible for downstream phases.
Fix issue seen in #83388 where tail merging ended up adding new predecessors to the scratch block, making adding more "initialization IR" impossible for downstream phases.
* Avoid setting GTF_GLOB_REF on GT_FIELD_ADDR nodes * Avoid setting GTF_GLOB_REF on GT_FIELD nodes off of implicit byrefs. This is ok now since implicit byref morphing indiscriminately sets GTF_GLOB_REF for these. * Manually clone a "pointer to span" in span intrinsic expansion when it points to a local. Unfortunately this does not fall out from the above since gtClone does not handle FIELD_ADDR, and making it handle this needs some more work. These changes are necessary to avoid address exposure in the two user benchmarks in #83388. Fix #74563 Fix #856
No diffs now. TP regressions are from the new phase (even with the early exit) and the change in @AndyAyersMS Anything else you think I need to do here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@En3Tho, are you ok with me adding your |
@jakobbotsch Will be glad if you do so! I think in that particular benchmark size of collection should be increased. |
Introduce a "physical" promotion pass (name suggested by @SingleAccretion; thanks!) that generalizes the existing promotion. More specifically, it does not have restrictions on field count and it can handle arbitrary recursive promotion.
The pass is physical in the sense that it does not rely on any field metadata for structs. Instead, it works in two separate passes over the IR:
we see IR like:
and the analysis produces
Here the pairs are (#ref counts, wtd ref counts).
Based on this accounting, the analysis estimates the profitability of replacing some of the accessed parts of the struct with a local. This may be costly because overlapping struct accesses (e.g. passing the whole struct as an argument) may require more expensive codegen after promotion. And of course, creating new locals introduces more register pressure. Currently the profitability analysis is very crude.
In this case the logic decides that promotion is not worth it:
In the above case we can override the profitability analysis with stress mode
STRESS_PHYSICAL_PROMOTION_COST
and we get:The pass still only has rudimentary support and is missing many basic CQ optimization optimizations. For example, it does not make use of any liveness yet and it does not have any decomposition support for assignments. Yet, it already shows good potential in a couple of user benchmarks (benchmark 1, benchmark 2). I have listed some follow-up improvements in #76928.
This PR is adding the pass but it is disabled by default. It can be enabled by setting
DOTNET_JitStressModeNames=STRESS_PHYSICAL_PROMOTION
. There are two new scenarios added to jit-experimental that enables it, to be used for testing purposes.