Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Suboptimal block layout with loops exiting to returns #105083

Closed
jakobbotsch opened this issue Jul 18, 2024 · 3 comments · Fixed by #105084
Closed

JIT: Suboptimal block layout with loops exiting to returns #105083

jakobbotsch opened this issue Jul 18, 2024 · 3 comments · Fixed by #105084
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Milestone

Comments

@jakobbotsch
Copy link
Member

I noticed this codegen from the aspnet collection:

; Assembly listing for method Perfolizer.Horology.TimeUnit:GetBestTimeUnit(double[]):Perfolizer.Horology.TimeUnit (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
; optimized using Dynamic PGO
; rsp based frame
; fully interruptible
; with Dynamic PGO: fgCalledCount is 80
; 0 inlinees with PGO data; 2 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  4,  4   )     ref  ->  rcx         class-hnd single-def <double[]>
;  V01 loc0         [V01,T06] (  2,  2.60)  double  ->  mm0         single-def
;  V02 loc1         [V02,T04] (  2,  2.60)     ref  ->  rdx         class-hnd exact single-def <Perfolizer.Horology.TimeUnit[]>
;* V03 loc2         [V03,T05] (  0,  0   )     int  ->  zero-ref   
;  V04 loc3         [V04,T03] (  3,  4.20)     ref  ->  rax         class-hnd <Perfolizer.Horology.TimeUnit>
;  V05 OutArgs      [V05    ] (  1,  1   )  struct (32) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V06 rat0         [V06,T01] (  4,  5.80)    long  ->  rcx         "Strength reduced derived IV"
;  V07 rat1         [V07,T02] (  4,  5.80)     int  ->   r8         "Trip count IV"
;
; Lcl frame size = 40

G_M16141_IG01:        ; bbWeight=1, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref, nogc <-- Prolog IG
       sub      rsp, 40
						;; size=4 bbWeight=1 PerfScore 0.25
G_M16141_IG02:        ; bbWeight=1, gcrefRegs=0002 {rcx}, byrefRegs=0000 {}, byref, isz
       ; gcrRegs +[rcx]
       cmp      dword ptr [rcx+0x08], 0
       je       SHORT G_M16141_IG08
       call     [<unknown method>]
       ; gcrRegs -[rcx]
       ; gcr arg pop 0
       mov      rax, 0xD1FFAB1E      ; const ptr
       mov      rdx, gword ptr [rax]
       ; gcrRegs +[rdx]
       mov      ecx, 16
       mov      r8d, 7
						;; size=36 bbWeight=1 PerfScore 9.75
G_M16141_IG03:        ; bbWeight=1.60, gcrefRegs=0004 {rdx}, byrefRegs=0000 {}, byref, isz
       mov      rax, gword ptr [rdx+rcx]
       ; gcrRegs +[rax]
       imul     r10, qword ptr [rax+0x18], 0x3E8
       vxorps   xmm1, xmm1, xmm1
       vcvtsi2sd xmm1, xmm1, r10
       vucomisd xmm1, xmm0
       jbe      SHORT G_M16141_IG05
						;; size=27 bbWeight=1.60 PerfScore 26.13
G_M16141_IG04:        ; bbWeight=1, gcrefRegs=0001 {rax}, byrefRegs=0000 {}, byref, epilog, nogc
       ; gcrRegs -[rdx]
       add      rsp, 40
       ret      
						;; size=5 bbWeight=1 PerfScore 1.25
G_M16141_IG05:        ; bbWeight=1.60, gcVars=0000000000000000 {}, gcrefRegs=0004 {rdx}, byrefRegs=0000 {}, gcvars, byref, isz
       ; gcrRegs -[rax] +[rdx]
       add      rcx, 8
       dec      r8d
       jne      SHORT G_M16141_IG03
						;; size=9 bbWeight=1.60 PerfScore 2.40
G_M16141_IG06:        ; bbWeight=0, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref
       ; gcrRegs -[rdx]
       mov      rdx, 0xD1FFAB1E      ; const ptr
       mov      rdx, gword ptr [rdx]
       ; gcrRegs +[rdx]
       mov      rcx, 0xD1FFAB1E      ; <unknown method>
						;; size=23 bbWeight=0 PerfScore 0.00
G_M16141_IG07:        ; bbWeight=0, epilog, nogc, extend
       add      rsp, 40
       tail.jmp [System.Linq.Enumerable:Last[System.__Canon](System.Collections.Generic.IEnumerable`1[System.__Canon]):System.__Canon]
						;; size=10 bbWeight=0 PerfScore 0.00
G_M16141_IG08:        ; bbWeight=0, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, gcvars, byref
       ; gcrRegs -[rdx]
       mov      rax, 0xD1FFAB1E      ; const ptr
       mov      rax, gword ptr [rax]
       ; gcrRegs +[rax]
						;; size=13 bbWeight=0 PerfScore 0.00
G_M16141_IG09:        ; bbWeight=0, epilog, nogc, extend
       add      rsp, 40
       ret      
						;; size=5 bbWeight=0 PerfScore 0.00

; Total bytes of code 132, prolog size 4, PerfScore 39.78, instruction count 28, allocated bytes for code 132 (MethodHash=973dc0f2) for method Perfolizer.Horology.TimeUnit:GetBestTimeUnit(double[]):Perfolizer.Horology.TimeUnit (Tier1)
; ============================================================

Note the G_M16141_IG04 epilog in the middle of the loop. I would expect that epilog to be placed after the loop instead, such that G_M16141_IG05 was packed together with G_M16141_IG03. In this case that block even has higher block weight, so I am surprised to see it there.

FWIW, this problem seems quite common without PGO; e.g.

[MethodImpl(MethodImplOptions.NoInlining)]
public bool Contains(char c, char[] s)
{
    for (int i = 0; i < s.Length; i++)
    {
        if (s[i] == c)
            return true;
    }

    return false;
}

results in

G_M15173_IG04:  ;; offset=0x0010
       movzx    rax, word  ptr [r8]
       cmp      eax, edx
       jne      SHORT G_M15173_IG07
						;; size=8 bbWeight=4 PerfScore 13.00
G_M15173_IG05:  ;; offset=0x0018
       mov      eax, 1
						;; size=5 bbWeight=0.50 PerfScore 0.12
G_M15173_IG06:  ;; offset=0x001D
       ret      
						;; size=1 bbWeight=0.50 PerfScore 0.50
G_M15173_IG07:  ;; offset=0x001E
       add      r8, 2
       dec      ecx
       jne      SHORT G_M15173_IG04
						;; size=8 bbWeight=4 PerfScore 6.00

without PGO (e.g. when looking in disasmo, or with DOTNET_TieredCompilation=0).

Not sure if this is already on the radar, cc @amanasifkhalid @dotnet/jit-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jul 18, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jul 18, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Member

Thanks for pointing this out. I suspect fgMoveHotJumps during layout is the culprit. Here's the layout before reordering for the example you gave:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..004)-> BB05(0.5),BB02(0.5)     ( cond )                     i idxlen
BB02 [0007]  1       BB01                  0.25 [004..???)-> BB03(1)                 (always)                     internal idxlen
BB03 [0001]  2       BB02,BB04             4    [004..00A)-> BB04(0.5),BB07(0.5)     ( cond )                     i loophead idxlen bwd bwd-target
BB04 [0003]  1       BB03                  4    [00C..016)-> BB03(0.5),BB05(0.5)     ( cond )                     i idxlen bwd
BB05 [0008]  2       BB01,BB04             0.50 [016..018)                           (return)                     i
BB07 [0002]  1       BB03                  0.50 [00A..00C)                           (return)                     i
BB08 [0009]  0                             0    [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

The edge likelihoods look sketchy. The block weights suggest the edge BB03->BB07 should be relatively unlikely, but the canned likelihoods don't reflect that. Running profile repair right before layout would probably fix this, but that's a work item for later.

In this case, BB03's edge likelihoods are equal, and one of its successors (which is intuitively the hotter one) is already the next block, so fgMoveHotJumps should probably do nothing in such cases. This should be an easy fix.

@amanasifkhalid amanasifkhalid self-assigned this Jul 18, 2024
@amanasifkhalid amanasifkhalid added this to the 9.0.0 milestone Jul 18, 2024
@amanasifkhalid amanasifkhalid added Priority:2 Work that is important, but not critical for the release and removed untriaged New issue has not been triaged by the area owner labels Jul 18, 2024
@amanasifkhalid
Copy link
Member

With #105084, here's the new layout:

---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   [IL range]   [jump]                            [EH region]        [flags]
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    [000..004)-> BB05(0.5),BB02(0.5)     ( cond )                     i idxlen
BB02 [0007]  1       BB01                  0.25 [004..???)-> BB03(1)                 (always)                     internal idxlen
BB03 [0001]  2       BB02,BB04             4    [004..00A)-> BB04(0.5),BB07(0.5)     ( cond )                     i loophead idxlen bwd bwd-target
BB04 [0003]  1       BB03                  4    [00C..016)-> BB03(0.5),BB05(0.5)     ( cond )                     i idxlen bwd
BB05 [0008]  2       BB01,BB04             0.50 [016..018)                           (return)                     i
BB07 [0002]  1       BB03                  0.50 [00A..00C)                           (return)                     i
BB08 [0009]  0                             0    [???..???)                           (throw )                     i rare keep internal
---------------------------------------------------------------------------------------------------------------------------------------------------------------------

@github-actions github-actions bot locked and limited conversation to collaborators Aug 18, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI Priority:2 Work that is important, but not critical for the release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants