Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: inefficient codegen for calls returning 16-byte structs on Linux x64 / arm64 #8571

Closed
AndyAyersMS opened this issue Jul 17, 2017 · 2 comments
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Milestone

Comments

@AndyAyersMS
Copy link
Member

From the binarytrees performance benchmark, initial call to bottomUpTree from Bench (other calls to this method have similar issues)

;;; Windows (return via hidden byref)
       488D4C2428           lea      rcx, bword ptr [rsp+28H]    ;; address of return byref
       448BC3               mov      r8d, ebx
       33D2                 xor      edx, edx
       E863FBFFFF           call     TreeNode:bottomUpTree(int,int):struct
       488D4C2428           lea      rcx, bword ptr [rsp+28H]
       E869FBFFFF           call     TreeNode:itemCheck():int:this

;;; Linux (return in register pair)
       418BF7               mov      esi, r15d
       33FF                 xor      edi, edi
       E85BFBFFFF           call     TreeNode:bottomUpTree(int,int):struct
       48894598             mov      gword ptr [rbp-68H], rax       ;; spill return to temp
       488955A0             mov      qword ptr [rbp-60H], rdx
       488D7D98             lea      rdi, bword ptr [rbp-68H]       ;; copy temp to another temp
       488B07               mov      rax, gword ptr [rdi]
       488945B0             mov      gword ptr [rbp-50H], rax
       8B7F08               mov      edi, dword ptr [rdi+8]
       897DB8               mov      dword ptr [rbp-48H], edi
       488D7DB0             lea      rdi, bword ptr [rbp-50H]       ;; pass 2nd temp to itemCheck
       E849FBFFFF           call     TreeNode:itemCheck():int:this

bottomUpTree has similar issues at its recursive call sites, and also does some redundant zeroing of temp structs that were zeroed in the prolog:

;; prolog: zero from rbp-28H to rbp-88H
       488DBD78FFFFFF       lea      rdi, [rbp-88H]
       B918000000           mov      ecx, 24
       33C0                 xor      rax, rax
       F3AB                 rep stosd 

;; later: re-zero part of the range
       488D7DB8             lea      rdi, bword ptr [rbp-48H]

G_M53682_IG03:
       660F57C0             xorpd    xmm0, xmm0
       F30F7F07             movdqu   qword ptr [rdi], xmm0

;; later: re-zero another part, overwrite it (partially with a zero),
;; then immediately read & return the values just written as a pair
       488D45C8             lea      rax, bword ptr [rbp-38H]

G_M53682_IG09:
       660F57C0             xorpd    xmm0, xmm0
       F30F7F00             movdqu   qword ptr [rax], xmm0

G_M53682_IG10:
       895DD0               mov      dword ptr [rbp-30H], ebx
       33C0                 xor      rax, rax
       488945C8             mov      gword ptr [rbp-38H], rax
       488B45C8             mov      rax, gword ptr [rbp-38H]
       488B55D0             mov      rdx, qword ptr [rbp-30H]

G_M53682_IG11:
       488D65D8             lea      rsp, [rbp-28H]
       5B                   pop      rbx
       415C                 pop      r12
       415D                 pop      r13
       415E                 pop      r14
       415F                 pop      r15
       5D                   pop      rbp
       C3                   ret      

Note this latter bit of code could simply be something like

   movsx   rdx, ebx
   lea     rsp, ...
   ...
   ret

category:cq
theme:structs
skill-level:expert
cost:large

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@kunalspathak kunalspathak changed the title JIT: inefficient codegen for calls returning 16-byte structs on Linux x64 JIT: inefficient codegen for calls returning 16-byte structs on Linux x64 / arm64 Mar 25, 2020
@kunalspathak
Copy link
Member

This is also applicable for ARM64 as well. When profiling this method on windows ARM4, we generate the following. The ctor call is inlined.

G_M43267_IG01:
        A9BE7BFD          stp     fp, lr, [sp,#-32]!
        910003FD          mov     fp, sp
        F9000BBF          str     xzr, [fp,#16] // [V02 tmp1]
                                                ;; bbWeight=1    PerfScore 2.50
G_M43267_IG02:
        F9400400          ldr     x0, [x0,#8]
        B40001C0          cbz     x0, G_M43267_IG06
                                                ;; bbWeight=1    PerfScore 4.00
G_M43267_IG03:
        B9400801          ldr     w1, [x0,#8]
        2A0103E1          mov     w1, w1
        F100283F          cmp     x1, #10
        54000143          blo     G_M43267_IG06
                                                ;; bbWeight=0.50 PerfScore 2.50
G_M43267_IG04:
        91004000          add     x0, x0, #16
        52800141          mov     w1, #10
        910043A2          add     x2, fp, #16   // [V02 tmp1]
        F9000040          str     x0, [x2]
        B9000841          str     w1, [x2,#8]
        F9400BA0          ldr     x0, [fp,#16]  // [V02 tmp1]
        F9400FA1          ldr     x1, [fp,#24]  // [V02 tmp1+0x08]
                                                ;; bbWeight=1    PerfScore 7.50
G_M43267_IG05:
        A8C27BFD          ldp     fp, lr, [sp],#32
        D65F03C0          ret     lr
                                                ;; bbWeight=1    PerfScore 2.00
G_M43267_IG06:
        97E65C22          bl      System.ThrowHelper:ThrowArgumentOutOfRangeException()
        D43E0000          bkpt
                                                ;; bbWeight=0    PerfScore 0.00

Here G_M43267_IG04 doesn't need to be so long and can be:

add x0, x0, #16
mov w1, #10
# At this point, x0 and x1 has the struct that we are trying to return

@CarolEidt
Copy link
Contributor

This issue seems to pertain to an older version of binarytrees (perhaps https://github.com/dotnet/coreclr/blob/414ab4ee1a6f31ae63f166de2b9d4d0af640574f/tests/src/JIT/Performance/CodeQuality/BenchmarksGame/binarytrees/binarytrees.csharp.cs). For that version, after #36862, we are largely keeping the return values in registers.

@CarolEidt CarolEidt modified the milestones: Future, 5.0.0 Oct 27, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-x64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions optimization tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

4 participants