Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Emit power-of-two constant multiply as shift #13128

Merged
merged 1 commit into from
Aug 2, 2017

Conversation

JosephTremoulet
Copy link

Most of these get translated to shifts at Morph, but adding a failsafe
here next to the transform for 3/5/9->LEA ensures we won't have these in
the emitted code when they (re-)appear in later phases.

@JosephTremoulet
Copy link
Author

@dotnet/jit-contrib PTAL. This rewrites ~6,000 multiplications across ~570 files in the jit-diff input set. See stats and discussion for motivation/background.

else if (!requiresOverflowCheck && rmOp->isUsedFromReg() && isPow2(imm))
{
// Use shift for constant multiply when legal
unsigned int shiftAmount = genLog2((uint64_t)imm);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why uint64_t and not size_t? And a static_cast would be preferable to a C style cast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why uint64_t and not size_t?

I thought it best to exactly match one of the overloads of genLog2.

a static_cast would be preferable to a C style cast

Indeed; updated.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, right, unlike isPow2 genLog2 is not a template.

@pgavlin
Copy link

pgavlin commented Jul 31, 2017

@JosephTremoulet can you post an example or two of the diffs you're seeing? Also, it might be productive to file an issue to track moving this to Lower.

LGTM otherwise.

@BruceForstall
Copy link
Member

What if imm==2^31? In that case, ssize_t on x86 makes it negative, so I would think isPow2 will return false. (Similar for 2^63 on x64?)

@mikedn
Copy link

mikedn commented Aug 1, 2017

What if imm==2^31? In that case, ssize_t on x86 makes it negative, so I would think isPow2 will return false. (Similar for 2^63 on x64?)

imm should be cast to size_t in the call to isPow2. But multiplying by 2^31 is so rare that this is probably not useful.

And presumably we don't handle negative multipliers either - does x * -4 work?

PS

And presumably we don't handle negative multipliers either - does x * -4 work?

Looks like morph handles this but this codegen implementation obviously won't. Anyway, this optimization should be moved out of morph into lower/codegen and at that point it will work properly.

Most of these get translated to shifts at Morph, but adding a failsafe
here next to the transform for 3/5/9->LEA ensures we won't have these in
the emitted code when they (re-)appear in later phases.
@JosephTremoulet
Copy link
Author

Updated to use a bit test that will include min_int (might as well), but not to check for negative powers of two and insert the neg (my analysis showed almost none of these popping up at CodeGen, so I agree that deferring this until migrating the more complete pattern-matching from morph to lower makes sense, same as 3/5/9 times power-of-two).

Also updated to handle the case that src and target reg differ.

@JosephTremoulet
Copy link
Author

can you post an example or two of the diffs you're seeing?

Sure, here's a small method that changes:

Before:

; Assembly listing for method TestApp:test_88(ref,long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  rdx        
;* V02 loc0         [V02    ] (  0,  0   )    long  ->  zero-ref   
;  V03 loc1         [V03    ] (  2,  2   )   byref  ->  [rsp+0x20]   must-init pinned
;  V04 tmp0         [V04,T02] (  2,  4   )    long  ->  rax        
;* V05 tmp1         [V05    ] (  0,  0   )    long  ->  zero-ref   
;* V06 tmp2         [V06    ] (  0,  0   )    long  ->  zero-ref   
;  V07 tmp3         [V07,T04] (  3,  3   )   byref  ->  rax        
;  V08 tmp4         [V08,T03] (  2,  4   )    long  ->  rax        
;  V09 OutArgs      [V09    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40
G_M10220_IG01:
       sub      rsp, 40
       xor      rax, rax
       mov      qword ptr [rsp+20H], rax
G_M10220_IG02:
       lea      rax, [rdx-1]
       mov      edx, 1
       sub      eax, dword ptr [rcx+24]
       cmp      eax, dword ptr [rcx+16]
       jae      SHORT G_M10220_IG04
       sub      edx, dword ptr [rcx+28]
       cmp      edx, dword ptr [rcx+20]
       jae      SHORT G_M10220_IG04
       mov      r8d, dword ptr [rcx+20]
       imul     r8, rax
       mov      rax, rdx
       add      rax, r8
       imul     rax, rax, 16             ; <-------- multiply by constant power-of-two
       lea      rax, bword ptr [rcx+rax+32]
       cmp      dword ptr [rax], eax
       add      rax, 8
       mov      bword ptr [rsp+20H], rax
       mov      rax, bword ptr [rsp+20H]
       mov      rax, qword ptr [rax]
G_M10220_IG03:
       add      rsp, 40
       ret      
G_M10220_IG04:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 89, prolog size 11 for method TestApp:test_88(ref,long):long

After:

; Assembly listing for method TestApp:test_88(ref,long):long
; Emitting BLENDED_CODE for X64 CPU with SSE2
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     ref  ->  rcx         class-hnd
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  rdx        
;* V02 loc0         [V02    ] (  0,  0   )    long  ->  zero-ref   
;  V03 loc1         [V03    ] (  2,  2   )   byref  ->  [rsp+0x20]   must-init pinned
;  V04 tmp0         [V04,T02] (  2,  4   )    long  ->  rax        
;* V05 tmp1         [V05    ] (  0,  0   )    long  ->  zero-ref   
;* V06 tmp2         [V06    ] (  0,  0   )    long  ->  zero-ref   
;  V07 tmp3         [V07,T04] (  3,  3   )   byref  ->  rax        
;  V08 tmp4         [V08,T03] (  2,  4   )    long  ->  rax        
;  V09 OutArgs      [V09    ] (  1,  1   )  lclBlk (32) [rsp+0x00]  
;
; Lcl frame size = 40
G_M10220_IG01:
       sub      rsp, 40
       xor      rax, rax
       mov      qword ptr [rsp+20H], rax
G_M10220_IG02:
       lea      rax, [rdx-1]
       mov      edx, 1
       sub      eax, dword ptr [rcx+24]
       cmp      eax, dword ptr [rcx+16]
       jae      SHORT G_M10220_IG04
       sub      edx, dword ptr [rcx+28]
       cmp      edx, dword ptr [rcx+20]
       jae      SHORT G_M10220_IG04
       mov      r8d, dword ptr [rcx+20]
       imul     r8, rax
       mov      rax, rdx
       add      rax, r8
       shl      rax, 4             ; <-------- changed to shift
       lea      rax, bword ptr [rcx+rax+32]
       cmp      dword ptr [rax], eax
       add      rax, 8
       mov      bword ptr [rsp+20H], rax
       mov      rax, bword ptr [rsp+20H]
       mov      rax, qword ptr [rax]
G_M10220_IG03:
       add      rsp, 40
       ret      
G_M10220_IG04:
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 89, prolog size 11 for method TestApp:test_88(ref,long):long

@JosephTremoulet
Copy link
Author

it might be productive to file an issue to track moving this to Lower.

Created #13150

@JosephTremoulet
Copy link
Author

@BruceForstall, good with update?

@JosephTremoulet JosephTremoulet merged commit 21d7b38 into dotnet:master Aug 2, 2017
@JosephTremoulet JosephTremoulet deleted the Mulshift branch August 2, 2017 18:28
@karelz karelz modified the milestone: 2.1.0 Aug 28, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants