Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use HW-intrinsics in BitConverter for double <-> long / float <-> int #33476

Merged
merged 2 commits into from
Mar 20, 2020

Conversation

gfoidl
Copy link
Member

@gfoidl gfoidl commented Mar 11, 2020

... to emit movd instead of using the stack on hardware that has SSE2.

Ideally the JIT would emit code like this, so this workaround isn't needed.
(But my knowledge of JIT-programming is too limited to make the proper change over there.)

Cf. #12733 (comment) and #33057 (comment)

Code used for generating the asm-dumps
using System;
using System.Runtime.CompilerServices;

namespace ConsoleApp4
{
    class Program
    {
        static int Main(string[] args)
        {
            long lval   = Double2Long(Math.PI);
            double dval = Long2Double(lval);

            int ival   = Float2Int(MathF.PI);
            float fval = Int2Float(ival);

            return dval == Math.PI && fval == MathF.PI ? 0 : 1;
        }

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static long Double2Long(double value) => BitConverter.DoubleToInt64Bits(value);

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static double Long2Double(long value) => BitConverter.Int64BitsToDouble(value);

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static int Float2Int(float value) => BitConverter.SingleToInt32Bits(value);

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static float Int2Float(int value) => BitConverter.Int32BitsToSingle(value);
    }
}
asm before
; Assembly listing for method ConsoleApp4.Program:Double2Long(double):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )  double  ->  mm0
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )  double  ->  [rsp+0x00]   do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 8

G_M41745_IG01:
       50                   push     rax
       C5F877               vzeroupper
                        ;; bbWeight=1    PerfScore 2.00
G_M41745_IG02:
       C5FB110424           vmovsd   qword ptr [rsp], xmm0
       488B0424             mov      rax, qword ptr [rsp]
                        ;; bbWeight=1    PerfScore 1.50
G_M41745_IG03:
       4883C408             add      rsp, 8
       C3                   ret
                        ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 18, prolog size 4, PerfScore 6.65, (MethodHash=784c5cee) for method ConsoleApp4.Program:Double2Long(double):long
; ============================================================

; Assembly listing for method ConsoleApp4.Program:Long2Double(long):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    long  ->  rdi
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )    long  ->  [rsp+0x00]   do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 8

G_M5681_IG01:
       50                   push     rax
       C5F877               vzeroupper
                        ;; bbWeight=1    PerfScore 2.00
G_M5681_IG02:
       48893C24             mov      qword ptr [rsp], rdi
       C5FB100424           vmovsd   xmm0, qword ptr [rsp]
                        ;; bbWeight=1    PerfScore 3.00
G_M5681_IG03:
       4883C408             add      rsp, 8
       C3                   ret
                        ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 18, prolog size 4, PerfScore 8.15, (MethodHash=79fee9ce) for method ConsoleApp4.Program:Long2Double(long):double
; ============================================================

; Assembly listing for method ConsoleApp4.Program:Float2Int(float):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   float  ->  mm0
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )   float  ->  [rsp+0x04]   do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 8

G_M977_IG01:
       50                   push     rax
       C5F877               vzeroupper
                        ;; bbWeight=1    PerfScore 2.00
G_M977_IG02:
       C5FA11442404         vmovss   dword ptr [rsp+04H], xmm0
       8B442404             mov      eax, dword ptr [rsp+04H]
                        ;; bbWeight=1    PerfScore 1.50
G_M977_IG03:
       4883C408             add      rsp, 8
       C3                   ret
                        ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 19, prolog size 4, PerfScore 6.75, (MethodHash=2cd6fc2e) for method ConsoleApp4.Program:Float2Int(float):int
; ============================================================

; Assembly listing for method ConsoleApp4.Program:Int2Float(int):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->  rdi
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  4   )     int  ->  [rsp+0x04]   do-not-enreg[F] ld-addr-op "Inlining Arg"
;
; Lcl frame size = 8

G_M15857_IG01:
       50                   push     rax
       C5F877               vzeroupper
                        ;; bbWeight=1    PerfScore 2.00
G_M15857_IG02:
       897C2404             mov      dword ptr [rsp+04H], edi
       C5FA10442404         vmovss   xmm0, dword ptr [rsp+04H]
                        ;; bbWeight=1    PerfScore 3.00
G_M15857_IG03:
       4883C408             add      rsp, 8
       C3                   ret
                        ;; bbWeight=1    PerfScore 1.25

; Total bytes of code 19, prolog size 4, PerfScore 8.25, (MethodHash=ce7cc20e) for method ConsoleApp4.Program:Int2Float(int):float
; ============================================================
asm after
; Assembly listing for method ConsoleApp4.Program:Double2Long(double):long
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  3,  3   )  double  ->  mm0
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  2,  2   )    long  ->  rax         "Inline return value spill temp"
;* V03 tmp2         [V03    ] (  0,  0   )  double  ->  zero-ref    ld-addr-op "Inlining Arg"
;
; Lcl frame size = 0

G_M41745_IG01:
       C5F877               vzeroupper
                        ;; bbWeight=1    PerfScore 1.00
G_M41745_IG02:
       C4E1F97EC0           vmovd    rax, xmm0
                        ;; bbWeight=1    PerfScore 1.00
G_M41745_IG03:
       C3                   ret
                        ;; bbWeight=1    PerfScore 1.00

; Total bytes of code 9, prolog size 3, PerfScore 3.90, (MethodHash=784c5cee) for method ConsoleApp4.Program:Double2Long(double):long
; ============================================================

; Assembly listing for method ConsoleApp4.Program:Long2Double(long):double
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    long  ->  rdi
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  2   )  double  ->  mm0         "Inline return value spill temp"
;* V03 tmp2         [V03    ] (  0,  0   )    long  ->  zero-ref    ld-addr-op "Inlining Arg"
;
; Lcl frame size = 0

G_M5681_IG01:
       C5F877               vzeroupper
                        ;; bbWeight=1    PerfScore 1.00
G_M5681_IG02:
       C4E1F96EC7           vmovd    xmm0, rdi
                        ;; bbWeight=1    PerfScore 1.00
G_M5681_IG03:
       C3                   ret
                        ;; bbWeight=1    PerfScore 1.00

; Total bytes of code 9, prolog size 3, PerfScore 3.90, (MethodHash=79fee9ce) for method ConsoleApp4.Program:Long2Double(long):double
; ============================================================

; Assembly listing for method ConsoleApp4.Program:Float2Int(float):int
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T01] (  3,  3   )   float  ->  mm0
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T00] (  2,  2   )     int  ->  rax         "Inline return value spill temp"
;* V03 tmp2         [V03    ] (  0,  0   )   float  ->  zero-ref    ld-addr-op "Inlining Arg"
;
; Lcl frame size = 0

G_M977_IG01:
       C5F877               vzeroupper
                        ;; bbWeight=1    PerfScore 1.00
G_M977_IG02:
       C5F97EC0             vmovd    eax, xmm0
                        ;; bbWeight=1    PerfScore 1.00
G_M977_IG03:
       C3                   ret
                        ;; bbWeight=1    PerfScore 1.00

; Total bytes of code 8, prolog size 3, PerfScore 3.90, (MethodHash=2cd6fc2e) for method ConsoleApp4.Program:Float2Int(float):int
; ============================================================

; Assembly listing for method ConsoleApp4.Program:Int2Float(int):float
; Emitting BLENDED_CODE for X64 CPU with AVX - Unix
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->  rdi
;# V01 OutArgs      [V01    ] (  1,  1   )  lclBlk ( 0) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T01] (  2,  2   )   float  ->  mm0         "Inline return value spill temp"
;* V03 tmp2         [V03    ] (  0,  0   )     int  ->  zero-ref    ld-addr-op "Inlining Arg"
;
; Lcl frame size = 0

G_M15857_IG01:
       C5F877               vzeroupper
                        ;; bbWeight=1    PerfScore 1.00
G_M15857_IG02:
       C5F96EC7             vmovd    xmm0, edi
                        ;; bbWeight=1    PerfScore 1.00
G_M15857_IG03:
       C3                   ret
                        ;; bbWeight=1    PerfScore 1.00

; Total bytes of code 8, prolog size 3, PerfScore 3.90, (MethodHash=ce7cc20e) for method ConsoleApp4.Program:Int2Float(int):float
; ============================================================

/cc: @tannergooding

@EgorBo
Copy link
Member

EgorBo commented Mar 11, 2020

It fixes #11413 I guess (if it wasn't expected to be implemented in jit instead)

@GrabYourPitchforks
Copy link
Member

Any concerns with this affecting AOT? I think there were some recent improvements re: how AOT and SSE2 work together. But would be good to get confirmation that this won't negatively impact anything.

@tannergooding
Copy link
Member

tannergooding commented Mar 11, 2020

Any concerns with this affecting AOT

There shouldn't be. This is SSE/SSE2 so it should be part of the baseline instruction set.

cc. @dotnet/jit-contrib should also review and may be able to provide input on if this is something better handled in JIT (maybe reinterpret casting for int/uint<->float and long/ulong<->double can be more generally recognized and replaced with this)?

Edit: If it can/should be in the JIT, but that work is unlikely to be done for .NET 5, then it might be worthwhile to take the change and to track removing it when the JIT can do this more generally.

@tannergooding
Copy link
Member

ping @dotnet/jit-contrib

Copy link
Member

@GrabYourPitchforks GrabYourPitchforks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming no JIT or AOT issues

@CarolEidt
Copy link
Contributor

I don't have a problem with taking this improvement. It might be nice to add a comment explaining that this is working around the fact that the JIT doesn't produce the code that one might expect (and possibly leaving #11413 open to address that workaround).

@gfoidl
Copy link
Member Author

gfoidl commented Mar 19, 2020

add a comment explaining that this is working around

Added in d271fd3

@tannergooding
Copy link
Member

Thanks for the contribution @gfoidl!

@tannergooding tannergooding merged commit 107fbc1 into dotnet:master Mar 20, 2020
@gfoidl gfoidl deleted the bitconverter-stack-usage branch March 20, 2020 16:28
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants