Optimize bswap+mov to movbe on xarch #66965

aromaa · 2022-03-21T23:20:40Z

Adds lowering for the pattern BSWAP|BSWAP16(IND) and STOREIND(addr, BSWAP|BSWAP16(x)) on xarch and emits the movbe instruction.

Methods using the BinaryPrimitives read & write helpers do not yet benefit from this optimization as their code has been layed out in a way that is not easily recognizable. This is not fixed in this PR.

Fixes #953

ghost · 2022-03-21T23:20:47Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Adds lowering for the pattern BSWAP(IND) ands STOREIND(addr, BSWAP(x)) on xarch and emits the movbe instruction instead. This does not match the 16-bit node BSWAP16 as the importer wraps it inside short <- int <- short cast and made it more complicated to deal with.

Methods using the BinaryPrimitives read & write helpers do not yet benefit from this optimization as they use MemoryMarshal under the hood which breaks the pattern. This should be switched to use Unsafe to take advantage of this, which is not included in this PR.

Fixes #953

Author:	aromaa
Assignees:	-
Labels:	`area-CodeGen-coreclr`
Milestone:	-

src/coreclr/jit/codegenxarch.cpp

Wraith2 · 2022-03-22T00:27:14Z

Methods using the BinaryPrimitives read & write helpers do not yet benefit from this optimization as they use MemoryMarshal under the hood which breaks the pattern. This should be switched to use Unsafe to take advantage of this, which is not included in this PR.

Can you expand on this a bit? What is it about them that breaks the pattern?
When I was investigating this I discussed it with some people in discord and chose to rewrite the read and write primitives to avoid a multi use variable and that cleared up the tree considerably for me.
instead of:

public static int ReadInt32BigEndian(ReadOnlySpan<byte> source)
{
    int result = MemoryMarshal.Read<int>(source);
    if (BitConverter.IsLittleEndian)
    {
        result = ReverseEndianness(result);
    }
    return result;
}

I used:

public static int ReadInt32BigEndian(ReadOnlySpan<byte> source)
{
    if (BitConverter.IsLittleEndian)
    {
        return BinaryPrimitives.ReverseEndianness(MemoryMarshal.Read<int>(source));
    }
    else
    {
        return MemoryMarshal.Read<int>(source);
    }
}

aromaa · 2022-03-22T01:31:57Z

Can you expand on this a bit? What is it about them that breaks the pattern?

Yes, the read one is trivial to solve as you mentioned above and gets optimized by this PR. But the problematic one is the MemoryMarshal.Write, which ends up creating bound checks between bswap and mov and can't be recognized easily. To fix this the bound check needs to be manually written before ReverseEndianess.

The IR for writing is following:

N003 (???,???) [000115] ------------                 IL_OFFSET void   INLRT @ 0x000[E-] REG NA
N005 (  1,  1) [000091] -------N----        t91 =    LCL_VAR   byref  V00 arg0         u:1 rcx Zero Fseq[_pointer] REG rcx $80
                                                  /--*  t91    byref
N007 (  3,  2) [000092] n-----------        t92 = *  IND       byref  REG rax <l:$140, c:$81>
                                                  /--*  t92    byref
N009 (  7,  5) [000093] DA----------              *  STORE_LCL_VAR byref  V12 tmp10        d:1 rax REG rax
N011 (  1,  1) [000000] -------N----         t0 =    LCL_VAR   byref  V00 arg0         u:1 rcx (last use) REG rcx $80
                                                  /--*  t0     byref
N013 (  2,  2) [000096] -c----------        t96 = *  LEA(b+8)  byref  REG NA
                                                  /--*  t96    byref
N015 (  4,  4) [000097] n-----------        t97 = *  IND       int    REG rcx <l:$200, c:$c1>
                                                  /--*  t97    int
N017 (  4,  4) [000098] DA----------              *  STORE_LCL_VAR int    V13 tmp11        d:1 rcx REG rcx
N019 (???,???) [000116] ------------                 IL_OFFSET void   INL01 @ 0x000[E-] <- INLRT @ 0x000[E-] REG NA
N021 (  1,  1) [000001] ------------         t1 =    LCL_VAR   int    V01 arg1         u:1 rdx (last use) REG rdx $c0
                                                  /--*  t1     int
N023 (  2,  2) [000007] ------------         t7 = *  BSWAP     int    REG rdx $c2
                                                  /--*  t7     int
N025 (  2,  3) [000009] DA----------              *  STORE_LCL_VAR int    V04 tmp2         d:1 rdx REG rdx
N027 (???,???) [000117] ------------                 IL_OFFSET void   INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N029 (  3,  2) [000101] ------------       t101 =    LCL_VAR   byref  V12 tmp10        u:1 rax (last use) REG rax <l:$140, c:$81>
                                                  /--*  t101   byref
N031 (  3,  3) [000102] DA----------              *  STORE_LCL_VAR byref  V14 tmp12        d:1 rax REG rax
N033 (???,???) [000118] ------------                 IL_OFFSET void   INL03 @ ??? <- INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N035 (  1,  1) [000069] ------------        t69 =    LCL_VAR   int    V13 tmp11        u:1 rcx (last use) REG rcx <l:$200, c:$c1>
N037 (  1,  1) [000065] -c----------        t65 =    CNS_INT   int    4 REG NA $45
                                                  /--*  t69    int
                                                  +--*  t65    int
N039 (  3,  3) [000041] N------N-U--              *  LT        void   REG NA <l:$281, c:$280>
N041 (  5,  5) [000042] ------------              *  JTRUE     void   REG NA $VN.Void

------------ BB02 [000..001) (return), preds={BB01} succs={}
N045 (???,???) [000119] ------------                 IL_OFFSET void   INL03 @ 0x02B[E-] <- INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N047 (???,???) [000120] ------------                 IL_OFFSET void   INL08 @ 0x000[E-] <- INL03 @ 0x02B[E-] <- INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N049 (  1,  1) [000080] ------------        t80 =    LCL_VAR   byref  V14 tmp12        u:1 rax (last use) REG rax <l:$140, c:$81>
N051 (  1,  1) [000081] ------------        t81 =    LCL_VAR   int    V04 tmp2         u:1 rdx (last use) REG rdx $c2
                                                  /--*  t80    byref
                                                  +--*  t81    int
N053 (???,???) [000121] -A-XG-------              *  STOREIND  int    REG NA
N055 (???,???) [000122] ------------                 IL_OFFSET void   INLRT @ 0x007[E-] REG NA
N057 (  0,  0) [000005] ------------                 RETURN    void   REG NA $VN.Void

------------ BB03 [000..001) (throw), preds={BB01} succs={}
N061 (???,???) [000123] ------------                 IL_OFFSET void   INL03 @ 0x024[E-] <- INL01 @ ??? <- INLRT @ 0x000[E-] REG NA
N063 (  1,  1) [000052] ------------        t52 =    CNS_INT   int    41 REG rcx $46
                                                  /--*  t52    int
N065 (???,???) [000124] ------------       t124 = *  PUTARG_REG int    REG rcx
N067 (  2, 10) [000125] Hc----------       t125 =    CNS_INT(h) long   0x7ffa2e025c78 ftn REG NA
                                                  /--*  t125   long
N069 (  4, 12) [000126] -c----------       t126 = *  IND       long   REG NA
                                                  /--*  t124   int    arg0 in rcx
                                                  +--*  t126   long   control expr
N071 ( 15,  7) [000053] --CXG-------              *  CALL      void   System.ThrowHelper.ThrowArgumentOutOfRangeException REG NA $VN.Void

MichalStrehovsky · 2022-03-22T02:26:37Z

Once this is ready, could you please also add this for NativeAOT configs? The blueprint for NativeAOT-specific changes is in #63563 - it should be mostly mechanical.

Since there's no public API, the extent of needed changes will be smaller than the above pull request.

src/coreclr/jit/lowerxarch.cpp

src/coreclr/jit/emitxarch.cpp

src/coreclr/jit/lowerxarch.cpp

src/coreclr/jit/codegenxarch.cpp

src/coreclr/jit/lowerxarch.cpp

src/coreclr/jit/lsrabuild.cpp

src/coreclr/jit/lowerxarch.cpp

jakobbotsch · 2022-04-24T12:00:07Z

@aromaa Can you let me know when this is ready to be reviewed again?

src/coreclr/jit/lowerxarch.cpp

jakobbotsch · 2022-04-25T08:35:25Z

@aromaa can you merge from main? I think superpmi-diffs is failing because #68292 was merged in the meantime and it does not have a baseline release JIT that matches for this branch.

aromaa · 2022-04-25T08:57:27Z

@aromaa can you merge from main? I think superpmi-diffs is failing because #68292 was merged in the meantime and it does not have a baseline release JIT that matches for this branch.

The diffs are actually failing because the jiteeversionguid.h was changed in the PR because the ISA was modified. You would get bogus diffs if it tried to run them. I tried to do local collection but I had some trouble on it few days ago, but I'm planning to get the diffs before merging.

jakobbotsch · 2022-04-25T09:09:56Z

The diffs are actually failing because the jiteeversionguid.h was changed in the PR because the ISA was modified. You would get bogus diffs if it tried to run them. I tried to do local collection but I had some trouble on it few days ago, but I'm planning to get the diffs before merging.

Af of course. Usually in this case we would use jit-diff instead. But your change also not really incompatible with previous collections, so it might be the easiest to just make a temporary hack of the JIT-EE GUID/ISA check to collect SPMI diffs.

aromaa · 2022-04-25T09:26:31Z

Af of course. Usually in this case we would use jit-diff instead. But your change also not really incompatible with previous collections, so it might be the easiest to just make a temporary hack of the JIT-EE GUID/ISA check to collect SPMI diffs.

I tried changing that but it gives me bogus diffs where POPCNT and MOVBE were missing so I gave up on that. Removing the ISA before running the diffs works fine but I didint bother to do that too many times and relied on the test cases.

aromaa · 2022-04-26T00:52:28Z

Summary of Perf Score diffs:
(Lower is better)

Total PerfScoreUnits of base: 2.4243817735148437E+31
Total PerfScoreUnits of diff: 2.4243817735148437E+31
Total PerfScoreUnits of delta: -16,86 (-0.00 % of base)
Total relative delta: NaN
    diff is an improvement.
    relative diff is a regression.

Detail diffs

Top file regressions (PerfScoreUnits):
        0,25 : System.Net.Sockets.dasm (0,00 % of base)
        0,20 : System.Diagnostics.DiagnosticSource.dasm (0,00 % of base)

Top file improvements (PerfScoreUnits):
       -9,70 : System.Private.CoreLib.dasm (-0,00 % of base)
       -5,20 : System.Formats.Cbor.dasm (-0,03 % of base)
       -2,34 : System.Memory.dasm (-0,00 % of base)
       -0,07 : System.Net.Primitives.dasm (-0,00 % of base)

6 total files with Perf Score differences (4 improved, 2 regressed), 265 unchanged.

Top method regressions (PerfScoreUnits):
        0,90 (5,59 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
        0,90 (5,59 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
        0,50 (3,25 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadSingleBigEndian(System.ReadOnlySpan`1[Byte]):float
        0,50 (3,25 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadSingleBigEndian(System.ReadOnlySpan`1[Byte]):float
        0,40 (0,86 % of base) : System.Diagnostics.DiagnosticSource.dasm - System.Diagnostics.ActivitySpanId:.ctor(System.ReadOnlySpan`1[Byte]):this
        0,25 (0,49 % of base) : System.Net.Sockets.dasm - System.Net.Sockets.SocketPal:SetMulticastOption(System.Net.Sockets.SafeSocketHandle,int,System.Net.Sockets.MulticastOption):int

Top method improvements (PerfScoreUnits):
       -2,00 (-2,11 % of base) : System.Private.CoreLib.dasm - System.Guid:<TryParseExactD>g__TryCompatParsing|30_0(System.ReadOnlySpan`1[Char],byref):bool
       -1,80 (-1,28 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadHalf():System.Half:this
       -1,80 (-9,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryWriteHalfBigEndian(System.Span`1[Byte],System.Half):bool
       -1,80 (-10,37 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:WriteHalfBigEndian(System.Span`1[Byte],System.Half)
       -1,80 (-11,73 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadHalfBigEndian(System.ReadOnlySpan`1[Byte]):System.Half
       -1,80 (-11,73 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadHalfBigEndian(System.ReadOnlySpan`1[Byte]):System.Half
       -1,80 (-2,52 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborWriter:WriteHalf(System.Half):this
       -1,25 (-8,29 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt16BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -1,25 (-8,94 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt16BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -1,22 (-0,76 % of base) : System.Memory.dasm - System.Buffers.SequenceReaderExtensions:TryReadReverseEndianness(byref,byref):bool (3 methods)
       -1,12 (-1,01 % of base) : System.Memory.dasm - System.Buffers.SequenceReaderExtensions:TryReadBigEndian(byref,byref):bool (3 methods)
       -0,80 (-0,46 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadSingle():float:this
       -0,50 (-3,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt32BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,50 (-3,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt32BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,40 (-0,20 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadDouble():double:this
       -0,20 (-0,23 % of base) : System.Diagnostics.DiagnosticSource.dasm - System.Diagnostics.ActivityTraceId:.ctor(System.ReadOnlySpan`1[Byte]):this
       -0,10 (-0,72 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt64BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,10 (-0,72 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt64BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,07 (-0,26 % of base) : System.Net.Primitives.dasm - System.Net.IPAddress:MapToIPv6():System.Net.IPAddress:this

Top method regressions (percentages):
        0,90 (5,59 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
        0,90 (5,59 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
        0,50 (3,25 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadSingleBigEndian(System.ReadOnlySpan`1[Byte]):float
        0,50 (3,25 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadSingleBigEndian(System.ReadOnlySpan`1[Byte]):float
        0,40 (0,86 % of base) : System.Diagnostics.DiagnosticSource.dasm - System.Diagnostics.ActivitySpanId:.ctor(System.ReadOnlySpan`1[Byte]):this
        0,25 (0,49 % of base) : System.Net.Sockets.dasm - System.Net.Sockets.SocketPal:SetMulticastOption(System.Net.Sockets.SafeSocketHandle,int,System.Net.Sockets.MulticastOption):int

Top method improvements (percentages):
       -1,80 (-11,73 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:ReadHalfBigEndian(System.ReadOnlySpan`1[Byte]):System.Half
       -1,80 (-11,73 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborHelpers:ReadHalfBigEndian(System.ReadOnlySpan`1[Byte]):System.Half
       -1,80 (-10,37 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:WriteHalfBigEndian(System.Span`1[Byte],System.Half)
       -1,80 (-9,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryWriteHalfBigEndian(System.Span`1[Byte],System.Half):bool
       -1,25 (-8,94 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt16BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -1,25 (-8,29 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt16BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,50 (-3,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt32BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,50 (-3,84 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt32BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -1,80 (-2,52 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborWriter:WriteHalf(System.Half):this
       -2,00 (-2,11 % of base) : System.Private.CoreLib.dasm - System.Guid:<TryParseExactD>g__TryCompatParsing|30_0(System.ReadOnlySpan`1[Char],byref):bool
       -1,80 (-1,28 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadHalf():System.Half:this
       -1,12 (-1,01 % of base) : System.Memory.dasm - System.Buffers.SequenceReaderExtensions:TryReadBigEndian(byref,byref):bool (3 methods)
       -1,22 (-0,76 % of base) : System.Memory.dasm - System.Buffers.SequenceReaderExtensions:TryReadReverseEndianness(byref,byref):bool (3 methods)
       -0,10 (-0,72 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadInt64BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,10 (-0,72 % of base) : System.Private.CoreLib.dasm - System.Buffers.Binary.BinaryPrimitives:TryReadUInt64BigEndian(System.ReadOnlySpan`1[Byte],byref):bool
       -0,80 (-0,46 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadSingle():float:this
       -0,07 (-0,26 % of base) : System.Net.Primitives.dasm - System.Net.IPAddress:MapToIPv6():System.Net.IPAddress:this
       -0,20 (-0,23 % of base) : System.Diagnostics.DiagnosticSource.dasm - System.Diagnostics.ActivityTraceId:.ctor(System.ReadOnlySpan`1[Byte]):this
       -0,40 (-0,20 % of base) : System.Formats.Cbor.dasm - System.Formats.Cbor.CborReader:ReadDouble():double:this

25 total methods with Perf Score differences (19 improved, 6 regressed), 378877 unchanged.

Regression example

@@ -2,15 +2,14 @@ G_M10490_IG01:
        sub      rsp, 40
        vzeroupper
                                                ;; size=7 bbWeight=0    PerfScore 0.00
 G_M10490_IG02:
        mov      rax, bword ptr [rcx]
        mov      ecx, dword ptr [rcx+8]
        cmp      ecx, 8
        jl       SHORT G_M10490_IG04
-       mov      rcx, qword ptr [rax]
-       bswap    rcx
+       movbe    rcx, qword ptr [rax]
        vmovd    xmm0, rcx
-                                               ;; size=22 bbWeight=1    PerfScore 10.25
+                                               ;; size=21 bbWeight=1    PerfScore 11.25
 G_M10490_IG03:
        add      rsp, 40
        ret
@@ -15,10 +14,10 @@ G_M10490_IG03:
        add      rsp, 40
        ret
                                                ;; size=5 bbWeight=1    PerfScore 1.25
 G_M10490_IG04:
        mov      ecx, 41
        call     [System.ThrowHelper:ThrowArgumentOutOfRangeException(int)]
        int3
                                                ;; size=12 bbWeight=0    PerfScore 0.00

-; Total bytes of code 46, prolog size 7, PerfScore 16.10, instruction count 14, allocated bytes for code 46 (MethodHash=cb5ed705) for method System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
+; Total bytes of code 45, prolog size 7, PerfScore 17.00, instruction count 13, allocated bytes for code 45 (MethodHash=cb5ed705) for method System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double

jakobbotsch

Looks good now, just a couple of small nits.

src/coreclr/jit/codegenxarch.cpp

src/coreclr/jit/lowerxarch.cpp

jakobbotsch · 2022-04-26T09:18:28Z

Regression

I'm not sure why some of these would show up as perfscore regressions but I wouldn't worry too much about it. I would expect this transform to be an improvement whenever it applies.

jakobbotsch · 2022-04-26T09:22:13Z

cc @tannergooding, can you take a look just to confirm that the various ISA table entries look right?

aromaa · 2022-04-26T10:38:11Z

I'm not sure why some of these would show up as perfscore regressions but I wouldn't worry too much about it. I would expect this transform to be an improvement whenever it applies.

I investigated it a bit and looks like to be due to the logic in emitter::insEvaluateExecutionCost. It decreases the latency by one per instruction and then does max(throughput, latency) which ends up the value to be 0.5 for 16 and 32 bswap and 1 for 64 bswap. So having identical perf scores and one less instruction we actually get higher perf score estimate due to ironically having one less instructions.

Thank you for the reviews! Learned a lot and hopefully further optimization attempts go more smoothly :)

src/coreclr/jit/lowerxarch.cpp

jakobbotsch · 2022-04-26T10:51:47Z

I investigated it a bit and looks like to be due to the logic in emitter::insEvaluateExecutionCost. It decreases the latency by one per instruction and then does max(throughput, latency) which ends up the value to be 0.5 for 16 and 32 bswap and 1 for 64 bswap. So having identical perf scores and one less instruction we actually get higher perf score estimate due to ironically having one less instructions.

Ah, that's quite unfortunate, but good to know. Probably something we ought to look into.

Thank you for the reviews! Learned a lot and hopefully further optimization attempts go more smoothly :)

Don't worry about it, so did I. And FWIW, I would not consider the process of this PR unsmooth -- the JIT is complex and the optimization you made in this PR has to deal with a lot of the details, so it is understandable that there were a few corner cases that requires some extra treatment.

jakobbotsch · 2022-05-02T15:22:18Z

ping @tannergooding for a review of the ISA related changes

tannergooding · 2022-05-06T12:54:30Z

@aromaa, could you please resolve the merge conflict? You should just be able to just keep the guid already generated for this PR.

We should be able to merge this once that's in (provided CI is passing).

aromaa · 2022-05-06T20:12:57Z

Failure is #68690. Not sure why the Mono leg is failing, there doesn't seem to be much to go with?

jakobbotsch · 2022-05-08T12:35:27Z

Failures looked unrelated. Thanks for the contribution!

pentp · 2022-05-09T12:49:19Z

I'm not sure why some of these would show up as perfscore regressions but I wouldn't worry too much about it. I would expect this transform to be an improvement whenever it applies.

I investigated it a bit and looks like to be due to the logic in emitter::insEvaluateExecutionCost. It decreases the latency by one per instruction and then does max(throughput, latency) which ends up the value to be 0.5 for 16 and 32 bswap and 1 for 64 bswap. So having identical perf scores and one less instruction we actually get higher perf score estimate due to ironically having one less instructions.

This is probably not a mistake - according to uops.info:
mov min. latency is 2 + 64-bit bswap latency is 2 on Skylake-X.
64-bit movbe min. latency is 4 on Skylake-X.
On AMD movbe has a latency of 6 while mov is 5 + bswap is 1.
So this optimization might in some cases actually be slower.

aromaa requested a review from MichalStrehovsky as a code owner March 21, 2022 23:20

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 21, 2022

ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 21, 2022

Wraith2 reviewed Mar 22, 2022

View reviewed changes

src/coreclr/jit/codegenxarch.cpp Show resolved Hide resolved

jakobbotsch reviewed Mar 22, 2022

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

jakobbotsch reviewed Mar 22, 2022

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

This was referenced Mar 22, 2022

ThreadPoolTests.CooperativeBlockingCanCreateThreadsFaster timed out #64964

Closed

Threadpool test CooperativeBlockingCanCreateThreadsFaster failing on Mac #66852

Closed

SingleAccretion reviewed Mar 22, 2022

View reviewed changes

tannergooding self-requested a review March 27, 2022 06:31

aromaa force-pushed the opts/movbe branch from 297b178 to 001448f Compare March 27, 2022 18:08

JulieLeeMSFT assigned aromaa and jakobbotsch Apr 4, 2022

JulieLeeMSFT added this to the 7.0.0 milestone Apr 4, 2022

jakobbotsch reviewed Apr 7, 2022

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

jakobbotsch reviewed Apr 7, 2022

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Show resolved Hide resolved

This was referenced Apr 7, 2022

Fix wrong constant folding for bswap16 #67726

Closed

Make BSWAP16 nodes normalize upper 16 bits #67903

Merged

aromaa added 6 commits April 22, 2022 02:06

Optimize bswap+mov to movbe

d444a91

Fix build

0b6d2b9

PR feedback

3866ab1

Support BSWAP16

20cb5d9

Add NativeAOT configs

e35aa3c

PR feedback

4a99422

aromaa force-pushed the opts/movbe branch from 001448f to 4a99422 Compare April 22, 2022 20:12

jakobbotsch reviewed Apr 25, 2022

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

Fix BSWAP16 loads

f068a16

jakobbotsch reviewed Apr 26, 2022

View reviewed changes

src/coreclr/jit/codegenxarch.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

Nit

548d5df

jakobbotsch reviewed Apr 26, 2022

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

Ensure same size on store

0273f69

This was referenced Apr 26, 2022

jit/performance/codequality/benchmarksgame/k-nucleotide/k-nucleotide-9/k-nucleotide-9.sh #67675

Closed

jit.1 work item failing on mono #67888

Closed

jakobbotsch approved these changes Apr 26, 2022

View reviewed changes

tannergooding approved these changes May 5, 2022

View reviewed changes

Merge branch 'main' of https://github.com/dotnet/runtime into opts/movbe

bee2b34

jakobbotsch merged commit 24714ef into dotnet:main May 8, 2022

aromaa deleted the opts/movbe branch May 8, 2022 12:42

aromaa mentioned this pull request May 9, 2022

Adjust the usage of ReverseEndianness in BinaryPrimitives #69063

Merged

JulieLeeMSFT mentioned this pull request Jun 3, 2022

What's new in .NET 7 Preview 5 [WIP] dotnet/core#7441

Closed

ghost locked as resolved and limited conversation to collaborators Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize bswap+mov to movbe on xarch #66965

Optimize bswap+mov to movbe on xarch #66965

aromaa commented Mar 21, 2022 •

edited

Loading

ghost commented Mar 21, 2022

Wraith2 commented Mar 22, 2022

aromaa commented Mar 22, 2022

MichalStrehovsky commented Mar 22, 2022 •

edited

Loading

jakobbotsch commented Apr 24, 2022

jakobbotsch commented Apr 25, 2022

aromaa commented Apr 25, 2022

jakobbotsch commented Apr 25, 2022

aromaa commented Apr 25, 2022

aromaa commented Apr 26, 2022 •

edited

Loading

jakobbotsch left a comment

jakobbotsch commented Apr 26, 2022

jakobbotsch commented Apr 26, 2022

aromaa commented Apr 26, 2022

jakobbotsch commented Apr 26, 2022

jakobbotsch commented May 2, 2022

tannergooding commented May 6, 2022

aromaa commented May 6, 2022

jakobbotsch commented May 8, 2022

pentp commented May 9, 2022 •

edited

Loading

Optimize bswap+mov to movbe on xarch #66965

Optimize bswap+mov to movbe on xarch #66965

Conversation

aromaa commented Mar 21, 2022 • edited Loading

ghost commented Mar 21, 2022

Wraith2 commented Mar 22, 2022

aromaa commented Mar 22, 2022

MichalStrehovsky commented Mar 22, 2022 • edited Loading

jakobbotsch commented Apr 24, 2022

jakobbotsch commented Apr 25, 2022

aromaa commented Apr 25, 2022

jakobbotsch commented Apr 25, 2022

aromaa commented Apr 25, 2022

aromaa commented Apr 26, 2022 • edited Loading

jakobbotsch left a comment

Choose a reason for hiding this comment

jakobbotsch commented Apr 26, 2022

jakobbotsch commented Apr 26, 2022

aromaa commented Apr 26, 2022

jakobbotsch commented Apr 26, 2022

jakobbotsch commented May 2, 2022

tannergooding commented May 6, 2022

aromaa commented May 6, 2022

jakobbotsch commented May 8, 2022

pentp commented May 9, 2022 • edited Loading

aromaa commented Mar 21, 2022 •

edited

Loading

MichalStrehovsky commented Mar 22, 2022 •

edited

Loading

aromaa commented Apr 26, 2022 •

edited

Loading

pentp commented May 9, 2022 •

edited

Loading