-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize bswap+mov to movbe on xarch #66965
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsAdds lowering for the pattern BSWAP(IND) ands STOREIND(addr, BSWAP(x)) on xarch and emits the Methods using the BinaryPrimitives read & write helpers do not yet benefit from this optimization as they use MemoryMarshal under the hood which breaks the pattern. This should be switched to use Unsafe to take advantage of this, which is not included in this PR. Fixes #953
|
Can you expand on this a bit? What is it about them that breaks the pattern? public static int ReadInt32BigEndian(ReadOnlySpan<byte> source)
{
int result = MemoryMarshal.Read<int>(source);
if (BitConverter.IsLittleEndian)
{
result = ReverseEndianness(result);
}
return result;
} I used: public static int ReadInt32BigEndian(ReadOnlySpan<byte> source)
{
if (BitConverter.IsLittleEndian)
{
return BinaryPrimitives.ReverseEndianness(MemoryMarshal.Read<int>(source));
}
else
{
return MemoryMarshal.Read<int>(source);
}
} |
Yes, the read one is trivial to solve as you mentioned above and gets optimized by this PR. But the problematic one is the MemoryMarshal.Write, which ends up creating bound checks between bswap and mov and can't be recognized easily. To fix this the bound check needs to be manually written before ReverseEndianess. The IR for writing is following:
|
Once this is ready, could you please also add this for NativeAOT configs? The blueprint for NativeAOT-specific changes is in #63563 - it should be mostly mechanical. Since there's no public API, the extent of needed changes will be smaller than the above pull request. |
@aromaa Can you let me know when this is ready to be reviewed again? |
The diffs are actually failing because the jiteeversionguid.h was changed in the PR because the ISA was modified. You would get bogus diffs if it tried to run them. I tried to do local collection but I had some trouble on it few days ago, but I'm planning to get the diffs before merging. |
Af of course. Usually in this case we would use jit-diff instead. But your change also not really incompatible with previous collections, so it might be the easiest to just make a temporary hack of the JIT-EE GUID/ISA check to collect SPMI diffs. |
I tried changing that but it gives me bogus diffs where POPCNT and MOVBE were missing so I gave up on that. Removing the ISA before running the diffs works fine but I didint bother to do that too many times and relied on the test cases. |
Detail diffs
Regression example@@ -2,15 +2,14 @@ G_M10490_IG01:
sub rsp, 40
vzeroupper
;; size=7 bbWeight=0 PerfScore 0.00
G_M10490_IG02:
mov rax, bword ptr [rcx]
mov ecx, dword ptr [rcx+8]
cmp ecx, 8
jl SHORT G_M10490_IG04
- mov rcx, qword ptr [rax]
- bswap rcx
+ movbe rcx, qword ptr [rax]
vmovd xmm0, rcx
- ;; size=22 bbWeight=1 PerfScore 10.25
+ ;; size=21 bbWeight=1 PerfScore 11.25
G_M10490_IG03:
add rsp, 40
ret
@@ -15,10 +14,10 @@ G_M10490_IG03:
add rsp, 40
ret
;; size=5 bbWeight=1 PerfScore 1.25
G_M10490_IG04:
mov ecx, 41
call [System.ThrowHelper:ThrowArgumentOutOfRangeException(int)]
int3
;; size=12 bbWeight=0 PerfScore 0.00
-; Total bytes of code 46, prolog size 7, PerfScore 16.10, instruction count 14, allocated bytes for code 46 (MethodHash=cb5ed705) for method System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double
+; Total bytes of code 45, prolog size 7, PerfScore 17.00, instruction count 13, allocated bytes for code 45 (MethodHash=cb5ed705) for method System.Buffers.Binary.BinaryPrimitives:ReadDoubleBigEndian(System.ReadOnlySpan`1[Byte]):double |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good now, just a couple of small nits.
I'm not sure why some of these would show up as perfscore regressions but I wouldn't worry too much about it. I would expect this transform to be an improvement whenever it applies. |
cc @tannergooding, can you take a look just to confirm that the various ISA table entries look right? |
I investigated it a bit and looks like to be due to the logic in Thank you for the reviews! Learned a lot and hopefully further optimization attempts go more smoothly :) |
Ah, that's quite unfortunate, but good to know. Probably something we ought to look into.
Don't worry about it, so did I. And FWIW, I would not consider the process of this PR unsmooth -- the JIT is complex and the optimization you made in this PR has to deal with a lot of the details, so it is understandable that there were a few corner cases that requires some extra treatment. |
ping @tannergooding for a review of the ISA related changes |
@aromaa, could you please resolve the merge conflict? You should just be able to just keep the guid already generated for this PR. We should be able to merge this once that's in (provided CI is passing). |
Failure is #68690. Not sure why the Mono leg is failing, there doesn't seem to be much to go with? |
Failures looked unrelated. Thanks for the contribution! |
This is probably not a mistake - according to uops.info: |
Adds lowering for the pattern BSWAP|BSWAP16(IND) and STOREIND(addr, BSWAP|BSWAP16(x)) on xarch and emits the
movbe
instruction.Methods using the BinaryPrimitives read & write helpers do not yet benefit from this optimization as their code has been layed out in a way that is not easily recognizable. This is not fixed in this PR.
Fixes #953