-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; #61847
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue Detailsold assembly code for ARM64: movi v16.16b, #0x00
str s16, [x1] new assembly code for ARM64: str wzr, [x0] Other changes:
Previously, the code was like this: data->SetContained();
data->BashToConst(intCns, type); Now: data->BashToConst(intCns, type);
#if defined(TARGET_ARM64)
data->SetContained();
#endif BashToConst zeroes the node flags and SetContained didn't work.
|
src/coreclr/jit/lower.cpp
Outdated
@@ -6793,6 +6793,40 @@ void Lowering::LowerStoreIndirCommon(GenTreeStoreInd* ind) | |||
if (!comp->codeGen->gcInfo.gcIsWriteBarrierStoreIndNode(ind)) | |||
{ | |||
LowerStoreIndir(ind); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The pattern in lower.cpp
is that LowerCommonX
methods call LowerX
methods. It may be that LowerStoreIndir
has modified ind
by the time we check we check for our optimization opportunity.
So I suggest a reorder: move the optimization before LowerStoreIndir
. You should then be able to delete the #if
for SetContained
, as it would be dead code - LowerStoreIndir
should check for the containment as appropriate.
Also: is containing all FP constants worth it (does it even work?) on ARM/ARM64? The codegen for things like *x = 1.1
should be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your comments.
So I suggest a reorder: move the optimization before
LowerStoreIndir
. You should then be able to delete the#if
forSetContained
, as it would be dead code -LowerStoreIndir
should check for the containment as appropriate.
I'll make a call to LowerStoreInd after optimization.
Also: is containing all FP constants worth it (does it even work?) on ARM/ARM64? The codegen for things like
*x = 1.1
should be checked.
You made the right remarks. I will fix them and come back with new changes soon.
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
3d27939
to
36bc6b4
Compare
And so I corrected a couple of points, now everything should work normally, but I need to go a little deeper into JIT and ARM instructions. Current output for x64/x86 (win/unix)I did not post each one, since they are almost the same except for the registers ; Method od Program:Foo (*x = 0)
xor eax, eax
mov dword ptr [rcx], eax
; Total bytes of code: 5
; Method od Program:Foo (*x = 1)
mov dword ptr [rcx], 0xD1FFAB1E
; Total bytes of code: 7
; Method od Program:Foo (*x = 2.2)
mov dword ptr [rcx], 0x400CCCCD
; Total bytes of code: 7 Current output for ARM64; Method od Program:Foo (*x = 0)
str wzr, [x0]
; Total bytes of code: 5
; Method od Program:Foo (*x = 1)
fmov s16, #1.0000
str s16, [x0]
; Total bytes of code: 7
; Method od Program:Foo (*x = 2.2)
ldr s16, [@RWD00]
str s16, [x0]
; Total bytes of code: 7 Current output for ARMThe Is Contained flag is not set for ARM for some reason, I looked in the code and met this:
Tomorrow I will study ARM better and try to understand how it can be implemented in JIT. I would be glad to get help, since this is my first PR and I have not worked with JIT before. ; Method od Program:Foo (*x = 0)
movs r3, 0
str r3, [r0]
; Total bytes of code: 5
; Method od Program:Foo (*x = 1)
mov r3, 0x3f800000
vmov.i2f s8, r3
vstr s8, [r0]
; Total bytes of code: 7
; Method od Program:Foo (*x = 2.2)
movw r3, 0xd1ff
movt r3, 0xd1ff
vmov.i2f s8, r3
vstr s8, [r0]
; Total bytes of code: 7 |
As far as the ARM situation, I think it is the case we'll want to always switch to integral constants to it, as we assembly FP constants from inline integers, and so just storing the integer directly will (always) save us a Edit: for ARM64 (note I am not an expert), I think we should leave the "only contain zeroes" logic. Or only switch for immediates that are encodable in the instruction directly... |
… with conditional compilation in lower.cpp
New codegen for ARM. ; Method Program:Foo(int) (*x = 0)
G_M7200_IG01:
push {r11,lr}
mov r11, sp
;; bbWeight=1 PerfScore 2.00
G_M7200_IG02:
movs r3, 0
str r3, [r0]
;; bbWeight=1 PerfScore 2.00
G_M7200_IG03:
pop {r11,pc}
;; bbWeight=1 PerfScore 1.00
; Total bytes of code: 14
; Method Program:Foo(int,int) (*x = -0.0f)
G_M21919_IG01:
push {r11,lr}
mov r11, sp
;; bbWeight=1 PerfScore 2.00
G_M21919_IG02:
mov r3, 0x80000000
str r3, [r0]
;; bbWeight=1 PerfScore 2.00
G_M21919_IG03:
pop {r11,pc}
;; bbWeight=1 PerfScore 1.00
; Total bytes of code: 16
; Method Program:Foo(int,int,int) (*x = 2.7f)
G_M62624_IG01:
push {r11,lr}
mov r11, sp
;; bbWeight=1 PerfScore 2.00
G_M62624_IG02:
movw r3, 0xd1ff
movt r3, 0xd1ff
str r3, [r0]
;; bbWeight=1 PerfScore 3.00
G_M62624_IG03:
pop {r11,pc}
;; bbWeight=1 PerfScore 1.00
; Total bytes of code: 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM to me, modulo comment requests.
Someone from @dotnet/jit-contrib will have the final say.
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
@echesakovMSFT PTAL the community PR. |
Hi @SeanWoo, thank you for your contribution! I am seeing that there some regressions: linux-arm64
linux-arm
Have you had a chance to look at them? If not, I can help with this. |
@echesakovMSFT Hello, I have not seen these comparisons, I found the tests on which it falls, I need to understand how to quickly make a comparison of Asm code and identify at what specific values it falls. I'm still working on it |
@SeanWoo The diffs were generated using the SPMI tool in the |
Sure, please let me know if you need help with superpmi. |
@echesakovMSFT Is it right that I run asmdiff through this?:
The same question is, is there an opportunity to run not all tests, but some specific one that does not work? This would speed up the work of superpmi, especially it works for a long time when you set the --diff_jit_dump key to get a dump. After the first and second run, I had a situation that regression occurs in different files, and sometimes it does not exist at all. I ran the command twice in a row and that's what brought out: It seems I'm doing something wrong :) |
It is known that there can be spurious diffs on ARM/ARM64 due to #53773.
The usual way to check is to get the command line SPMI used (it prints it, something like: |
@SingleAccretion @echesakovMSFT |
So, after almost a week, I returned to the problem again and decided to study it more qualitatively. I looked at the output of SuperPMI and JitDump, running the same test every time, which then regresses, then does not. I came to the conclusion that the problem exactly lies in #53773, since regressions occur exactly the same as were indicated here (Base on the left, current on the right) The second run of this test immediately after the completion of the first, the results are completely different: The third run, and again this is what we see: Then the fourth launch, everything is fine there as well But on the fifth launch we see it again: When regression occurs, it still happens like this: So, what are we going to do with this PR? |
Hi @SeanWoo, I collected SuperPMI diffs for your change as well. I looked over some of the diffs and found two interesting artifacts in a case when the value in the floating point register is not the last use. For example, that could lead to:
--- a/base/835.dasm
+++ b/diff/835.dasm
@@ -44,7 +44,7 @@ G_M59164_IG01: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Pr
stp x19, x20, [sp,#64]
mov fp, sp
;; bbWeight=1 PerfScore 5.50
-G_M59164_IG02: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+G_M59164_IG02: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz, align
movz x0, #0xd1ffab1e
movk x0, #0xd1ffab1e LSL #16
movk x0, #0xd1ffab1e LSL #32
@@ -54,11 +54,10 @@ G_M59164_IG02: ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
; gcr arg pop 0
mov x19, x0
; gcrRegs +[x19]
- movi v0.16b, #0x00
- str d0, [x19,#24]
- fmov d8, d0
+ str xzr, [x19,#24]
+ movi v8.16b, #0x00
mov w20, #2
--- a/base/226591.dasm
+++ b/diff/226591.dasm
@@ -508,8 +508,7 @@ G_M14851_IG04: ; , extend
; byrRegs +[r0]
movw r3, 0xd1ff
movt r3, 0xd1ff
- vmov.i2f s0, r3
- vstr s0, [r1+4]
+ str r3, [r1+4]
movw r12, 0xd1ff
movt r12, 0xd1ff
blx r12 // CORINFO_HELP_CHECKED_ASSIGN_REF
@@ -519,6 +518,9 @@ G_M14851_IG04: ; , extend
; gcrRegs +[r0]
movw r3, 0xd1ff
movt r3, 0xd1ff
+ vmov.i2f s0, r3
+ movw r3, 0xd1ff
+ movt r3, 0xd1ff
blx r3 // VariantNative:Marshal_Struct_ByValue_Single()
; gcrRegs -[r0]
movs r1, 1 I don't think the lower has information to determine such cases, so perhaps, there is not much we can do here. There are two formatting jobs that failed - can you please run
to fix the formatting and push the change before I can merge it? |
…into issue-61620-ARM64
@echesakovMSFT I executed this command, everything is ready |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@echesakovMSFT I executed this command, everything is ready
Thank you. I re-triggered the failing legs - will merge as soon as they pass.
@SeanWoo Thank you for your contribution! |
Issue: #61620
old assembly code for ARM64:
new assembly code for ARM64:
Other changes:
Previously, the code was like this:
Now:
BashToConst zeroes the node flags and SetContained didn't work.
I checked the output of the assembler code for x64-x86, arm and it remained the same as before the change. The change is applied only for ARM64