[JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; #61847

SeanWoo · 2021-11-19T18:39:30Z

Issue: #61620
old assembly code for ARM64:

            movi    v16.16b, #0x00
            str     s16, [x1]

new assembly code for ARM64:

            str     wzr, [x0]

Other changes:

The code was moved from lowerxarch.cpp in lower.cpp
Calling data->SetContained() for ARM64 only.

Previously, the code was like this:

    data->SetContained();
    data->BashToConst(intCns, type);

Now:

    data->BashToConst(intCns, type);
#if defined(TARGET_ARM64)
    data->SetContained();
#endif

BashToConst zeroes the node flags and SetContained didn't work.

I checked the output of the assembler code for x64-x86, arm and it remained the same as before the change. The change is applied only for ARM64

ghost · 2021-11-19T18:39:39Z

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

old assembly code for ARM64:

            movi    v16.16b, #0x00
            str     s16, [x1]

new assembly code for ARM64:

            str     wzr, [x0]

Other changes:

The code was moved from lowerxarch.cpp in lower.cpp
Calling data->SetContained() for ARM64 only.

Previously, the code was like this:

    data->SetContained();
    data->BashToConst(intCns, type);

Now:

    data->BashToConst(intCns, type);
#if defined(TARGET_ARM64)
    data->SetContained();
#endif

BashToConst zeroes the node flags and SetContained didn't work.

Author:	SeanWoo
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `community-contribution`
Milestone:	-

dnfadmin · 2021-11-19T18:39:43Z

All CLA requirements met.

SingleAccretion · 2021-11-19T19:21:53Z

src/coreclr/jit/lower.cpp

@@ -6793,6 +6793,40 @@ void Lowering::LowerStoreIndirCommon(GenTreeStoreInd* ind)
    if (!comp->codeGen->gcInfo.gcIsWriteBarrierStoreIndNode(ind))
    {
        LowerStoreIndir(ind);


The pattern in lower.cpp is that LowerCommonX methods call LowerX methods. It may be that LowerStoreIndir has modified ind by the time we check we check for our optimization opportunity.

So I suggest a reorder: move the optimization before LowerStoreIndir. You should then be able to delete the #if for SetContained, as it would be dead code - LowerStoreIndir should check for the containment as appropriate.

Also: is containing all FP constants worth it (does it even work?) on ARM/ARM64? The codegen for things like *x = 1.1 should be checked.

Thank you for your comments.

So I suggest a reorder: move the optimization before LowerStoreIndir. You should then be able to delete the #if for SetContained, as it would be dead code - LowerStoreIndir should check for the containment as appropriate.

I'll make a call to LowerStoreInd after optimization.

Also: is containing all FP constants worth it (does it even work?) on ARM/ARM64? The codegen for things like *x = 1.1 should be checked.

You made the right remarks. I will fix them and come back with new changes soon.

src/coreclr/jit/lower.cpp

Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>

SeanWoo · 2021-11-20T00:45:35Z

And so I corrected a couple of points, now everything should work normally, but I need to go a little deeper into JIT and ARM instructions.

Current output for x64/x86 (win/unix)

I did not post each one, since they are almost the same except for the registers

; Method od Program:Foo (*x = 0)
    xor      eax, eax
    mov      dword ptr [rcx], eax
; Total bytes of code: 5

; Method od Program:Foo (*x = 1)
    mov      dword ptr [rcx], 0xD1FFAB1E
; Total bytes of code: 7

; Method od Program:Foo (*x = 2.2)
    mov      dword ptr [rcx], 0x400CCCCD
; Total bytes of code: 7

Current output for ARM64

; Method od Program:Foo (*x = 0)
    str     wzr, [x0]
; Total bytes of code: 5

; Method od Program:Foo (*x = 1)
    fmov    s16, #1.0000
    str     s16, [x0]
; Total bytes of code: 7

; Method od Program:Foo (*x = 2.2)
    ldr     s16, [@RWD00]
    str     s16, [x0]
; Total bytes of code: 7

Current output for ARM

The Is Contained flag is not set for ARM for some reason, I looked in the code and met this:

      // ARM floating-point load/store doesn't support a form similar to integer
     // ldr Rdst, [Rbase + Roffset] with offset in a register. The only supported
     // form is vldr Rdst, [Rbase + imm] with a more limited constraint on the imm.

Tomorrow I will study ARM better and try to understand how it can be implemented in JIT. I would be glad to get help, since this is my first PR and I have not worked with JIT before.
Also, as far as I understand, in ARM/ARM64 str does not allow the use of PC-relative expressions due to the size limitation of one instruction and this will not allow us to do something like str 1, [x0]

; Method od Program:Foo (*x = 0)
   movs    r3, 0
   str     r3, [r0]
; Total bytes of code: 5

; Method od Program:Foo (*x = 1)
    mov     r3, 0x3f800000
    vmov.i2f s8, r3
    vstr    s8, [r0]
; Total bytes of code: 7

; Method od Program:Foo (*x = 2.2)
    movw    r3, 0xd1ff
    movt    r3, 0xd1ff
    vmov.i2f s8, r3
    vstr    s8, [r0]
; Total bytes of code: 7

src/coreclr/jit/lower.cpp

SingleAccretion · 2021-11-20T12:14:34Z

As far as the ARM situation, I think it is the case we'll want to always switch to integral constants to it, as we assembly FP constants from inline integers, and so just storing the integer directly will (always) save us a vmov.i2f.

Edit: for ARM64 (note I am not an expert), I think we should leave the "only contain zeroes" logic. Or only switch for immediates that are encodable in the instruction directly...

… with conditional compilation in lower.cpp

SeanWoo · 2021-11-21T18:50:55Z

New codegen for ARM.
I think this codegen is better.

; Method Program:Foo(int) (*x = 0)
G_M7200_IG01:
            push    {r11,lr}
            mov     r11, sp
						;; bbWeight=1    PerfScore 2.00

G_M7200_IG02:
            movs    r3, 0
            str     r3, [r0]
						;; bbWeight=1    PerfScore 2.00

G_M7200_IG03:
            pop     {r11,pc}
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 14

; Method Program:Foo(int,int) (*x = -0.0f)
G_M21919_IG01:
            push    {r11,lr}
            mov     r11, sp
						;; bbWeight=1    PerfScore 2.00

G_M21919_IG02:
            mov     r3, 0x80000000
            str     r3, [r0]
						;; bbWeight=1    PerfScore 2.00

G_M21919_IG03:
            pop     {r11,pc}
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 16

; Method Program:Foo(int,int,int) (*x = 2.7f)
G_M62624_IG01:
            push    {r11,lr}
            mov     r11, sp
						;; bbWeight=1    PerfScore 2.00

G_M62624_IG02:
            movw    r3, 0xd1ff
            movt    r3, 0xd1ff
            str     r3, [r0]
						;; bbWeight=1    PerfScore 3.00

G_M62624_IG03:
            pop     {r11,pc}
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 20

SingleAccretion

Generally LGTM to me, modulo comment requests.

Someone from @dotnet/jit-contrib will have the final say.

src/coreclr/jit/lower.cpp

Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>

JulieLeeMSFT · 2021-11-22T13:33:58Z

@echesakovMSFT PTAL the community PR.

echesakov · 2021-11-24T02:20:02Z

Hi @SeanWoo, thank you for your contribution! I am seeing that there some regressions:

linux-arm64

8 (1.98 % of base) : 246789.dasm - Benchstone.BenchF.Adams:Bench()

linux-arm

6 (0.24 % of base) : 224924.dasm - Test_VariantTest:TestByRef(bool)
6 (0.24 % of base) : 225888.dasm - Test_VariantTest:TestByRef(bool)
6 (0.26 % of base) : 224923.dasm - Test_VariantTest:TestByValue(bool)
6 (0.26 % of base) : 225887.dasm - Test_VariantTest:TestByValue(bool)
6 (0.27 % of base) : 224927.dasm - Test_VariantTest:TestFieldByRef(bool)
6 (0.27 % of base) : 225891.dasm - Test_VariantTest:TestFieldByRef(bool)
6 (0.27 % of base) : 224926.dasm - Test_VariantTest:TestFieldByValue(bool)
6 (0.27 % of base) : 225890.dasm - Test_VariantTest:TestFieldByValue(bool)

Have you had a chance to look at them? If not, I can help with this.

SeanWoo · 2021-11-24T13:02:14Z

@echesakovMSFT Hello, I have not seen these comparisons, I found the tests on which it falls, I need to understand how to quickly make a comparison of Asm code and identify at what specific values it falls. I'm still working on it

SingleAccretion · 2021-11-24T13:09:48Z

@SeanWoo The diffs were generated using the SPMI tool in the runtime-coreclr superpmi-asmdiffs pipeline. You can look up the docs on SPMI here: https://github.com/dotnet/runtime/blob/main/src/coreclr/scripts/superpmi.md.

echesakov · 2021-11-24T19:00:20Z

@echesakovMSFT Hello, I have not seen these comparisons, I found the tests on which it falls, I need to understand how to quickly make a comparison of Asm code and identify at what specific values it falls. I'm still working on it

Sure, please let me know if you need help with superpmi.

SeanWoo · 2021-11-24T19:30:08Z

@echesakovMSFT
I kind of figured out how to use it, but there are a few questions.

Is it right that I run asmdiff through this?:

py superpmi.py asmdiffs -jit_name clrjit_universal_arm64_x64.dll --altjit -target_os Linux -target_arch arm64 -arch x64 -filter coreclr_tests

The same question is, is there an opportunity to run not all tests, but some specific one that does not work? This would speed up the work of superpmi, especially it works for a long time when you set the --diff_jit_dump key to get a dump.

After the first and second run, I had a situation that regression occurs in different files, and sometimes it does not exist at all. I ran the command twice in a row and that's what brought out:

First
Second

It seems I'm doing something wrong :)

SingleAccretion · 2021-11-24T19:42:41Z

It is known that there can be spurious diffs on ARM/ARM64 due to #53773.

The same question is, is there an opportunity to run not all tests, but some specific one that does not work? This would speed up the work of superpmi, especially it works for a long time when you set the --diff_jit_dump key to get a dump.

The usual way to check is to get the command line SPMI used (it prints it, something like: C:\Users\Accretion\source\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\superpmi.exe -c ### C:\Users\Accretion\source\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\clrjit.dll C:\Users\Accretion\source\dotnet\diffs\spmi\mch\3df3e3ec-b1f6-4e6d-8439-2e7f3f7fa2ac.windows.x64\libraries.pmi.windows.x64.checked.mch, here the -c argument is known as "the method context" and is the same as the file name printed in the diffs) and then run "the base Jit" (also printed by SPMI) and "the diff Jit" with JitDumps enabled. And then diff the dumps manually.

SeanWoo · 2021-11-25T12:04:20Z

@SingleAccretion @echesakovMSFT
Given #53773, what should I do next? As far as I understand, this problem is not related to me, since the code is then optimized as expected, then not optimized because of #53773

SeanWoo · 2021-11-29T23:22:06Z

So, after almost a week, I returned to the problem again and decided to study it more qualitatively. I looked at the output of SuperPMI and JitDump, running the same test every time, which then regresses, then does not. I came to the conclusion that the problem exactly lies in #53773, since regressions occur exactly the same as were indicated here

(Base on the left, current on the right)
The first run is a test called testout1:Func_0():int for Linux ARM64 failed, the following regressions appeared:
24 ( 0,05% of base) : 222814.dasm - testout1:Func_0():int

The second run of this test immediately after the completion of the first, the results are completely different:
-928 (-2,01% of base) : 222814.dasm - testout1:Func_0():int
Multiple changes are immediately visible

The third run, and again this is what we see:
-24 (-0,05% of base) : 222814.dasm - testout1:Func_0():int

Then the fourth launch, everything is fine there as well

But on the fifth launch we see it again:
352 ( 0,78% of base) : 222814.dasm - testout1:Func_0():int

When regression occurs, it still happens like this:

Just like here

So, what are we going to do with this PR?

echesakov · 2021-11-30T00:01:11Z

Hi @SeanWoo,

I collected SuperPMI diffs for your change as well.

I looked over some of the diffs and found two interesting artifacts in a case when the value in the floating point register is not the last use.

For example, that could lead to:

Swapping movi and str instructions order (that sometimes results in better code as in the example):

--- a/base/835.dasm
+++ b/diff/835.dasm
@@ -44,7 +44,7 @@ G_M59164_IG01:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Pr
             stp     x19, x20, [sp,#64]
             mov     fp, sp
                                                ;; bbWeight=1    PerfScore 5.50
-G_M59164_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+G_M59164_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz, align
             movz    x0, #0xd1ffab1e
             movk    x0, #0xd1ffab1e LSL #16
             movk    x0, #0xd1ffab1e LSL #32
@@ -54,11 +54,10 @@ G_M59164_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
             ; gcr arg pop 0
             mov     x19, x0
             ; gcrRegs +[x19]
-            movi    v0.16b, #0x00
-            str     d0, [x19,#24]
-            fmov    d8, d0
+            str     xzr, [x19,#24]
+            movi    v8.16b, #0x00
             mov     w20, #2

Rematerialization of the constant later at the code (this happens only on arm) that, as far as I can tell, the only regressions I observe locally (I don't see instances of Static field addresses are not deterministic when doing SPMI replay #53773).

--- a/base/226591.dasm
+++ b/diff/226591.dasm
@@ -508,8 +508,7 @@ G_M14851_IG04:        ; , extend
             ; byrRegs +[r0]
             movw    r3, 0xd1ff
             movt    r3, 0xd1ff
-            vmov.i2f s0, r3
-            vstr    s0, [r1+4]
+            str     r3, [r1+4]
             movw    r12, 0xd1ff
             movt    r12, 0xd1ff
             blx     r12                // CORINFO_HELP_CHECKED_ASSIGN_REF
@@ -519,6 +518,9 @@ G_M14851_IG04:        ; , extend
             ; gcrRegs +[r0]
             movw    r3, 0xd1ff
             movt    r3, 0xd1ff
+            vmov.i2f s0, r3
+            movw    r3, 0xd1ff
+            movt    r3, 0xd1ff
             blx     r3         // VariantNative:Marshal_Struct_ByValue_Single()
             ; gcrRegs -[r0]
             movs    r1, 1

I don't think the lower has information to determine such cases, so perhaps, there is not much we can do here.

There are two formatting jobs that failed - can you please run

%jitutils%\bin\jit-format.exe --coreclr %runtime%\src\coreclr --fix --untidy --arch x64 --os windows --build checked

to fix the formatting and push the change before I can merge it?

…into issue-61620-ARM64

SeanWoo · 2021-11-30T09:09:00Z

@echesakovMSFT I executed this command, everything is ready

echesakov

@echesakovMSFT I executed this command, everything is ready

Thank you. I re-triggered the failing legs - will merge as soon as they pass.

echesakov · 2021-11-30T20:24:04Z

@SeanWoo Thank you for your contribution!

[Issue: 61620] Optimizing ARM64 for *x = 0;

b91555e

ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 19, 2021

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 19, 2021

SeanWoo mentioned this pull request Nov 19, 2021

JIT: Optimize *x = dblCns to *x = intCns for arm64 #61620

Closed

SeanWoo changed the title ~~[Issue: 61620] Optimizing ARM64 for *x = 0;~~ [Issue: 61620] Optimizing ARM64 for *x = dblCns; Nov 19, 2021

SeanWoo changed the title ~~[Issue: 61620] Optimizing ARM64 for *x = dblCns;~~ [JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; Nov 19, 2021

SingleAccretion reviewed Nov 19, 2021

View reviewed changes

Update src/coreclr/jit/lower.cpp

ee617c7

Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>

runfoapp bot mentioned this pull request Nov 19, 2021

system.text.regularexpressions.tests.regexmatchtests.match_cachedpattern_newtimeoutapplies #61794

Closed

Fixed bug with * x = dConst if dConst is not 0

36bc6b4

SeanWoo force-pushed the issue-61620-ARM64 branch from 3d27939 to 36bc6b4 Compare November 19, 2021 23:50

remove extra printf

0596430

SeanWoo requested a review from SingleAccretion November 20, 2021 11:22

SingleAccretion reviewed Nov 20, 2021

View reviewed changes

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved

SeanWoo added 2 commits November 20, 2021 21:10

Replacing IsFPZero with IsCnsNonZeroFltOrDbl for STOREIND Minor edits…

b9b0f55

… with conditional compilation in lower.cpp

fixed ARM codegen for STOREIND

9a2a2d5

SeanWoo requested a review from SingleAccretion November 21, 2021 18:51

SingleAccretion approved these changes Nov 21, 2021

View reviewed changes

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved

src/coreclr/jit/lower.cpp Show resolved Hide resolved

SeanWoo and others added 2 commits November 22, 2021 00:57

Update src/coreclr/jit/lower.cpp

25f047e

Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>

Update src/coreclr/jit/lower.cpp

153fd39

Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>

JulieLeeMSFT assigned SeanWoo and echesakov Nov 22, 2021

JulieLeeMSFT added this to the 7.0.0 milestone Nov 22, 2021

JulieLeeMSFT requested a review from echesakov November 22, 2021 13:34

SeanWoo requested a review from SingleAccretion November 28, 2021 17:40

SeanWoo added 2 commits November 30, 2021 04:22

fix formatting

d5dd56b

Merge branch 'issue-61620-ARM64' of https://github.com/SeanWoo/runtime …

ec0f188

…into issue-61620-ARM64

echesakov approved these changes Nov 30, 2021

View reviewed changes

echesakov merged commit bb597e2 into dotnet:main Nov 30, 2021

joshpeterson mentioned this pull request Nov 30, 2021

bot upstream main merge 2021 11 30 Unity-Technologies/runtime#11

Closed

SeanWoo deleted the issue-61620-ARM64 branch November 30, 2021 21:41

joshpeterson mentioned this pull request Dec 1, 2021

bot upstream main merge 2021 12 01 Unity-Technologies/runtime#12

Closed

ghost locked as resolved and limited conversation to collaborators Dec 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; #61847

[JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; #61847

SeanWoo commented Nov 19, 2021 •

edited

Loading

ghost commented Nov 19, 2021

dnfadmin commented Nov 19, 2021 •

edited

Loading

SingleAccretion Nov 19, 2021 •

edited

Loading

SeanWoo Nov 19, 2021

SeanWoo commented Nov 20, 2021 •

edited

Loading

SingleAccretion commented Nov 20, 2021 •

edited

Loading

SeanWoo commented Nov 21, 2021

SingleAccretion left a comment

JulieLeeMSFT commented Nov 22, 2021

echesakov commented Nov 24, 2021

SeanWoo commented Nov 24, 2021

SingleAccretion commented Nov 24, 2021

echesakov commented Nov 24, 2021

SeanWoo commented Nov 24, 2021 •

edited

Loading

SingleAccretion commented Nov 24, 2021

SeanWoo commented Nov 25, 2021

SeanWoo commented Nov 29, 2021

echesakov commented Nov 30, 2021

SeanWoo commented Nov 30, 2021

echesakov left a comment

echesakov commented Nov 30, 2021

[JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; #61847

[JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; #61847

Conversation

SeanWoo commented Nov 19, 2021 • edited Loading

ghost commented Nov 19, 2021

dnfadmin commented Nov 19, 2021 • edited Loading

SingleAccretion Nov 19, 2021 • edited Loading

Choose a reason for hiding this comment

SeanWoo Nov 19, 2021

Choose a reason for hiding this comment

SeanWoo commented Nov 20, 2021 • edited Loading

SingleAccretion commented Nov 20, 2021 • edited Loading

SeanWoo commented Nov 21, 2021

SingleAccretion left a comment

Choose a reason for hiding this comment

JulieLeeMSFT commented Nov 22, 2021

echesakov commented Nov 24, 2021

SeanWoo commented Nov 24, 2021

SingleAccretion commented Nov 24, 2021

echesakov commented Nov 24, 2021

SeanWoo commented Nov 24, 2021 • edited Loading

SingleAccretion commented Nov 24, 2021

SeanWoo commented Nov 25, 2021

SeanWoo commented Nov 29, 2021

echesakov commented Nov 30, 2021

SeanWoo commented Nov 30, 2021

echesakov left a comment

Choose a reason for hiding this comment

echesakov commented Nov 30, 2021

SeanWoo commented Nov 19, 2021 •

edited

Loading

dnfadmin commented Nov 19, 2021 •

edited

Loading

SingleAccretion Nov 19, 2021 •

edited

Loading

SeanWoo commented Nov 20, 2021 •

edited

Loading

SingleAccretion commented Nov 20, 2021 •

edited

Loading

SeanWoo commented Nov 24, 2021 •

edited

Loading