Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; #61847

Merged
merged 10 commits into from
Nov 30, 2021

Conversation

SeanWoo
Copy link
Contributor

@SeanWoo SeanWoo commented Nov 19, 2021

Issue: #61620
old assembly code for ARM64:

            movi    v16.16b, #0x00
            str     s16, [x1]

new assembly code for ARM64:

            str     wzr, [x0]

image

Other changes:

  • The code was moved from lowerxarch.cpp in lower.cpp
  • Calling data->SetContained() for ARM64 only.

Previously, the code was like this:

    data->SetContained();
    data->BashToConst(intCns, type);

Now:

    data->BashToConst(intCns, type);
#if defined(TARGET_ARM64)
    data->SetContained();
#endif

BashToConst zeroes the node flags and SetContained didn't work.

I checked the output of the assembler code for x64-x86, arm and it remained the same as before the change. The change is applied only for ARM64

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 19, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 19, 2021
@ghost
Copy link

ghost commented Nov 19, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

old assembly code for ARM64:

            movi    v16.16b, #0x00
            str     s16, [x1]

new assembly code for ARM64:

            str     wzr, [x0]

image

Other changes:

  • The code was moved from lowerxarch.cpp in lower.cpp
  • Calling data->SetContained() for ARM64 only.

Previously, the code was like this:

    data->SetContained();
    data->BashToConst(intCns, type);

Now:

    data->BashToConst(intCns, type);
#if defined(TARGET_ARM64)
    data->SetContained();
#endif

BashToConst zeroes the node flags and SetContained didn't work.

Author: SeanWoo
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@dnfadmin
Copy link

dnfadmin commented Nov 19, 2021

CLA assistant check
All CLA requirements met.

@SeanWoo SeanWoo changed the title [Issue: 61620] Optimizing ARM64 for *x = 0; [Issue: 61620] Optimizing ARM64 for *x = dblCns; Nov 19, 2021
@SeanWoo SeanWoo changed the title [Issue: 61620] Optimizing ARM64 for *x = dblCns; [JIT] [Issue: 61620] Optimizing ARM64 for *x = dblCns; Nov 19, 2021
@@ -6793,6 +6793,40 @@ void Lowering::LowerStoreIndirCommon(GenTreeStoreInd* ind)
if (!comp->codeGen->gcInfo.gcIsWriteBarrierStoreIndNode(ind))
{
LowerStoreIndir(ind);
Copy link
Contributor

@SingleAccretion SingleAccretion Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pattern in lower.cpp is that LowerCommonX methods call LowerX methods. It may be that LowerStoreIndir has modified ind by the time we check we check for our optimization opportunity.

So I suggest a reorder: move the optimization before LowerStoreIndir. You should then be able to delete the #if for SetContained, as it would be dead code - LowerStoreIndir should check for the containment as appropriate.

Also: is containing all FP constants worth it (does it even work?) on ARM/ARM64? The codegen for things like *x = 1.1 should be checked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments.

So I suggest a reorder: move the optimization before LowerStoreIndir. You should then be able to delete the #if for SetContained, as it would be dead code - LowerStoreIndir should check for the containment as appropriate.

I'll make a call to LowerStoreInd after optimization.

Also: is containing all FP constants worth it (does it even work?) on ARM/ARM64? The codegen for things like *x = 1.1 should be checked.

You made the right remarks. I will fix them and come back with new changes soon.

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
@SeanWoo
Copy link
Contributor Author

SeanWoo commented Nov 20, 2021

And so I corrected a couple of points, now everything should work normally, but I need to go a little deeper into JIT and ARM instructions.

Current output for x64/x86 (win/unix)

I did not post each one, since they are almost the same except for the registers

; Method od Program:Foo (*x = 0)
    xor      eax, eax
    mov      dword ptr [rcx], eax
; Total bytes of code: 5

; Method od Program:Foo (*x = 1)
    mov      dword ptr [rcx], 0xD1FFAB1E
; Total bytes of code: 7

; Method od Program:Foo (*x = 2.2)
    mov      dword ptr [rcx], 0x400CCCCD
; Total bytes of code: 7
Current output for ARM64
; Method od Program:Foo (*x = 0)
    str     wzr, [x0]
; Total bytes of code: 5

; Method od Program:Foo (*x = 1)
    fmov    s16, #1.0000
    str     s16, [x0]
; Total bytes of code: 7

; Method od Program:Foo (*x = 2.2)
    ldr     s16, [@RWD00]
    str     s16, [x0]
; Total bytes of code: 7
Current output for ARM

The Is Contained flag is not set for ARM for some reason, I looked in the code and met this:

      // ARM floating-point load/store doesn't support a form similar to integer
     // ldr Rdst, [Rbase + Roffset] with offset in a register. The only supported
     // form is vldr Rdst, [Rbase + imm] with a more limited constraint on the imm.

Tomorrow I will study ARM better and try to understand how it can be implemented in JIT. I would be glad to get help, since this is my first PR and I have not worked with JIT before.
Also, as far as I understand, in ARM/ARM64 str does not allow the use of PC-relative expressions due to the size limitation of one instruction and this will not allow us to do something like str 1, [x0]

; Method od Program:Foo (*x = 0)
   movs    r3, 0
   str     r3, [r0]
; Total bytes of code: 5

; Method od Program:Foo (*x = 1)
    mov     r3, 0x3f800000
    vmov.i2f s8, r3
    vstr    s8, [r0]
; Total bytes of code: 7

; Method od Program:Foo (*x = 2.2)
    movw    r3, 0xd1ff
    movt    r3, 0xd1ff
    vmov.i2f s8, r3
    vstr    s8, [r0]
; Total bytes of code: 7

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
@SingleAccretion
Copy link
Contributor

SingleAccretion commented Nov 20, 2021

As far as the ARM situation, I think it is the case we'll want to always switch to integral constants to it, as we assembly FP constants from inline integers, and so just storing the integer directly will (always) save us a vmov.i2f.

Edit: for ARM64 (note I am not an expert), I think we should leave the "only contain zeroes" logic. Or only switch for immediates that are encodable in the instruction directly...

@SeanWoo
Copy link
Contributor Author

SeanWoo commented Nov 21, 2021

New codegen for ARM.
I think this codegen is better.

; Method Program:Foo(int) (*x = 0)
G_M7200_IG01:
            push    {r11,lr}
            mov     r11, sp
						;; bbWeight=1    PerfScore 2.00

G_M7200_IG02:
            movs    r3, 0
            str     r3, [r0]
						;; bbWeight=1    PerfScore 2.00

G_M7200_IG03:
            pop     {r11,pc}
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 14

; Method Program:Foo(int,int) (*x = -0.0f)
G_M21919_IG01:
            push    {r11,lr}
            mov     r11, sp
						;; bbWeight=1    PerfScore 2.00

G_M21919_IG02:
            mov     r3, 0x80000000
            str     r3, [r0]
						;; bbWeight=1    PerfScore 2.00

G_M21919_IG03:
            pop     {r11,pc}
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 16

; Method Program:Foo(int,int,int) (*x = 2.7f)
G_M62624_IG01:
            push    {r11,lr}
            mov     r11, sp
						;; bbWeight=1    PerfScore 2.00

G_M62624_IG02:
            movw    r3, 0xd1ff
            movt    r3, 0xd1ff
            str     r3, [r0]
						;; bbWeight=1    PerfScore 3.00

G_M62624_IG03:
            pop     {r11,pc}
						;; bbWeight=1    PerfScore 1.00
; Total bytes of code: 20

Copy link
Contributor

@SingleAccretion SingleAccretion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM to me, modulo comment requests.

Someone from @dotnet/jit-contrib will have the final say.

src/coreclr/jit/lower.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/lower.cpp Show resolved Hide resolved
SeanWoo and others added 2 commits November 22, 2021 00:57
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
Co-authored-by: SingleAccretion <62474226+SingleAccretion@users.noreply.github.com>
@JulieLeeMSFT
Copy link
Member

@echesakovMSFT PTAL the community PR.

@echesakov
Copy link
Contributor

Hi @SeanWoo, thank you for your contribution! I am seeing that there some regressions:

linux-arm64

8 (1.98 % of base) : 246789.dasm - Benchstone.BenchF.Adams:Bench()

linux-arm

6 (0.24 % of base) : 224924.dasm - Test_VariantTest:TestByRef(bool)
6 (0.24 % of base) : 225888.dasm - Test_VariantTest:TestByRef(bool)
6 (0.26 % of base) : 224923.dasm - Test_VariantTest:TestByValue(bool)
6 (0.26 % of base) : 225887.dasm - Test_VariantTest:TestByValue(bool)
6 (0.27 % of base) : 224927.dasm - Test_VariantTest:TestFieldByRef(bool)
6 (0.27 % of base) : 225891.dasm - Test_VariantTest:TestFieldByRef(bool)
6 (0.27 % of base) : 224926.dasm - Test_VariantTest:TestFieldByValue(bool)
6 (0.27 % of base) : 225890.dasm - Test_VariantTest:TestFieldByValue(bool)

Have you had a chance to look at them? If not, I can help with this.

@SeanWoo
Copy link
Contributor Author

SeanWoo commented Nov 24, 2021

@echesakovMSFT Hello, I have not seen these comparisons, I found the tests on which it falls, I need to understand how to quickly make a comparison of Asm code and identify at what specific values it falls. I'm still working on it

@SingleAccretion
Copy link
Contributor

@SeanWoo The diffs were generated using the SPMI tool in the runtime-coreclr superpmi-asmdiffs pipeline. You can look up the docs on SPMI here: https://github.com/dotnet/runtime/blob/main/src/coreclr/scripts/superpmi.md.

@echesakov
Copy link
Contributor

@echesakovMSFT Hello, I have not seen these comparisons, I found the tests on which it falls, I need to understand how to quickly make a comparison of Asm code and identify at what specific values it falls. I'm still working on it

Sure, please let me know if you need help with superpmi.

@SeanWoo
Copy link
Contributor Author

SeanWoo commented Nov 24, 2021

@echesakovMSFT
I kind of figured out how to use it, but there are a few questions.

Is it right that I run asmdiff through this?:

py superpmi.py asmdiffs -jit_name clrjit_universal_arm64_x64.dll --altjit -target_os Linux -target_arch arm64 -arch x64 -filter coreclr_tests

The same question is, is there an opportunity to run not all tests, but some specific one that does not work? This would speed up the work of superpmi, especially it works for a long time when you set the --diff_jit_dump key to get a dump.

After the first and second run, I had a situation that regression occurs in different files, and sometimes it does not exist at all. I ran the command twice in a row and that's what brought out:

First
Second

It seems I'm doing something wrong :)

@SingleAccretion
Copy link
Contributor

It is known that there can be spurious diffs on ARM/ARM64 due to #53773.

The same question is, is there an opportunity to run not all tests, but some specific one that does not work? This would speed up the work of superpmi, especially it works for a long time when you set the --diff_jit_dump key to get a dump.

The usual way to check is to get the command line SPMI used (it prints it, something like: C:\Users\Accretion\source\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\superpmi.exe -c ### C:\Users\Accretion\source\dotnet\runtime\artifacts\tests\coreclr\windows.x64.Checked\Tests\Core_Root\clrjit.dll C:\Users\Accretion\source\dotnet\diffs\spmi\mch\3df3e3ec-b1f6-4e6d-8439-2e7f3f7fa2ac.windows.x64\libraries.pmi.windows.x64.checked.mch, here the -c argument is known as "the method context" and is the same as the file name printed in the diffs) and then run "the base Jit" (also printed by SPMI) and "the diff Jit" with JitDumps enabled. And then diff the dumps manually.

@SeanWoo
Copy link
Contributor Author

SeanWoo commented Nov 25, 2021

@SingleAccretion @echesakovMSFT
Given #53773, what should I do next? As far as I understand, this problem is not related to me, since the code is then optimized as expected, then not optimized because of #53773

@SeanWoo
Copy link
Contributor Author

SeanWoo commented Nov 29, 2021

So, after almost a week, I returned to the problem again and decided to study it more qualitatively. I looked at the output of SuperPMI and JitDump, running the same test every time, which then regresses, then does not. I came to the conclusion that the problem exactly lies in #53773, since regressions occur exactly the same as were indicated here

(Base on the left, current on the right)
The first run is a test called testout1:Func_0():int for Linux ARM64 failed, the following regressions appeared:
24 ( 0,05% of base) : 222814.dasm - testout1:Func_0():int

The second run of this test immediately after the completion of the first, the results are completely different:
-928 (-2,01% of base) : 222814.dasm - testout1:Func_0():int
Multiple changes are immediately visible

The third run, and again this is what we see:
-24 (-0,05% of base) : 222814.dasm - testout1:Func_0():int

Then the fourth launch, everything is fine there as well

But on the fifth launch we see it again:
352 ( 0,78% of base) : 222814.dasm - testout1:Func_0():int

When regression occurs, it still happens like this:
image
Just like here

So, what are we going to do with this PR?

@echesakov
Copy link
Contributor

Hi @SeanWoo,

I collected SuperPMI diffs for your change as well.

I looked over some of the diffs and found two interesting artifacts in a case when the value in the floating point register is not the last use.

For example, that could lead to:

  1. Swapping movi and str instructions order (that sometimes results in better code as in the example):
--- a/base/835.dasm
+++ b/diff/835.dasm
@@ -44,7 +44,7 @@ G_M59164_IG01:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Pr
             stp     x19, x20, [sp,#64]
             mov     fp, sp
                                                ;; bbWeight=1    PerfScore 5.50
-G_M59164_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
+G_M59164_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz, align
             movz    x0, #0xd1ffab1e
             movk    x0, #0xd1ffab1e LSL #16
             movk    x0, #0xd1ffab1e LSL #32
@@ -54,11 +54,10 @@ G_M59164_IG02:        ; gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, isz
             ; gcr arg pop 0
             mov     x19, x0
             ; gcrRegs +[x19]
-            movi    v0.16b, #0x00
-            str     d0, [x19,#24]
-            fmov    d8, d0
+            str     xzr, [x19,#24]
+            movi    v8.16b, #0x00
             mov     w20, #2
  1. Rematerialization of the constant later at the code (this happens only on arm) that, as far as I can tell, the only regressions I observe locally (I don't see instances of Static field addresses are not deterministic when doing SPMI replay #53773).
--- a/base/226591.dasm
+++ b/diff/226591.dasm
@@ -508,8 +508,7 @@ G_M14851_IG04:        ; , extend
             ; byrRegs +[r0]
             movw    r3, 0xd1ff
             movt    r3, 0xd1ff
-            vmov.i2f s0, r3
-            vstr    s0, [r1+4]
+            str     r3, [r1+4]
             movw    r12, 0xd1ff
             movt    r12, 0xd1ff
             blx     r12                // CORINFO_HELP_CHECKED_ASSIGN_REF
@@ -519,6 +518,9 @@ G_M14851_IG04:        ; , extend
             ; gcrRegs +[r0]
             movw    r3, 0xd1ff
             movt    r3, 0xd1ff
+            vmov.i2f s0, r3
+            movw    r3, 0xd1ff
+            movt    r3, 0xd1ff
             blx     r3         // VariantNative:Marshal_Struct_ByValue_Single()
             ; gcrRegs -[r0]
             movs    r1, 1

I don't think the lower has information to determine such cases, so perhaps, there is not much we can do here.

There are two formatting jobs that failed - can you please run

%jitutils%\bin\jit-format.exe --coreclr %runtime%\src\coreclr --fix --untidy --arch x64 --os windows --build checked

to fix the formatting and push the change before I can merge it?

@SeanWoo
Copy link
Contributor Author

SeanWoo commented Nov 30, 2021

@echesakovMSFT I executed this command, everything is ready

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echesakovMSFT I executed this command, everything is ready

Thank you. I re-triggered the failing legs - will merge as soon as they pass.

@echesakov echesakov merged commit bb597e2 into dotnet:main Nov 30, 2021
@echesakov
Copy link
Contributor

@SeanWoo Thank you for your contribution!

@SeanWoo SeanWoo deleted the issue-61620-ARM64 branch November 30, 2021 21:41
@ghost ghost locked as resolved and limited conversation to collaborators Dec 31, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants