Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64 - Emitting msub instruction #66621

Merged
merged 7 commits into from
Apr 12, 2022
Merged

ARM64 - Emitting msub instruction #66621

merged 7 commits into from
Apr 12, 2022

Conversation

TIHan
Copy link
Contributor

@TIHan TIHan commented Mar 14, 2022

Addresses the final piece for this issue: #34937
Though, this PR does not explicitly make changes to MOD or UMOD.

Description

Expression a - b * c can be optimized to a single instruction, msub, on ARM64.
It only optimizes that expression for integral types.

Acceptance Criteria

Some ARM64 diffs

             cmp     w3, #0
             beq     G_M9825_IG04
             udiv    w4, w1, w3
-            mul     w3, w4, w3
-            sub     w3, w1, w3
+            msub    w3, w4, w3, w1
             str     w3, [x2]
             ldr     x2, [x19]
             ; byrRegs -[x2]
             bl      CORINFO_HELP_LDELEMA_REF
             ; gcrRegs -[x0]
             ; byrRegs +[x0]
-						;; bbWeight=1    PerfScore 42.50
+						;; bbWeight=1    PerfScore 42.00
-            mul     w1, w1, w2
-            sub     w0, w0, w1
-						;; bbWeight=1    PerfScore 2.50
+            msub    w0, w1, w2, w0
+						;; bbWeight=1    PerfScore 2.00
             sdiv    w2, w0, w1
-            mul     w1, w2, w1
-            sub     w0, w0, w1
-						;; bbWeight=1    PerfScore 13.50
+            msub    w0, w2, w1, w0
+						;; bbWeight=1    PerfScore 13.00

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 14, 2022
@ghost ghost assigned TIHan Mar 14, 2022
@ghost
Copy link

ghost commented Mar 14, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Addresses the final piece for this issue: #34937

Description

Expression 'a - b * c' can be optimized to a single instruction, msub, on ARM64.
It only optimizes that expression for integral types.

Acceptance Criteria

  • Add Tests

Some ARM64 diffs

             cmp     w3, #0
             beq     G_M9825_IG04
             udiv    w4, w1, w3
-            mul     w3, w4, w3
-            sub     w3, w1, w3
+            msub    w3, w4, w3, w1
             str     w3, [x2]
             ldr     x2, [x19]
             ; byrRegs -[x2]
             bl      CORINFO_HELP_LDELEMA_REF
             ; gcrRegs -[x0]
             ; byrRegs +[x0]
-						;; bbWeight=1    PerfScore 42.50
+						;; bbWeight=1    PerfScore 42.00
-            mul     w1, w1, w2
-            sub     w0, w0, w1
-						;; bbWeight=1    PerfScore 2.50
+            msub    w0, w1, w2, w0
+						;; bbWeight=1    PerfScore 2.00
             sdiv    w2, w0, w1
-            mul     w1, w2, w1
-            sub     w0, w0, w1
-						;; bbWeight=1    PerfScore 13.50
+            msub    w0, w2, w1, w0
+						;; bbWeight=1    PerfScore 13.00
Author: TIHan
Assignees: TIHan
Labels:

area-CodeGen-coreclr

Milestone: -

@TIHan
Copy link
Contributor Author

TIHan commented Mar 14, 2022

@kunalspathak @jakobbotsch This is ready.

// Arguments:
// tree - GT_MSUB tree where op2 is GT_MUL
//
void CodeGen::genCodeForMsub(GenTreeOp* tree)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered to just handle it as part of GT_MADD? I bet it'd be much less lines of code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider it - I guess it's a design choice whether or not to use GT_MADD or introduce GT_MSUB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually conflicted on it. On the one hand, just using GT_MADD is sufficient, but on the other hand, the codegen for GT_MADD is a bit complicated since we are making GT_NEG nodes as contained.

The goal of introducing GT_MSUB and its code-gen was to make it easy to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name GT_MADD as a lowering-only op that is specific for ARM64 which reflects the actual instruction 'madd' - at least for me, I wouldn't it to expect to emit 'msub', but I totally understand why it does though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consider actually just making GT_MADD only emit 'madd' ? Then you could do the decision to use GT_MADD or GT_MSUB in lowering rather than in code-gen. It would make containment less complicated and code-gen simpler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consider actually just making GT_MADD only emit 'madd'

I like this and it makes it have "parity" with GT_ADD and GT_SUB.

@TIHan
Copy link
Contributor Author

TIHan commented Apr 11, 2022

@kunalspathak @EgorBo This is ready. CI is again failing due to unrelated reasons.

I know I'm adding GT_MSUB while GT_MADD can emit either madd or msub - but I'm willing to do a follow-up PR to make GT_MADD just emit madd and do the swapping and stuff in lowering to emit GT_MSUB so there is less confusion between the two.

@kunalspathak
Copy link
Member

I know I'm adding GT_MSUB while GT_MADD can emit either madd or msub - but I'm willing to do a follow-up PR to make GT_MADD just emit madd and do the swapping and stuff in lowering to emit GT_MSUB so there is less confusion between the two.

sounds good to me.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@TIHan
Copy link
Contributor Author

TIHan commented Apr 11, 2022

Made a follow-up issue: #67869

@TIHan TIHan merged commit cfd1241 into dotnet:main Apr 12, 2022
@DrewScoggins
Copy link
Member

Windows-Arm64 Improvements: dotnet/perf-autofiling-issues#4624

@dakersnar
Copy link
Contributor

dakersnar commented Apr 21, 2022

More Windows-Arm64 Improvements: dotnet/perf-autofiling-issues#4733

@AndyAyersMS
Copy link
Member

Ubuntu arm64 regression: dotnet/perf-autofiling-issues#4737

@tannergooding
Copy link
Member

Would be interesting to see the codegen difference: https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/runtime/Benchstones/BenchI/Pi.cs

I'd guess this is more likely due to some subtle loop alignment change from the smaller instruction sequence

@ghost ghost locked as resolved and limited conversation to collaborators May 22, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants