-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64 - Emitting msub
instruction
#66621
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsAddresses the final piece for this issue: #34937 Description Expression 'a - b * c' can be optimized to a single instruction, Acceptance Criteria
Some ARM64 diffs cmp w3, #0
beq G_M9825_IG04
udiv w4, w1, w3
- mul w3, w4, w3
- sub w3, w1, w3
+ msub w3, w4, w3, w1
str w3, [x2]
ldr x2, [x19]
; byrRegs -[x2]
bl CORINFO_HELP_LDELEMA_REF
; gcrRegs -[x0]
; byrRegs +[x0]
- ;; bbWeight=1 PerfScore 42.50
+ ;; bbWeight=1 PerfScore 42.00 - mul w1, w1, w2
- sub w0, w0, w1
- ;; bbWeight=1 PerfScore 2.50
+ msub w0, w1, w2, w0
+ ;; bbWeight=1 PerfScore 2.00 sdiv w2, w0, w1
- mul w1, w2, w1
- sub w0, w0, w1
- ;; bbWeight=1 PerfScore 13.50
+ msub w0, w2, w1, w0
+ ;; bbWeight=1 PerfScore 13.00
|
@kunalspathak @jakobbotsch This is ready. |
// Arguments: | ||
// tree - GT_MSUB tree where op2 is GT_MUL | ||
// | ||
void CodeGen::genCodeForMsub(GenTreeOp* tree) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered to just handle it as part of GT_MADD? I bet it'd be much less lines of code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider it - I guess it's a design choice whether or not to use GT_MADD or introduce GT_MSUB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually conflicted on it. On the one hand, just using GT_MADD is sufficient, but on the other hand, the codegen for GT_MADD is a bit complicated since we are making GT_NEG nodes as contained.
The goal of introducing GT_MSUB and its code-gen was to make it easy to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name GT_MADD as a lowering-only op that is specific for ARM64 which reflects the actual instruction 'madd' - at least for me, I wouldn't it to expect to emit 'msub', but I totally understand why it does though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consider actually just making GT_MADD only emit 'madd' ? Then you could do the decision to use GT_MADD or GT_MSUB in lowering rather than in code-gen. It would make containment less complicated and code-gen simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we consider actually just making GT_MADD only emit 'madd'
I like this and it makes it have "parity" with GT_ADD
and GT_SUB
.
@kunalspathak @EgorBo This is ready. CI is again failing due to unrelated reasons. I know I'm adding GT_MSUB while GT_MADD can emit either madd or msub - but I'm willing to do a follow-up PR to make GT_MADD just emit madd and do the swapping and stuff in lowering to emit GT_MSUB so there is less confusion between the two. |
sounds good to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Made a follow-up issue: #67869 |
Windows-Arm64 Improvements: dotnet/perf-autofiling-issues#4624 |
More Windows-Arm64 Improvements: dotnet/perf-autofiling-issues#4733 |
Ubuntu arm64 regression: dotnet/perf-autofiling-issues#4737 |
Would be interesting to see the codegen difference: https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/runtime/Benchstones/BenchI/Pi.cs I'd guess this is more likely due to some subtle loop alignment change from the smaller instruction sequence |
Addresses the final piece for this issue: #34937
Though, this PR does not explicitly make changes to MOD or UMOD.
Description
Expression
a - b * c
can be optimized to a single instruction,msub
, on ARM64.It only optimizes that expression for integral types.
Acceptance Criteria
Add Tests(asmdiffs cover this, also ARM64 - Optimizing a % b operations part 2 #66407 includes tests)Some ARM64 diffs