-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64-SVE: Delay free all ops within conditional select #107036
Conversation
@dotnet/arm64-contrib @kunalspathak |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
src/coreclr/jit/lsraarm64.cpp
Outdated
} | ||
|
||
srcCount += BuildDelayFreeUses(intrin.op3, emitOp1); | ||
prefUseNode = emitOp1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we have the problem here for the resultOpNum == 3
case? prefUseNode
will end up being intrinEmb.op3
, hence the last ref position we build can end up being a non-delay freed version of an earlier operand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AIUI, the code in hwintrinsiccodegenarm64.cpp
will choose the correct instruction based on which input register matches the output register.
For the resultOpNum == 3
case it knows op3
is the last use so ideally that should be the RMW operand. Then lsra ensures that is the one that does not get delay freed.
Possible there is something I'm missing here.
If that logic is correct, then I think the code can be simplified to:
unsigned resultOpNum =
embOp2Node->GetResultOpNumForRmwIntrinsic(user, intrinEmb.op1, intrinEmb.op2, intrinEmb.op3);
if (resultOpNum == 0)
{
prefUseNode = embOp2Node->Op(1);
}
else
{
assert(resultOpNum >= 1 && resultOpNum <= 3);
prefUseNode = embOp2Node->Op(resultOpNum);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm not understanding is when do we have the correctness requirement that the destination register is not the same as an operand register? Does this case have the correctness requirement?
I am less interesting in the optimizations happening here until we get the constraints encoded correctly in a way that I know register allocation will handle. Currently I am not convinced, but I also do not have a good understanding of what cases have the register requirements we have been discussing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The requirements here would be that 1) op1
and the destination are ideally same register for RMW use 2) The other inputs can't use the same operand as the destination (because it might follow a MOVPRFX
)
The codegen then outputs extra code to handle the case when op1
and the destination don't match. Plus it asserts if other inputs share the same register as the destination.
The optimisation is that any of the inputs can be chosen as the RMW node (which is what the GetResultOpNumForRmwIntrinsic()
is picking). Codegeneration spots this has happened by looking at which input matches the destination.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anything prevent prefUseNode
to be op3
and op2
to be the same interval as op3
? It will create a delay-freed op2
which has no effect. It will create non delay-freed op3
. Then the destination register is allowed to be the same as op2
and op3
. If I understand correctly, this will hit the same problem as in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does FusedMultiplyAdd
handle that? Does LSRA make an internal register available (z17
here) and then the FMA codegen has an extra move from the internal register back to the allocated destination register?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha, that's because the conditionalselect node which wraps it has it's own destination register:
[000083] -A-XG------ t83 = * HWINTRINSIC simd16 double ConditionalSelect REG d17
And the FusedMultiplyAdd
is contained within the ConditionalSelect
, so codegen uses the target register of the ConditionalSelect
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curiously, this is exactly the two issues in #106867. Which this PR now fixes:
With the first:
Generating: N017 ( 1, 1) [000028] ----------- t28 = LCL_VAR simd16 V07 cse0 u:1 d17 REG d17 <l:$280, c:$2c0>
Generating: N019 ( 3, 2) [000005] -----+----- t5 = CNS_VEC simd16<0x3f800000, 0x00000000, 0x00000000, 0x00000000> REG d18 $181
IN0009: ldr q18, (LARGELDC)[@RWD16]
Generating: N021 ( 1, 1) [000030] ----------- t30 = LCL_VAR simd16 V07 cse0 u:1 d17 (last use) REG d17 <l:$280, c:$2c0>
Generating: N023 (???,???) [000033] ----------- t33 = HWINTRINSIC mask float CreateTrueMaskAll REG p1
IN000a: ptrue p1.s
Generating: N025 (???,???) [000034] -c--------- t34 = CNS_VEC simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> REG NA
/--* t28 simd16
+--* t5 simd16
+--* t30 simd16
Generating: N027 ( 12, 19) [000012] -A--G+----- t12 = * HWINTRINSIC simd16 float FusedMultiplySubtractNegated REG NA <l:$300, c:$301>
/--* t33 mask
+--* t12 simd16
+--* t34 simd16
Generating: N029 (???,???) [000035] -A--G------ t35 = * HWINTRINSIC simd16 float ConditionalSelect REG d17
V07 in reg d17 is becoming dead [000030]
Live regs: 0002000000000001 {x0 d17} - {d17} => 0000000000000001 {x0}
Live vars after [000030]: {V00 V07} -{V07} => {V00}
IN000b: movprfx z17.s, p1/z, z17.s
IN000c: fnmsb z17.s, p1/m, z18.s, z17.s
is now:
Generating: N017 ( 1, 1) [000028] ----------- t28 = LCL_VAR simd16 V07 cse0 u:1 d17 REG d17 <l:$280, c:$2c0>
Generating: N019 ( 3, 2) [000005] -----+----- t5 = CNS_VEC simd16<0x3f800000, 0x00000000, 0x00000000, 0x00000000> REG d18 $181
IN0009: ldr q18, (LARGELDC)[@RWD16]
Generating: N021 ( 1, 1) [000030] ----------- t30 = LCL_VAR simd16 V07 cse0 u:1 d17 (last use) REG d17 <l:$280, c:$2c0>
Generating: N023 (???,???) [000033] ----------- t33 = HWINTRINSIC mask float CreateTrueMaskAll REG p1
IN000a: ptrue p1.s
Generating: N025 (???,???) [000034] -c--------- t34 = CNS_VEC simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> REG NA
/--* t28 simd16
+--* t5 simd16
+--* t30 simd16
Generating: N027 ( 12, 19) [000012] -A--G+----- t12 = * HWINTRINSIC simd16 float FusedMultiplySubtractNegated REG NA <l:$300, c:$301>
/--* t33 mask
+--* t12 simd16
+--* t34 simd16
Generating: N029 (???,???) [000035] -A--G------ t35 = * HWINTRINSIC simd16 float ConditionalSelect REG d19
V07 in reg d17 is becoming dead [000030]
Live regs: 0002000000000001 {x0 d17} - {d17} => 0000000000000001 {x0}
Live vars after [000030]: {V00 V07} -{V07} => {V00}
IN000b: movprfx z19.s, p1/z, z17.s
IN000c: fnmls z19.s, p1/m, z18.s, z17.s
The second one:
Generating: N017 ( 1, 1) [000003] -----+----- t3 = LCL_VAR simd16<System.Numerics.Vector`1> V00 loc0 u:2 d16 REG d16 $100
Generating: N019 ( 1, 1) [000004] -----+----- t4 = LCL_VAR simd16<System.Numerics.Vector`1> V00 loc0 u:2 d16 REG d16 $100
Generating: N021 ( 1, 1) [000005] -----+----- t5 = LCL_VAR simd16<System.Numerics.Vector`1> V00 loc0 u:2 d16 (last use) REG d16 $100
Generating: N023 (???,???) [000025] ----------- t25 = HWINTRINSIC mask short CreateTrueMaskAll REG p1
IN0006: ptrue p1.h
Generating: N025 (???,???) [000026] -c--------- t26 = CNS_VEC simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> REG NA
/--* t3 simd16
+--* t4 simd16
+--* t5 simd16
Generating: N027 ( 4, 4) [000006] -c---+----- t6 = * HWINTRINSIC simd16 short MultiplyAdd REG NA $200
/--* t25 mask
+--* t6 simd16
+--* t26 simd16
Generating: N029 (???,???) [000027] ----------- t27 = * HWINTRINSIC simd16 short ConditionalSelect REG d16
V00 in reg d16 is becoming dead [000005]
Live regs: 0001000000000000 {d16} - {d16} => 0000000000000000 {}
Debug: Closing V00 debug range.
Live vars after [000005]: {V00} -{V00} => {}
IN0007: movprfx z16.h, p1/z, z16.h
IN0008: mad z16.h, p1/m, z16.h, z16.h
is now:
Generating: N017 ( 1, 1) [000003] -----+----- t3 = LCL_VAR simd16<System.Numerics.Vector`1> V00 loc0 u:2 d16 REG d16 $100
Generating: N019 ( 1, 1) [000004] -----+----- t4 = LCL_VAR simd16<System.Numerics.Vector`1> V00 loc0 u:2 d16 REG d16 $100
Generating: N021 ( 1, 1) [000005] -----+----- t5 = LCL_VAR simd16<System.Numerics.Vector`1> V00 loc0 u:2 d16 (last use) REG d16 $100
Generating: N023 (???,???) [000025] ----------- t25 = HWINTRINSIC mask short CreateTrueMaskAll REG p1
IN0006: ptrue p1.h
Generating: N025 (???,???) [000026] -c--------- t26 = CNS_VEC simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> REG NA
/--* t3 simd16
+--* t4 simd16
+--* t5 simd16
Generating: N027 ( 4, 4) [000006] -c---+----- t6 = * HWINTRINSIC simd16 short MultiplyAdd REG NA $200
/--* t25 mask
+--* t6 simd16
+--* t26 simd16
Generating: N029 (???,???) [000027] ----------- t27 = * HWINTRINSIC simd16 short ConditionalSelect REG d17
V00 in reg d16 is becoming dead [000005]
Live regs: 0001000000000000 {d16} - {d16} => 0000000000000000 {}
Debug: Closing V00 debug range.
Live vars after [000005]: {V00} -{V00} => {}
IN0007: movprfx z17.h, p1/z, z16.h
IN0008: mla z17.h, p1/m, z16.h, z16.h
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the those tests to this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jakobbotsch - are you happy with this PR with the latest changes?
TEST_IMG: ubuntu/dotnet-build TEST_CMD: safe ./projects/dotnet/build-runtime.sh Jira: ENTLLT-7634 Change-Id: I337a291be6661f104fe90c7cdc27150eede43647
src/tests/JIT/Regression/JitBlue/Runtime_106867/Runtime_106867_1.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
/backport to release/9.0-rc1 |
Started backporting to release/9.0-rc1: https://github.com/dotnet/runtime/actions/runs/10729636992 |
/backport to release/9.0 |
Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/10729651396 |
@kunalspathak backporting to release/9.0 failed, the patch most likely resulted in conflicts: $ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch
Applying: ARM64-SVE: Delay free all ops within conditional select
Applying: Fix comment
Applying: Add test header
Applying: don't delay prefUseOpNum
Applying: Fix FMA
Applying: Add assert checks for delay free
Applying: Merge embedded op build code
Applying: fix formatting
Applying: simplify assert
Applying: simplify FMA code
Applying: Add tests for 106867
Applying: ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic
Applying: Add Sve.IsSupported to tests
Applying: Add Sve.IsSupported to test
Applying: fix formatting
error: sha1 information is lacking or useless (src/coreclr/jit/lowerarmarch.cpp).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0015 fix formatting
Error: The process '/usr/bin/git' failed with exit code 128 Please backport manually! |
@kunalspathak an error occurred while backporting to release/9.0, please check the run log for details! Error: git am failed, most likely due to a merge conflict. |
* ARM64-SVE: Delay free all ops within conditional select * Fix comment * Add test header * don't delay prefUseOpNum * Fix FMA * Add assert checks for delay free * Merge embedded op build code * fix formatting * simplify assert * simplify FMA code * Add tests for 106867 * ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic TEST_IMG: ubuntu/dotnet-build TEST_CMD: safe ./projects/dotnet/build-runtime.sh Jira: ENTLLT-7634 Change-Id: I337a291be6661f104fe90c7cdc27150eede43647 * Add Sve.IsSupported to tests * Add Sve.IsSupported to test * fix formatting * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"
…07436) * ARM64-SVE: Delay free all ops within conditional select * Fix comment * Add test header * don't delay prefUseOpNum * Fix FMA * Add assert checks for delay free * Merge embedded op build code * fix formatting * simplify assert * simplify FMA code * Add tests for 106867 * ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic TEST_IMG: ubuntu/dotnet-build TEST_CMD: safe ./projects/dotnet/build-runtime.sh Jira: ENTLLT-7634 Change-Id: I337a291be6661f104fe90c7cdc27150eede43647 * Add Sve.IsSupported to tests * Add Sve.IsSupported to test * fix formatting * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" Co-authored-by: Alan Hayward <a74nh@users.noreply.github.com>
* ARM64-SVE: Delay free all ops within conditional select * Fix comment * Add test header * don't delay prefUseOpNum * Fix FMA * Add assert checks for delay free * Merge embedded op build code * fix formatting * simplify assert * simplify FMA code * Add tests for 106867 * ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic TEST_IMG: ubuntu/dotnet-build TEST_CMD: safe ./projects/dotnet/build-runtime.sh Jira: ENTLLT-7634 Change-Id: I337a291be6661f104fe90c7cdc27150eede43647 * Add Sve.IsSupported to tests * Add Sve.IsSupported to test * fix formatting * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"
* ARM64-SVE: Delay free all ops within conditional select * Fix comment * Add test header * don't delay prefUseOpNum * Fix FMA * Add assert checks for delay free * Merge embedded op build code * fix formatting * simplify assert * simplify FMA code * Add tests for 106867 * ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic TEST_IMG: ubuntu/dotnet-build TEST_CMD: safe ./projects/dotnet/build-runtime.sh Jira: ENTLLT-7634 Change-Id: I337a291be6661f104fe90c7cdc27150eede43647 * Add Sve.IsSupported to tests * Add Sve.IsSupported to test * fix formatting * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic" * Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"
Fixes #106864
Fixes #106867
Embedded ops may be prefixed by a MOVPFRX instruction. If so, then the embedded op can't use the reuse a source register as the destination. Ensure all the inputs are marked as delay free.
This solves the case where an embedded mask operation is passed two identical inputs.