Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64-SVE: Delay free all ops within conditional select #107036

Merged
merged 19 commits into from
Sep 5, 2024

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Aug 27, 2024

Fixes #106864
Fixes #106867

Embedded ops may be prefixed by a MOVPFRX instruction. If so, then the embedded op can't use the reuse a source register as the destination. Ensure all the inputs are marked as delay free.

This solves the case where an embedded mask operation is passed two identical inputs.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Aug 27, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Aug 27, 2024
@a74nh a74nh marked this pull request as ready for review August 27, 2024 16:11
@a74nh
Copy link
Contributor Author

a74nh commented Aug 27, 2024

@dotnet/arm64-contrib @kunalspathak

Copy link
Member

@amanasifkhalid amanasifkhalid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

src/coreclr/jit/lsraarm64.cpp Outdated Show resolved Hide resolved
}

srcCount += BuildDelayFreeUses(intrin.op3, emitOp1);
prefUseNode = emitOp1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we have the problem here for the resultOpNum == 3 case? prefUseNode will end up being intrinEmb.op3, hence the last ref position we build can end up being a non-delay freed version of an earlier operand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, the code in hwintrinsiccodegenarm64.cpp will choose the correct instruction based on which input register matches the output register.

For the resultOpNum == 3 case it knows op3 is the last use so ideally that should be the RMW operand. Then lsra ensures that is the one that does not get delay freed.

Possible there is something I'm missing here.

If that logic is correct, then I think the code can be simplified to:

            unsigned resultOpNum =
                embOp2Node->GetResultOpNumForRmwIntrinsic(user, intrinEmb.op1, intrinEmb.op2, intrinEmb.op3);

            if (resultOpNum == 0)
            {
                   prefUseNode = embOp2Node->Op(1);
            }
            else
            {
                  assert(resultOpNum >= 1 && resultOpNum <= 3);
                 prefUseNode = embOp2Node->Op(resultOpNum);
            }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm not understanding is when do we have the correctness requirement that the destination register is not the same as an operand register? Does this case have the correctness requirement?

I am less interesting in the optimizations happening here until we get the constraints encoded correctly in a way that I know register allocation will handle. Currently I am not convinced, but I also do not have a good understanding of what cases have the register requirements we have been discussing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirements here would be that 1) op1 and the destination are ideally same register for RMW use 2) The other inputs can't use the same operand as the destination (because it might follow a MOVPRFX)

The codegen then outputs extra code to handle the case when op1 and the destination don't match. Plus it asserts if other inputs share the same register as the destination.

The optimisation is that any of the inputs can be chosen as the RMW node (which is what the GetResultOpNumForRmwIntrinsic() is picking). Codegeneration spots this has happened by looking at which input matches the destination.

Copy link
Member

@jakobbotsch jakobbotsch Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything prevent prefUseNode to be op3 and op2 to be the same interval as op3? It will create a delay-freed op2 which has no effect. It will create non delay-freed op3. Then the destination register is allowed to be the same as op2 and op3. If I understand correctly, this will hit the same problem as in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does FusedMultiplyAdd handle that? Does LSRA make an internal register available (z17 here) and then the FMA codegen has an extra move from the internal register back to the allocated destination register?

Copy link
Contributor Author

@a74nh a74nh Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha, that's because the conditionalselect node which wraps it has it's own destination register:

[000083] -A-XG------                   t83 = *  HWINTRINSIC simd16 double ConditionalSelect REG d17

And the FusedMultiplyAdd is contained within the ConditionalSelect, so codegen uses the target register of the ConditionalSelect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curiously, this is exactly the two issues in #106867. Which this PR now fixes:

With the first:

Generating: N017 (  1,  1) [000028] -----------                   t28 =    LCL_VAR   simd16 V07 cse0         u:1 d17 REG d17 <l:$280, c:$2c0>
Generating: N019 (  3,  2) [000005] -----+-----                    t5 =    CNS_VEC   simd16<0x3f800000, 0x00000000, 0x00000000, 0x00000000> REG d18 $181
IN0009:             ldr     q18, (LARGELDC)[@RWD16]
Generating: N021 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   simd16 V07 cse0         u:1 d17 (last use) REG d17 <l:$280, c:$2c0>
Generating: N023 (???,???) [000033] -----------                   t33 =    HWINTRINSIC mask   float CreateTrueMaskAll REG p1
IN000a:             ptrue   p1.s
Generating: N025 (???,???) [000034] -c---------                   t34 =    CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> REG NA
                                                                        /--*  t28    simd16 
                                                                        +--*  t5     simd16 
                                                                        +--*  t30    simd16 
Generating: N027 ( 12, 19) [000012] -A--G+-----                   t12 = *  HWINTRINSIC simd16 float FusedMultiplySubtractNegated REG NA <l:$300, c:$301>
                                                                        /--*  t33    mask   
                                                                        +--*  t12    simd16 
                                                                        +--*  t34    simd16 
Generating: N029 (???,???) [000035] -A--G------                   t35 = *  HWINTRINSIC simd16 float ConditionalSelect REG d17
							V07 in reg d17 is becoming dead  [000030]
							Live regs: 0002000000000001 {x0 d17} - {d17} => 0000000000000001 {x0}
							Live vars after [000030]: {V00 V07} -{V07} => {V00}
IN000b:             movprfx z17.s, p1/z, z17.s
IN000c:             fnmsb   z17.s, p1/m, z18.s, z17.s

is now:

Generating: N017 (  1,  1) [000028] -----------                   t28 =    LCL_VAR   simd16 V07 cse0         u:1 d17 REG d17 <l:$280, c:$2c0>
Generating: N019 (  3,  2) [000005] -----+-----                    t5 =    CNS_VEC   simd16<0x3f800000, 0x00000000, 0x00000000, 0x00000000> REG d18 $181
IN0009:             ldr     q18, (LARGELDC)[@RWD16]
Generating: N021 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   simd16 V07 cse0         u:1 d17 (last use) REG d17 <l:$280, c:$2c0>
Generating: N023 (???,???) [000033] -----------                   t33 =    HWINTRINSIC mask   float CreateTrueMaskAll REG p1
IN000a:             ptrue   p1.s
Generating: N025 (???,???) [000034] -c---------                   t34 =    CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> REG NA
                                                                        /--*  t28    simd16 
                                                                        +--*  t5     simd16 
                                                                        +--*  t30    simd16 
Generating: N027 ( 12, 19) [000012] -A--G+-----                   t12 = *  HWINTRINSIC simd16 float FusedMultiplySubtractNegated REG NA <l:$300, c:$301>
                                                                        /--*  t33    mask   
                                                                        +--*  t12    simd16 
                                                                        +--*  t34    simd16 
Generating: N029 (???,???) [000035] -A--G------                   t35 = *  HWINTRINSIC simd16 float ConditionalSelect REG d19
							V07 in reg d17 is becoming dead  [000030]
							Live regs: 0002000000000001 {x0 d17} - {d17} => 0000000000000001 {x0}
							Live vars after [000030]: {V00 V07} -{V07} => {V00}
IN000b:             movprfx z19.s, p1/z, z17.s
IN000c:             fnmls   z19.s, p1/m, z18.s, z17.s

The second one:

Generating: N017 (  1,  1) [000003] -----+-----                    t3 =    LCL_VAR   simd16<System.Numerics.Vector`1> V00 loc0         u:2 d16 REG d16 $100
Generating: N019 (  1,  1) [000004] -----+-----                    t4 =    LCL_VAR   simd16<System.Numerics.Vector`1> V00 loc0         u:2 d16 REG d16 $100
Generating: N021 (  1,  1) [000005] -----+-----                    t5 =    LCL_VAR   simd16<System.Numerics.Vector`1> V00 loc0         u:2 d16 (last use) REG d16 $100
Generating: N023 (???,???) [000025] -----------                   t25 =    HWINTRINSIC mask   short CreateTrueMaskAll REG p1
IN0006:             ptrue   p1.h
Generating: N025 (???,???) [000026] -c---------                   t26 =    CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> REG NA
                                                                        /--*  t3     simd16 
                                                                        +--*  t4     simd16 
                                                                        +--*  t5     simd16 
Generating: N027 (  4,  4) [000006] -c---+-----                    t6 = *  HWINTRINSIC simd16 short MultiplyAdd REG NA $200
                                                                        /--*  t25    mask   
                                                                        +--*  t6     simd16 
                                                                        +--*  t26    simd16 
Generating: N029 (???,???) [000027] -----------                   t27 = *  HWINTRINSIC simd16 short ConditionalSelect REG d16
							V00 in reg d16 is becoming dead  [000005]
							Live regs: 0001000000000000 {d16} - {d16} => 0000000000000000 {}
Debug: Closing V00 debug range.
							Live vars after [000005]: {V00} -{V00} => {}
IN0007:             movprfx z16.h, p1/z, z16.h
IN0008:             mad     z16.h, p1/m, z16.h, z16.h

is now:

Generating: N017 (  1,  1) [000003] -----+-----                    t3 =    LCL_VAR   simd16<System.Numerics.Vector`1> V00 loc0         u:2 d16 REG d16 $100
Generating: N019 (  1,  1) [000004] -----+-----                    t4 =    LCL_VAR   simd16<System.Numerics.Vector`1> V00 loc0         u:2 d16 REG d16 $100
Generating: N021 (  1,  1) [000005] -----+-----                    t5 =    LCL_VAR   simd16<System.Numerics.Vector`1> V00 loc0         u:2 d16 (last use) REG d16 $100
Generating: N023 (???,???) [000025] -----------                   t25 =    HWINTRINSIC mask   short CreateTrueMaskAll REG p1
IN0006:             ptrue   p1.h
Generating: N025 (???,???) [000026] -c---------                   t26 =    CNS_VEC   simd16<0x00000000, 0x00000000, 0x00000000, 0x00000000> REG NA
                                                                        /--*  t3     simd16 
                                                                        +--*  t4     simd16 
                                                                        +--*  t5     simd16 
Generating: N027 (  4,  4) [000006] -c---+-----                    t6 = *  HWINTRINSIC simd16 short MultiplyAdd REG NA $200
                                                                        /--*  t25    mask   
                                                                        +--*  t6     simd16 
                                                                        +--*  t26    simd16 
Generating: N029 (???,???) [000027] -----------                   t27 = *  HWINTRINSIC simd16 short ConditionalSelect REG d17
							V00 in reg d16 is becoming dead  [000005]
							Live regs: 0001000000000000 {d16} - {d16} => 0000000000000000 {}
Debug: Closing V00 debug range.
							Live vars after [000005]: {V00} -{V00} => {}
IN0007:             movprfx z17.h, p1/z, z16.h
IN0008:             mla     z17.h, p1/m, z16.h, z16.h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the those tests to this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakobbotsch - are you happy with this PR with the latest changes?

TEST_IMG: ubuntu/dotnet-build
TEST_CMD: safe ./projects/dotnet/build-runtime.sh

Jira: ENTLLT-7634
Change-Id: I337a291be6661f104fe90c7cdc27150eede43647
@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Sep 3, 2024
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@kunalspathak kunalspathak merged commit 2dd517e into dotnet:main Sep 5, 2024
115 of 117 checks passed
@kunalspathak
Copy link
Member

/backport to release/9.0-rc1

Copy link
Contributor

github-actions bot commented Sep 5, 2024

Started backporting to release/9.0-rc1: https://github.com/dotnet/runtime/actions/runs/10729636992

@kunalspathak
Copy link
Member

/backport to release/9.0

Copy link
Contributor

github-actions bot commented Sep 5, 2024

Started backporting to release/9.0: https://github.com/dotnet/runtime/actions/runs/10729651396

Copy link
Contributor

github-actions bot commented Sep 5, 2024

@kunalspathak backporting to release/9.0 failed, the patch most likely resulted in conflicts:

$ git am --3way --empty=keep --ignore-whitespace --keep-non-patch changes.patch

Applying: ARM64-SVE: Delay free all ops within conditional select
Applying: Fix comment
Applying: Add test header
Applying: don't delay prefUseOpNum
Applying: Fix FMA
Applying: Add assert checks for delay free
Applying: Merge embedded op build code
Applying: fix formatting
Applying: simplify assert
Applying: simplify FMA code
Applying: Add tests for 106867
Applying: ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic
Applying: Add Sve.IsSupported to tests
Applying: Add Sve.IsSupported to test
Applying: fix formatting
error: sha1 information is lacking or useless (src/coreclr/jit/lowerarmarch.cpp).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0015 fix formatting
Error: The process '/usr/bin/git' failed with exit code 128

Please backport manually!

Copy link
Contributor

github-actions bot commented Sep 5, 2024

@kunalspathak an error occurred while backporting to release/9.0, please check the run log for details!

Error: git am failed, most likely due to a merge conflict.

kunalspathak pushed a commit to kunalspathak/runtime that referenced this pull request Sep 5, 2024
* ARM64-SVE: Delay free all ops within conditional select

* Fix comment

* Add test header

* don't delay prefUseOpNum

* Fix FMA

* Add assert checks for delay free

* Merge embedded op build code

* fix formatting

* simplify assert

* simplify FMA code

* Add tests for 106867

* ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic

TEST_IMG: ubuntu/dotnet-build
TEST_CMD: safe ./projects/dotnet/build-runtime.sh

Jira: ENTLLT-7634
Change-Id: I337a291be6661f104fe90c7cdc27150eede43647

* Add Sve.IsSupported to tests

* Add Sve.IsSupported to test

* fix formatting

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"
@a74nh a74nh deleted the reg_github branch September 6, 2024 11:22
carlossanlop pushed a commit that referenced this pull request Sep 6, 2024
…07436)

* ARM64-SVE: Delay free all ops within conditional select

* Fix comment

* Add test header

* don't delay prefUseOpNum

* Fix FMA

* Add assert checks for delay free

* Merge embedded op build code

* fix formatting

* simplify assert

* simplify FMA code

* Add tests for 106867

* ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic

TEST_IMG: ubuntu/dotnet-build
TEST_CMD: safe ./projects/dotnet/build-runtime.sh

Jira: ENTLLT-7634
Change-Id: I337a291be6661f104fe90c7cdc27150eede43647

* Add Sve.IsSupported to tests

* Add Sve.IsSupported to test

* fix formatting

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

Co-authored-by: Alan Hayward <a74nh@users.noreply.github.com>
jtschuster pushed a commit to jtschuster/runtime that referenced this pull request Sep 17, 2024
* ARM64-SVE: Delay free all ops within conditional select

* Fix comment

* Add test header

* don't delay prefUseOpNum

* Fix FMA

* Add assert checks for delay free

* Merge embedded op build code

* fix formatting

* simplify assert

* simplify FMA code

* Add tests for 106867

* ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic

TEST_IMG: ubuntu/dotnet-build
TEST_CMD: safe ./projects/dotnet/build-runtime.sh

Jira: ENTLLT-7634
Change-Id: I337a291be6661f104fe90c7cdc27150eede43647

* Add Sve.IsSupported to tests

* Add Sve.IsSupported to test

* fix formatting

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"
sirntar pushed a commit to sirntar/runtime that referenced this pull request Sep 30, 2024
* ARM64-SVE: Delay free all ops within conditional select

* Fix comment

* Add test header

* don't delay prefUseOpNum

* Fix FMA

* Add assert checks for delay free

* Merge embedded op build code

* fix formatting

* simplify assert

* simplify FMA code

* Add tests for 106867

* ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic

TEST_IMG: ubuntu/dotnet-build
TEST_CMD: safe ./projects/dotnet/build-runtime.sh

Jira: ENTLLT-7634
Change-Id: I337a291be6661f104fe90c7cdc27150eede43647

* Add Sve.IsSupported to tests

* Add Sve.IsSupported to test

* fix formatting

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"

* Revert "ARM64-SVE: Allow op inside conditionselect to be non HWintrinsic"
@github-actions github-actions bot locked and limited conversation to collaborators Oct 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member
Projects
None yet
5 participants