Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT ARM64-SVE: Add FK_3{A,B,C}, EJ_3A, EK_3A, EY_3B, EW_3{A,B} #98187

Merged
merged 12 commits into from
Feb 11, 2024

Conversation

amanasifkhalid
Copy link
Member

Part of #94549. Implements the following encodings:

  • IF_SVE_FK_3A
  • IF_SVE_FK_3B
  • IF_SVE_FK_3C
  • IF_SVE_EJ_3A
  • IF_SVE_EK_3A
  • IF_SVE_EY_3B
  • IF_SVE_EW_3A (SVE2, unsupported by capstone)
  • IF_SVE_EW_3B (SVE2, unsupported by capstone)

cstool output:

sqrdmlah      z0.h, z1.h, z1.h[1]
sqrdmlah      z2.h, z3.h, z3.h[3]
sqrdmlsh      z4.h, z5.h, z5.h[5]
sqrdmlsh      z6.h, z7.h, z7.h[7]
sqrdmlah      z8.s, z9.s, z0.s[0]
sqrdmlah      z10.s, z11.s, z2.s[1]
sqrdmlsh      z12.s, z13.s, z4.s[2]
sqrdmlsh      z14.s, z15.s, z6.s[3]
sqrdmlah      z16.d, z17.d, z0.d[0]
sqrdmlah      z18.d, z19.d, z5.d[1]
sqrdmlsh      z20.d, z21.d, z10.d[0]
sqrdmlsh      z22.d, z23.d, z15.d[1]
cdot  z0.s, z1.b, z2.b, #0
cdot  z3.s, z4.b, z5.b, #90
cdot  z6.d, z7.h, z8.h, #180
cdot  z9.d, z10.h, z11.h, #270
cmla  z0.b, z1.b, z2.b, #0
cmla  z3.h, z4.h, z5.h, #90
cmla  z6.s, z7.s, z8.s, #180
cmla  z9.d, z10.d, z11.d, #270
sqrdcmlah     z12.b, z13.b, z14.b, #0
sqrdcmlah     z15.h, z16.h, z17.h, #90
sqrdcmlah     z18.s, z19.s, z20.s, #180
sqrdcmlah     z21.d, z22.d, z23.d, #270
sdot  z0.d, z1.h, z0.h[0]
sdot  z2.d, z3.h, z5.h[1]
udot  z4.d, z5.h, z10.h[0]
udot  z6.d, z7.h, z15.h[1]

JitDisasm output:

sqrdmlah z0.h, z1.h, z1.h[1]
sqrdmlah z2.h, z3.h, z3.h[3]
sqrdmlsh z4.h, z5.h, z5.h[5]
sqrdmlsh z6.h, z7.h, z7.h[7]
sqrdmlah z8.s, z9.s, z0.s[0]
sqrdmlah z10.s, z11.s, z2.s[1]
sqrdmlsh z12.s, z13.s, z4.s[2]
sqrdmlsh z14.s, z15.s, z6.s[3]
sqrdmlah z16.d, z17.d, z0.d[0]
sqrdmlah z18.d, z19.d, z5.d[1]
sqrdmlsh z20.d, z21.d, z10.d[0]
sqrdmlsh z22.d, z23.d, z15.d[1]
cdot    z0.s, z1.b, z2.b, #0
cdot    z3.s, z4.b, z5.b, #90
cdot    z6.d, z7.h, z8.h, #180
cdot    z9.d, z10.h, z11.h, #270
cmla    z0.b, z1.b, z2.b, #0
cmla    z3.h, z4.h, z5.h, #90
cmla    z6.s, z7.s, z8.s, #180
cmla    z9.d, z10.d, z11.d, #270
sqrdcmlah z12.b, z13.b, z14.b, #0
sqrdcmlah z15.h, z16.h, z17.h, #90
sqrdcmlah z18.s, z19.s, z20.s, #180
sqrdcmlah z21.d, z22.d, z23.d, #270
sdot    z0.d, z1.h, z0.h[0]
sdot    z2.d, z3.h, z5.h[1]
udot    z4.d, z5.h, z10.h[0]
udot    z6.d, z7.h, z15.h[1]

cc @dotnet/arm64-contrib

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Feb 8, 2024
@ghost ghost assigned amanasifkhalid Feb 8, 2024
@ghost
Copy link

ghost commented Feb 8, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Part of #94549. Implements the following encodings:

  • IF_SVE_FK_3A
  • IF_SVE_FK_3B
  • IF_SVE_FK_3C
  • IF_SVE_EJ_3A
  • IF_SVE_EK_3A
  • IF_SVE_EY_3B
  • IF_SVE_EW_3A (SVE2, unsupported by capstone)
  • IF_SVE_EW_3B (SVE2, unsupported by capstone)

cstool output:

sqrdmlah      z0.h, z1.h, z1.h[1]
sqrdmlah      z2.h, z3.h, z3.h[3]
sqrdmlsh      z4.h, z5.h, z5.h[5]
sqrdmlsh      z6.h, z7.h, z7.h[7]
sqrdmlah      z8.s, z9.s, z0.s[0]
sqrdmlah      z10.s, z11.s, z2.s[1]
sqrdmlsh      z12.s, z13.s, z4.s[2]
sqrdmlsh      z14.s, z15.s, z6.s[3]
sqrdmlah      z16.d, z17.d, z0.d[0]
sqrdmlah      z18.d, z19.d, z5.d[1]
sqrdmlsh      z20.d, z21.d, z10.d[0]
sqrdmlsh      z22.d, z23.d, z15.d[1]
cdot  z0.s, z1.b, z2.b, #0
cdot  z3.s, z4.b, z5.b, #90
cdot  z6.d, z7.h, z8.h, #180
cdot  z9.d, z10.h, z11.h, #270
cmla  z0.b, z1.b, z2.b, #0
cmla  z3.h, z4.h, z5.h, #90
cmla  z6.s, z7.s, z8.s, #180
cmla  z9.d, z10.d, z11.d, #270
sqrdcmlah     z12.b, z13.b, z14.b, #0
sqrdcmlah     z15.h, z16.h, z17.h, #90
sqrdcmlah     z18.s, z19.s, z20.s, #180
sqrdcmlah     z21.d, z22.d, z23.d, #270
sdot  z0.d, z1.h, z0.h[0]
sdot  z2.d, z3.h, z5.h[1]
udot  z4.d, z5.h, z10.h[0]
udot  z6.d, z7.h, z15.h[1]

JitDisasm output:

sqrdmlah z0.h, z1.h, z1.h[1]
sqrdmlah z2.h, z3.h, z3.h[3]
sqrdmlsh z4.h, z5.h, z5.h[5]
sqrdmlsh z6.h, z7.h, z7.h[7]
sqrdmlah z8.s, z9.s, z0.s[0]
sqrdmlah z10.s, z11.s, z2.s[1]
sqrdmlsh z12.s, z13.s, z4.s[2]
sqrdmlsh z14.s, z15.s, z6.s[3]
sqrdmlah z16.d, z17.d, z0.d[0]
sqrdmlah z18.d, z19.d, z5.d[1]
sqrdmlsh z20.d, z21.d, z10.d[0]
sqrdmlsh z22.d, z23.d, z15.d[1]
cdot    z0.s, z1.b, z2.b, #0
cdot    z3.s, z4.b, z5.b, #90
cdot    z6.d, z7.h, z8.h, #180
cdot    z9.d, z10.h, z11.h, #270
cmla    z0.b, z1.b, z2.b, #0
cmla    z3.h, z4.h, z5.h, #90
cmla    z6.s, z7.s, z8.s, #180
cmla    z9.d, z10.d, z11.d, #270
sqrdcmlah z12.b, z13.b, z14.b, #0
sqrdcmlah z15.h, z16.h, z17.h, #90
sqrdcmlah z18.s, z19.s, z20.s, #180
sqrdcmlah z21.d, z22.d, z23.d, #270
sdot    z0.d, z1.h, z0.h[0]
sdot    z2.d, z3.h, z5.h[1]
udot    z4.d, z5.h, z10.h[0]
udot    z6.d, z7.h, z15.h[1]

cc @dotnet/arm64-contrib

Author: amanasifkhalid
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@amanasifkhalid amanasifkhalid added the arm-sve Work related to arm64 SVE/SVE2 support label Feb 8, 2024
@amanasifkhalid
Copy link
Member Author

@a74nh sorry I stole one of your encodings; I thought I would get IF_SVE_EW_3A out of the way if I'm going to do IF_SVE_EW_3B.

@@ -5912,6 +5912,42 @@ void CodeGen::genArm64EmitterUnitTestsSve()
theEmitter->emitIns_R_R_R_I(INS_sve_udot, EA_SCALABLE, REG_V7, REG_V8, REG_V3, 3,
INS_OPTS_SCALABLE_H); // UDOT <Zda>.S, <Zn>.H, <Zm>.H[<imm>]

// IF_SVE_EJ_3A
theEmitter->emitIns_R_R_R_I(INS_sve_cdot, EA_SCALABLE, REG_V0, REG_V1, REG_V2, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these instructions that take a rotation value, shouldn't we be passing 0, 90, 180, 270 instead of 0, 1, 2, 3?

I also implemented a few instructions that needed rotation values here: #98141 which I am passing 0, 90, 180, 270.

Copy link
Member Author

@amanasifkhalid amanasifkhalid Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a little weird passing 0-3 instead of 0-270, but I did this to match the bit-level representation of the rotation value, so that we don't need a helper method to encode the rr bits; then when displaying the instruction in JitDisasms, I multiply the immediate by 90 to display it correctly. I'm fine changing my approach to match yours, though I'll have to update a few encodings already merged in. @kunalspathak @a74nh do you have any preference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way I look at is the emitIns functions are APIs that should try to match what the instructions actually are. The bit-level representation/encoding is an implementation detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In that case, how about I wait for #98141 to be merged in, and then I'll update my encodings that use rotation values to use the helper methods you added?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fair to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we write for readability first if we know that such an optimization would not make any difference in 99.9% of scenarios.

However, I'm fine with encoding it as 0, 1, 2, 3 on instrDesc if that is what we all want. I'll have to adjust my work as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we write for readability first

It is readable from the perspective of calling the emitIns* method as seen here: https://github.com/dotnet/runtime/pull/98187/files#diff-d4f9f119d0a321cea7e82023cb754d8abdb800d6185c8bb9464d389ebd50debcR6288

and we flip it just before saving it in instrdesc:

https://github.com/dotnet/runtime/pull/98187/files#diff-2b2c8b9011607926410624d6f81613fad7b74c6e0516d578675a8b792998fe4fR11110

I am not sure if emitOutputInstr() method is readable anyway :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sympathetic to the readability argument. I guess the silver lining with our current approach is the bitwise representation of the rotation value is abstracted away from the API surface (i.e. the emitIns methods). Maybe I'm being naive, but I don't anticipate the code for handling the rotation values in emitIns or emitDispInsHelp changing with any frequency after this is merged in, whereas the usage of emitIns will certainly increase once we start using these SVE instructions. So in the "important" case, readability isn't hindered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't necessarily about making emitOutputInstr readable, but about the display code being simple. We will have to decode the imm whose values are 0-3 to be translated to 0-270 on display.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess what I'm trying to say is, if we encode the values as 0-3 on the instrDesc, we will have to have an encode/decode for it, whereas if we store 0-270, we only need one encode function.

@ryujit-bot
Copy link

Diff results for #98187

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
benchmarks.run.windows.arm64.checked.mch +0.01%

Details here


{
assert(isValidUimm4(imm)); // ii rr
assert((REG_V0 <= reg3) && (reg3 <= REG_V7));
fmt = IF_SVE_FA_3A;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these changes? I see that we moved it under emitIns_R_R_R_I_I() and wondering why this was not done when we implemented SVE_FA_3A, SVE_FA_3B, etc. ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first added emitIns_R_R_R_I_I, I was trying to minimize code duplication, so I just made it a wrapper for emitIns_R_R_R_I by bitwise OR-ing imm1 and imm2 into one imm, and then passing this along to emitIns_R_R_R_I to do the rest. Now I'm running into instructions that have encodings that take one immediate, and encodings that take two immediates, so it's easier to separate these two emitIns methods out.

else
{
assert(opt == INS_OPTS_SCALABLE_D);
assert((REG_V0 <= reg3) && (reg3 <= REG_V15)); // mmmm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth adding a function like isLowVectorRegister()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing.

@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Feb 9, 2024
@amanasifkhalid
Copy link
Member Author

@kunalspathak thanks for the review -- I applied your feedback

@ryujit-bot
Copy link

Diff results for #98187

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
libraries.crossgen2.linux.arm64.checked.mch +0.01%
libraries_tests.run.linux.arm64.Release.mch +0.01%
coreclr_tests.run.linux.arm64.checked.mch +0.01%
benchmarks.run.linux.arm64.checked.mch +0.01%
benchmarks.run_pgo.linux.arm64.checked.mch +0.01%
smoke_tests.nativeaot.linux.arm64.checked.mch +0.01%
benchmarks.run_tiered.linux.arm64.checked.mch +0.01%

Details here


Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@amanasifkhalid
Copy link
Member Author

I merged in #98141, and replaced my rotation value fixups with the helper methods @TIHan added.

@ryujit-bot
Copy link

Diff results for #98187

Throughput diffs

Throughput diffs for windows/arm64 ran on windows/x64

MinOpts (-0.00% to +0.01%)
Collection PDIFF
libraries.pmi.windows.arm64.checked.mch +0.01%

Details here


Throughput diffs for linux/arm64 ran on linux/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
coreclr_tests.run.linux.arm64.checked.mch +0.01%
libraries.crossgen2.linux.arm64.checked.mch +0.01%
smoke_tests.nativeaot.linux.arm64.checked.mch +0.01%
benchmarks.run_pgo.linux.arm64.checked.mch +0.01%
benchmarks.run_tiered.linux.arm64.checked.mch +0.01%
libraries_tests.run.linux.arm64.Release.mch +0.01%
benchmarks.run.linux.arm64.checked.mch +0.01%

Details here


@ryujit-bot
Copy link

Diff results for #98187

Throughput diffs

Throughput diffs for linux/arm64 ran on linux/x64

MinOpts (+0.00% to +0.01%)
Collection PDIFF
benchmarks.run_pgo.linux.arm64.checked.mch +0.01%
libraries.crossgen2.linux.arm64.checked.mch +0.01%
smoke_tests.nativeaot.linux.arm64.checked.mch +0.01%
benchmarks.run_tiered.linux.arm64.checked.mch +0.01%
benchmarks.run.linux.arm64.checked.mch +0.01%
coreclr_tests.run.linux.arm64.checked.mch +0.01%
libraries_tests.run.linux.arm64.Release.mch +0.01%

Details here


@amanasifkhalid amanasifkhalid deleted the sve-sqrdmlah branch February 12, 2024 03:37
@github-actions github-actions bot locked and limited conversation to collaborators Mar 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants