Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable EVEX feature: Embedded Rounding for Avx512F.Add() #94684

Merged
merged 22 commits into from
Jan 19, 2024

Conversation

Ruihan-Yin
Copy link
Contributor

Description

Enabled EVEX feature: embedded rounding in JIT backend, and exposed only 1 related API:

public static Vector512<double> Add(Vector512<double> left, Vector512<double> right, FloatRoundingMode mode);

to establish the complete compiling path starting from importation.

Details

To tag the related AST node/instruction is embedded rounding enabled, we introduced some new flags:

  1. For AST node, we introduced GTF_HW_ER_*: As the APIs are designed in a way that FloatRoundingMode information is passed as an extra operand, we choose to let the AST node carry this operand until lowering, and during lowering, the node will be converted back to the normal version, and the extra operand will be converted to a flag on the node. In this way, we can simply reuse the existing emit paths in the embedded rounding enabled cases.
  2. When reaching the emit stage, we will have the corresponding instruction according to the node information, and we followed the design in Embedded Broadcast, used insOpts to carry this extra information to inform emitter embedded rounding is on, so in the instrDesc, we made the following changes:
    1. renamed _idEvexbContext and extend it to 2 bit as _idIsEmbBroadcast and _idIsEmbRounding
    2. introduced 2 extra bits to indicate the rounding mode: _idEmbRoundingMode

Follow-up in this PR

This PR, at current stage, is intended to show and discuss the design with maintainers, we are open to adjust and improve the designs.

And we will need to add some unit tests to better cover this feature and related APIs.

Follow-up after this PR

As this PR only exposed 1 embedded rounding related API, we will make another follow-up PR to expose other related APIs altogether. Since embedded rounding is mostly compatible with arithmetic and casting intrinsics, some more emit path will be impacted, we expect to extend the same design to different emit path if needed.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Nov 13, 2023
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 13, 2023
@ghost
Copy link

ghost commented Nov 13, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Enabled EVEX feature: embedded rounding in JIT backend, and exposed only 1 related API:

public static Vector512<double> Add(Vector512<double> left, Vector512<double> right, FloatRoundingMode mode);

to establish the complete compiling path starting from importation.

Details

To tag the related AST node/instruction is embedded rounding enabled, we introduced some new flags:

  1. For AST node, we introduced GTF_HW_ER_*: As the APIs are designed in a way that FloatRoundingMode information is passed as an extra operand, we choose to let the AST node carry this operand until lowering, and during lowering, the node will be converted back to the normal version, and the extra operand will be converted to a flag on the node. In this way, we can simply reuse the existing emit paths in the embedded rounding enabled cases.
  2. When reaching the emit stage, we will have the corresponding instruction according to the node information, and we followed the design in Embedded Broadcast, used insOpts to carry this extra information to inform emitter embedded rounding is on, so in the instrDesc, we made the following changes:
    1. renamed _idEvexbContext and extend it to 2 bit as _idIsEmbBroadcast and _idIsEmbRounding
    2. introduced 2 extra bits to indicate the rounding mode: _idEmbRoundingMode

Follow-up in this PR

This PR, at current stage, is intended to show and discuss the design with maintainers, we are open to adjust and improve the designs.

And we will need to add some unit tests to better cover this feature and related APIs.

Follow-up after this PR

As this PR only exposed 1 embedded rounding related API, we will make another follow-up PR to expose other related APIs altogether. Since embedded rounding is mostly compatible with arithmetic and casting intrinsics, some more emit path will be impacted, we expect to extend the same design to different emit path if needed.

Author: Ruihan-Yin
Assignees: -
Labels:

area-CodeGen-coreclr, new-api-needs-documentation, community-contribution

Milestone: -

@Ruihan-Yin
Copy link
Contributor Author

Failures are known, will apply the format patch later.

Hi @tannergooding, I think we can start to have some discussion on the design of embedded rounding. Would you please take a look at this draft? Thanks!

The impact on the throughput is less than I expected considering the extra bits I introduced in the instrDesc, is it attributed to the optimization given in #87373?

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Nov 17, 2023
1. fix typo in commnets
2. Add SetEvexBroadcastIfNeeded
3. Added mask in insOpts
2. removed round-to-even, the default option from InsOpts as it will be handled on the default path.
@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented Jan 2, 2024

Looks like this is pending some formatting fixes and tests, is that correct?

Rebased the branch, running the pipeline tests again. I'll push the generated format patch shortly.

For the tests, we are working internally on templates to cover all the embedded rounding APIs, I can push a commit with the template and cover all the Add unit tests, so we can review the test design in this PR, and I would expect to have a separate PR to expose the rest of the APIs and unit tests.

Will the plan sound good to you?

  • Update: format patch is applied, fails looks irrelevant.

…ol byte is not constant.

2. Create a template to generate the unit tests for embedded rounding APIs.
3. nit: fix naming.
@Ruihan-Yin
Copy link
Contributor Author

Ruihan-Yin commented Jan 4, 2024

Updates:

  1. Added the unit test template for embedded rounding APIs.
  2. Added a jump table fallback for non-constant rounding mode control byte.
  3. removed hand-written unit tests.

The expected test results were calculated on i9-9980XE with a C++ program.

Edit: Failures look irrelevant to the changes.

Hi @tannergooding, format patch and unit tests have been applied. There should be no other blockers.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC. @dotnet/jit-contrib for secondary review

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nits/questions/requests

@@ -774,13 +774,13 @@ class emitter
unsigned _idCallAddr : 1; // IL indirect calls: can make a direct call to iiaAddr
unsigned _idNoGC : 1; // Some helpers don't get recorded in GC tables
#if defined(TARGET_XARCH)
unsigned _idEvexbContext : 1; // does EVEX.b need to be set.
#endif // TARGET_XARCH
unsigned _idEvexbContext : 2 // Does Evex.b need to be set for embedded broadcast/embedded rounding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a semicolon (probably got lucky that there is a "stray" semicolon below on CLANG_FORMAT_COMMENT_ANCHOR

Suggested change
unsigned _idEvexbContext : 2 // Does Evex.b need to be set for embedded broadcast/embedded rounding.
unsigned _idEvexbContext : 2; // Does Evex.b need to be set for embedded broadcast/embedded rounding.

Also, "Does Evex.b need to be set" makes it sound like this is a 0/1, false/true boolean. Then why does it need 2 bits? I think you should expand the comment to state what the two bits are used for. (You don't need to put the comment all on this one line; make it a multi-line comment above this line if necessary)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

_idEvexbContext is used to differentiate several features: normal intrinsics, embedded broadcast, embedded rounding. In the normal or embedded broadcast cases, the semantic for Evex.L'L remains the same, intrinsic vector length, while in embedded rounding case, the semantic changes to indicate the rounding mode. _idEvexbContext will be used to inform emitter to specially handle the Evex.L'L in the embedded rounding scenario. I will properly document this part in the comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment like you just wrote would be great.

// x86: 53/49 bits
// amd64: 54/49 bits
// x86: 54/50 bits
// amd64: 55/51 bits
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "51" should be "50" AFAICT

Suggested change
// amd64: 55/51 bits
// amd64: 55/50 bits

@@ -1542,11 +1542,31 @@ class emitter
{
return _idEvexbContext != 0;
}
void idSetEvexbContext()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function above, idIsEvexbContext() sounds odd. Should it be renamed idIsEvexbContextSet()?

// GetEmbRoudingMode: Get the rounding mode for embedded rounding
//
// Arguments:
// mode -- the flag from the correspoding gentree node indicating the mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit (typo)

Suggested change
// mode -- the flag from the correspoding gentree node indicating the mode.
// mode -- the flag from the corresponding GenTree node indicating the mode.

@@ -1139,6 +1139,29 @@ static bool isLowSimdReg(regNumber reg)
#endif
}

//------------------------------------------------------------------------
// GetEmbRoudingMode: Get the rounding mode for embedded rounding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GetEmbRoudingMode: Get the rounding mode for embedded rounding
// GetEmbRoundingMode: Get the rounding mode for embedded rounding

Comment on lines 558 to 562
GTF_HW_ER_MASK = 0x30000000, // Bits used by handle types below
GTF_HW_ER_TOEVEN = 0x00000000, // GT_HWINTRINSIC -- embedded rounding mode: ToEven (Default).
GTF_HW_ER_TONEGATIVEINFINITY = 0x10000000, // GT_HWINTRINSIC -- embedded rounding mode: ToNegativeInfinity.
GTF_HW_ER_TOPOSITIVEINFINITY = 0x20000000, // GT_HWINTRINSIC -- embedded rounding mode: ToPositiveInfinity.
GTF_HW_ER_TOZERO = 0x30000000, // GT_HWINTRINSIC -- embedded rounding mode: ToZero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes:

  1. please align the =
  2. Add underscores between words (otherwise it's to easy to read "TOE...", "TONE...", "TOP..."
Suggested change
GTF_HW_ER_MASK = 0x30000000, // Bits used by handle types below
GTF_HW_ER_TOEVEN = 0x00000000, // GT_HWINTRINSIC -- embedded rounding mode: ToEven (Default).
GTF_HW_ER_TONEGATIVEINFINITY = 0x10000000, // GT_HWINTRINSIC -- embedded rounding mode: ToNegativeInfinity.
GTF_HW_ER_TOPOSITIVEINFINITY = 0x20000000, // GT_HWINTRINSIC -- embedded rounding mode: ToPositiveInfinity.
GTF_HW_ER_TOZERO = 0x30000000, // GT_HWINTRINSIC -- embedded rounding mode: ToZero.
GTF_HW_ER_MASK = 0x30000000, // Bits used by handle types below
GTF_HW_ER_TO_EVEN = 0x00000000, // GT_HWINTRINSIC -- embedded rounding mode: FloatRoundingMode = ToEven (Default) "{rn-sae}"
GTF_HW_ER_TO_NEGATIVE_INFINITY = 0x10000000, // GT_HWINTRINSIC -- embedded rounding mode: FloatRoundingMode = ToNegativeInfinity "{rd-sae}"
GTF_HW_ER_TO_POSITIVE_INFINITY = 0x20000000, // GT_HWINTRINSIC -- embedded rounding mode: FloatRoundingMode = ToPositiveInfinity "{ru-sae}"
GTF_HW_ER_TO_ZERO = 0x30000000, // GT_HWINTRINSIC -- embedded rounding mode: FloatRoundingMode = ToZero "{rz-sae}"

@@ -1055,6 +1055,70 @@ GenTree* Lowering::LowerHWIntrinsic(GenTreeHWIntrinsic* node)

NamedIntrinsic intrinsicId = node->GetHWIntrinsicId();

if (HWIntrinsicInfo::IsEmbRoundingCompatible(intrinsicId))
{

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary extra line

Suggested change

Comment on lines 1107 to 1108
// this
// point.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// this
// point.
// this point.

assert(gtOper == GT_HWINTRINSIC);
gtFlags &= ~GTF_HW_ER_MASK;
}
void SetEmbRoundingMode(uint8_t mode)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add function comments? In particular, it would be nice to know that mode is one of the values from System.Runtime.Intrinsics.X86.FloatRoundingMode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to go with no comma.

we adopted the no-comma format in embedded broadcast (#90123), and also MSVC adopted the same format.

Comment on lines +2255 to +2256
default:
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be:

Suggested change
default:
break;
case 0:
break;
default:
unreached();

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new test failures come from this change:
When we call the embedded rounding APIs in the reflection call, JIT will generate a jump table during the emit stage, this jump table is essentially a switch case containing all the possible results under different entry values, say in this case 0~11 (by the existing design, the jump table will iterate from 0 to the max value we set, say here 11), so even this function is supposed to accept 8~11 only, there are case when it might take 0~7.
Based on the given information, do we consider revert the changes, or do we want to modify the jump table generation logic a bit?

p.s. I made a mistake setting the upper bound to be 12, that will take a small fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks okay to revert the changes, and let this function accept those unexpected values.

For reference, by the current design, Avx2.GatherMaskVector128 is also generating a jump table containing all the results under entry values from 0~8, while this intrinsic only expect 1,2,4,8. Seems, those unexpected value shall be blocked out on the language level, say we don't define values other than those in the System.Runtime.Intrinsics.X86.FloatRoundingMode Enum.

And for the generated results with the unexpected values would be not "harmful" as they will be the results under the default rounding mode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so the generic jump table code generates options for all values [0..maxValue] for some API-specific maximum value, but some APIs (like this one) have a much more specific, smaller set of expected values. Seems ok to leave your code as it was -- maybe document (comment) why the switch allows "unexpected" values.

@ghost ghost added needs-author-action An issue or pull request that requires more info or actions from the author. and removed needs-author-action An issue or pull request that requires more info or actions from the author. labels Jan 17, 2024
@BruceForstall
Copy link
Member

@Ruihan-Yin Looks like there are some test failures:

21:53:02.094 Running test: _Avx512F_r::JIT.HardwareIntrinsics.X86._Avx512F.Program.AddDoubleToNegativeInfinity()
Beginning scenario: RunBasicScenario_UnsafeRead
Beginning scenario: RunBasicScenario_Load
Beginning scenario: RunBasicScenario_LoadAligned
Beginning scenario: RunReflectionScenario_UnsafeRead

Assert failure(PID 73252 [0x00011e24], Thread: 73252 [0x11e24]): Assertion failed 'unreached' in 'System.Runtime.Intrinsics.X86.Avx512F:Add(System.Runtime.Intrinsics.Vector512`1[double],System.Runtime.Intrinsics.Vector512`1[double],ubyte):System.Runtime.Intrinsics.Vector512`1[double]' during 'Generate code' (IL size 9; hash 0x01102930; FullOpts)

    File: /__w/1/s/src/coreclr/jit/gentree.h Line: 2261
    Image: /datadisks/disk1/work/AFB708FB/p/corerun

https://dev.azure.com/dnceng-public/public/_build/results?buildId=532443&view=ms.vss-test-web.build-test-results-tab
https://dev.azure.com/dnceng-public/public/_build/results?buildId=532448&view=ms.vss-test-web.build-test-results-tab

@Ruihan-Yin
Copy link
Contributor Author

Yes, please see my comment here.

let SetEmbRoundingMode accept unexpected values to accomadate the jump table generatation logics.
@BruceForstall BruceForstall merged commit 2d751ca into dotnet:main Jan 19, 2024
204 of 208 checks passed
@Ruihan-Yin
Copy link
Contributor Author

Thanks for all the reviews and help!

tmds pushed a commit to tmds/runtime that referenced this pull request Jan 23, 2024
* some workaround with embedded rounding in compiler backend.

* extend _idEvexbContext to 2bit to distinguish embedded broadcast and embedded rounding

* Expose APIs with rounding mode.

* Apply format patch

* Do not include the third parameter in Avx512.Add(left, right)

* split _idEvexbContext bits and made a explicit convert function from uint8_t to insOpts for embedded rounding mode.

* Remove unexpected comment-out

* Fix unexpected deletion

* resolve comments:
removed redundent bits in instDesc for EVEX.b context.

Introduced `emitDispEmbRounding` to display the embedded rounding feature in the disassembly.

* bug fix:
fix un-needed assertion check.

* Apply format patch.

* Resolve comments:
merge INS_OPTS_EVEX_b and INS_OPTS_EVEX_er_rd
Do a pre-check for embedded rounding before lowering.

* Add a helper function to generalize the logic when lowering the embedded rounding intrinsics.

* Resolve comments:
1. fix typo in commnets
2. Add SetEvexBroadcastIfNeeded
3. Added mask in insOpts

* 1. Add unit case for non-default rounding mode
2. removed round-to-even, the default option from InsOpts as it will be handled on the default path.

* formatting

* 1. Create a fallback jump table for embedded rounding APIs when control byte is not constant.
2. Create a template to generate the unit tests for embedded rounding APIs.
3. nit: fix naming.

* remove hand-written unit tests for embedded rounding.

* formatting

* Resolve comments.

* formatting

* revert changes:
let SetEmbRoundingMode accept unexpected values to accomadate the jump table generatation logics.
@github-actions github-actions bot locked and limited conversation to collaborators Feb 19, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants