Enable EVEX feature: Embedded Rounding for Avx512F.Add() #94684

Ruihan-Yin · 2023-11-13T20:32:17Z

Description

Enabled EVEX feature: embedded rounding in JIT backend, and exposed only 1 related API:

public static Vector512<double> Add(Vector512<double> left, Vector512<double> right, FloatRoundingMode mode);

to establish the complete compiling path starting from importation.

Details

To tag the related AST node/instruction is embedded rounding enabled, we introduced some new flags:

For AST node, we introduced GTF_HW_ER_*: As the APIs are designed in a way that FloatRoundingMode information is passed as an extra operand, we choose to let the AST node carry this operand until lowering, and during lowering, the node will be converted back to the normal version, and the extra operand will be converted to a flag on the node. In this way, we can simply reuse the existing emit paths in the embedded rounding enabled cases.
When reaching the emit stage, we will have the corresponding instruction according to the node information, and we followed the design in Embedded Broadcast, used insOpts to carry this extra information to inform emitter embedded rounding is on, so in the instrDesc, we made the following changes:
1. renamed _idEvexbContext and extend it to 2 bit as _idIsEmbBroadcast and _idIsEmbRounding
2. introduced 2 extra bits to indicate the rounding mode: _idEmbRoundingMode

Follow-up in this PR

This PR, at current stage, is intended to show and discuss the design with maintainers, we are open to adjust and improve the designs.

And we will need to add some unit tests to better cover this feature and related APIs.

Follow-up after this PR

As this PR only exposed 1 embedded rounding related API, we will make another follow-up PR to expose other related APIs altogether. Since embedded rounding is mostly compatible with arithmetic and casting intrinsics, some more emit path will be impacted, we expect to extend the same design to different emit path if needed.

dotnet-issue-labeler · 2023-11-13T20:34:10Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

ghost · 2023-11-13T20:34:28Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

Enabled EVEX feature: embedded rounding in JIT backend, and exposed only 1 related API:

public static Vector512<double> Add(Vector512<double> left, Vector512<double> right, FloatRoundingMode mode);

to establish the complete compiling path starting from importation.

Details

To tag the related AST node/instruction is embedded rounding enabled, we introduced some new flags:

For AST node, we introduced GTF_HW_ER_*: As the APIs are designed in a way that FloatRoundingMode information is passed as an extra operand, we choose to let the AST node carry this operand until lowering, and during lowering, the node will be converted back to the normal version, and the extra operand will be converted to a flag on the node. In this way, we can simply reuse the existing emit paths in the embedded rounding enabled cases.
When reaching the emit stage, we will have the corresponding instruction according to the node information, and we followed the design in Embedded Broadcast, used insOpts to carry this extra information to inform emitter embedded rounding is on, so in the instrDesc, we made the following changes:
1. renamed _idEvexbContext and extend it to 2 bit as _idIsEmbBroadcast and _idIsEmbRounding
2. introduced 2 extra bits to indicate the rounding mode: _idEmbRoundingMode

Follow-up in this PR

This PR, at current stage, is intended to show and discuss the design with maintainers, we are open to adjust and improve the designs.

And we will need to add some unit tests to better cover this feature and related APIs.

Follow-up after this PR

As this PR only exposed 1 embedded rounding related API, we will make another follow-up PR to expose other related APIs altogether. Since embedded rounding is mostly compatible with arithmetic and casting intrinsics, some more emit path will be impacted, we expect to extend the same design to different emit path if needed.

Author:	Ruihan-Yin
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `new-api-needs-documentation`, `community-contribution`
Milestone:	-

Ruihan-Yin · 2023-11-14T01:51:39Z

Failures are known, will apply the format patch later.

Hi @tannergooding, I think we can start to have some discussion on the design of embedded rounding. Would you please take a look at this draft? Thanks!

The impact on the throughput is less than I expected considering the extra bits I introduced in the instrDesc, is it attributed to the optimization given in #87373?

src/coreclr/jit/emit.h

src/coreclr/jit/emitxarch.cpp

src/coreclr/jit/lowerxarch.cpp

src/tests/JIT/HardwareIntrinsics/X86_Avx512/Avx512F/EmbeddedRounding.Double.cs

src/coreclr/jit/emit.h

src/coreclr/jit/emitxarch.cpp

src/coreclr/jit/hwintrinsic.h

1. fix typo in commnets 2. Add SetEvexBroadcastIfNeeded 3. Added mask in insOpts

2. removed round-to-even, the default option from InsOpts as it will be handled on the default path.

Ruihan-Yin · 2024-01-02T22:05:51Z

Looks like this is pending some formatting fixes and tests, is that correct?

Rebased the branch, running the pipeline tests again. I'll push the generated format patch shortly.

For the tests, we are working internally on templates to cover all the embedded rounding APIs, I can push a commit with the template and cover all the Add unit tests, so we can review the test design in this PR, and I would expect to have a separate PR to expose the rest of the APIs and unit tests.

Will the plan sound good to you?

Update: format patch is applied, fails looks irrelevant.

…ol byte is not constant. 2. Create a template to generate the unit tests for embedded rounding APIs. 3. nit: fix naming.

Ruihan-Yin · 2024-01-04T19:57:39Z

Updates:

Added the unit test template for embedded rounding APIs.
Added a jump table fallback for non-constant rounding mode control byte.
removed hand-written unit tests.

The expected test results were calculated on i9-9980XE with a C++ program.

Edit: Failures look irrelevant to the changes.

Hi @tannergooding, format patch and unit tests have been applied. There should be no other blockers.

tannergooding

CC. @dotnet/jit-contrib for secondary review

BruceForstall

Some small nits/questions/requests

BruceForstall · 2024-01-17T22:35:16Z

src/coreclr/jit/emit.h

@@ -774,13 +774,13 @@ class emitter
        unsigned _idCallAddr : 1; // IL indirect calls: can make a direct call to iiaAddr
        unsigned _idNoGC : 1;     // Some helpers don't get recorded in GC tables
 #if defined(TARGET_XARCH)
-        unsigned _idEvexbContext : 1; // does EVEX.b need to be set.
-#endif                                //  TARGET_XARCH
+        unsigned _idEvexbContext : 2 // Does Evex.b need to be set for embedded broadcast/embedded rounding.


Missing a semicolon (probably got lucky that there is a "stray" semicolon below on CLANG_FORMAT_COMMENT_ANCHOR

Suggested change

unsigned _idEvexbContext : 2 // Does Evex.b need to be set for embedded broadcast/embedded rounding.

unsigned _idEvexbContext : 2; // Does Evex.b need to be set for embedded broadcast/embedded rounding.

Also, "Does Evex.b need to be set" makes it sound like this is a 0/1, false/true boolean. Then why does it need 2 bits? I think you should expand the comment to state what the two bits are used for. (You don't need to put the comment all on this one line; make it a multi-line comment above this line if necessary)

Thanks for the comments!

_idEvexbContext is used to differentiate several features: normal intrinsics, embedded broadcast, embedded rounding. In the normal or embedded broadcast cases, the semantic for Evex.L'L remains the same, intrinsic vector length, while in embedded rounding case, the semantic changes to indicate the rounding mode. _idEvexbContext will be used to inform emitter to specially handle the Evex.L'L in the embedded rounding scenario. I will properly document this part in the comments.

A comment like you just wrote would be great.

BruceForstall · 2024-01-17T22:37:08Z

src/coreclr/jit/emit.h

-        // x86:         53/49 bits
-        // amd64:       54/49 bits
+        // x86:         54/50 bits
+        // amd64:       55/51 bits


The "51" should be "50" AFAICT

Suggested change

// amd64: 55/51 bits

// amd64: 55/50 bits

BruceForstall · 2024-01-17T22:42:01Z

src/coreclr/jit/emit.h

@@ -1542,11 +1542,31 @@ class emitter
        {
            return _idEvexbContext != 0;
        }
-        void idSetEvexbContext()
+


The function above, idIsEvexbContext() sounds odd. Should it be renamed idIsEvexbContextSet()?

BruceForstall · 2024-01-17T22:46:31Z

src/coreclr/jit/emitxarch.cpp

+// GetEmbRoudingMode: Get the rounding mode for embedded rounding
+//
+// Arguments:
+//     mode -- the flag from the correspoding gentree node indicating the mode.


nit (typo)

Suggested change

// mode -- the flag from the correspoding gentree node indicating the mode.

// mode -- the flag from the corresponding GenTree node indicating the mode.

BruceForstall · 2024-01-17T22:46:47Z

src/coreclr/jit/emitxarch.cpp

@@ -1139,6 +1139,29 @@ static bool isLowSimdReg(regNumber reg)
 #endif
 }

+//------------------------------------------------------------------------
+// GetEmbRoudingMode: Get the rounding mode for embedded rounding


Suggested change

// GetEmbRoudingMode: Get the rounding mode for embedded rounding

// GetEmbRoundingMode: Get the rounding mode for embedded rounding

BruceForstall · 2024-01-17T23:14:18Z

src/coreclr/jit/gentree.h

+    GTF_HW_ER_MASK              = 0x30000000, // Bits used by handle types below 
+    GTF_HW_ER_TOEVEN            = 0x00000000, // GT_HWINTRINSIC -- embedded rounding mode: ToEven (Default).
+    GTF_HW_ER_TONEGATIVEINFINITY = 0x10000000, // GT_HWINTRINSIC -- embedded rounding mode: ToNegativeInfinity.
+    GTF_HW_ER_TOPOSITIVEINFINITY = 0x20000000, // GT_HWINTRINSIC -- embedded rounding mode: ToPositiveInfinity.
+    GTF_HW_ER_TOZERO            = 0x30000000, // GT_HWINTRINSIC -- embedded rounding mode: ToZero.


A few notes:

please align the =

Add underscores between words (otherwise it's to easy to read "TOE...", "TONE...", "TOP..."

Suggested change

GTF_HW_ER_MASK = 0x30000000, // Bits used by handle types below

GTF_HW_ER_TOEVEN = 0x00000000, // GT_HWINTRINSIC -- embedded rounding mode: ToEven (Default).

GTF_HW_ER_TONEGATIVEINFINITY = 0x10000000, // GT_HWINTRINSIC -- embedded rounding mode: ToNegativeInfinity.

GTF_HW_ER_TOPOSITIVEINFINITY = 0x20000000, // GT_HWINTRINSIC -- embedded rounding mode: ToPositiveInfinity.

GTF_HW_ER_TOZERO = 0x30000000, // GT_HWINTRINSIC -- embedded rounding mode: ToZero.

GTF_HW_ER_MASK = 0x30000000, // Bits used by handle types below

GTF_HW_ER_TO_EVEN = 0x00000000, // GT_HWINTRINSIC -- embedded rounding mode: FloatRoundingMode = ToEven (Default) "{rn-sae}"

GTF_HW_ER_TO_NEGATIVE_INFINITY = 0x10000000, // GT_HWINTRINSIC -- embedded rounding mode: FloatRoundingMode = ToNegativeInfinity "{rd-sae}"

GTF_HW_ER_TO_POSITIVE_INFINITY = 0x20000000, // GT_HWINTRINSIC -- embedded rounding mode: FloatRoundingMode = ToPositiveInfinity "{ru-sae}"

GTF_HW_ER_TO_ZERO = 0x30000000, // GT_HWINTRINSIC -- embedded rounding mode: FloatRoundingMode = ToZero "{rz-sae}"

BruceForstall · 2024-01-17T23:18:15Z

src/coreclr/jit/lowerxarch.cpp

@@ -1055,6 +1055,70 @@ GenTree* Lowering::LowerHWIntrinsic(GenTreeHWIntrinsic* node)

    NamedIntrinsic intrinsicId = node->GetHWIntrinsicId();

+    if (HWIntrinsicInfo::IsEmbRoundingCompatible(intrinsicId))
+    {
+


nit: unnecessary extra line

Suggested change

BruceForstall · 2024-01-17T23:19:11Z

src/coreclr/jit/lowerxarch.cpp

+                    // this
+                    // point.


Suggested change

// this

// point.

// this point.

BruceForstall · 2024-01-17T23:26:11Z

src/coreclr/jit/gentree.h

+        assert(gtOper == GT_HWINTRINSIC);
+        gtFlags &= ~GTF_HW_ER_MASK;
+    }
+    void SetEmbRoundingMode(uint8_t mode)


Can you please add function comments? In particular, it would be nice to know that mode is one of the values from System.Runtime.Intrinsics.X86.FloatRoundingMode.

I think it is fine to go with no comma.

we adopted the no-comma format in embedded broadcast (#90123), and also MSVC adopted the same format.

BruceForstall · 2024-01-17T23:27:27Z

src/coreclr/jit/gentree.h

+            default:
+                break;


should this be:

Suggested change

default:

break;

case 0:

break;

default:

unreached();

?

The new test failures come from this change:
When we call the embedded rounding APIs in the reflection call, JIT will generate a jump table during the emit stage, this jump table is essentially a switch case containing all the possible results under different entry values, say in this case 0~11 (by the existing design, the jump table will iterate from 0 to the max value we set, say here 11), so even this function is supposed to accept 8~11 only, there are case when it might take 0~7.
Based on the given information, do we consider revert the changes, or do we want to modify the jump table generation logic a bit?

p.s. I made a mistake setting the upper bound to be 12, that will take a small fix.

It looks okay to revert the changes, and let this function accept those unexpected values.

For reference, by the current design, Avx2.GatherMaskVector128 is also generating a jump table containing all the results under entry values from 0~8, while this intrinsic only expect 1,2,4,8. Seems, those unexpected value shall be blocked out on the language level, say we don't define values other than those in the System.Runtime.Intrinsics.X86.FloatRoundingMode Enum.

And for the generated results with the unexpected values would be not "harmful" as they will be the results under the default rounding mode.

I see, so the generic jump table code generates options for all values [0..maxValue] for some API-specific maximum value, but some APIs (like this one) have a much more specific, smaller set of expected values. Seems ok to leave your code as it was -- maybe document (comment) why the switch allows "unexpected" values.

BruceForstall · 2024-01-19T00:22:16Z

@Ruihan-Yin Looks like there are some test failures:

21:53:02.094 Running test: _Avx512F_r::JIT.HardwareIntrinsics.X86._Avx512F.Program.AddDoubleToNegativeInfinity()
Beginning scenario: RunBasicScenario_UnsafeRead
Beginning scenario: RunBasicScenario_Load
Beginning scenario: RunBasicScenario_LoadAligned
Beginning scenario: RunReflectionScenario_UnsafeRead

Assert failure(PID 73252 [0x00011e24], Thread: 73252 [0x11e24]): Assertion failed 'unreached' in 'System.Runtime.Intrinsics.X86.Avx512F:Add(System.Runtime.Intrinsics.Vector512`1[double],System.Runtime.Intrinsics.Vector512`1[double],ubyte):System.Runtime.Intrinsics.Vector512`1[double]' during 'Generate code' (IL size 9; hash 0x01102930; FullOpts)

    File: /__w/1/s/src/coreclr/jit/gentree.h Line: 2261
    Image: /datadisks/disk1/work/AFB708FB/p/corerun

https://dev.azure.com/dnceng-public/public/_build/results?buildId=532443&view=ms.vss-test-web.build-test-results-tab
https://dev.azure.com/dnceng-public/public/_build/results?buildId=532448&view=ms.vss-test-web.build-test-results-tab

Ruihan-Yin · 2024-01-19T00:24:19Z

Yes, please see my comment here.

let SetEmbRoundingMode accept unexpected values to accomadate the jump table generatation logics.

Ruihan-Yin · 2024-01-19T17:51:13Z

Thanks for all the reviews and help!

* some workaround with embedded rounding in compiler backend. * extend _idEvexbContext to 2bit to distinguish embedded broadcast and embedded rounding * Expose APIs with rounding mode. * Apply format patch * Do not include the third parameter in Avx512.Add(left, right) * split _idEvexbContext bits and made a explicit convert function from uint8_t to insOpts for embedded rounding mode. * Remove unexpected comment-out * Fix unexpected deletion * resolve comments: removed redundent bits in instDesc for EVEX.b context. Introduced `emitDispEmbRounding` to display the embedded rounding feature in the disassembly. * bug fix: fix un-needed assertion check. * Apply format patch. * Resolve comments: merge INS_OPTS_EVEX_b and INS_OPTS_EVEX_er_rd Do a pre-check for embedded rounding before lowering. * Add a helper function to generalize the logic when lowering the embedded rounding intrinsics. * Resolve comments: 1. fix typo in commnets 2. Add SetEvexBroadcastIfNeeded 3. Added mask in insOpts * 1. Add unit case for non-default rounding mode 2. removed round-to-even, the default option from InsOpts as it will be handled on the default path. * formatting * 1. Create a fallback jump table for embedded rounding APIs when control byte is not constant. 2. Create a template to generate the unit tests for embedded rounding APIs. 3. nit: fix naming. * remove hand-written unit tests for embedded rounding. * formatting * Resolve comments. * formatting * revert changes: let SetEmbRoundingMode accept unexpected values to accomadate the jump table generatation logics.

dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels Nov 13, 2023

ghost added the community-contribution Indicates that the PR has been added by a community member label Nov 13, 2023

This was referenced Nov 13, 2023

CI error: System.Net.Quic.QuicException: The connection timed out from inactivity #91757

Closed

[mono][tvos] OOM in System.IO.Tests.MemoryStreamTests #92467

Closed

build-analysis bot mentioned this pull request Nov 14, 2023

Timeout in System.Net.Quic.Functional.Tests #86019

Closed

tannergooding reviewed Nov 17, 2023

View reviewed changes

src/coreclr/jit/emit.h Outdated Show resolved Hide resolved

BruceForstall added the avx512 Related to the AVX-512 architecture label Nov 17, 2023

BruceForstall mentioned this pull request Nov 17, 2023

Intel architecture improvements for .NET 9 #93196

Closed

33 tasks

Ruihan-Yin requested a review from tannergooding November 21, 2023 18:15

tannergooding reviewed Nov 22, 2023

View reviewed changes

src/coreclr/jit/emit.h Outdated Show resolved Hide resolved

tannergooding reviewed Nov 22, 2023

View reviewed changes

src/coreclr/jit/emitxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Nov 22, 2023

View reviewed changes

src/coreclr/jit/emitxarch.cpp Show resolved Hide resolved

tannergooding reviewed Nov 22, 2023

View reviewed changes

src/coreclr/jit/lowerxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Nov 22, 2023

View reviewed changes

src/tests/JIT/HardwareIntrinsics/X86_Avx512/Avx512F/EmbeddedRounding.Double.cs Outdated Show resolved Hide resolved

Ruihan-Yin marked this pull request as ready for review November 27, 2023 17:36

Ruihan-Yin requested a review from tannergooding November 28, 2023 21:04

build-analysis bot mentioned this pull request Nov 29, 2023

WritingShouldUpdateWriteTime_After_SetLastAccessTime [FAIL] #95378

Closed