Cleanup some handling around Avx10v1 #103241

tannergooding · 2024-06-10T18:08:26Z

This does some minor cleanup around the new AVX10v1 support introduced in #101938

Namely it:

fixes up the instructionsetdesc implications to not repeat the EVEX dependencies twice
ensures the ``instructionsetdesc` implications are correctly ordered
has the minipal_getcpufeatures and EEJitManager::SetCpuInfo functions only report Avx512F if F+BW+CD+DQ+VL are supported
handles a couple minor edge cases in the JIT that weren't handling Avx10v1 support

dotnet-issue-labeler · 2024-06-10T18:08:31Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

dotnet-issue-labeler · 2024-06-10T18:08:33Z

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

dotnet-policy-service · 2024-06-10T18:09:05Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

tannergooding · 2024-06-10T18:16:02Z

src/coreclr/vm/codeman.cpp

+    if (((cpuFeatures & XArchIntrinsicConstants_Avx512f) != 0) &&
+        ((cpuFeatures & XArchIntrinsicConstants_Avx512f_vl) != 0)
+        ((cpuFeatures & XArchIntrinsicConstants_Avx512bw) != 0)
+        ((cpuFeatures & XArchIntrinsicConstants_Avx512bw_vl) != 0)
+        ((cpuFeatures & XArchIntrinsicConstants_Avx512cd) != 0)
+        ((cpuFeatures & XArchIntrinsicConstants_Avx512cd_vl) != 0)
+        ((cpuFeatures & XArchIntrinsicConstants_Avx512dq) != 0)
+        ((cpuFeatures & XArchIntrinsicConstants_Avx512dq_vl) != 0))
+    {


It might require a R2R breaking change, but it would be nice if we could just compress these down to 1 bit and cleanup the CorInfoInstructionSet handling a bit as well.

We're never going to support, in the JIT at least, having the baseline AVX512 ISAs independent, so it'd be nicer if we just had DOTNET_EnableAVX512, XArchIntrinsicConstants_Avx512, and InstructionSet_Avx512 instead. That would also allow us to do simpler queries in the JIT around ISA support

At the same time, the nested classes are implied by other information. For example, the X64, VL, and V512 nested classes are all implied by the enclosing type being supported and another implicit bit of information such as the target architecture or if the base class is supported. So we really don't "need" all the extra logic and instruction set IDs to support those.

tannergooding · 2024-06-12T17:18:48Z

CC. @dotnet/jit-contrib for review and @dotnet/avx512-contrib as an FYI

Just some post cleanup work for the AVX10v1 support that went in last week.

tannergooding · 2024-06-12T17:34:15Z

(Resolved merge conflicts)

tannergooding · 2024-06-13T22:36:28Z

CC. @jkotas for the VM side changes.

Relatively straightforward as detailed above, mostly just ensuring that the instructionsetdesc are ordered correctly so the implications work out as expected. The recursive implications could actually cause problems in some edges depending on how they were disabled via the environment variable knobs, so enforcing it as part of the minipal cpufeature querying seems overall much cleaner and avoids that complexity.

As called out, it would be nice if we could simplify this even more but I expect that'd require a breaking change to the config knobs we expose and to the R2R format.

src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt

jkotas · 2024-06-14T15:12:44Z

src/native/minipal/cpufeatures.c

-                                                }
-                                            }
+                                            result |= XArchIntrinsicConstants_Evex;
+                                            result |= XArchIntrinsicConstants_Avx512f;


If we bake the JIT policy of requiring all these features together at the same time into the minipal, we can fold all these bits into just a single bit XArchIntrinsicConstants_Avx512f.

The code in JIT/EE interface https://github.com/dotnet/runtime/pull/103241/files#diff-eba17a72b6caa8b52b196f3b84bfd20dfeaf39a89f410943a692d1f33f569f3bR1366 does the same thing. What's the benefit of doing this in the minipal too?

If we bake the JIT policy of requiring all these features together at the same time into the minipal, we can fold all these bits into just a single bit XArchIntrinsicConstants_Avx512f.

Right, and I think that's desirable long term. I've updated the PR to fold it down to Avx512 and Avx512Vbmi

What's the benefit of doing this in the minipal too?

It's helping enforce the consistency everywhere that consumes the CPUID bits.

What I ended up finding is that the implication defined in InstructionSetDesc don't work well with recursive implication sets. This is because it basically does a naive ordering and so what happened is that Avx512F would be supported, which meant that Avx512BW/CD/DQ had their requirements met. Then at end when processing the recursive implications, if one of BW/CD/DQ/VL wasn't supported, F would be unset but the others wouldn't leading to a torn view.

While handling this in codeman is sufficient for the JIT, it doesn't necessarily mean we followed suite in NAOT or other scenarios as they could get disconnected. So simply enforcing this at the minipal level makes it a bit easier and ensures everywhere has a consistent view of what we require to be supported together.

More ideally we'd also clean this up at the InstructionSet_* level too as we don't need instructionset64bit (this is implied by the target architecture bitness), we don't need instructionset Avx512F_VL (this could be compressed down to just Avx512F), etc. I think doing this when we're ready for the next R2R breaking change would be good, as it should free up bits and reduce the overall complexity of the JIT by just requiring a little bit of handling in the lookupIntrinsicId function instead.

I think doing this when we're ready for the next R2R breaking change would be good

We just took a R2R breaking change a few days ago, so if you would like to cleanup the CorInfoInstructionSet bits, it is a good time now.

👍, I'd like to do that in a follow up PR then so that the more core fixes can go in. The CorInfoInstructionSet cleanup is just nice to have at that point.

jkotas · 2024-06-15T06:01:20Z

src/native/minipal/cpufeatures.h

+    XArchIntrinsicConstants_Serialize = 0x20000,
+    XArchIntrinsicConstants_Avx10v1 = 0x40000,
+    XArchIntrinsicConstants_Avx10v1_V512 = 0x80000,
+    XArchIntrinsicConstants_Evex = 0x100000,


Do we really need the Evex bit here? It is implied by other bits.

The Evex bit is intentional and a bit special (will work on getting an explanation of it added to the docs).

The consideration is that with Avx10v1 we have two different ways to say that certain 128-bit or 256-bit EVEX encoded instructions are supported. These instructions having been made available originally as part of AVX512VL (which requires 512-bit support) but now also being available in AVX10v1 (which does not require 512-bit support).

So, if code is using say Avx512F.VL.* (and never uses say Avx512F.*), then what we want is to actually track this as requiring the EVEX bit and not as requiring Avx10v1 or Avx512 specifically, since it will work provided that either of these two ISAs are supported. This allows the JIT to track and report the actual requirements for certain instructions more cleanly (we only need to check the EVEX bit for such instructions) and allows NAOT images to be generated that don't fail if you're code was written using one but was generated using another.

If we had known about Avx10v1 prior to shipping Avx512 support, I imagine we would've exposed the managed API surface differently. We probably would've intentionally designed the surface area a bit differently (even with it not matching hardware). In particular, the actual setup we have is basically the following, where AvxF+BW+CD+DQ and below that have a nested V512 type representing the 512-bit extensions and where we would've come up with an ISA name to represent AvxF+BW+CD+DQ (possibly just Evex, maybe something different):

graph TD; Sse2-->Aes; Vex-->Avx; Avx-->Avx2; Evex-->AvxF+BW+CD+DQ; AvxF+BW+CD+DQ-->AvxVbmi; AvxF+BW+CD+DQ-->AvxIfma; AvxF+BW+CD+DQ-->AvxBf16; AvxF+BW+CD+DQ-->AvxBitalg; AvxF+BW+CD+DQ-->AvxFp16; AvxF+BW+CD+DQ-->AvxVbmi2; AvxF+BW+CD+DQ-->AvxVpopcntdq; AvxVbmi-->Avx10v1; AvxIfma-->Avx10v1; AvxBf16-->Avx10v1; AvxBitalg-->Avx10v1; AvxFp16-->Avx10v1; AvxVbmi2-->Avx10v1; AvxVpopcntdq-->Avx10v1; Avx2-->AvxVnni; Vex-->Bmi1; Vex-->Bmi2; Avx2-->Evex; Fma-->Evex; Avx-->Fma; X86Base-->Lzcnt Sse2-->Pclmulqdq Sse42-->Popcnt X86Base-->Sse; Sse-->Sse2; Sse2-->Sse3; Sse3-->Ssse3; Ssse3-->Sse41; Sse41-->Sse42; Sse42-->Vex;

Loading

I think it is fine for the JIT to have internal methods that return true when it is fine to use Vex or Evex instruction encodings. I do not see why the Evex bit has to leak outside the JIT all the way to the minipal. The Vex bit does not leak all the way to the minipal either.

The consideration is that for EVEX it's not internal to the JIT.

Anytime the JIT does a compExactlyDependsOn(InstructionSet_*) or compOpportunisticallyDependsOn(InstructionSet_*) query, it gets reported back to the VM. The different VMs handle it as follows:

For JIT this reporting is a no-op

For R2R this reporting exactly tracks the InstructionSet_* queried and whether it was "exactly" or "opportunistic"

This is where the cleanup of InstructionSet_* would be beneficial, because we have lots more flags than actually needed

For AOT this reporting gets compressed back to XArchIntrinsicConstants which gets queried for an exact match on AOT startup

So, we need the JIT query for EVEX usage to map back to InstructionSet_EVEX for R2R and XArchIntrinsicConstants_EVEX for AOT that way if the code only uses Avx512F.VL.Abs, it can work when Avx10v1 without V512 support exists. Otherwise we'd need it to report as Avx10v1 or Avx512, both of which cut out hardware that would otherwise work with the binary. We only query the more exact ISA when it isn't part of that synthetic EVEX baseline and thus needs to restrict the target.

For VEX there isn't any need, because the hierarchy is logical and makes sense. There is no historical split where two separate ISAs give access to the same underlying instructions.

Ok, got it.

EVEX is low-level instruction encoding. I think it would be best to confine it to the JIT. The mapping between EVEX and the ISAs is a bit of mental leap outside the JIT.

Would it be better to have bits for Avx512_VLOnly virtual ISA that is the common subset of Avx512 and Avx10v1? I assume that all Avx512*.VL ISAs are supported in Avx10v1. Is that correct?

if the code only uses Avx512F.VL.Abs, it can work when Avx10v1 without V512 support exists.

Is the user code going to look like this:

if (Avx512F.VL.IsSupported || Avx10v1.IsSupported) { ... code that uses Avx512F.VL.Abs ... }

Or are we going to make Avx512F.VL.IsSupported return true for Avx10v1 so that Avx10v1.IsSupported part of the condition is not needed?

Maybe, just as an idea, we could call it just Avx10 or Avx10v0 (representing the baseline common to Avx512 and Avx10v1)? Alternatively we could call it just AvxF_BW_CD_DQ?

I have looked around how other projects are solving this naming problem. LLVM seems to use EVEX256 name for the Avx10v1 and Avx512 intersection. I guess it is how we ended up with EVEX here. I like the 256 suffix. It helps with clarifying that the virtual ISA represents the non-512 intersection.

On a slightly related point, I believe it would it be a binary breaking change to change from an abstract class to an interface (i.e. change public abstract class Avx to instead be public interface Avx), correct?

It should not be a breaking change for typical uses, but as you have said there are likely cases that it would break.

I am not sure how well the ISAs modeling via inheritance hierarchy is going to work going forward. I would not be surprised if we see more places where new hardware revisions carve out subsets of existing instructions and give it a new ISA name. It just happened with Avx512/Avx10v1, Arm64 streaming mode is heading that way, and RISCV may end up there too given its development process. We may want to think about different approach to accommodate these trends.

I guess it is how we ended up with EVEX here. I like the 256 suffix. It helps with clarifying that the virtual ISA represents the non-512 intersection.

👍

I am not sure how well the ISAs modeling via inheritance hierarchy is going to work going forward.

I agree we should think some more, but overall I do believe this is still the correct model. We've seen quite a bit of praise on how we've modeled everything as compared to how C/C++ or Rust do it. The overall separation of xplat APIs vs platform specific APIs and then the division of the latter into clear classes that are based on the underlying hardware checks has helped make discoverability, approachability, and usability of the feature in .NET one of the best.

It just happened with Avx512/Avx10v1

This one I think was more x64 having tried something new with Avx512 and then finding out that it didn't work in practice, so they are now fixing it by reconverging everything with Avx10v1. The architectural documents around Avx10v1 then cover how they plan to handle this moving forward and that it's effectively going back to the model we had seen previously with Sse and Avx (where you have a baseline encoding and then incremental functionality extended on top of this in a version like manner).

Arm64 streaming mode is heading that way

This is a case where there is still a clear hierarchy given we have SveStreaming and SveNonStreaming as the split.

Even if we had shipped class Sve prior to streaming support becoming defined, we could have inserted a class SveStreaming and had Sve derive from it, moving relevant methods down (since that is safe to do for classes).

-- What I'll plan on doing given the conversation so far is getting this in as is first as it fixes some bugs without the refactoring being too complex and puts us in a good position for .NET 9.

I'll then follow up with a PR that does a broader cleanup of the data used by R2R and changes EVEX to EVEX256 (which will itself require some refactoring in the JIT).

if we had shipped class Sve prior to streaming support becoming defined, we could have inserted a class SveStreaming and had Sve derive from it, moving relevant methods down (since that is safe to do for classes).

I think this hierarchy would look weird since the streaming and non-streaming support do not form a hierarchy. You can have one without the other.

streaming and non-streaming support do not form a hierarchy

I think this is more just the names aren't the "right" ones.

What I had meant is that you have two sets of instructions. That is, there are SVE instructions that are available both in regular SVE and in SVE streaming mode. Then there are SVE instructions which are not available in streaming mode unless another architectural bit is set.

jkotas · 2024-06-15T06:04:51Z

@jkotas for the VM side changes.

LGTM

Cleanup the instruction set implications for xarch

c87d73c

tannergooding requested a review from MichalStrehovsky as a code owner June 10, 2024 18:08

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 10, 2024

dotnet-issue-labeler bot added the new-api-needs-documentation label Jun 10, 2024

dotnet-policy-service bot assigned tannergooding Jun 10, 2024

tannergooding commented Jun 10, 2024

View reviewed changes

tannergooding closed this Jun 10, 2024

tannergooding reopened this Jun 10, 2024

tannergooding force-pushed the avx10v1 branch from b7b9dd2 to 7ef9a0d Compare June 10, 2024 19:49

Cleanup some handling around Avx10v1

9bffa42

tannergooding force-pushed the avx10v1 branch from 7ef9a0d to 9bffa42 Compare June 10, 2024 23:05

Merge remote-tracking branch 'dotnet/main' into avx10v1

876d1a9

tannergooding mentioned this pull request Jun 12, 2024

remove unnecessary Avx512VL ISA flags. #103144

Merged

This was referenced Jun 12, 2024

System.Net.Http.Functional.Tests.HttpMetricsTest_Http11_Async_HttpMessageInvoker Test failure #96407

Closed

System.Net.Security.Tests.SslStreamTlsResumeTests failing #103354

Closed

jkotas reviewed Jun 14, 2024

View reviewed changes

src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt Show resolved Hide resolved

jkotas reviewed Jun 14, 2024

View reviewed changes

tannergooding added 2 commits June 14, 2024 09:02

Remove unnecessary XArchIntrinsicConstants bits

a1b690d

Add a clarifying comment

3d6f60c

jkotas reviewed Jun 15, 2024

View reviewed changes

TIHan approved these changes Jun 18, 2024

View reviewed changes

tannergooding merged commit cb5832d into dotnet:main Jun 18, 2024
184 of 186 checks passed

tannergooding deleted the avx10v1 branch June 18, 2024 03:19

build-analysis bot mentioned this pull request Jun 18, 2024

System.Numerics.Tensors.Tests.TensorSpanTests test failure #103525

Closed

JulieLeeMSFT mentioned this pull request Jun 18, 2024

Intel architecture improvements for .NET 9 #93196

Closed

33 tasks

This was referenced Jul 9, 2024

[NO REVIEW] Update the CPUID and XSAVE logics for APX #103019

Closed

Update the CPUID and XSAVE logics for APX #104637

Open

github-actions bot locked and limited conversation to collaborators Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup some handling around Avx10v1 #103241

Cleanup some handling around Avx10v1 #103241

tannergooding commented Jun 10, 2024

dotnet-issue-labeler bot commented Jun 10, 2024

dotnet-issue-labeler bot commented Jun 10, 2024

dotnet-policy-service bot commented Jun 10, 2024

tannergooding Jun 10, 2024

tannergooding commented Jun 12, 2024

tannergooding commented Jun 12, 2024

tannergooding commented Jun 13, 2024

jkotas Jun 14, 2024

jkotas Jun 14, 2024

tannergooding Jun 14, 2024 •

edited

Loading

jkotas Jun 15, 2024

tannergooding Jun 15, 2024

jkotas Jun 15, 2024

tannergooding Jun 15, 2024

jkotas Jun 15, 2024

tannergooding Jun 15, 2024

jkotas Jun 16, 2024 •

edited

Loading

jkotas Jun 17, 2024

tannergooding Jun 17, 2024

tannergooding Jun 17, 2024

jkotas Jun 17, 2024

tannergooding Jun 17, 2024

jkotas commented Jun 15, 2024

Cleanup some handling around Avx10v1 #103241

Cleanup some handling around Avx10v1 #103241

Conversation

tannergooding commented Jun 10, 2024

dotnet-issue-labeler bot commented Jun 10, 2024

dotnet-issue-labeler bot commented Jun 10, 2024

dotnet-policy-service bot commented Jun 10, 2024

Choose a reason for hiding this comment

tannergooding commented Jun 12, 2024

tannergooding commented Jun 12, 2024

tannergooding commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tannergooding Jun 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas Jun 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jkotas commented Jun 15, 2024

tannergooding Jun 14, 2024 •

edited

Loading

jkotas Jun 16, 2024 •

edited

Loading