Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add VPCLMULQDQ intrinsics #109137

Merged
merged 20 commits into from
Nov 20, 2024
Merged

Add VPCLMULQDQ intrinsics #109137

merged 20 commits into from
Nov 20, 2024

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Oct 23, 2024

Fixes #95772

This is one of several similar new ISAs, where an existing ISA (PCLMULQDQ) was extended to 256-bit with one cpuid flag (VPCLMULQDQ) and then to 512-bit when combined with AVX-512 (VPCLMULQDQ+AVX512F) support.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 23, 2024
@saucecontrol
Copy link
Member Author

The existing modeling for ISA support presents a challenge in that JIT wants to see a 1:1 mapping between ISA and implementing class, but the actual ISAs are represented by a combination of flags.

For my first attempt, I have virtualized this in JIT similarly to the way that e.g. Vector256<T> is a 'fake' ISA built from some combination of other ISA support, however this is not working properly with R2R/AOT where it wants all ISAs to be selectable with a switch.

Looking at some of the existing implementations, I see that the fake _VL ISAs are leaking into the ilc --instruction-set list, which is less than ideal. It also seems that the Avx10v1_V512 handling is broken for R2R since the ThunkGenerator code only handles nested class names named X64 or VL.

I'd appreciate some guidance if there's a better way to handle this scenario since there will be more like this.

@saucecontrol
Copy link
Member Author

saucecontrol commented Oct 23, 2024

@MichalStrehovsky could I trouble you to look at this? JIT side is working, but I've left it as draft for now because the NAOT leg is failing due to an assert on the (intentionally) unexposed fake ISA.
At this point it looks like I have two options:

  1. Leak the fake ISA into R2R/AOT configs, as is done for the others
  2. Change the way ISAs are mapped to implementing intrinsic classes in R2R/AOT

I'm not sure how much work 2) is in the end, or whether that's something the runtime team cares about

The currently broken code I mentioned is

case "Avx10v1_V512":
if (nestedTypeName == "X64")
{ return InstructionSet.X64_AVX10v1_V512_X64; }
else
{ return InstructionSet.X64_AVX10v1_V512; }

because that generated method is matching on managed type name, and there's no Avx10v1_V512 type. The generator currently has special handling for VL and X64 nested classes, so I could add V256 and V512 handling to that, but it also only handles one layer of nesting right now, so it can't handle Avx10v1.V512.X64 without bigger changes.

I guess there's also option 3) Do it the ugly way for now and hope somebody cleans it up later...

@MichalStrehovsky
Copy link
Member

@MichalStrehovsky could I trouble you to look at this?

I don't have much guidance to offer, sorry. Things got a lot more complicated since I last touched any of this (when all we had was AVX2) and I haven't exactly been keeping track of it. E.g. I don't know why _VL instructions are fake and whether we intentionally want/or do not want to support them as --instruction-set, and what to do about this one. I think there was some discussion about getting rid of _VL in the past but I can't quickly find the PR.

Ideally RyuJIT implementation details shouldn't leak out into the managed parts of the compiler or R2R file format, so if RyuJIT needs something fake to operate, it ideally shouldn't burden other components (because then the owners of said component who don't know about RyuJIT implementation details and don't know much about hardware intrinsics in general either have no clue about what's going on). But maybe it's necessary, I don't know. We pulled these RyuJIT implementation details from cpufeatures.h in the past, maybe they can be pulled from more places.

@tannergooding and @davidwrighton might have more of an opinion.

@saucecontrol
Copy link
Member Author

saucecontrol commented Oct 24, 2024

I don't have much guidance to offer, sorry.

Fair enough. Thanks for the reply anyway.

For background, the reason the _VL ISAs are 'fake' is that they represent the intersection of other real cpuid bits. e.g. AV512BW_VL actually means AVX512BW+AVX512VL. In reality this distinction is unimportant since there isn't any actual hardware that implements one of those ISAs without the other, meaning no one would ever have a legit reason for wanting to test support of the intersection independent from the individual cpuid bits. It does represent a usability issue, however, since the fake ISA leaks into the --instruction-set options, and their being fake means people can't easily look up what they mean. This is mitigated by the fact we also have the x86-64-v4 option which includes all the _VL fake ISAs and is a well-known shortcut.

These newer intersection-style ISAs are more problematic because 1) they don't fall under a well-known x86-64 version set and 2) they actually do exist in hardware independent of each other. For example, Skylake-X implements PCLMULQDQ+AVX512F but not VPCLMULQDQ, Alder Lake implements PCLMULQDQ+VPCLMULQDQ but not AVX512F, and Zen 4 and 5 implement the full set. I'm hoping we can arrive at a better solution for them that also happens to clean up the handling of the ISAs that are already implemented.

@MichalStrehovsky
Copy link
Member

Thanks for the explanation! I agree that given all this, we should ideally not expose _VL as something people can specify on the command line.

@tannergooding
Copy link
Member

I think there was some discussion about getting rid of _VL in the past but I can't quickly find the PR.

It's spread out in a few places, I think the most recent was in: #103241 (comment)

There's a bit of a balance overall between modeling what the CPU exposes (irrespective of the implementation) and modeling something reasonable for users to consume and handle. For AVX512, I'd say our current setup is "wrong" and we should instead either:
A) Remove the various AVX512*_VL combined flags in favor of just having AVX512VL
B) Remote AVX512F, AVX512BW, AVX512CD, AVX512DQ and the various AVX512*_VL flags; only exposing AVX512

With A, we have something where each --instruction-set matches a corresponding CPUID bit. But the weird nuance is that we then have to figure out how that maps to the implementation since we require F+BW+CD+DQ+VL for any AVX512 functionality to work.

With B, we instead more closely model what the runtime requires and what is effectively the minimum baseline for things to "work".


With VPCLMULQDQ here, I don't think we have a choice other than having both PCLMULQDQ and VPCLMULQDQ as --instruction-set options due to how the CPUID bits and the implementation needs to exist. The rest of the combinations (PCLMULQDQ+AVX, VPCLMULQDQ+AVX, VPCLMULQDQ+AVX512F, and VPCLMULQDQ+AVX512VL) don't need their own flags and are instead represented by the other opt ins the user has given.

So, I think what we want is we need to do is always have virtual instruction sets for any managed exposed ISA class (such as Avx10v1.V512 or Pclmulqdq). These may not have a corresponding --instruction-set. We then ensure that some of the instruction sets do map to --instruction-set bits where it is sensible to do.

I think what you currently have in the PR roughly models that. We have Pclmulqdq (the class) which maps to pclmul (the instruction-set); we have Pclmulqdq_V256 (the class) which maps to vpclmulqdq (the instruction set, which maybe should be called vpclmul for parity); and we have Pclmulqdq_V512 (the class) which doesn't map to any instruction set (as the instruction set is implied by vpclmul+avx512).

@saucecontrol
Copy link
Member Author

saucecontrol commented Oct 25, 2024

Thanks, Tanner. I think if the decision is to change up the handling of the virtual ISAs in general, that's probably better done in a separate PR, which leads me to believe maybe the best path here is to go ahead and follow the existing pattern for now, and clean them all up later.

I've made the required changes to ThunkGenerator to fix the Avx10v1.V512 mappings as well as the V256 and V512 from this new API. That change ended up being bigger than I expected, because it led me to realize that all the _VL_X64 ISAs were in fact not used anywhere. I've split that change out into #109210 and will rebase this one once that's merged.

@saucecontrol
Copy link
Member Author

OK, this is ready for another review pass. All feedback addressed and updated tests passing.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC. @dotnet/jit-contrib for secondary review

@tannergooding
Copy link
Member

ping @dotnet/jit-contrib for secondary review

@tannergooding tannergooding merged commit 08a36ca into dotnet:main Nov 20, 2024
167 of 169 checks passed
@saucecontrol saucecontrol deleted the vpclmulqdq branch November 20, 2024 20:56
MichalStrehovsky added a commit to MichalStrehovsky/runtime that referenced this pull request Nov 28, 2024
AFAIK methods on the nested X64/Arm64 classes shouldn't be considered intrinsics on 32bit platforms since they are as relevant as e.g. WASM intrinsics. This should fix widespread runtime-nativeaot-outerloop failure on x86. I think this regressed in dotnet#109137.
MichalStrehovsky added a commit that referenced this pull request Dec 3, 2024
AFAIK methods on the nested X64/Arm64 classes shouldn't be considered intrinsics on 32bit platforms since they are as relevant as e.g. WASM intrinsics. This should fix widespread runtime-nativeaot-outerloop failure on x86. I think this regressed in #109137.
eduardo-vp pushed a commit to eduardo-vp/runtime that referenced this pull request Dec 5, 2024
AFAIK methods on the nested X64/Arm64 classes shouldn't be considered intrinsics on 32bit platforms since they are as relevant as e.g. WASM intrinsics. This should fix widespread runtime-nativeaot-outerloop failure on x86. I think this regressed in dotnet#109137.
mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this pull request Dec 10, 2024
* add vpclmulqdq intrinsics

* add missing break

* add alternate instruction def for evex encoding

* rename instruction

* whitespace

* re-run thunk generator

* fix AOT instruction sets

* address feedback

* apply formatting patch

* address feedback round 2

* add missing brace

* fix smoketest expected results

* fix suffix order

* handle implied V512 support in AOT

* remove more unnecessary X64 ISA variants

---------

Co-authored-by: Tanner Gooding <tagoo@outlook.com>
mikelle-rogers pushed a commit to mikelle-rogers/runtime that referenced this pull request Dec 10, 2024
AFAIK methods on the nested X64/Arm64 classes shouldn't be considered intrinsics on 32bit platforms since they are as relevant as e.g. WASM intrinsics. This should fix widespread runtime-nativeaot-outerloop failure on x86. I think this regressed in dotnet#109137.
@github-actions github-actions bot locked and limited conversation to collaborators Dec 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime.Intrinsics community-contribution Indicates that the PR has been added by a community member new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API Proposal]: VPCLMULQDQ Intrinsics
4 participants