Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying hardware support level for crossgen2 #226

Closed
cshung opened this issue Nov 22, 2019 · 16 comments
Closed

Allow specifying hardware support level for crossgen2 #226

cshung opened this issue Nov 22, 2019 · 16 comments
Assignees
Milestone

Comments

@cshung
Copy link
Member

cshung commented Nov 22, 2019

Goals

To make ahead of time compiled binary widely applicable, it cannot make assumptions on the execution environment. For example, it cannot assume the processor used to run the program support AVX instructions.

This is unfortunate because ahead of time compilation is meant for performance, but it cannot be as performant as it could be, just because of the lack of information.

The change proposed in this issue is to remedy it - the problem is the lack of information - so we ask the user to supply it.

The abstract, high-level functional requirement:

  • Users specify the hardware support level during publishing.
  • Crossgen2 will make the assumption that the underlying hardware will have the support, and it generates code as such.
  • Runtime, upon load of the assembly, will verify that the current hardware does support the assumed level. Ready to run code will be used only when the validation succeeded.

The key challenges:

  • What exactly do we mean by the hardware support level?
    This is tricky, different processors support different subsets of instructions. These subsets are not totally-ordered. The only way to specify the exact hardware support is to specify the subset.

  • What if the hardware is more capable than the assumed level?
    Ideally, the hardware support is exactly as assumed. If it is less capable, we can bail out and refuse to load the assembly. But it is more capable, then refusing to load the assembly seems harsh. If we do use the ready-to-run code, there could be problems as follow:

  1. Disagreeing on support level:
    Suppose we jit this, and ready-to-run compile ready_to_run_use_x assuming x is not supported. ready_to_run_use_x would throw PlatformNotSupportedException at runtime, not what we wanted.
void JittedCode()
{
  if (x is supported)
  {
    ready_to_run_use_x();
  }
}
  1. Disagreeing on Vector<T> size:
    Suppose we run this:
void JittedCode()
{
   Vector<T> x;
   ready_to_run_with_x(x);
}

If they do not agree on Vector<T> sizes, the call will not work.

  1. Disagreeing on calling convention:
    This is a general problem - if we change calling convention - then the call will not work. In general, calling convention is something we should just never change, but it appears to me that we will - due to this:
    https://github.com/dotnet/coreclr/issues/15943

AVX is likely to be much less useful if we cannot use Vector<T>.

Design

The solution to (1) is TBD, it will be a verbose and extensible format that describes the instruction subset.
The solution to (2) is that we specify a fixed Vector<T> size and also enforce the same size at runtime when JIT asks for it.
The solution to (3) is TBD - ideally, it is fixed so we can use Vector<T> in crossgen2.

If we zoom out a little bit - we notice the general problem is disagreement. The past approach for ready-to-run is to solve the disagreement by shutting up (i.e. not compiling). Here I am proposing something different, I am saying we should solve the disagreement by letting the JIT follows the assumption (i.e Vector<T> size).

Currently, Vector<T> size is the only thing I want to enforce the JIT to follow. In particular, suppose we implemented AVX512, I don't want to stop the JIT from using it, meaning if there is any code that used AVX512, crossgen2 will refuse to compile it, just like it was.

Audience

The key customer of this feature are:

  • People who cared about performance, and
  • People who have control over where their code runs.

This is likely to be rare in terms of the number of people. But if we could make it on the cloud, that could potentially benefit many people automatically.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-crossgen2-coreclr untriaged New issue has not been triaged by the area owner labels Nov 22, 2019
@jeffschwMSFT jeffschwMSFT removed the untriaged New issue has not been triaged by the area owner label Jan 8, 2020
@jeffschwMSFT jeffschwMSFT added this to the 5.0 milestone Jan 8, 2020
@davidwrighton
Copy link
Member

davidwrighton commented Feb 11, 2020

Crossgen2 Instruction Set Support

This covers specification of the baseline instruction set for use in compiles via Crossgen2.

Instruction sets

Instruction sets in crossgen2 are currently handled on a minimal baseline basis, and use of intrinsics which require instructions beyond the baseline set cause the compile to not generate code and instead rely on the JIT. The intention with this work is to specify a means by which a developer can specify a new baseline to the compiler, and achieve correctness.

Problems for version resilient code generation with intrinsics and vector instructions

  • On X86 and X64 architectures, the jit has multiple different ways to generate the 128bit and 256 bit vector instructions. In particular, there is the legacy SSE encoding, the VEX encoding and the EVEX encoding (introduced by AVX512, and currently unsupported by the runtime). This leads to significantly different codegen between compilations when Avx is specified as an available instruction set and when it is not.
  • Vector<T> is of 2 different sizes on X86. 16 bytes on Sse supporting architectures, and 32 bytes on Avx2 supporting architectures. (Note that Avx support is insufficient for support of 32 byte Vector<T>.)
  • For System.Private.CoreLib only, crossgen supports making IsSupported checks for some Sse variations. This is a pragmatic support for fixing startup time JIT penalties. In other assemblies, such support does not exist, as it is not considered quite as reliable as rejitting the method (in terms of guaranteeing that the intrinsics are correctly used, and do not cause an illegal instruction fault.)

New Command line arguments to support changed instruction set baselines

-instruction-set:<InstructionSetName>[+-]

This command line option may be specified multiple times on the command line. If there are multiple specifications for the same instruction set name, then the behavior specified rightmost on the command line shall take precedence. If no qualifier is provided for the instruction set, + is assumed.

e.g -instruction-set:Avx2 -instruction-set:Bmi1 -instruction-set:Bmi2-
This indicates that the Avx and Bmi1 instruction sets are supported, and the Bmi2 instruction set is known to be unsupported. Note: the default state of instruction sets is indeterminate which means support is unknown to the compiler. In general, that means if a method uses such an instruction, then crossgen2 will abort generation of the method, and instead force a JIT operation to occur.

Exception to the rules above

On the X86 and X64 platforms, the Sse and Sse2 are supported by the baseline. These command line arguments provide a means for generating code which does not support those intrinsics, but also require that the runtime is run in a mode where Sse and Sse2 instrinsics are disabled. This can be done via a environment variable.

TODO: What is the baseline support on Arm64 and Arm

Effect of instruction set specification for crossgen2 in .NET 5.0

  • The backend compiler shall be run in a mode where all instruction sets supported by the baseline enabled for use by the JIT. All instruction sets marked as unsupported, will not be enabled. If the compiler encounters a use of a type or intrinsic which requires use of an instruction set which has indeterminate support, then the method will not be AOT compiled, as it does today.
  • Vector<T> hardware intrinsic support will be enabled in crossgen2 on X86 and X64 platforms if the status of the Avx2 instruction set is known.
  • Similarly, Vector<T> hardware intrinsic support will be enabled in crossgen2 on Arm64 platforms if the status of the appropriate instruction set is known.
  • Some instruction sets imply that other instruction sets are enabled. For instance, Sse2 implies that Sse is enabled. Conflicting specification will produce a crossgen2 compile time error. (An indeterminate specification cannot produce such an error, but may be overridden by a positive assertion to an instruction set which is supported.

Supported instruction sets for crossgen2 in .NET 5.0

The naming of these instruction sets is based on the type names in the System.Runtime.Intrinsics namespace.

Architecture Instruction Set Implied Instruction Sets Notes
X86/X64 Aes Sse2
X86/X64 Avx Sse, Sse2, Sse3, Ssse3, Sse41, Sse42
X86/X64 Avx2 Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Avx
X86/X64 Bmi1 Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Avx This instruction set implies Avx because they use the VEX encoding
X86/X64 Bmi2 Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Avx This instruction set implies Avx because they use the VEX encoding
X86/X64 Fma Sse, Sse2, Sse3, Ssse3, Sse41, Sse42, Avx
X86/X64 Lzcnt
X86/X64 Pclmulqdq Sse2
X86/X64 Popcnt Sse, Sse2, Sse3, Ssse3, Sse41, Sse42
X86/X64 Sse
X86/X64 Sse2 Sse
X86/X64 Sse3 Sse, Sse2
X86/X64 Ssse3 Sse, Sse2, Sse3
X86/X64 Sse41 Sse, Sse2, Sse3, Ssse3
X86/X64 Sse42 Sse, Sse2, Sse3, Ssse3, Sse41
Arm64 AdvSimd ArmBase
Arm64 Aes ArmBase
Arm64 ArmBase
Arm64 Crc32 ArmBase
Arm64 Sha1 ArmBase
Arm64 Sha256 ArmBase

R2R format changes and runtime support to support changed instruction set baselines

  • The R2R format shall support a new fixup type READYTORUN_FIXUP_Check_InstructionSetSupport. This fixup type will encode a series of difference instruction sets indicating whether or not support is expected to exist or not. (Each instruction set will be assigned a unique number. The fixup will consist of a single number indicating the number of instruction sets described, followed by each individual instruction set. The instruction set number shall be shifted left by one bit, leaving 31 bits of encoding for the instruction set number, and the low bit shall be used to indicate whether or not the instruction set is known to be supported or not. Individual methods may be marked with this fixup, or it may exist in a eager fixups section. If an individual method is marked with this fixup, then that particular method isn't used. If this is found in an eager fixups section (An import fixups section marked with CORCOMPILE_IMPORT_FLAGS_EAGER), then none of the code from the module may be used.

  • The behavior of R2R format eager fixups (fixups where their import section is marked with CORCOMPILE_IMPORT_FLAGS_EAGER will change. In particular, instead of throwing BadImageFormatException when an fixup fails to process, they will instead disable use of the R2R image for the rest of program execution.

It is expected that crossgen2 updated baseline scenarios will add an eager fixups section that specifies the layout of Vector<T>, as well as the updated baseline instruction sets.

Most common expected use cases

  1. Server deployed applications with well known processor support. Cloud platforms have very regular support for most of these instruction sets. In particular, support for Avx2 is is extremely common in cloud environments, and can likely be relied on to exist.
  2. Testing of restricted instruction set surface area, such as testing on X64 non-hardware assisted paths for use on Arm machines. Use of this capability will require execution of the runtime with special configuration knobs set which disable baseline.

Plausible future work

  • Multiversioning of code, so that support for a moderately recent, and very old instruction set baselines are supported. This implies supporting multiple different Vector<T> calling conventions within 1 compile with crossgen, as well as runtime support for choosing between 2 different variants of code in an R2R image. While technically very feasible, the cost is strictly higher than required to provide decent support for cloud customers.
  • Optimistic code gen for vector intrinsics support. Instead of multiversioning, have a baseline, and preferred instruction set. Where there are differences generate only one copy of code, but do it in a smart way as to provide maximum optimization on most hardware.

@davidwrighton
Copy link
Member

@jkotas, does this sound like a reasonable plan?

@tannergooding
Copy link
Member

Worth noting a few instances of Vector<T> don't show up properly since the < and > aren't escaped (nor are they in a code block).

It might be worth elaborating why Bmi1/Bmi2 imply Avx even though they don't inherit from it. Namely, Bmi1/Bmi2 use the VEX encoding and the JIT doesn't support separating them today.
The only scenario this really matters is if the CPU supports AVX but the OS hasn't set the flag indicating it supports XSAVE/XRSTOR (which allows saving YMM registers). You can currently enter this state on some operating systems by toggling a boot flag (such as bcdedit /set xsavedisable 0 on Windows).

The R2R format shall have a 64bit integer embedded

Will this be sufficient for all combinations in the future? AVX-512 itself has, what looks to be, 20 ISAs: https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512

@jkotas
Copy link
Member

jkotas commented Feb 11, 2020

the runtime shall produce a FailFast with a message indicating the instruction support mismatch.

R2R gracefully fallbacks to JIT when the R2R payload is unusable. We used to throw exceptions on various mismatches in the past, but people found it very inconvenient. I believe we should fallback gracefully in this case as well. We have tracing and other mechanisms to allow diagnosing the mismatches.

The R2R format shall have a 64bit integer embedded in a new section to specify the instruction baseline.

Would it make sense to encode this as fixup? The fixups are designed to encode image or method body prerequisites among other things. For example, READYTORUN_FIXUP_Check_TypeLayout fixup is used to encode dependencies on external struct layout. This looks very similar. Also, fixups are variable length so it can be naturally extended infinitely in future.

he default state of instruction sets is ?

Nit: ? is wildcard on Unix shells. Shell wildcards are always fun to deal with ... do we even need it?

Multiversioning of code

An easier variant of multiversioning may be encoding method prerequisites. We know that majority of machines out there do support many of the intrinsics. How hard would it be to generate the method code with assumption that it runs on recent machine, record the assumptions made by the method if there are any, and then fallback to JIT only when the assumptions are not met?

@davidwrighton
Copy link
Member

@tannergooding @jkotas I've updated my idea on specification of instruction sets. Its now quite a bit more flexible.

Also, I've removed the concept of the ? specifier. That really only existed to allow command line overriding of an already passed command line option back to default indeterminate behavior. Its a bit … overdone.

@tannergooding Could you tell me if my set of implied instruction sets is correct. Of particular interest are the Lzcnt, Popcnt, Aes, and Pclmulqdq instruction sets, which nothing else implies.

@tannergooding, do we have baseline support on Arm64 for any of the instruction sets (such as ArmBase. Is there an instruction for which support is required before support for Vector<T> appears?

@tannergooding, is Arm expected to any of the instruction sets? Does it support Vector<T>?

@jkotas
Copy link
Member

jkotas commented Feb 11, 2020

READYTORUN_SECTION_EAGER_FIXUPS

This section exists already. It is encoded via CORCOMPILE_IMPORT_FLAGS_EAGER flag on CORCOMPILE_IMPORT_SECTION

@davidwrighton
Copy link
Member

@jkotas, fantastic. I've updated the spec to discuss CORCOMPILE_IMPORT_FLAGS_EAGER. I've added a tweak to that behavior to not throw BadImageFormatException in cases of a Check failure, but otherwise I think I can use the existing functionality.

@tannergooding
Copy link
Member

Could you tell me if my set of implied instruction sets is correct. Of particular interest are the Lzcnt, Popcnt, Aes, and Pclmulqdq instruction sets, which nothing else implies

Looks correct and to match the inheritance heirarchy (modulo Bmi1/2 which have the associated explanation).

do we have baseline support on Arm64 for any of the instruction sets (such as ArmBase. Is there an instruction for which support is required before support for Vector appears?

I'm not aware what hardware we have declared as baselien for ARM64. Vector<T> requires AdvSIMD support to exist.

is Arm expected to any of the instruction sets? Does it support Vector?

Arm32 (and prior) currently has no intrinsic support and will not for 5.0. I'm unaware if we have any plans to make it happen in the future.

@davidwrighton
Copy link
Member

Adding @dotnet/crossgen-contrib for comments. My goal is to implement something as close as possible to my first comment.

davidwrighton added a commit that referenced this issue Apr 3, 2020
- Add support for the --instruction-set parameter as described in #226 . 
NOTE: As the abi for Vector parameters is not yet stable, support for the --instruction-set parameter is only enabled if --inputbubble is also enabled. Parallel work to stabilize the abi is in progress, but is not complete.
ALSO NOTE: The names of the instruction sets are shared with mono, and don't follow the names in issue #226
- Add concept of baseline instruction set support to R2R file format 
- Can be applied at a per method level or at the entire R2R file level
  - R2RDump support for dumping the extra data
- Refactor how support for hardware intrinsics beyond SSE2 support are handled in crossgen2 
- Add feature to the JIT to detect which hardware features are actually used
  - Tell the JIT unconditionally that SSE42+Lzcnt+Popcnt+Pclmulqdq are supported
  - But if support beyond the --instruction-set specified baseline is used, notate the method with a per-method instruction set support fixup.
  - This enables usage of many intrinsics in corelib with greater efficiency than today
  - This enables usage of SSE42 and below intrinsics safely in non-CoreLib code. Use of higher level intrinsics in non CoreLib code will generate code which does not use the higher level intrinsic, and note that the method's code should not be used in the presence of hardware which does support greater CPU capabilities. 
  - In the future a logical enhancement of this work would be to generate multiple bodies of code to handle these more complex cases.
  - In combination with the --instruction-set argument, if Avx2 is enabled, then the logic gracefully adds a dependency on Avx2 capability and Vector<T> becomes useable by crossgen'd code.
@twest820
Copy link

twest820 commented May 3, 2020

What would the developer interface for this look like? In C++ a typical method for processor targeting would be to implement private methods of a class across in several files (translation units) so that different /arch settings can be used. A .csproj build setting for a minimum instruction set seems to make sense---this already exists in a way, since an x64 platform target implies SSE2---but it's less clear to me how more granular instruction set targeting might be accomplished. For example, if I could have a project set to an AVX platform target I'd still be looking for something like [MethodImpl(InstructionSet = AVX2)] on certain methods where the AVX instruction set is limiting.

Another case I commonly encounter is register spilling due to 16 ymms being inadequate. So, even in methods keeping to 128 bit SIMD to avoid knocking the core out of turbo, current dissasemblies suggest substantial performance improvement would be possible if inlining decisions could take advantage of 32 zmms. EVEX access from C# isn't currently much of a concern for me but, with increasing AVX-512 availability from Intel Ice Lake shipments, I expect this to be changing by the end of 2020.

I'm also curious of the extent to which the CPU dispatching required of .NET Core implementations might be optimized out. For example, I've ended up "reimplementing" certain System.Runtime.Intrinsics paths for AVX targets by dropping unnecessary branches. This isn't a big deal but it does carry some cost.

@davidwrighton
Copy link
Member

@twest820 Actually, crossgen2 support has been checked in to the tool itself, and is functional now although the benefits are currently quite modest (See the --instruction-set command line argument). The current expectation is that the control will be at a module level of granularity for now, but sometime after crossgen2 replaces crossgen and intrinsics use becomes more common in the community, we will likely explore adding more granular controls.

This issue remains open as the developer interface (which will be some csproj property) has not been designed and implemented.

@twest820
Copy link

twest820 commented May 5, 2020

Hi David, thanks for the update. My experience of /arch:AVX and /arch:AVX2 is quite modest as well. Also within my experience, it's common compute intensive applications don't need to support 10+ or 7+ year old processors. So the return on the developer time for changing the build target remains excellent.

Is there anticipated timeframe for the developer interface? I'm kind of hearing after .NET 5.

@davidwrighton
Copy link
Member

We're likely to ship the feature as opt in for x64 in .NET 5, in a use at your own risk manner. We expect it to function correctly in all cases, but performance will not have been tuned.

@davidwrighton davidwrighton modified the milestones: 5.0.0, 6.0.0 Jul 10, 2020
@davidwrighton
Copy link
Member

davidwrighton commented Jul 10, 2020

We've enabled the ability to use crossgen2 in the sdk for .NET 5 via the PublishReadyToRunUseCrossgen2 property, but a friendly developer interface to specify the minimum viable instruction set will not happen in this release. In the meantime, the PublishReadyToRunCrossgen2ExtraArgs property can be used to specify a custom instruction set with an entry like

<PublishReadyToRunCrossgen2ExtraArgs>--instruction-set:avx2,bmi2,fma,pclmul,popcnt,aes</PublishReadyToRunCrossgen2ExtraArgs>

Note: These property values will be respected by the 5.0 sdk, but as support will be in preview, it is possible for the properties to change in future releases.

@mjsabby
Copy link
Contributor

mjsabby commented Jul 12, 2020

@davidwrighton Thanks for the update. Seems reasonable that it will work and may change in 6.0 LTS but at least there is an escape hatch for 5.0

@mangod9
Copy link
Member

mangod9 commented Jul 9, 2021

Closing since crossgen2 now supports specifying instruction-sets.

@mangod9 mangod9 closed this as completed Jul 9, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Aug 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants