Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

[CPAOT] Generating code for hardware intrinsic #26772

Merged

Conversation

cshung
Copy link
Member

@cshung cshung commented Sep 18, 2019

  • Testing
  • Restrict generation to CoreLib only

Some testing is in progress for validating this change:

  1. Compiling System.Private.CoreLib with the assert causing failure in #12093 commented out. The compilation succeeded.

  2. R2RDump verification of the generated binary.

  • Avx related artifacts are NOT generated.
  • get_IsSupported for x86/x64 hardware intrinsics are not pre-compiled (the ARM64 ones are generated, which is weird but doesn't hurt)
  • The hardware intrinsic method themselves are not generated. This fact turns out to be important. We had the code to filter the method attributes so that the hardware intrinsic bit is taken away, if the unsupported Avx methods are precompiled, the infinite recursion code would be precompiled as is into the binary, causing stack overflow.
  • The hardware intrinsic instructions are used in code path guard by get_IsSupported.
  • The fallback path is generated.

Below is a snippet showing the case where the hardware intrinsic instruction is generated and guarded by IsSupported with the fallback path generated as well:

Int32 System.Decimal+DecCalc.ScaleResult(System.Decimal+DecCalc+Buf24*, UInt32, Int32)
...
  338ad5: ff 15 1d 4e 55 00     call qword ptr [0x88d8f8]     // Boolean System.Runtime.Intrinsics.X86.Lzcnt.get_IsSupported() (METHOD_ENTRY_DEF_TOKEN)
  338adb: 0f b6 c8              movzx ecx, al
  338ade: 85 c9                 test ecx, ecx
  338ae0: 74 0a                 je 0x338aec
  338ae2: 45 33 ff              xor r15d, r15d
  338ae5: f3 45 0f bd fe        lzcnt r15d, r14d
  338aea: eb 28                 jmp 0x338b14
  338aec: 45 85 f6              test r14d, r14d
  338aef: 0f 94 c1              sete cl
  338af2: 0f b6 c9              movzx ecx, cl
  338af5: 85 c9                 test ecx, ecx
  338af7: 74 08                 je 0x338b01
  338af9: 41 bf 20 00 00 00     mov r15d, 32
  338aff: eb 13                 jmp 0x338b14
  338b01: 41 8b ce              mov ecx, r14d
  338b04: ff 15 8e d2 54 00     call qword ptr [0x885d98]     // Int32 System.Numerics.BitOperations.Log2SoftwareFallback(UInt32) (METHOD_ENTRY_DEF_TOKEN)
  338b0a: 44 8b f8              mov r15d, eax
  338b0d: 41 f7 df              neg r15d
  338b10: 41 83 c7 1f           add r15d, 31
  338b14: 41 2b ef              sub ebp, r15d
...
  1. test.cmd - following the standard procedure, the test binaries are built. I replaced all copies of System.Private.CoreLib.dll compiled with crossgen with the one compiled with crossgen2 and ran all the tests, 3 tests out of 2600+ tests failed, and they were failing before my changes.

  2. CI passes

@dotnet/crossgen-contrib

@cshung cshung force-pushed the dev/andrewau/crossgen2-hardware-intrinsic branch from 63a56b7 to a035d0a Compare September 18, 2019 23:02
@MichalStrehovsky
Copy link
Member

I have been experimenting with this feedback for a while today. I can't find the logic that stops the JIT from inlining the IsSupported methods, but I don't see them getting inlined either after I applied Hmm - not sure. So well, I am not sure, but it seems to work.

Just to make sure we're talking about the same thing: are you porting #24689, #24917, ideally also #25365? Hardware intrinsics are not actually enabled unless all the CORJIT_FLAG_USE_AES, etc. flags are enabled (they're all under a READYTORUN ifdef right now).

We're stripping the intrinsic bit and marking them as non-inlineable in a single place. The non-inlinineability is just a precaution - since IsSupported calls are currently implemented to recursively call themselves, they won't get inlined. But this shouldn't inline even if that definition changes for some reason or if RyuJIT becomes smart and turns it into throw StackOverflowException() - RyuJIT simply shouldn't be looking at that body and marking this non-inlineable accomplishes that.

@cshung
Copy link
Member Author

cshung commented Sep 26, 2019

Just to make sure we're talking about the same thing: are you porting #24689, #24917, ideally also #25365?

I wasn't aware of #25365, now I am.

Hardware intrinsics are not actually enabled unless all the CORJIT_FLAG_USE_AES, etc. flags are enabled (they're all under a READYTORUN ifdef right now).

Thanks for the catch, it was there when I did it on the CoreRT repo, and got drop accidentally when I port it over here. I will put it back and test again.

We're stripping the intrinsic bit and marking them as non-inlineable in a single place.

That is what I worried and not sure about. I just can't find that one place. Whatever that place is, it should be using CORINFO_FLG_DONT_INLINE, but searching across the whole repo that value is used only once in all CSharp files. The function FilterHardwareIntrinsicMethodAttribs doesn't seem to exist either which confuse me further.

since IsSupported calls are currently implemented to recursively call themselves, they won't get inlined

Here is my simple detector to check whether or not the JIT attempts to inline it or not.

Adding this line right before here:

printf("%s\n", info.compFullName);

I believe any function that is compiled or inlined by the JIT must go through there, and if I spot any IsSupported there, we are screwed. I confirmed there is none during the compilation of CoreLib. It might happen after the recursion check though. Maybe I can try to modify the implementation so that it is no longer a recursion (e.g. simply return true) and see if it got inlined to be sure.

@cshung cshung force-pushed the dev/andrewau/crossgen2-hardware-intrinsic branch from ed695fc to 018fcbd Compare September 26, 2019 07:54
@MichalStrehovsky
Copy link
Member

That is what I worried and not sure about. I just can't find that one place. Whatever that place is, it should be using CORINFO_FLG_DONT_INLINE, but searching across the whole repo that value is used only once in all CSharp files. The function FilterHardwareIntrinsicMethodAttribs doesn't seem to exist either which confuse me further.

I'm referring to code that needs to be ported over. They should be ported from the crossgen codebase - look at the referenced pull requests.

Maybe I can try to modify the implementation so that it is no longer a recursion (e.g. simply return true) and see if it got inlined to be sure.

You might be looking at a cross-version bubble inline (we block those, you need to run with large version bubble enabled etc.). Do not spend time on it, please just port over what crossgen does.

@cshung
Copy link
Member Author

cshung commented Sep 26, 2019

I'm referring to code that needs to be ported over. They should be ported from the crossgen codebase - look at the referenced pull requests.

Gotcha, I was confused earlier because I didn't know you were referring code that doesn't exist yet. Am I correct that you are suggesting the logic should be written here?

In general, not being able to tell what I was doing is right or wrong is painful. I was trying all sort of ways to make the code reports whether or not I am doing the right thing or not. Even when I missed a bunch of flags, a whole collection of hardware intrinsics are compiled, compared to the version without change. (Maybe adding the flags will add even more, I will have to experiment tomorrow).

@MichalStrehovsky
Copy link
Member

In general, not being able to tell what I was doing is right or wrong is painful.

Yep, it was painful when I was doing this for crossgen too. I ended up identifying a several methods of interest in CoreLib that use HW intrinsics, dumped them with r2r dump and validated that:

  1. We're generating them (or not generating them - in case of the Avx/Avx2 intrinsics)
  2. They have explicit calls to IsSupported (except when this is Sse/Sse2, in which case they should not call IsSupported anymore and they should be expanded as if IsSupported returned true).
  3. For the cases when IsSupported call is present, we generate both the intrinsic implementation and non-intrinsic implementation

So overall you need to inspect:

  • Something that uses Sse/Sse2
  • Something that uses Avx2
  • Something that uses e.g. Lzcnt or Popcnt

Then also have another test where you inspect it does the expected thing outside CoreLib.

@cshung cshung changed the title [WIP] [CPAOT] Generating code for hardware intrinsic [CPAOT] Generating code for hardware intrinsic Sep 26, 2019
@cshung cshung changed the title [CPAOT] Generating code for hardware intrinsic [WIP] [CPAOT] Generating code for hardware intrinsic Sep 27, 2019
@cshung cshung changed the title [WIP] [CPAOT] Generating code for hardware intrinsic [CPAOT] Generating code for hardware intrinsic Sep 27, 2019
Copy link
Member

@MichalStrehovsky MichalStrehovsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, modulo the 4 questions.

Maybe let's do the remaining two things in a separate pull request:

  • Not generate these into non-CoreLib modules: a check that grabs MethodBeingCompiled, extracts the owning types' module, and compares it with _compilation.TypeSystemContext.SystemModule might be enough (throw a RequiresRuntimeJitException to abort compilation of the method).
  • Making sure Vector64/128/256 are not generated (unless CoreLib is part of the version bubble). This should be just a small shuffle in ReadyToRunCompilerContext.cs around line VectorIntrinsicFieldLayoutAlgorithm.IsVectorType(type) - we want to only return _vectorIntrinsicFieldLayoutAlgorithm if CoreLib is in the version bubble. Otherwise we should return _vectorFieldLayoutAlgorithm (which is the thing we use for Vector<T> and will conveniently throw RequiresRuntimeJit whenever the layout is needed).

@cshung cshung force-pushed the dev/andrewau/crossgen2-hardware-intrinsic branch from 5c15f2b to 2899786 Compare September 30, 2019 23:56
@cshung cshung merged commit 6ec7898 into dotnet:master Oct 1, 2019
@cshung cshung deleted the dev/andrewau/crossgen2-hardware-intrinsic branch October 1, 2019 05:49
cshung added a commit to cshung/coreclr that referenced this pull request Oct 2, 2019