[RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant #9989

fiigii · 2018-03-21T22:51:39Z

Now, certain hardware intrinsics that accept an imm8 argument would be replaced by a function call (usually the function body is big jump-table) if the imm8 argument is not a JIT time constant.

This feature provides more stable runtime behaviors instead of throwing exceptions, but it may cause the significant performance regression, so we should avoid the fallback-replacement if possible.

For example, the code below is not allowed in C++ but legal in C#.

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static sbyte Extract(Vector256<sbyte> value, byte index)
        {
            index &= 0x1F;
            if (index > 15)
            {
                return Sse41.Extract(Avx.ExtractVector128(value, 1), (byte)(index - 16));
            }
            else
            {
                return Sse41.Extract(Avx.GetLowerHalf(value), index);
            }
        }

In the first return statement, Sse41.Extract gets an expression (byte)(index - 16) that is not a static constant, locally. However, once the function is called with a literal argument of index and inlined at the call-site, (byte)(index - 16) could be a JIT time constant.

The current problem is that we check if the imm8 argument is constant in the importer, which is too early for some situations (e.g., casted argument).
In this example, (byte)(index - 16) is not a constant in the importer, but the expression could finally be a constant at the backend of RyuJIT. If we expand the fallback again after the mid-end optimizations (e.g., CSE, conditional constant propagation, integer-promotion elimination, etc.) the CQ of imm-intrinsics would be much better.

cc @CarolEidt @AndyAyersMS @mikedn @tannergooding

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium

The text was updated successfully, but these errors were encountered:

4creators · 2018-03-22T01:00:46Z

IMO the part of function for extracting elements with index > 15 should not be implemented. Developers using HW intrinsics should be able to write Sse41.Extract(Avx.ExtractVector128(value, 1), (byte)(index - 16)) by themselves with explicit usage of constant parameters instead of index - 16. To help in that it should be sufficient to indicate it in docs.

However, this functionality should be a part of Numerics Vector<T> implementation.

tannergooding · 2018-03-22T01:40:20Z

@4creators, I don't agree. This is a helper function for extracting the proper element of a 256-bit vector. This includes the upper 16 indices.

I believe this is one of the "core" helper intrinsics that should be provided by the framework.

fiigii · 2018-03-22T02:17:27Z

Developers using HW intrinsics should be able to write Sse41.Extract(Avx.ExtractVector128(value, 1), (byte)(index - 16))

@4creators This is just an example that shows the optimization opportunity and the users may write.

I have moved the implementation of Avx.Insert/Extract into the JIT compiler dotnet/coreclr#17030

RussKeldorph · 2018-03-22T17:43:46Z

@dotnet/jit-contrib

mikedn · 2018-03-22T18:04:33Z

Overall it sounds like a good idea.

The main potential drawback is that the end result (whether a fallback is used or not) depends on the JIT optimization abilities. Someone writes some code where the fallback happens to not be used and gets good performance. The code ends up running on different version of JIT that perhaps fails to do some constant folding so the fallback version is used and the code gets significantly slower. But such mishaps are always possible when a JIT is used, it's not something that affects only intrinsics.

In terms of implementation there may be some problems. You'll want to delay the use of fallbacks until lowering. I don't think you want to generate the fallback code (that uses large switches) inline so you'll need to turn the intrinsic node into a call node during lowering. This is not something that the JIT does often today, calls are usually expected to exist before morphing so they pass through fgMorphArgs.

CarolEidt · 2018-03-22T23:07:58Z

We have made it a design parameter that we expect developers using these intrinsics to be savvy enough to use analysis tools to ensure that they are getting what they expect. This is especially true if they are planning to rely on the JIT to optimize and inline.

That said, it is an interesting thought to delay the immediate fallback until Lowering. I am a bit more optimistic that it can be done, perhaps, than @mikedn (we had some issues some years ago when we first started introducing calls in Lowering, but I think the biggest ones have been resolved).

Even still I'm not certain that the benefit is worth the cost. I think we should hold off until we get some feedback from developers.

fiigii · 2018-03-23T01:40:42Z

Is the above optimization possible in RyuJIT?

@CarolEidt @mikedn Thank you so much for the excellent comments. The answer to the above question looks like "YES". But, of course, we need more work loads and date at first.

fiigii · 2018-12-13T02:59:05Z

Updated the title and description.
I am trying a solution that expands imm-intrinsic's non-const fallback in morph to get the expected CQ. Perhaps, we can leverage the new-style JIT intrinsic's isSpecialIntrinsic feature, which tries to expand intrinsic (i.e., NI_System_Enum_HasFlag) multiple times after importation. Additionally, I plan to add a new field in GenTreeCall to cache the intrinsic ID (retrieved in importer), which avoids calling the relatively expensive lookupNamedIntrinsic (that is a linear search now).

@AndyAyersMS @CarolEidt @mikedn Does this approach look okay to you?

mikedn · 2018-12-13T06:57:11Z

I am trying a solution that expands imm-intrinsic's non-const fallback in morph

Why not in lowering?

import all non imm intrinsics as usual
also import all imm intrinsics with constant imm operands as usual
leave the rest of imm intrinsics as calls
in lowering attempt to convert these calls to intrinsics if the imm argument changed to a constant

It seems that this should work fine because:

delaying call->intrinsic conversion to lowering is in general a bad idea because calls are "heavy", the sooner you get rid of them the better. But it's fine to delay imm intrinsics with non-constant imm operands because these are supposed to be rare, special cases.
generating calls in lowering can be problematic but the opposite should work just fine
there are cases where constants appear only after VN runs so doing the conversion only in global morph will miss some (I've seen a few such cases when I moved magic division from morph to lowering)
doing the conversion outside global morph should work but then seriously, morph is already a hodge- podge of everything, I'd avoid adding stuff to it unless necessary. It's necessary to do it in morph only if this enables additional optimizations. This is unlikely in the case of intrinsics because the JIT doesn't do any intrinsic optimizations. And I doubt that it will do this anytime soon, there are far better things to improve in the JIT, than trying to optimize intrinsic code. In most case you'd expect the developers to write good intrinsic code to begin with.

Additionally, I plan to add a new field in GenTreeCall to cache the intrinsic ID

Good luck with that, GenTreeCall is AFAIR already the biggest node in the JIT.

fiigii · 2018-12-13T08:36:59Z

Why not in lowering?

doing the conversion outside global morph should work but then seriously, morph is already a hodge- podge of everything, I'd avoid adding stuff to it unless necessary. It's necessary to do it in morph only if this enables additional optimizations.

@mikedn I just looked NI_System_Enum_HasFlag code that expands the intrinsic in morph. Thank you so much for teaching, will try in lowering.

fiigii · 2018-12-14T00:22:29Z

@mikedn @CarolEidt I am trying to expand the fallback calls to intrinsic nodes in LowerCall. However, at that position, arguments are already lowered to ARGPLACE (or PUTARG_REG)

lowering call (before):
N001 (  1,  1) [000040] ------------        t40 =    LCL_VAR   simd32 V04 tmp3         u:1 (last use) $102
N002 (  1,  1) [000041] -c----------        t41 =    CNS_INT   int    1 $47
                                                 /--*  t40    simd32 
                                                 +--*  t41    int    
N003 (  3,  3) [000042] -------N----        t42 = *  HWIntrinsic simd16 ubyte ExtractVector128 $c4
                                                 /--*  t42    simd16 
N005 (  7,  6) [000065] DA--G-----L-              *  STORE_LCL_VAR simd16(AX) V05 tmp4         
N009 (  3,  2) [000066] -------N----        t66 =    LCL_VAR_ADDR byref  V05 tmp4         
N011 (  1,  1) [000046] ------------        t46 =    CNS_INT   int    1 $47
                                                 /--*  t66    byref  arg0 in rcx
                                                 +--*  t46    int    arg1 in rdx
N014 ( 28, 17) [000047] --CXG-------        t47 = *  CALL      int    System.Runtime.Intrinsics.X86.Sse41.Extract $146

How can I take the actual arguments (t46) out from ARGPLACE (to check it is const or not)?

mikedn · 2018-12-14T07:03:34Z

How can I take the actual arguments (t46) out from ARGPLACE (to check it is const or not)?

Hrm, yes, because vectors are still treated as structs for ABI purposes, intrinsics calls are unfortunately rather complicated. Vector args get spilled to temporaries and that forces args into the late arg list. You should find the constant in gtCallLateArgs but it's probably easier to get it by using fgArgInfo->GetArgNode(2).

It would be better if intrinsic calls would follow vector calling conventions to avoid this mess but that's probably not going to happen too soon, if ever.

fiigii · 2018-12-14T07:51:57Z

You should find the constant in gtCallLateArgs but it's probably easier to get it by using fgArgInfo->GetArgNode(2)

Ah, great, thanks!

because vectors are still treated as structs for ABI purposes, intrinsics calls are unfortunately rather complicated.

Yes, another problem is that in LowerCall all the arguments have already get "lowered". If we directly transform the fallback call-site to an intrinsic node, the allocated registers and stack slots (e.g., t46 and tmp4) would become useless. So that may be wasteful, I concern.

mikedn · 2018-12-14T08:30:34Z

If we directly transform the fallback call-site to an intrinsic node, the allocated registers and stack slots (e.g., t46 and tmp4) would become useless. So that may be wasteful, I concern.

Yes, you should be able to replace the call with the intrinsic but some of the consequences of call morphing might be more difficult to remove. In particular, since vectors are treated as structs, you'll probably end up with copies.

You really don't have many options. It's either lowering or global morph. And as explained earlier, doing this in global morph will miss various cases.

fiigii · 2018-12-14T08:35:25Z

You really don't have many options.

Hmm, let me try the two options both.

mikedn · 2018-12-14T10:41:18Z

Hmm, let me try the two options both.

If you're referring to the morph option then it won't work with your example. index won't become constant until VN/assertion prop.

CarolEidt · 2020-10-13T16:46:31Z

Related/duplicate issues: #11062, #11138, #36070, #38003

echesakov · 2021-07-07T01:58:50Z

I am going to move this to Future.

echesakov · 2022-03-15T19:30:23Z

Un-assigning myself
cc @BruceForstall

fiigii changed the title ~~[RyuJIT] Delay fallback-replacement of imm-intrinsics for better CQ~~ [RyuJIT] Expand imm-intrinsics' fallback for better CQ Dec 13, 2018

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

tannergooding mentioned this issue Apr 15, 2020

API Proposal : Arm Shift and Permute intrinsics #31324

Closed

tannergooding mentioned this issue Jun 17, 2020

Static method that evaluates to a constant not being inlined by .NET Core 3 and 5 JIT #38003

Open

CarolEidt changed the title ~~[RyuJIT] Expand imm-intrinsics' fallback for better CQ~~ [RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant Oct 12, 2020

CarolEidt modified the milestones: Future, 6.0.0 Oct 13, 2020

JulieLeeMSFT assigned echesakov Mar 23, 2021

JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021

JulieLeeMSFT removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 7, 2021

echesakov modified the milestones: 6.0.0, Future Jul 7, 2021

echesakov removed their assignment Mar 15, 2022

This was referenced Mar 15, 2022

HW intrinsics: Poor codegen for operations requiring imm8 when the value is calculated to constant during JIT #11062

Closed

Reevaluate HW intrinsic immediate const parameters after method containing HW intrinsics is inlined #11138

Closed

BruceForstall added the optimization label Mar 16, 2022

gfoidl mentioned this issue May 18, 2022

Vector{128,256} operations that use MmShuffle fall back to method call SixLabors/ImageSharp#2121

Closed

4 tasks

tannergooding mentioned this issue May 29, 2024

Allow shuffle and other hwintrinsic that require a constant to stay intrinsic if the operand becomes constant later #102827

Merged

dotnet-policy-service bot added the in-pr There is an active PR which will close this issue when it is merged label May 29, 2024

TIHan closed this as completed in #102827 Jun 1, 2024

github-actions bot locked and limited conversation to collaborators Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant #9989

[RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant #9989

fiigii commented Mar 21, 2018 •

edited by BruceForstall

Loading

4creators commented Mar 22, 2018

tannergooding commented Mar 22, 2018

fiigii commented Mar 22, 2018

RussKeldorph commented Mar 22, 2018

mikedn commented Mar 22, 2018

CarolEidt commented Mar 22, 2018

fiigii commented Mar 23, 2018

fiigii commented Dec 13, 2018 •

edited

Loading

mikedn commented Dec 13, 2018

fiigii commented Dec 13, 2018

fiigii commented Dec 14, 2018

mikedn commented Dec 14, 2018

fiigii commented Dec 14, 2018

mikedn commented Dec 14, 2018

fiigii commented Dec 14, 2018

mikedn commented Dec 14, 2018

CarolEidt commented Oct 13, 2020 •

edited

Loading

echesakov commented Jul 7, 2021

echesakov commented Mar 15, 2022

[RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant #9989

[RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant #9989

Comments

fiigii commented Mar 21, 2018 • edited by BruceForstall Loading

4creators commented Mar 22, 2018

tannergooding commented Mar 22, 2018

fiigii commented Mar 22, 2018

RussKeldorph commented Mar 22, 2018

mikedn commented Mar 22, 2018

CarolEidt commented Mar 22, 2018

fiigii commented Mar 23, 2018

fiigii commented Dec 13, 2018 • edited Loading

mikedn commented Dec 13, 2018

fiigii commented Dec 13, 2018

fiigii commented Dec 14, 2018

mikedn commented Dec 14, 2018

fiigii commented Dec 14, 2018

mikedn commented Dec 14, 2018

fiigii commented Dec 14, 2018

mikedn commented Dec 14, 2018

CarolEidt commented Oct 13, 2020 • edited Loading

echesakov commented Jul 7, 2021

echesakov commented Mar 15, 2022

fiigii commented Mar 21, 2018 •

edited by BruceForstall

Loading

fiigii commented Dec 13, 2018 •

edited

Loading

CarolEidt commented Oct 13, 2020 •

edited

Loading