Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant #9989

Closed
fiigii opened this issue Mar 21, 2018 · 19 comments · Fixed by #102827
Closed
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI design-discussion Ongoing discussion about design without consensus in-pr There is an active PR which will close this issue when it is merged optimization
Milestone

Comments

@fiigii
Copy link
Contributor

fiigii commented Mar 21, 2018

Now, certain hardware intrinsics that accept an imm8 argument would be replaced by a function call (usually the function body is big jump-table) if the imm8 argument is not a JIT time constant.

This feature provides more stable runtime behaviors instead of throwing exceptions, but it may cause the significant performance regression, so we should avoid the fallback-replacement if possible.

For example, the code below is not allowed in C++ but legal in C#.

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static sbyte Extract(Vector256<sbyte> value, byte index)
        {
            index &= 0x1F;
            if (index > 15)
            {
                return Sse41.Extract(Avx.ExtractVector128(value, 1), (byte)(index - 16));
            }
            else
            {
                return Sse41.Extract(Avx.GetLowerHalf(value), index);
            }
        }

In the first return statement, Sse41.Extract gets an expression (byte)(index - 16) that is not a static constant, locally. However, once the function is called with a literal argument of index and inlined at the call-site, (byte)(index - 16) could be a JIT time constant.

The current problem is that we check if the imm8 argument is constant in the importer, which is too early for some situations (e.g., casted argument).
In this example, (byte)(index - 16) is not a constant in the importer, but the expression could finally be a constant at the backend of RyuJIT. If we expand the fallback again after the mid-end optimizations (e.g., CSE, conditional constant propagation, integer-promotion elimination, etc.) the CQ of imm-intrinsics would be much better.

cc @CarolEidt @AndyAyersMS @mikedn @tannergooding

category:cq
theme:hardware-intrinsics
skill-level:expert
cost:medium

@4creators
Copy link
Contributor

IMO the part of function for extracting elements with index > 15 should not be implemented. Developers using HW intrinsics should be able to write Sse41.Extract(Avx.ExtractVector128(value, 1), (byte)(index - 16)) by themselves with explicit usage of constant parameters instead of index - 16. To help in that it should be sufficient to indicate it in docs.

However, this functionality should be a part of Numerics Vector<T> implementation.

@tannergooding
Copy link
Member

@4creators, I don't agree. This is a helper function for extracting the proper element of a 256-bit vector. This includes the upper 16 indices.

I believe this is one of the "core" helper intrinsics that should be provided by the framework.

@fiigii
Copy link
Contributor Author

fiigii commented Mar 22, 2018

Developers using HW intrinsics should be able to write Sse41.Extract(Avx.ExtractVector128(value, 1), (byte)(index - 16))

@4creators This is just an example that shows the optimization opportunity and the users may write.

I have moved the implementation of Avx.Insert/Extract into the JIT compiler dotnet/coreclr#17030

@RussKeldorph
Copy link
Contributor

@dotnet/jit-contrib

@mikedn
Copy link
Contributor

mikedn commented Mar 22, 2018

Overall it sounds like a good idea.

The main potential drawback is that the end result (whether a fallback is used or not) depends on the JIT optimization abilities. Someone writes some code where the fallback happens to not be used and gets good performance. The code ends up running on different version of JIT that perhaps fails to do some constant folding so the fallback version is used and the code gets significantly slower. But such mishaps are always possible when a JIT is used, it's not something that affects only intrinsics.

In terms of implementation there may be some problems. You'll want to delay the use of fallbacks until lowering. I don't think you want to generate the fallback code (that uses large switches) inline so you'll need to turn the intrinsic node into a call node during lowering. This is not something that the JIT does often today, calls are usually expected to exist before morphing so they pass through fgMorphArgs.

@CarolEidt
Copy link
Contributor

We have made it a design parameter that we expect developers using these intrinsics to be savvy enough to use analysis tools to ensure that they are getting what they expect. This is especially true if they are planning to rely on the JIT to optimize and inline.

That said, it is an interesting thought to delay the immediate fallback until Lowering. I am a bit more optimistic that it can be done, perhaps, than @mikedn (we had some issues some years ago when we first started introducing calls in Lowering, but I think the biggest ones have been resolved).

Even still I'm not certain that the benefit is worth the cost. I think we should hold off until we get some feedback from developers.

@fiigii
Copy link
Contributor Author

fiigii commented Mar 23, 2018

Is the above optimization possible in RyuJIT?

@CarolEidt @mikedn Thank you so much for the excellent comments. The answer to the above question looks like "YES". But, of course, we need more work loads and date at first.

@fiigii fiigii changed the title [RyuJIT] Delay fallback-replacement of imm-intrinsics for better CQ [RyuJIT] Expand imm-intrinsics' fallback for better CQ Dec 13, 2018
@fiigii
Copy link
Contributor Author

fiigii commented Dec 13, 2018

Updated the title and description.
I am trying a solution that expands imm-intrinsic's non-const fallback in morph to get the expected CQ. Perhaps, we can leverage the new-style JIT intrinsic's isSpecialIntrinsic feature, which tries to expand intrinsic (i.e., NI_System_Enum_HasFlag) multiple times after importation. Additionally, I plan to add a new field in GenTreeCall to cache the intrinsic ID (retrieved in importer), which avoids calling the relatively expensive lookupNamedIntrinsic (that is a linear search now).

@AndyAyersMS @CarolEidt @mikedn Does this approach look okay to you?

@mikedn
Copy link
Contributor

mikedn commented Dec 13, 2018

I am trying a solution that expands imm-intrinsic's non-const fallback in morph

Why not in lowering?

  • import all non imm intrinsics as usual
  • also import all imm intrinsics with constant imm operands as usual
  • leave the rest of imm intrinsics as calls
  • in lowering attempt to convert these calls to intrinsics if the imm argument changed to a constant

It seems that this should work fine because:

  • delaying call->intrinsic conversion to lowering is in general a bad idea because calls are "heavy", the sooner you get rid of them the better. But it's fine to delay imm intrinsics with non-constant imm operands because these are supposed to be rare, special cases.
  • generating calls in lowering can be problematic but the opposite should work just fine
  • there are cases where constants appear only after VN runs so doing the conversion only in global morph will miss some (I've seen a few such cases when I moved magic division from morph to lowering)
  • doing the conversion outside global morph should work but then seriously, morph is already a hodge- podge of everything, I'd avoid adding stuff to it unless necessary. It's necessary to do it in morph only if this enables additional optimizations. This is unlikely in the case of intrinsics because the JIT doesn't do any intrinsic optimizations. And I doubt that it will do this anytime soon, there are far better things to improve in the JIT, than trying to optimize intrinsic code. In most case you'd expect the developers to write good intrinsic code to begin with.

Additionally, I plan to add a new field in GenTreeCall to cache the intrinsic ID

Good luck with that, GenTreeCall is AFAIR already the biggest node in the JIT.

@fiigii
Copy link
Contributor Author

fiigii commented Dec 13, 2018

Why not in lowering?

doing the conversion outside global morph should work but then seriously, morph is already a hodge- podge of everything, I'd avoid adding stuff to it unless necessary. It's necessary to do it in morph only if this enables additional optimizations.

@mikedn I just looked NI_System_Enum_HasFlag code that expands the intrinsic in morph. Thank you so much for teaching, will try in lowering.

@fiigii
Copy link
Contributor Author

fiigii commented Dec 14, 2018

@mikedn @CarolEidt I am trying to expand the fallback calls to intrinsic nodes in LowerCall. However, at that position, arguments are already lowered to ARGPLACE (or PUTARG_REG)

lowering call (before):
N001 (  1,  1) [000040] ------------        t40 =    LCL_VAR   simd32 V04 tmp3         u:1 (last use) $102
N002 (  1,  1) [000041] -c----------        t41 =    CNS_INT   int    1 $47
                                                 /--*  t40    simd32 
                                                 +--*  t41    int    
N003 (  3,  3) [000042] -------N----        t42 = *  HWIntrinsic simd16 ubyte ExtractVector128 $c4
                                                 /--*  t42    simd16 
N005 (  7,  6) [000065] DA--G-----L-              *  STORE_LCL_VAR simd16(AX) V05 tmp4         
N009 (  3,  2) [000066] -------N----        t66 =    LCL_VAR_ADDR byref  V05 tmp4         
N011 (  1,  1) [000046] ------------        t46 =    CNS_INT   int    1 $47
                                                 /--*  t66    byref  arg0 in rcx
                                                 +--*  t46    int    arg1 in rdx
N014 ( 28, 17) [000047] --CXG-------        t47 = *  CALL      int    System.Runtime.Intrinsics.X86.Sse41.Extract $146

How can I take the actual arguments (t46) out from ARGPLACE (to check it is const or not)?

@mikedn
Copy link
Contributor

mikedn commented Dec 14, 2018

How can I take the actual arguments (t46) out from ARGPLACE (to check it is const or not)?

Hrm, yes, because vectors are still treated as structs for ABI purposes, intrinsics calls are unfortunately rather complicated. Vector args get spilled to temporaries and that forces args into the late arg list. You should find the constant in gtCallLateArgs but it's probably easier to get it by using fgArgInfo->GetArgNode(2).

It would be better if intrinsic calls would follow vector calling conventions to avoid this mess but that's probably not going to happen too soon, if ever.

@fiigii
Copy link
Contributor Author

fiigii commented Dec 14, 2018

You should find the constant in gtCallLateArgs but it's probably easier to get it by using fgArgInfo->GetArgNode(2)

Ah, great, thanks!

because vectors are still treated as structs for ABI purposes, intrinsics calls are unfortunately rather complicated.

Yes, another problem is that in LowerCall all the arguments have already get "lowered". If we directly transform the fallback call-site to an intrinsic node, the allocated registers and stack slots (e.g., t46 and tmp4) would become useless. So that may be wasteful, I concern.

@mikedn
Copy link
Contributor

mikedn commented Dec 14, 2018

If we directly transform the fallback call-site to an intrinsic node, the allocated registers and stack slots (e.g., t46 and tmp4) would become useless. So that may be wasteful, I concern.

Yes, you should be able to replace the call with the intrinsic but some of the consequences of call morphing might be more difficult to remove. In particular, since vectors are treated as structs, you'll probably end up with copies.

You really don't have many options. It's either lowering or global morph. And as explained earlier, doing this in global morph will miss various cases.

@fiigii
Copy link
Contributor Author

fiigii commented Dec 14, 2018

You really don't have many options.

Hmm, let me try the two options both.

@mikedn
Copy link
Contributor

mikedn commented Dec 14, 2018

Hmm, let me try the two options both.

If you're referring to the morph option then it won't work with your example. index won't become constant until VN/assertion prop.

@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@CarolEidt CarolEidt changed the title [RyuJIT] Expand imm-intrinsics' fallback for better CQ [RyuJIT] Avoid method call to fallback intrinsic method if immediate arg becomes constant Oct 12, 2020
@CarolEidt
Copy link
Contributor

CarolEidt commented Oct 13, 2020

Related/duplicate issues: #11062, #11138, #36070, #38003

@CarolEidt CarolEidt modified the milestones: Future, 6.0.0 Oct 13, 2020
@JulieLeeMSFT JulieLeeMSFT added the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Mar 23, 2021
@JulieLeeMSFT JulieLeeMSFT removed the needs-further-triage Issue has been initially triaged, but needs deeper consideration or reconsideration label Jun 7, 2021
@echesakov
Copy link
Contributor

I am going to move this to Future.

@echesakov echesakov modified the milestones: 6.0.0, Future Jul 7, 2021
@echesakov echesakov removed their assignment Mar 15, 2022
@echesakov
Copy link
Contributor

Un-assigning myself
cc @BruceForstall

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI design-discussion Ongoing discussion about design without consensus in-pr There is an active PR which will close this issue when it is merged optimization
Projects
None yet
10 participants