Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Inline delegate invocation #11578

Closed
hypeartist opened this issue May 26, 2016 · 19 comments
Closed

[Proposal] Inline delegate invocation #11578

hypeartist opened this issue May 26, 2016 · 19 comments
Labels
Area-Compilers Resolution-External The behavior lies outside the functionality covered by this repository

Comments

@hypeartist
Copy link

Is it possible to make it happen? (Maybe following the same rules applied to ordinary methods)

@HaloFour
Copy link

You might want to ask that in the coreclr repo as method inlining is a concern for the JIT compiler, not the C# compiler.

@orthoxerox
Copy link
Contributor

@hypeartist Do you mean things like (x => Console.Write(x))("Hello");? This is not supported by the current type inference engine. Delegates must have an explicitly defined type, so they can be assigned to explicitly typed variables or fields, passed as arguments, but not invoked immediately or assigned to variables with inferred types.

@AlgorithmsAreCool
Copy link

@hypeartist Some of your pain might go away with local functions which are (presumably) able to be inlined.

string MyFunc()
{
    // Lambda/delegate (cannot be inlined, allocates memory)
    Func<string, int> getLen = (str => str.Length);

    // local function using arrow syntax (potential new feature)
    int getLen(string str) => str.Length;
}

@hypeartist
Copy link
Author

hypeartist commented May 27, 2016

@orthoxerox
I mean somthing like:

        public override void Check(Func<int> fn, out string result, out string expected)
        {
            result = fn().ToString();
            expected = (1 + 2 + 3 + 4).ToString();
        }

Make fn's body to be inlined (if it meets the inlining rules)

@AlgorithmsAreCool
Copy link

@hypeartist

I don't think this is possible at all. Here is why, (Forgive me if I tell you things your already know).

Inlining is only done at the JIT(or native compiler) level today and will probably always be done there. The JIT compiles the method Check before it is first called with actual data. Put another way, Check() only knows fn is a reference type at JIT compile time, not what any possible value (or IL body) of fn might be at invocation time since the method will only be JIT compiled once but called many times with many different values of fn.

The only way to take advantage of inlining is with data that is known at compile time (technically JIT time). So if fn was declared locally with a body inside of Check(), it might be possible to inline it's body if the JIT is fancy enough. But not if it is passed in from the outside because it could be anything. (or more importantly, it could be different things at different times.)

Hypothetically we could make this work if you are willing to re-JIT Check() everytime it is called inorder to inline fn. This is known as interpeting would add a lot of overhead (and basically be an interpreter). Keep in mind callsite caching (like what we do to make dynamic fast) wouldn't help because it only deals with binding calls to functions (argument types, not argument data).

Interpreting is almost always slower in general than JITing and might not be worthwhile in this case because either:

  1. fn would only get called a couple of times:
    • Result: The time spent re-JITing Check() would still be slower than the optimization.
  2. fn is called enough times (dozens probably) to be worth the cost of the re-JIT
    • Result: inlining fn might bloat the emitted code size of Check() so much it might not be worth the tradeoff (add this would slow JITing a bit).
  3. fn is called in a tight loop thousands of times:
    • Result: This case might be worthwhile, but the body of fn would have to be tiny to make it useful. The JIT currently only inlines methods smaller than 32bytes (IL size) right now, so your lambda body would need to be very simple.

Does this answer your question?

@hypeartist
Copy link
Author

hypeartist commented May 27, 2016

@AlgorithmsAreCool
Thanx a lot for such a detailed explaination. Now I see clear. But what about some kind of "call (invocation) chain inspection"? Usually Func<int> fn definition (or specialization) are "near" enough to backtrace to it and see what actually fn are? May be it can be done even at IL-generation level.

@AlgorithmsAreCool
Copy link

AlgorithmsAreCool commented May 27, 2016

I think this is possible yes. I believe it is called Data Flow Analysis and is a powerful optimization technique seen in c++ compilers typically.

Lets look at a optimizable case:

void Main()
{
    var data = new[] { 1.0, 2.0, 5.0 ... };
    Func<double, double> scale = Math.Tanh;
    ScaleData(data, scale);
}

void ScaleData(double[] data, Func<double, double> scale)
{
     for(int i = 0; i < data.Length; i++)
         data[i] = scale(data[i]);
}

In this example the human eye can see that scale is a simple reference to a static function. So the C# compiler vould emit a specialized version of ScaleData and replace the callsite in Main with a call to the new version like this

This is actually a bad example since Math.Tanh is a compiler intrinsic

void Main()
{
    var data = new[] { 1.0, 2.0, 5.0 ... };

    //Compiler removed
    //Func<double, double> scale = Math.Tanh;
    //ScaleData(data, scale);

    //compiler generated call
    ScaleData_tanh(data);
}

//Compiler generated
void ScaleData_tanh(double[] data)
{
    for(int i = 0; i < data.Length; i++)
    {
         double x = data[i];
         data[i] = (Math.Exp(x) - Math.Exp(-x)) / (Math.Exp(x) + Math.Exp(-x));
    }
}

void ScaleData(double[] data, Func<double, double> scale)
{
     for(int i = 0; i < data.Length; i++)
         data[i] = scale(data[i]);
}

This would be a win because we have saved a call in a 'hot' loop. It could be done by Roslyn or by the JIT. But this optimization has a few limits/drawbacks however.

  • ScaleData cannot be virtual or an interface call.
  • scale must be provable as constant at compile (or JIT) time.
  • There are now 2 or more versions of ScaleData in memory. Larger code size is not usually a good thing.

So yes it is possible to do this. But the C#/VB team typically don't like putting heavy optimizations into Roslyn, and the JIT pobably doesn't have the time to do this complex flow analysis. The JIT has to do quick optimizations that only take a few millisecond to keep the program starting up quickly. This opimization case would be better suited for the CoreRT or .NET Native projects where the compiler can take several minutes to study and tightly optimize the code.

Did this answer your question?

@svick
Copy link
Contributor

svick commented May 27, 2016

@AlgorithmsAreCool

This opimization case would be better suited for the CoreRT or .NET Native projects where the compiler can take several minutes to study and tightly optimize the code.

Either that, or tiered JIT.

@AlgorithmsAreCool
Copy link

AlgorithmsAreCool commented May 27, 2016

@svick I hope that happens one day soon.

I forgot to mention llilc too!

@GSPP
Copy link

GSPP commented May 29, 2016

This optimization is being done routinely by the HotSpot JVM. The .NET JITs are just very far behind. .NET developers are not used to vcalls being optimized away effectively.

From an optimization standpoint the techniques to devirtualize vcalls, interface calls and delegate calls are the same.

The JVM monitors what classes actually are loaded. That way it often knows that a vcall can have only one possible target. The CLR could do the same thing.

The JVM also re-JITs and profiles the actual call targets. It then generates patterns like

if (isExpectedTarget) //99.999% case
 DirectCall(); //Possibly inline
else
 IndirectCall();

The check can often be pushed outside loops. isExpectedTarget is just obj.internalHeaderField == someConstant. Similar for delegates.

Profiling is very easy with a tiered JIT. The first tier profiles, the second tier does not. The main drawback is that a profile that is created once never changes. But that is rare and this simple scheme already result in high gains.

The JVM has a very sophisticated tiering scheme with multiple tiers and complicated rules for upgrading and downgrading the tier. I don't think that is necessary to realize gains. Rather, the JVM designers must have realized how important tiering is. So they invested a lot of time to squeeze the last points of performance out of it .NET can get started with a simpler scheme.

@AlgorithmsAreCool
Copy link

@GSPP

Good point, Hotspot is a wonderful JIT indeed.

Another thought is Chakra (IE JIT), is also tiered and performs profile guided monomorphic call optimization at runtime. Chakra can even replace a optimization profile if the execution flow changes to break earlier optimizations.

In fact, that blog post has several optimizations that I wish .Net had.

@GSPP
Copy link

GSPP commented May 29, 2016

It boggles my mind how a platform that runs on hundreds of millions of servers has such an under-invested JIT. Isn't .NET the platform that is 2nd in the number of servers it drives (where Java is 1st)?!

The amount of money that is wasted on, say, 10% more servers needed due to the sub-par code quality with .NET is unfathomably high. It's equal to the TCO of >10 millions of servers.

Just the amount that Microsoft themselves waste on unneeded servers with Bing and Office 360 must be more than the entire JIT team's budget.

It is known how to make the JIT far better. This is not research work (to a large extent; excluding the machine learning inlining decisions effort that is underway right now).

@AlgorithmsAreCool
Copy link

AlgorithmsAreCool commented May 29, 2016

Looking into it a bit, I wonder if the departure of Kevin Frei to Fackbook in july 2014 has slowed advancement of RyuJit. He was codgen dev lead for .NET and RyuJit might have been his design.

Edit: See his twitter

@GSPP
Copy link

GSPP commented May 29, 2016

RyuJIT is based on some pre-existing JIT codebase according to Andrew Ayers in his LLVM talk. He says "it's a baroque codebase that nobody feels quite confident in". This wording is slightly misremembered by me. The actual wording might have been not as strong but certainly in this direction.

It's certainly not entirely new code. The tree-based IR design feels wrong. It's not what state of the art compilers (LLVM, HotSpot) use. I don't think anyone would use such a design if writing a new JIT. It would be entirely SSA based.

@AlgorithmsAreCool
Copy link

Fascinating! I'll have to find that talk.

@GSPP
Copy link

GSPP commented May 29, 2016

Yeah... I so don't understand the strategy. Not sure why a new JIT was created that's hardly better than the old. Maybe it's a stop-gap thing to be able to release CoreClr? Maybe the old JIT was not suitable for that (licensing and patent fears?).

The team has stated that they want a tiered JIT. So the stop-gap theory makes sense where some powerful code generator is the 2nd tier (maybe LLVM). Once the 2nd tier is there the 1st tier can be an interpreter which is very cheap to implement.

The video is here: http://llvm.org/devmtg/2015-04/ 2:10

"baroque internal architecture ... even the people who work on it are kind of nervous about aspects of the code".

@orthoxerox
Copy link
Contributor

If only Mike Pall wasn't so tired of writing JIT compilers...

@hypeartist
Copy link
Author

hypeartist commented May 30, 2016

@GSPP

devirtualize vcalls, interface calls and delegate calls

This is exactly what I was thinking of! (just was lack of right words :) )

@gafter
Copy link
Member

gafter commented Jun 3, 2016

This issue is misplaced in the Roslyn repository. The Roslyn compilers are not intended to perform optimizations, but rather to translate source code into IL. The runtime compiler is where optimizations occur (or do not occur, as in this case). You probably want to report this in the coreclr repository.

@gafter gafter closed this as completed Jun 3, 2016
@gafter gafter added the Resolution-External The behavior lies outside the functionality covered by this repository label Jun 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Resolution-External The behavior lies outside the functionality covered by this repository
Projects
None yet
Development

No branches or pull requests

8 participants