JIT: Force inline small BasicBlock methods #95484

amanasifkhalid · 2023-11-30T21:42:17Z

Several small methods aren't inlined in very large methods (such as Compiler::impImportBlockCode) by MSVC, probably due to its inlining budget. Force-inlining some of these methods, starting with a few BasicBlock methods, may slightly improve TP (at least this was the case locally).

ghost · 2023-11-30T21:42:27Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Several small methods aren't inlined in very large methods (such as Compiler::impImportBlockCode) by MSVC, probably due to its inlining budget. Force-inlining some of these methods, starting with a few BasicBlock methods, may slightly improve TP.

Author:	amanasifkhalid
Assignees:	amanasifkhalid
Labels:	`area-CodeGen-coreclr`
Milestone:	-

amanasifkhalid · 2023-12-01T01:05:33Z

cc @dotnet/jit-contrib. My intent was to limit this to roughly one-line helper methods, but I did see some non-inlined calls to isBBCallAlwaysPair and isBBCallAlwaysPairTail, so I decided to include them. TP diffs are surprisingly biggest on Linux x64, so I guess Clang was running into similar inlining issues?

If we think the TP diffs are worth it, I'll add some comments explaining the decision (and maybe a TODO to remove these eventually?).

SingleAccretion

If you could collect a per-method TP diff, it would be interesting to see where this makes a difference.

SingleAccretion · 2023-12-01T08:31:36Z

src/coreclr/jit/block.cpp

-//    "retless" BBJ_CALLFINALLY blocks due to a requirement to use the BBJ_ALWAYS for
-//    generating code.
-//
-bool BasicBlock::isBBCallAlwaysPair() const


Moving methods from a .cpp file to a header is a takeback w.r.t. maintainability (the coding convention encourages placing methods in implementation files). Does it make a good enough difference to pay for that?

Locally on Windows x64, I observed an additional 0.01% or 0.02% TP improvement; technically a 25-33% improvement over not force-inlining these methods, but only because the TP improvements were small on Windows x64 to begin with. As for maintainability, I agree I'd rather not move code previously in implementation files into header files, but I'm curious where we draw the line for "very small inline member function implementations" in the coding convention -- just one-line methods? Small methods with no function calls to avoid size increases from inlining?

SingleAccretion · 2023-12-01T08:32:48Z

src/coreclr/jit/block.h

@@ -538,7 +538,7 @@ struct BasicBlock : private LIR::Range
        return bbJumpKind;
    }

-    void SetJumpKind(BBjumpKinds jumpKind)
+    __forceinline void SetJumpKind(BBjumpKinds jumpKind)


__forceinline is an MSVC-original construct. It works, but it would be more idiomatic to use the FORCEINLINE macro.

jakobbotsch · 2023-12-01T09:02:02Z

I think like @jkotas has mentioned that it does not make sense to try to optimize inlining decisions in non PGO enabled builds since we expect PGO to be enabled and seriously influence these decisions. If we want to do that we should wait until we have native PGO enabled again and then do one-off TP collection in a PGO build with and without the change.

FWIW, there's some MSVC switches like /d2inlinelogfull:FunctionName and /d2inlinestats that can be used to give more information about inlining decisions. It might be interesting to see why MSVC decides not to inline.

amanasifkhalid · 2023-12-01T16:10:26Z

If we want to do that we should wait until we have native PGO enabled again and then do one-off TP collection in a PGO build with and without the change.

Good idea, I'll wait for #91297 to be merged in. I think this should happen soon, since #95510 was recently opened.

FWIW, there's some MSVC switches like /d2inlinelogfull:FunctionName and /d2inlinestats that can be used to give more information about inlining decisions.

Thanks for mentioning that. I suspect our exceptionally large callers are exhausting the inlining budget pretty quickly, leaving out the easy inlining candidates. I'll take another look once we have PGO data to see if this is an issue.

amanasifkhalid · 2023-12-01T16:12:26Z

By the way, we have some existing uses of FORCEINLINE (and __forceinline) in the codebase that might be worth revisiting with the updated PGO data. If the new PGO data negates the need for FORCEINLINE, then maybe I can turn this into a cleanup PR.

tannergooding · 2023-12-01T16:31:19Z

we expect PGO to be enabled and seriously influence these decisions

Are we currently enabling and using PGO on all platforms? I had thought we still had some where it was not and could not be trivially enabled?

It might be interesting to see why MSVC decides not to inline.

We are still using /Ox (Enable Most Speed Optimizations) and not /O2 (Maximize Speed) for Release builds: https://github.com/dotnet/runtime/blob/main/eng/native/configureoptimization.cmake

I still think this is a bit backwards since you'd likely be more inclined to debug Checked builds and so enabling most, but not all optimizations to improve debuggability makes sense there. However, most users won't be trying to debug a Release build that we've shipped to production and so we really want it to run as quickly as possible there. We're using -O3 on Unix for release already.

By using /Ox instead of /O2, we are still getting the following:

/Og (Global Optimizations)
/Oi (Generate Intrinsic Functions)
- We might be manually enabling this for all builds: https://github.com/dotnet/runtime/blob/main/eng/native/configurecompiler.cmake#L781
/Ot (Favor Fast Code)
/Oy (Frame-Pointer Omission)
- We might be manually disabling this for all builds: https://github.com/dotnet/runtime/blob/main/eng/native/configurecompiler.cmake#L782
- Worth noting it's ignored for x64 builds, however
/Ob2 (Inline Function Expansion)

However, we're missing out on:

/GF (Eliminate Duplicate Strings)
/Gy (Enable Function-Level Linking)
- We might be manually enabling this for all builds: https://github.com/dotnet/runtime/blob/main/eng/native/configurecompiler.cmake#L785C80-L785C80

Additionally, we aren't enabling other perf optimizations that might be beneficial (we do notably enable /LTCG in some cases):

/GL (Whole Program Optimization)
/Ob3 (Inline Function Expansion)
- This is new in VS2019+ and results in "better" inlining heuristics

amanasifkhalid · 2023-12-01T16:42:35Z

However, most users won't be trying to debug a Release build that we've shipped to production and so we really want it to run as quickly as possible there.

I agree with this sentiment, but could switching to /O2 impact how we support users facing issues in production? In other words, are there situations where we only have a Release build we can debug? Perhaps such cases are rare enough (and already tricky to debug) that they're worth the performance boost.

jkotas · 2023-12-01T16:45:26Z

/O2 vs /Ox is not about debuggability. It is about performance. Our historical experience is that enabling max speed optimizations makes the runtime slower. If you can prove that it is not the case, feel free to change it.

amanasifkhalid · 2023-12-01T16:48:00Z

Our historical experience is that enabling max speed optimizations makes the runtime slower.

Do you know when we last did a comparison? If we haven't tried the new inlining heuristics, I'd be interested in trying them.

tannergooding · 2023-12-01T16:48:41Z

Our historical experience is that enabling max speed optimizations makes the runtime slower.

That's surprising. I wouldn't expect that given as it looks like we're only really missing /GF (Eliminate Duplicate Strings); given that we seem to be manually enabling /Gy (Enable Function-Level Linking).

I think it'd be worth testing again and then testing with /Ob3 and potentially /GL, which are relatively newer options.

jkotas · 2023-12-01T17:17:18Z

Do you know when we last did a comparison?

It has been a while.

/GL

/GL should be enabled by CMAKE_INTERPROCEDURAL_OPTIMIZATION.

EgorBo · 2023-12-01T17:30:02Z

I tested O2 here #93336 and didn't detect any TP diffs (for jit)

BruceForstall · 2023-12-01T17:45:12Z

We absolutely should change MSVC compilation switches for Release to use -O2 instead of -Ox. #53849. Checked and Release should have the same compiler optimization switches. There are bugs for doing that for x86. I might have done it for non-x86 but testing Release is "hard" -- our CI systems don't do it, and you'd want to do some perf testing on it also.

jkotas · 2023-12-01T17:47:50Z

Checked and Release should have the same compiler optimization switches

I disagree. /GL makes sense for Release, but it does not make sense for Checked.

BruceForstall · 2023-12-01T17:54:10Z

I disagree. /GL makes sense for Release, but it does not make sense for Checked.

Why do you think this? The goal for Checked is to run as fast as possible but still have DEBUG checking (asserts, etc.) enabled.

Is it because you want Checked builds to compile faster (and maybe be incremental?) for faster dev inner-loop?

One optimization I admit would not make sense in Checked is native PGO, since we'll presumably never collect Checked PGO data.

jkotas · 2023-12-01T17:58:19Z

Is it because you want Checked builds to compile faster (and maybe be incremental?) for faster dev inner-loop?

Yes. Also, the speed up from /GY for code with debug asserts and no PGO data is not going to be dramatic.

Force inline small BasicBlock methods

9eda59d

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 30, 2023

ghost assigned amanasifkhalid Nov 30, 2023

build-analysis bot mentioned this pull request Dec 1, 2023

Checkout failure: "Git fetch failed with exit code 128" dotnet/arcade#9009

Open

2 tasks

amanasifkhalid marked this pull request as ready for review December 1, 2023 00:58

SingleAccretion reviewed Dec 1, 2023

View reviewed changes

Use FORCEINLINE macro

f934f03

amanasifkhalid closed this May 21, 2024

github-actions bot locked and limited conversation to collaborators Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Force inline small BasicBlock methods #95484

JIT: Force inline small BasicBlock methods #95484

amanasifkhalid commented Nov 30, 2023 •

edited

Loading

ghost commented Nov 30, 2023

amanasifkhalid commented Dec 1, 2023

SingleAccretion left a comment

SingleAccretion Dec 1, 2023

amanasifkhalid Dec 1, 2023

SingleAccretion Dec 1, 2023

jakobbotsch commented Dec 1, 2023

amanasifkhalid commented Dec 1, 2023

amanasifkhalid commented Dec 1, 2023

tannergooding commented Dec 1, 2023 •

edited

Loading

amanasifkhalid commented Dec 1, 2023

jkotas commented Dec 1, 2023 •

edited

Loading

amanasifkhalid commented Dec 1, 2023

tannergooding commented Dec 1, 2023

jkotas commented Dec 1, 2023

EgorBo commented Dec 1, 2023

BruceForstall commented Dec 1, 2023

jkotas commented Dec 1, 2023

BruceForstall commented Dec 1, 2023

jkotas commented Dec 1, 2023

JIT: Force inline small BasicBlock methods #95484

JIT: Force inline small BasicBlock methods #95484

Conversation

amanasifkhalid commented Nov 30, 2023 • edited Loading

ghost commented Nov 30, 2023

amanasifkhalid commented Dec 1, 2023

SingleAccretion left a comment

Choose a reason for hiding this comment

SingleAccretion Dec 1, 2023

Choose a reason for hiding this comment

amanasifkhalid Dec 1, 2023

Choose a reason for hiding this comment

SingleAccretion Dec 1, 2023

Choose a reason for hiding this comment

jakobbotsch commented Dec 1, 2023

amanasifkhalid commented Dec 1, 2023

amanasifkhalid commented Dec 1, 2023

tannergooding commented Dec 1, 2023 • edited Loading

amanasifkhalid commented Dec 1, 2023

jkotas commented Dec 1, 2023 • edited Loading

amanasifkhalid commented Dec 1, 2023

tannergooding commented Dec 1, 2023

jkotas commented Dec 1, 2023

EgorBo commented Dec 1, 2023

BruceForstall commented Dec 1, 2023

jkotas commented Dec 1, 2023

BruceForstall commented Dec 1, 2023

jkotas commented Dec 1, 2023

amanasifkhalid commented Nov 30, 2023 •

edited

Loading

tannergooding commented Dec 1, 2023 •

edited

Loading

jkotas commented Dec 1, 2023 •

edited

Loading