Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Force inline small BasicBlock methods #95484

Closed
wants to merge 2 commits into from

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Nov 30, 2023

Several small methods aren't inlined in very large methods (such as Compiler::impImportBlockCode) by MSVC, probably due to its inlining budget. Force-inlining some of these methods, starting with a few BasicBlock methods, may slightly improve TP (at least this was the case locally).

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Nov 30, 2023
@ghost ghost assigned amanasifkhalid Nov 30, 2023
@ghost
Copy link

ghost commented Nov 30, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Several small methods aren't inlined in very large methods (such as Compiler::impImportBlockCode) by MSVC, probably due to its inlining budget. Force-inlining some of these methods, starting with a few BasicBlock methods, may slightly improve TP.

Author: amanasifkhalid
Assignees: amanasifkhalid
Labels:

area-CodeGen-coreclr

Milestone: -

@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib. My intent was to limit this to roughly one-line helper methods, but I did see some non-inlined calls to isBBCallAlwaysPair and isBBCallAlwaysPairTail, so I decided to include them. TP diffs are surprisingly biggest on Linux x64, so I guess Clang was running into similar inlining issues?

If we think the TP diffs are worth it, I'll add some comments explaining the decision (and maybe a TODO to remove these eventually?).

Copy link
Contributor

@SingleAccretion SingleAccretion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could collect a per-method TP diff, it would be interesting to see where this makes a difference.

// "retless" BBJ_CALLFINALLY blocks due to a requirement to use the BBJ_ALWAYS for
// generating code.
//
bool BasicBlock::isBBCallAlwaysPair() const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving methods from a .cpp file to a header is a takeback w.r.t. maintainability (the coding convention encourages placing methods in implementation files). Does it make a good enough difference to pay for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locally on Windows x64, I observed an additional 0.01% or 0.02% TP improvement; technically a 25-33% improvement over not force-inlining these methods, but only because the TP improvements were small on Windows x64 to begin with. As for maintainability, I agree I'd rather not move code previously in implementation files into header files, but I'm curious where we draw the line for "very small inline member function implementations" in the coding convention -- just one-line methods? Small methods with no function calls to avoid size increases from inlining?

@@ -538,7 +538,7 @@ struct BasicBlock : private LIR::Range
return bbJumpKind;
}

void SetJumpKind(BBjumpKinds jumpKind)
__forceinline void SetJumpKind(BBjumpKinds jumpKind)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__forceinline is an MSVC-original construct. It works, but it would be more idiomatic to use the FORCEINLINE macro.

@jakobbotsch
Copy link
Member

I think like @jkotas has mentioned that it does not make sense to try to optimize inlining decisions in non PGO enabled builds since we expect PGO to be enabled and seriously influence these decisions. If we want to do that we should wait until we have native PGO enabled again and then do one-off TP collection in a PGO build with and without the change.

FWIW, there's some MSVC switches like /d2inlinelogfull:FunctionName and /d2inlinestats that can be used to give more information about inlining decisions. It might be interesting to see why MSVC decides not to inline.

@amanasifkhalid
Copy link
Member Author

If we want to do that we should wait until we have native PGO enabled again and then do one-off TP collection in a PGO build with and without the change.

Good idea, I'll wait for #91297 to be merged in. I think this should happen soon, since #95510 was recently opened.

FWIW, there's some MSVC switches like /d2inlinelogfull:FunctionName and /d2inlinestats that can be used to give more information about inlining decisions.

Thanks for mentioning that. I suspect our exceptionally large callers are exhausting the inlining budget pretty quickly, leaving out the easy inlining candidates. I'll take another look once we have PGO data to see if this is an issue.

@amanasifkhalid
Copy link
Member Author

By the way, we have some existing uses of FORCEINLINE (and __forceinline) in the codebase that might be worth revisiting with the updated PGO data. If the new PGO data negates the need for FORCEINLINE, then maybe I can turn this into a cleanup PR.

@tannergooding
Copy link
Member

tannergooding commented Dec 1, 2023

we expect PGO to be enabled and seriously influence these decisions

Are we currently enabling and using PGO on all platforms? I had thought we still had some where it was not and could not be trivially enabled?

It might be interesting to see why MSVC decides not to inline.

We are still using /Ox (Enable Most Speed Optimizations) and not /O2 (Maximize Speed) for Release builds: https://github.com/dotnet/runtime/blob/main/eng/native/configureoptimization.cmake

I still think this is a bit backwards since you'd likely be more inclined to debug Checked builds and so enabling most, but not all optimizations to improve debuggability makes sense there. However, most users won't be trying to debug a Release build that we've shipped to production and so we really want it to run as quickly as possible there. We're using -O3 on Unix for release already.

By using /Ox instead of /O2, we are still getting the following:

However, we're missing out on:

Additionally, we aren't enabling other perf optimizations that might be beneficial (we do notably enable /LTCG in some cases):

  • /GL (Whole Program Optimization)
  • /Ob3 (Inline Function Expansion)
    • This is new in VS2019+ and results in "better" inlining heuristics

@amanasifkhalid
Copy link
Member Author

However, most users won't be trying to debug a Release build that we've shipped to production and so we really want it to run as quickly as possible there.

I agree with this sentiment, but could switching to /O2 impact how we support users facing issues in production? In other words, are there situations where we only have a Release build we can debug? Perhaps such cases are rare enough (and already tricky to debug) that they're worth the performance boost.

@jkotas
Copy link
Member

jkotas commented Dec 1, 2023

/O2 vs /Ox is not about debuggability. It is about performance. Our historical experience is that enabling max speed optimizations makes the runtime slower. If you can prove that it is not the case, feel free to change it.

@amanasifkhalid
Copy link
Member Author

Our historical experience is that enabling max speed optimizations makes the runtime slower.

Do you know when we last did a comparison? If we haven't tried the new inlining heuristics, I'd be interested in trying them.

@tannergooding
Copy link
Member

Our historical experience is that enabling max speed optimizations makes the runtime slower.

That's surprising. I wouldn't expect that given as it looks like we're only really missing /GF (Eliminate Duplicate Strings); given that we seem to be manually enabling /Gy (Enable Function-Level Linking).

I think it'd be worth testing again and then testing with /Ob3 and potentially /GL, which are relatively newer options.

@jkotas
Copy link
Member

jkotas commented Dec 1, 2023

Do you know when we last did a comparison?

It has been a while.

/GL

/GL should be enabled by CMAKE_INTERPROCEDURAL_OPTIMIZATION.

@EgorBo
Copy link
Member

EgorBo commented Dec 1, 2023

I tested O2 here #93336 and didn't detect any TP diffs (for jit)

@BruceForstall
Copy link
Member

We absolutely should change MSVC compilation switches for Release to use -O2 instead of -Ox. #53849. Checked and Release should have the same compiler optimization switches. There are bugs for doing that for x86. I might have done it for non-x86 but testing Release is "hard" -- our CI systems don't do it, and you'd want to do some perf testing on it also.

@jkotas
Copy link
Member

jkotas commented Dec 1, 2023

Checked and Release should have the same compiler optimization switches

I disagree. /GL makes sense for Release, but it does not make sense for Checked.

@BruceForstall
Copy link
Member

I disagree. /GL makes sense for Release, but it does not make sense for Checked.

Why do you think this? The goal for Checked is to run as fast as possible but still have DEBUG checking (asserts, etc.) enabled.

Is it because you want Checked builds to compile faster (and maybe be incremental?) for faster dev inner-loop?

One optimization I admit would not make sense in Checked is native PGO, since we'll presumably never collect Checked PGO data.

@jkotas
Copy link
Member

jkotas commented Dec 1, 2023

Is it because you want Checked builds to compile faster (and maybe be incremental?) for faster dev inner-loop?

Yes. Also, the speed up from /GY for code with debug asserts and no PGO data is not going to be dramatic.

@github-actions github-actions bot locked and limited conversation to collaborators Jun 21, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants