-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Substitute GT_RET_EXPR in inline candidate arguments #69117
Substitute GT_RET_EXPR in inline candidate arguments #69117
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThese
|
src/coreclr/jit/fginline.cpp
Outdated
// Yes. We may still have GT_RET_EXPR in arguments, | ||
// substitute those here so the inlinee does not need to | ||
// deal with them. | ||
// Note that we leave late devirts in arguments for when we | ||
// handle inlined statements as fgLateDevirtualization | ||
// cannot handle being called twice on a late devirt | ||
// candidate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not ideal, in a case where we could late-devirt an argument containing a call you could imagine doing so expose valuable information to the inlinee for whole-program optimization. We might consider looking into how to fix this if we decide to invest in interprocedural analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decided to fix this, didn't turn out to be too difficult.
/azp run Fuzzlyn, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
/azp run Fuzzlyn, runtime-coreclr libraries-jitstress |
Azure Pipelines successfully started running 2 pipeline(s). |
Fuzzlyn failures are preexisting. libraries-jitstress failures are #69154 and #68803. Diffs. Generally we see small improvements in the non-test code, and some regressions in the test code. Presumably these are because we now have fewer "opaque" nodes during inlining. I will spot check these when I have some more time, but I don't think we need to block the review on them. There are some TP regressions, but they are mostly paid for by #69120 that I noticed while working on this. I think there are additional improvements to be made in the future; in particular, we currently walk the argument nodes once when we see the inline candidate, and then another time after they are added as part of inlining (because cc @dotnet/jit-contrib PTAL @AndyAyersMS |
I am curious about what's behind some of the larger diffs here, so please do take a look
|
Hmm yes. And after looking at the diff summary again I can see that
is no longer true. That's interesting because it was true before my latest commit 41e0af2, that changes late devirt to happen on the call arguments as well. These were the diffs before that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good.
I wonder if the diffs here are because this change uncovers more devirt + inlines?
src/coreclr/jit/importer.cpp
Outdated
@@ -21299,6 +21300,7 @@ void Compiler::impDevirtualizeCall(GenTreeCall* call, | |||
// it's a union field used for other things by virtual | |||
// stubs) | |||
call->gtInlineCandidateInfo = nullptr; | |||
call->gtCallMoreFlags &= ~GTF_CALL_M_LATE_DEVIRT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should rename this flag? GTF_CALL_M_HAS_LATE_DEVIRT_INFO
perhaps? I introduced it but the name could better describe what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
The difference here is that we do another inline because of: -Inline candidate callsite is boring. Multiplier increased to 2.3.
+Inline candidate callsite is warm. Multiplier increased to 3. This happens because we now do the following substitution: +Replacing the return expression placeholder [000164] with [000812]
+ [000164] --C-------- ▌ RET_EXPR ref (inl return expr [000812])
+
+Inserting the inline return expression
+ [000812] ---XG------ ▌ FIELD ref _headers
+ [000811] ----------- └──▌ LCL_VAR ref V13 tmp6
+
Expanding INLINE_CANDIDATE in statement STMT00050 in BB105:
STMT00050 ( ??? ... ??? )
[000167] I-C-G------ ▌ CALL nullcheck int HttpHeaders.TryAddWithoutValidation (exactContextHnd=0x00007FFC32571731)
- [000164] --C-------- this ├──▌ RET_EXPR ref (inl return expr [000812])
+ [000812] ---XG------ this ├──▌ FIELD ref _headers
+ [000811] ----------- │ └──▌ LCL_VAR ref V13 tmp6
[000165] ----------- arg1 ├──▌ LCL_VAR ref V01 arg1
[000166] ----------- arg2 └──▌ LCL_VAR ref V03 loc0 And that |
The above may also explain parts of the TP diff, since if we do more inlines we expect to spend more time. |
In asm.benchmarks.run.windows.x64.checked 8113 we have -; 0 inlinees with PGO data; 59 single block inlinees; 39 inlinees without PGO data
+; 0 inlinees with PGO data; 69 single block inlinees; 48 inlinees without PGO data The cause is that we create a couple fewer locals, which means in the new version we do one less lva table grow (which has a *= 2 multiplier). Then we end up with a bunch of extra inlines due to space in the lva table: -Caller has 148 locals. Multiplier decreased to 9.6668.
+Caller has 98 locals. Multiplier decreased to 10.2186.
calleeNativeSizeEstimate=1122
callsiteNativeSizeEstimate=115
-benefit multiplier=9.6668
+benefit multiplier=10.2186
-threshold=1111
+threshold=1175
-Native estimate for function size exceeds threshold for inlining 112.2 > 111.1 (multiplier = 9.6668)
+Native estimate for function size is within threshold for inlining 112.2 <= 117.5 (multiplier = 10.2186) Seems strange that we are using the capacity here: runtime/src/coreclr/jit/inlinepolicy.cpp Lines 1708 to 1714 in 96db556
Should this be using lvaCount instead?
|
I think this is something @EgorBo added... yeah, it seems odd to check capacity and not how much is actually in use. |
In aspnet.benchmarks.run.windows.x64.checked 8840 we also have -; 0 inlinees with PGO data; 243 single block inlinees; 63 inlinees without PGO data
+; 0 inlinees with PGO data; 243 single block inlinees; 67 inlinees without PGO data Cause is the same as above, we end up with fewer locals and decide to inline more because of it. In 10502 ( INLINER: during 'fgInline' result 'success' reason 'aggressive inline attribute' for 'System.Text.Json.JsonSerializer:ReadFromSpan(System.ReadOnlySpan`1[Char],System.Text.Json.Serialization.Metadata.JsonTypeInfo):MicroBenchmarks.Serializers.SimpleStructWithProperties' calling 'System.Runtime.CompilerServices.Unsafe:SizeOf():int'
INLINER: during 'fgInline' result 'success' reason 'aggressive inline attribute'
+
+Replacing the return expression placeholder [000347] with [000391]
+ [000347] --C-------- ▌ RET_EXPR int (inl return expr [000391])
+
+Inserting the inline return expression
+ [000391] ----------- ▌ CNS_INT int 1
+
+
+Folding long operator with constant nodes into a constant:
+ [000348] --C-------- ▌ CAST long <- int
+ [000391] ----------- └──▌ CNS_INT int 1
+Bashed to long constant:
+ [000348] ----------- ▌ CNS_INT long 1
+
+Folding binary operator with a constant operand:
+ [000349] --C-------- ▌ MUL long
+ [000346] ----------- ├──▌ LCL_VAR long V27 tmp20
+ [000348] ----------- └──▌ CNS_INT long 1
+Transformed into:
+ [000346] ----------- ▌ LCL_VAR long V27 tmp20
Expanding INLINE_CANDIDATE in statement STMT00072 in BB50:
STMT00072 ( INL17 @ ??? ... ??? ) <- INLRT @ 0x06A[E-]
[000350] I-C-G------ ▌ CALL void System.SpanHelpers.ClearWithoutReferences (exactContextHnd=0x00007FFF1E6F83B1)
[000343] ----------- arg0 ├──▌ LCL_VAR byref V26 tmp19
- [000349] --C-------- arg1 └──▌ MUL long
- [000346] ----------- ├──▌ LCL_VAR long V27 tmp20
- [000348] --C-------- └──▌ CAST long <- int
- [000347] --C-------- └──▌ RET_EXPR int (inl return expr [000391])
+ [000346] ----------- arg1 └──▌ LCL_VAR long V27 tmp20 This leads to one fewer statements added: ------------ Statements (and blocks) added due to the inlining of call [000350] -----------
-
-Arguments setup:
-STMT00084 ( INL17 @ ??? ... ??? ) <- INLRT @ 0x06A[E-]
- [000410] -AC-------- ▌ ASG long
- [000409] D------N--- ├──▌ LCL_VAR long V32 tmp25
- [000349] --C-------- └──▌ MUL long
- [000346] ----------- ├──▌ LCL_VAR long V27 tmp20
- [000348] --C-------- └──▌ CAST long <- int
- [000347] --C-------- └──▌ RET_EXPR int (inl return expr [000391]) which finally means we clone a finally that brings the extra code: -Finally in EH#0 has 16 statements, limit is 15; skipping.
+EH#0 is a candidate for finally cloning: 13 blocks, 15 statements 8810 is another case of more inlines: -; 0 inlinees with PGO data; 33 single block inlinees; 30 inlinees without PGO data
+; 0 inlinees with PGO data; 42 single block inlinees; 34 inlinees without PGO data In this case we manage to fold some arguments to constants, which results in us folding some branches during import and overall using less budget, so we run out of budget later than before. I think from what I've seen so far that the diffs are good so will stop checking more cases. |
Timeouts are known. Can you please take a look at the above diffs and approve if it sounds fine @AndyAyersMS? |
Ping @AndyAyersMS |
Sorry, got caught up in other things. Looking now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for looking at diffs.
These
GT_RET_EXPR
nodes would previously make it into the inlinees as part of the inline argument table. This was confusing to me, and it additionally means that we callgtRetExprVal()
in a few places to compensate (while for non-toplevel cases it would essentially cause the argument node containing theGT_RET_EXPR
to become "opaque" to the JIT). With this change we only ever expect to substituteGT_RET_EXPR
insidefgUpdateInlineReturnExpressionPlaceHolder
called from the mainfgInline
loop, so I have inlinedgtRetExprVal()
here and removed that function.