-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: peel off dominant switch case under PGO #52827
Conversation
If we have PGO data and the dominant non-default switch case has more than 30% of the profile, add an explicit test for that case upstream of the switch. We don't see switches all that often anymore as CSC is quite aggressive about turning them into if-then-else trees, but they still show up in the async methods.
@EgorBo PTAL Per SPMI, asm diffs in asp.net but nowhere else. Code size generally increases a bit, though it depends on what happens in flow opts and in LSRA. The
|
return bbsDstTab[bbsCount - 1]; | ||
} | ||
}; | ||
struct BBswtDesc; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rearranged so that I could refer to BasicBlock::weight_t
within BBswtDesc
.
src/coreclr/jit/fgprofile.cpp
Outdated
{ | ||
assert(block->bbJumpKind == BBJ_SWITCH); | ||
|
||
const BasicBlock::weight_t sufficientSamples = 100.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it expected that it won't work for dynamic PGO where maxSamples will be around 30 always?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. For dynamic PGO we'll see at least 30 calls to a method, but we often see many more than that. Also this is the count of executions of the switch, not of the method; the switch count can be higher or lower depending.
But changing this to 30 makes sense.
The two failures are both running crossgen2 so potentially related. Will investigate. |
src/coreclr/jit/fgprofile.cpp
Outdated
for (Edge* edge = dominantEdge->m_nextOutgoingEdge; edge != nullptr; edge = edge->m_nextOutgoingEdge) | ||
{ | ||
if (edge->m_weightKnown) | ||
{ | ||
if (!dominantEdge->m_weightKnown || (edge->m_weight > dominantEdge->m_weight)) | ||
{ | ||
dominantEdge = edge; | ||
} | ||
} | ||
} | ||
|
||
if (!dominantEdge->m_weightKnown) | ||
{ | ||
JITDUMP("No edges with known counts, sorry\n"); | ||
return; | ||
} | ||
|
||
BasicBlock::weight_t fraction = dominantEdge->m_weight / info->m_weight; | ||
|
||
// Because of count inconsistency we can see nonsensical ratios. Cap these. | ||
// | ||
if (fraction > 1.0) | ||
{ | ||
fraction = 1.0; | ||
} | ||
|
||
if (fraction < sufficientFraction) | ||
{ | ||
JITDUMP("Maximum edge likelihood is " FMT_WT " < " FMT_WT "; not sufficient to trigger peeling)\n", fraction, | ||
sufficientFraction); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly this optimization will always trigger for switch cases with 3 or fewer cases, even if no switch case is dominant. Should sufficientFraction
instead be computed based on the number of cases, e.g. 1.0 / (numCases - 1)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should
sufficientFraction
instead be computed based on the number of cases...?
Maybe? Seems like 3 could be special case.
The likelihood needs to be high enough to offset the cost of the extra compare and branch if we guess wrong. So I think the likelihood should always be at least something like 0.3 and not go lower for larger switches. That covers switches with 4 or more cases.
We already specially transform switches with 1-2 cases in fgOptimizeSwitchBranches
. We likely also handle case switches, though I don't see where we look for these.
That leaves 3 case switches. If we have a dominant non-default case over 0.3 and are also going to peel for the default then we're basically reducing the switch to a compare tree, but not getting the full benefit of doing so. I think this is ok for now.
(Some other notes on switch peeling)
Ideally, once we peel, the residual switch would get simplified and perhaps some of this other logic would kick in and simplifies things further. We could do this now if we peeled the highest numbered case, but not if we peel some other case.
Since we're always peeling the default (when lowering, so it would be peeled under any peel we do here), we should perhaps change strategy if the default case has a reasonable likelihood. For example, say there are 10 cases and the default case likelihood is 0.5, and case 0 has likelihood 0.25. We'd be better off first peeling for the default case. Once we do that, we can renormalize the switch likelihoods (multiplying by 1 / ( 1 - 0.5 )) and so case 0 relative likelihood is now 0.5, so we should peel for that case.
Of course once we do that second peel we can renormalize again and perhaps another case now becomes likely enough that we'd be willing to peel for that too. The rule of thumb I've seen here is that going up to 3 levels here can pay off (say the dominant cases are 0.9, then 0.09, then 0.009, these all become 0.9 with successive normalization). But I'm not sure we're ready for this just yet, as we also have to keep wonder how accurate our profile is when we start getting into the tail of the frequency distribution, and balance likely performance win vs code size increases. And there's some cost to having a high branch density.
Failure is:
so some sort of switch invariant is not being honored. |
Looks like more issues to chase down... |
Block coalescing was not making the combined block an "IBC" block if either block was, so we were losing IBC flag on a switch that we'd selected for peeling. Fixed the coalescing logic. |
Failure is known issue #52710. |
Should we consider (not in this change) essentially "re-constituting switches" such that we can optimize them using this method? Or, if not that literally, given an if-then-else tree for which we have PGO data, reorder the conditions, if we can prove they are mutually exclusive with no intervening side-effects (say). E.g., with a simple case, it looks like adding a 5th case to: turns it from if-then-else to switch. But maybe we know the last of the if-then-else tree is dominant. |
I think there is some merit to exploring this (and reordering even non-switch control flow). But some of the upstream opts that CSC does are useful and there are a huge variety of possible switch expansions to explore that we aren't yet capable of doing, so we'd need to expand our bag of tricks. |
src/coreclr/jit/block.h
Outdated
{ | ||
BasicBlock** bbsDstTab; // case label table address | ||
unsigned bbsCount; // count of cases (includes 'default' if bbsHasDefault) | ||
unsigned bbsDominantCase; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should add comments for the new fields. E.g., (1) define what "dominant" means, (2) are the fields always valid, or only under some PGO condition? (3) clarify that bbDominantCase
is an index into the bbsDstTab
array (but is only valid if bbsHasDominantCase
is true
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
src/coreclr/jit/block.h
Outdated
@@ -1291,6 +1254,47 @@ struct BasicBlockList | |||
} | |||
}; | |||
|
|||
// BBswtDesc -- descriptor for a switch block |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You inserted this in an odd location, because there is a huge comment above BasicBlockList
that covers BasicBlockList
as well as flowList
; why not put this before that comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
{ | ||
assert(block->bbJumpKind == BBJ_SWITCH); | ||
|
||
const BasicBlock::weight_t sufficientSamples = 30.0f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to add a comment here describing why these constants were chosen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/coreclr/jit/fgprofile.cpp
Outdated
{ | ||
if (edge->m_weightKnown) | ||
{ | ||
if (!dominantEdge->m_weightKnown || (edge->m_weight > dominantEdge->m_weight)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be safer to bail if there are any edges with unknown weights? Are we essentially assuming that unknown weight is zero weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revised, we'll bail out if we see any unknown edges.
src/coreclr/jit/block.h
Outdated
|
||
BBswtDesc() : bbsHasDefault(true), bbsHasDominantCase(false) | ||
{ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to update optCopyBlkDest
for the new fields (probably?) -- looks like that code is already broken for bbsHasDefault
. BBswtDesc
should probably have a copy function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should you dump the new fields in dspJumpKind
? fgTableDispBasicBlock
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and added some output to both methods, eg
BB01 [0000] 1 56 56 [000..022)-> BB03,BB05,BB04[dom(0.5535714)],BB06,BB08,BB07,BB09,BB02[def] (switch) IBC
src/coreclr/jit/fgopt.cpp
Outdated
// If either block or bNext has a profile weight | ||
// or if both block and bNext have non-zero weights | ||
// then we select the highest weight block. | ||
// then we wil use the max weight for the block. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: wil
src/coreclr/jit/fgopt.cpp
Outdated
BasicBlock* bDst = *jumpTab; | ||
flowList* edgeToDst = fgGetPredForBlock(bDst, bSrc); | ||
double outRatio = (double) edgeToDst->edgeWeightMin() / (double) bSrc->bbWeight; | ||
if (block->bbJumpKind != BBJ_SWITCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this ever called for LIR? Check for SWITCH before RunRarely, as it's more likely to be false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't make sense to call for LIR as switches are lowered there. So changed to an assert.
Also, up the minimal fraction to 0.55 based on some simple local benchmarking.
Did some more detailed benchmarking and it looks like we really should only peel if the dominant case is hit more than half the time. So will up the required likelihood to 0.55. |
@BruceForstall thanks for the feedback, see the latest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
If we have PGO data and the dominant non-default switch case has more than
30% of the profile, add an explicit test for that case upstream of the switch.
We don't see switches all that often anymore as CSC is quite aggressive about
turning them into if-then-else trees, but they still show up in the async
methods.