Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: peel off dominant switch case under PGO #52827

Merged
merged 4 commits into from
May 19, 2021

Conversation

AndyAyersMS
Copy link
Member

If we have PGO data and the dominant non-default switch case has more than
30% of the profile, add an explicit test for that case upstream of the switch.

We don't see switches all that often anymore as CSC is quite aggressive about
turning them into if-then-else trees, but they still show up in the async
methods.

If we have PGO data and the dominant non-default switch case has more than
30% of the profile, add an explicit test for that case upstream of the switch.

We don't see switches all that often anymore as CSC is quite aggressive about
turning them into if-then-else trees, but they still show up in the async
methods.
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 16, 2021
@AndyAyersMS
Copy link
Member Author

@EgorBo PTAL
cc @dotnet/jit-contrib

Per SPMI, asm diffs in asp.net but nowhere else. Code size generally increases a bit, though it depends on what happens in flow opts and in LSRA.

The <ProcessRequests>d__2231:MoveNext():this` are among the hottest methods in the TE scenarios. Didn't see any impact on TE numbers with this change, but it's not easy to make fine-grained measurements on those.

Total bytes of base: 271240
Total bytes of diff: 272341
Total bytes of delta: 1101 (0.41% of base)
    diff is a regression.


Top file regressions (bytes):
          67 : 30803.dasm (0.72% of base)
          63 : 20200.dasm (1.55% of base)
          60 : 39874.dasm (1.65% of base)
          31 : 37715.dasm (0.77% of base)
          29 : 30389.dasm (2.14% of base)
          29 : 24617.dasm (2.14% of base)
          27 : 15847.dasm (1.04% of base)
          25 : 37213.dasm (5.19% of base)
          25 : 38450.dasm (0.90% of base)
          25 : 39787.dasm (2.40% of base)
          24 : 24612.dasm (3.55% of base)
          24 : 30388.dasm (3.55% of base)
          24 : 31072.dasm (0.92% of base)
          24 : 32603.dasm (3.54% of base)
          24 : 39972.dasm (0.92% of base)
          22 : 12162.dasm (1.56% of base)
          19 : 39849.dasm (0.20% of base)
          17 : 30869.dasm (0.46% of base)
          17 : 36556.dasm (5.21% of base)
          15 : 30946.dasm (0.79% of base)

Top file improvements (bytes):
         -66 : 39457.dasm (-1.80% of base)
         -50 : 37462.dasm (-5.03% of base)
         -45 : 33399.dasm (-1.23% of base)
         -34 : 32371.dasm (-1.82% of base)
         -20 : 9853.dasm (-0.47% of base)
         -19 : 20218.dasm (-0.87% of base)
         -17 : 33205.dasm (-0.18% of base)
         -17 : 16073.dasm (-0.42% of base)
         -15 : 36282.dasm (-2.23% of base)
         -15 : 32961.dasm (-0.69% of base)
         -14 : 33612.dasm (-0.39% of base)
         -12 : 15898.dasm (-0.83% of base)
         -12 : 39941.dasm (-0.63% of base)
         -11 : 37750.dasm (-4.15% of base)
         -11 : 20460.dasm (-0.50% of base)
          -9 : 16046.dasm (-1.18% of base)
          -7 : 33415.dasm (-1.38% of base)
          -7 : 38025.dasm (-1.38% of base)
          -7 : 39891.dasm (-1.38% of base)
          -7 : 15678.dasm (-0.92% of base)

169 total files with Code Size differences (29 improved, 140 regressed), 2 unchanged.

Top method regressions (bytes):
          67 ( 0.72% of base) : 30803.dasm - <NextResult>d__48:MoveNext():this
          63 ( 1.55% of base) : 20200.dasm - <ProcessRequests>d__223`1:MoveNext():this
          60 ( 1.65% of base) : 39874.dasm - <MultiplexingWriteLoop>d__21:MoveNext():this
          31 ( 0.77% of base) : 37715.dasm - <ProcessRequests>d__223`1:MoveNext():this
          29 ( 2.14% of base) : 30389.dasm - ILGenerator:Emit(OpCode,int):this
          29 ( 2.14% of base) : 24617.dasm - ILGenerator:Emit(OpCode,int):this
          27 ( 1.04% of base) : 15847.dasm - KnownHeaders:GetCandidate(BytePtrAccessor):KnownHeader
          25 ( 5.19% of base) : 37213.dasm - String:EndsWith(String,int):bool:this
          25 ( 0.90% of base) : 38450.dasm - <MultiplexingReadLoop>d__195:MoveNext():this
          25 ( 2.40% of base) : 39787.dasm - <Invoke>d__4:MoveNext():this
          24 ( 3.55% of base) : 24612.dasm - MemberInfoCache`1:GetListByName(long,int,long,int,int,int):ref:this
          24 ( 3.55% of base) : 30388.dasm - MemberInfoCache`1:GetListByName(long,int,long,int,int,int):ref:this
          24 ( 0.92% of base) : 31072.dasm - <MultiplexingReadLoop>d__195:MoveNext():this
          24 ( 3.54% of base) : 32603.dasm - MemberInfoCache`1:GetListByName(long,int,long,int,int,int):ref:this
          24 ( 0.92% of base) : 39972.dasm - <MultiplexingReadLoop>d__195:MoveNext():this
          22 ( 1.56% of base) : 12162.dasm - Uri:CheckSchemeSyntax(ReadOnlySpan`1,byref):int
          19 ( 0.20% of base) : 39849.dasm - <NextResult>d__48:MoveNext():this
          17 ( 0.46% of base) : 30869.dasm - <MultiplexingWriteLoop>d__21:MoveNext():this
          17 ( 5.21% of base) : 36556.dasm - SelectManySingleSelectorIterator`2:MoveNext():bool:this
          15 ( 0.79% of base) : 30946.dasm - <Read>d__44:MoveNext():this

Top method improvements (bytes):
         -66 (-1.80% of base) : 39457.dasm - HttpResponseHeaders:CopyToFast(byref):this
         -50 (-5.03% of base) : 37462.dasm - ParameterExpression:Make(Type,String,bool):ParameterExpression
         -45 (-1.23% of base) : 33399.dasm - <MultiplexingWriteLoop>d__21:MoveNext():this
         -34 (-1.82% of base) : 32371.dasm - DbConnectionOptions:GetKeyValuePair(String,int,StringBuilder,bool,byref,byref):int
         -20 (-0.47% of base) : 9853.dasm - <DoReceive>d__27:MoveNext():this
         -19 (-0.87% of base) : 20218.dasm - ControllerActionInvoker:Next(byref,byref,byref,byref):Task:this
         -17 (-0.18% of base) : 33205.dasm - <NextResult>d__48:MoveNext():this
         -17 (-0.42% of base) : 16073.dasm - <DoReceive>d__27:MoveNext():this
         -15 (-2.23% of base) : 36282.dasm - MemberInfoCache`1:GetListByName(long,int,long,int,int,int):ref:this
         -15 (-0.69% of base) : 32961.dasm - <ProcessRequestsAsync>d__12`1:MoveNext():this
         -14 (-0.39% of base) : 33612.dasm - HttpResponseHeaders:CopyToFast(byref):this
         -12 (-0.83% of base) : 15898.dasm - ChunkedEncodingReadStream:ReadChunkFromConnectionBuffer(int,CancellationTokenRegistration):ReadOnlyMemory`1:this
         -12 (-0.63% of base) : 39941.dasm - <Read>d__44:MoveNext():this
         -11 (-4.15% of base) : 37750.dasm - SslStream:GetFrameSize(ReadOnlySpan`1):int:this
         -11 (-0.50% of base) : 20460.dasm - <ProcessRequestsAsync>d__12`1:MoveNext():this
          -9 (-1.18% of base) : 16046.dasm - IPv4AddressHelper:ParseNonCanonical(long,int,byref,bool):long
          -7 (-1.38% of base) : 33415.dasm - NpgsqlDataReader:TryFastRead():Nullable`1:this
          -7 (-1.38% of base) : 38025.dasm - NpgsqlDataReader:TryFastRead():Nullable`1:this
          -7 (-1.38% of base) : 39891.dasm - NpgsqlDataReader:TryFastRead():Nullable`1:this
          -7 (-0.92% of base) : 15678.dasm - IPv4AddressHelper:ParseNonCanonical(long,int,byref,bool):long

Top method regressions (percentages):
          15 ( 7.46% of base) : 24665.dasm - Enum:ToUInt64():long:this
          14 ( 6.57% of base) : 10236.dasm - Http1Connection:ParseRequest(byref):bool:this
          13 ( 5.96% of base) : 37839.dasm - Http1Connection:ParseRequest(byref):bool:this
          13 ( 5.96% of base) : 20076.dasm - Http1Connection:ParseRequest(byref):bool:this
          13 ( 5.96% of base) : 39362.dasm - Http1Connection:ParseRequest(byref):bool:this
          13 ( 5.96% of base) : 33004.dasm - Http1Connection:ParseRequest(byref):bool:this
          12 ( 5.94% of base) : 6004.dasm - Enumerator:MoveNext():bool:this
          12 ( 5.94% of base) : 20407.dasm - Enumerator:MoveNext():bool:this
          12 ( 5.94% of base) : 39555.dasm - Enumerator:MoveNext():bool:this
          12 ( 5.94% of base) : 16583.dasm - Enumerator:MoveNext():bool:this
          12 ( 5.94% of base) : 33756.dasm - Enumerator:MoveNext():bool:this
          12 ( 5.94% of base) : 36869.dasm - Enumerator:MoveNext():bool:this
          12 ( 5.94% of base) : 10452.dasm - Enumerator:MoveNext():bool:this
          17 ( 5.21% of base) : 36556.dasm - SelectManySingleSelectorIterator`2:MoveNext():bool:this
          25 ( 5.19% of base) : 37213.dasm - String:EndsWith(String,int):bool:this
           4 ( 4.82% of base) : 36366.dasm - ResultCache:.ctor(int,Type,int):this
          10 ( 3.98% of base) : 25217.dasm - SslStream:GetFrameSize(ReadOnlySpan`1):int:this
          10 ( 3.58% of base) : 24825.dasm - SelectManySingleSelectorIterator`2:MoveNext():bool:this
          24 ( 3.55% of base) : 24612.dasm - MemberInfoCache`1:GetListByName(long,int,long,int,int,int):ref:this
          24 ( 3.55% of base) : 30388.dasm - MemberInfoCache`1:GetListByName(long,int,long,int,int,int):ref:this

Top method improvements (percentages):
         -50 (-5.03% of base) : 37462.dasm - ParameterExpression:Make(Type,String,bool):ParameterExpression
         -11 (-4.15% of base) : 37750.dasm - SslStream:GetFrameSize(ReadOnlySpan`1):int:this
         -15 (-2.23% of base) : 36282.dasm - MemberInfoCache`1:GetListByName(long,int,long,int,int,int):ref:this
         -34 (-1.82% of base) : 32371.dasm - DbConnectionOptions:GetKeyValuePair(String,int,StringBuilder,bool,byref,byref):int
         -66 (-1.80% of base) : 39457.dasm - HttpResponseHeaders:CopyToFast(byref):this
          -7 (-1.38% of base) : 33415.dasm - NpgsqlDataReader:TryFastRead():Nullable`1:this
          -7 (-1.38% of base) : 38025.dasm - NpgsqlDataReader:TryFastRead():Nullable`1:this
          -7 (-1.38% of base) : 39891.dasm - NpgsqlDataReader:TryFastRead():Nullable`1:this
          -7 (-1.38% of base) : 30895.dasm - NpgsqlDataReader:TryFastRead():Nullable`1:this
         -45 (-1.23% of base) : 33399.dasm - <MultiplexingWriteLoop>d__21:MoveNext():this
          -9 (-1.18% of base) : 16046.dasm - IPv4AddressHelper:ParseNonCanonical(long,int,byref,bool):long
          -4 (-1.12% of base) : 10869.dasm - MemoryExtensions:Equals(ReadOnlySpan`1,ReadOnlySpan`1,int):bool
          -5 (-1.05% of base) : 36279.dasm - MemberInfoCache`1:Insert(byref,String,int):this
          -7 (-0.92% of base) : 15678.dasm - IPv4AddressHelper:ParseNonCanonical(long,int,byref,bool):long
         -19 (-0.87% of base) : 20218.dasm - ControllerActionInvoker:Next(byref,byref,byref,byref):Task:this
         -12 (-0.83% of base) : 15898.dasm - ChunkedEncodingReadStream:ReadChunkFromConnectionBuffer(int,CancellationTokenRegistration):ReadOnlyMemory`1:this
         -15 (-0.69% of base) : 32961.dasm - <ProcessRequestsAsync>d__12`1:MoveNext():this
         -12 (-0.63% of base) : 39941.dasm - <Read>d__44:MoveNext():this
         -11 (-0.50% of base) : 20460.dasm - <ProcessRequestsAsync>d__12`1:MoveNext():this
         -20 (-0.47% of base) : 9853.dasm - <DoReceive>d__27:MoveNext():this

169 total methods with Code Size differences (29 improved, 140 regressed), 2 unchanged.

return bbsDstTab[bbsCount - 1];
}
};
struct BBswtDesc;
Copy link
Member Author

@AndyAyersMS AndyAyersMS May 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rearranged so that I could refer to BasicBlock::weight_t within BBswtDesc.

{
assert(block->bbJumpKind == BBJ_SWITCH);

const BasicBlock::weight_t sufficientSamples = 100.0f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that it won't work for dynamic PGO where maxSamples will be around 30 always?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. For dynamic PGO we'll see at least 30 calls to a method, but we often see many more than that. Also this is the count of executions of the switch, not of the method; the switch count can be higher or lower depending.

But changing this to 30 makes sense.

@AndyAyersMS
Copy link
Member Author

The two failures are both running crossgen2 so potentially related. Will investigate.

Comment on lines 2605 to 2636
for (Edge* edge = dominantEdge->m_nextOutgoingEdge; edge != nullptr; edge = edge->m_nextOutgoingEdge)
{
if (edge->m_weightKnown)
{
if (!dominantEdge->m_weightKnown || (edge->m_weight > dominantEdge->m_weight))
{
dominantEdge = edge;
}
}
}

if (!dominantEdge->m_weightKnown)
{
JITDUMP("No edges with known counts, sorry\n");
return;
}

BasicBlock::weight_t fraction = dominantEdge->m_weight / info->m_weight;

// Because of count inconsistency we can see nonsensical ratios. Cap these.
//
if (fraction > 1.0)
{
fraction = 1.0;
}

if (fraction < sufficientFraction)
{
JITDUMP("Maximum edge likelihood is " FMT_WT " < " FMT_WT "; not sufficient to trigger peeling)\n", fraction,
sufficientFraction);
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly this optimization will always trigger for switch cases with 3 or fewer cases, even if no switch case is dominant. Should sufficientFraction instead be computed based on the number of cases, e.g. 1.0 / (numCases - 1)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should sufficientFraction instead be computed based on the number of cases...?

Maybe? Seems like 3 could be special case.

The likelihood needs to be high enough to offset the cost of the extra compare and branch if we guess wrong. So I think the likelihood should always be at least something like 0.3 and not go lower for larger switches. That covers switches with 4 or more cases.

We already specially transform switches with 1-2 cases in fgOptimizeSwitchBranches. We likely also handle case switches, though I don't see where we look for these.

That leaves 3 case switches. If we have a dominant non-default case over 0.3 and are also going to peel for the default then we're basically reducing the switch to a compare tree, but not getting the full benefit of doing so. I think this is ok for now.


(Some other notes on switch peeling)

Ideally, once we peel, the residual switch would get simplified and perhaps some of this other logic would kick in and simplifies things further. We could do this now if we peeled the highest numbered case, but not if we peel some other case.

Since we're always peeling the default (when lowering, so it would be peeled under any peel we do here), we should perhaps change strategy if the default case has a reasonable likelihood. For example, say there are 10 cases and the default case likelihood is 0.5, and case 0 has likelihood 0.25. We'd be better off first peeling for the default case. Once we do that, we can renormalize the switch likelihoods (multiplying by 1 / ( 1 - 0.5 )) and so case 0 relative likelihood is now 0.5, so we should peel for that case.

Of course once we do that second peel we can renormalize again and perhaps another case now becomes likely enough that we'd be willing to peel for that too. The rule of thumb I've seen here is that going up to 3 levels here can pay off (say the dominant cases are 0.9, then 0.09, then 0.009, these all become 0.9 with successive normalization). But I'm not sure we're ready for this just yet, as we also have to keep wonder how accurate our profile is when we start getting into the tail of the frequency distribution, and balance likely performance win vs code size increases. And there's some cost to having a high branch density.

@AndyAyersMS
Copy link
Member Author

Failure is:

Assertion failed 'pred->flDupCount == 1' in 'System.Xml.XmlTextReaderImpl:FinishOtherValueIterator():this' during 'Optimize layout' (IL size 222)

so some sort of switch invariant is not being honored.

@AndyAyersMS
Copy link
Member Author

Looks like more issues to chase down...

@AndyAyersMS
Copy link
Member Author

Block coalescing was not making the combined block an "IBC" block if either block was, so we were losing IBC flag on a switch that we'd selected for peeling. Fixed the coalescing logic.

@AndyAyersMS
Copy link
Member Author

Failure is known issue #52710.

@AndyAyersMS AndyAyersMS requested a review from EgorBo May 18, 2021 00:36
@BruceForstall
Copy link
Member

We don't see switches all that often anymore as CSC is quite aggressive about turning them into if-then-else trees

Should we consider (not in this change) essentially "re-constituting switches" such that we can optimize them using this method? Or, if not that literally, given an if-then-else tree for which we have PGO data, reorder the conditions, if we can prove they are mutually exclusive with no intervening side-effects (say). E.g., with a simple case, it looks like adding a 5th case to:
https://sharplab.io/#v2:EYLgtghglgdgNAFxBAzmAPgAQEwEYCwAUJgMwAEOZAwmQN5FmMXmwJkCyAFK2VAJR0GTYSgDuUBAGMAFmW58hwxvUJK1ZSagCmZAAwgKAdjKGA3IvWNNKHbgOZjAFnOrLTazsP3jADhdv3bTIATm8yXH8AsgATLQAzCABXABskIxDItQBfCxzCLKA===

turns it from if-then-else to switch. But maybe we know the last of the if-then-else tree is dominant.

@AndyAyersMS
Copy link
Member Author

Should we consider (not in this change) essentially "re-constituting switches" ...

I think there is some merit to exploring this (and reordering even non-switch control flow). But some of the upstream opts that CSC does are useful and there are a huge variety of possible switch expansions to explore that we aren't yet capable of doing, so we'd need to expand our bag of tricks.

{
BasicBlock** bbsDstTab; // case label table address
unsigned bbsCount; // count of cases (includes 'default' if bbsHasDefault)
unsigned bbsDominantCase;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should add comments for the new fields. E.g., (1) define what "dominant" means, (2) are the fields always valid, or only under some PGO condition? (3) clarify that bbDominantCase is an index into the bbsDstTab array (but is only valid if bbsHasDominantCase is true?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@@ -1291,6 +1254,47 @@ struct BasicBlockList
}
};

// BBswtDesc -- descriptor for a switch block
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You inserted this in an odd location, because there is a huge comment above BasicBlockList that covers BasicBlockList as well as flowList; why not put this before that comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

{
assert(block->bbJumpKind == BBJ_SWITCH);

const BasicBlock::weight_t sufficientSamples = 30.0f;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to add a comment here describing why these constants were chosen.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

{
if (edge->m_weightKnown)
{
if (!dominantEdge->m_weightKnown || (edge->m_weight > dominantEdge->m_weight))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be safer to bail if there are any edges with unknown weights? Are we essentially assuming that unknown weight is zero weight?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised, we'll bail out if we see any unknown edges.


BBswtDesc() : bbsHasDefault(true), bbsHasDominantCase(false)
{
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to update optCopyBlkDest for the new fields (probably?) -- looks like that code is already broken for bbsHasDefault. BBswtDesc should probably have a copy function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you dump the new fields in dspJumpKind? fgTableDispBasicBlock?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and added some output to both methods, eg

BB01 [0000]  1                            56     56    [000..022)-> BB03,BB05,BB04[dom(0.5535714)],BB06,BB08,BB07,BB09,BB02[def] (switch)                     IBC 

// If either block or bNext has a profile weight
// or if both block and bNext have non-zero weights
// then we select the highest weight block.
// then we wil use the max weight for the block.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: wil

BasicBlock* bDst = *jumpTab;
flowList* edgeToDst = fgGetPredForBlock(bDst, bSrc);
double outRatio = (double) edgeToDst->edgeWeightMin() / (double) bSrc->bbWeight;
if (block->bbJumpKind != BBJ_SWITCH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever called for LIR? Check for SWITCH before RunRarely, as it's more likely to be false?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't make sense to call for LIR as switches are lowered there. So changed to an assert.

Also, up the minimal fraction to 0.55 based on some simple local benchmarking.
@AndyAyersMS
Copy link
Member Author

Did some more detailed benchmarking and it looks like we really should only peel if the dominant case is hit more than half the time. So will up the required likelihood to 0.55.

@AndyAyersMS
Copy link
Member Author

@BruceForstall thanks for the feedback, see the latest.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants