-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Add loop-aware RPO, and use as LSRA's block sequence #108086
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
src/coreclr/jit/flowgraph.cpp
Outdated
// | ||
// Notes: | ||
// If the flow graph has loops, the DFS will be reordered such that loop bodies are compact. | ||
// This will invalidate BasicBlock::bbPreorderNum and BasicBlock::bbPostorderNum. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we have any dependencies on bbPreorderNum
or bbPostorderNum
in the backend, but if we want to use loop-aware RPOs elsewhere in the JIT, I can work on making these members consistent.
cc @dotnet/jit-contrib, @AndyAyersMS PTAL. I decided to only run this when optimizing since the potential codegen improvement doesn't seem to warrant the TP cost in MinOpts. Diffs show up to a 0.16% TP cost in FullOpts, so moving layout entirely to the backend and reusing the RPO computation should easily pay for this. Thanks! |
For min opts (at least conceptually) block order shouldn't matter, should it? There are no cross-block live registers. It would be good to verify this. If so, we might be able to save some more time in min opts by just using the linear chain order. |
I think you're right. I tried this out locally, and I'm not getting any asmdiffs. I'll open a separate PR for it. |
src/coreclr/jit/flowgraph.cpp
Outdated
// If the flow graph has loops, the DFS will be reordered such that loop bodies are compact. | ||
// This will invalidate BasicBlock::bbPreorderNum and BasicBlock::bbPostorderNum. | ||
// | ||
FlowGraphDfsTree* Compiler::fgComputeLoopAwareDfs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything gained from trying to represent this as an actual FlowGraphDfsTree
? I think it would make more sense to have a utility function that given FlowGraphDfsTree
and FlowGraphNaturalLoops
visits the blocks in RPO that respects the loop structure. It would basically be a slight generalization of what we have in VN already.
The "compute DFS tree" into "identify loops" into "now create another DFS tree" seems wasteful and conceptually a bit odd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything gained from trying to represent this as an actual FlowGraphDfsTree?
Probably not.
I think it would make more sense to have a utility function that given FlowGraphDfsTree and FlowGraphNaturalLoops visits the blocks in RPO that respects the loop structure.
That sounds sensible -- I'll try modeling this after FlowGraphNaturalLoop::VisitLoopBlocksReversePostOrder
.
src/coreclr/jit/compiler.hpp
Outdated
assert(blockToLoop != nullptr); | ||
|
||
EnsureBasicBlockEpoch(); | ||
BlockSet visitedBlocks(BlockSetOps::MakeEmpty(this)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use the post order number traits that you can get from the DFS tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I should probably make a note of phasing this bbNum
dependency out elsewhere.
src/coreclr/jit/compiler.hpp
Outdated
// (first when we visit its containing loop, and then later as we iterate | ||
// through the initial RPO). | ||
// Thus, we need to keep track of visited blocks. | ||
if (!BlockSetOps::IsMember(this, visitedBlocks, block->bbNum)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TryAddElemD
can be used as a replacement for IsMember
+ AddElemD
.
src/coreclr/jit/compiler.hpp
Outdated
// If this block is a loop header, visit the entire loop before moving on | ||
if ((loop != nullptr) && (block == loop->GetHeader())) | ||
{ | ||
loop->VisitLoopBlocksReversePostOrder(visitBlock); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this handles nested loops properly. This needs some form of recursion, probably -- similarly to fgValueNumberBlocks
.
I think this utility can be implemented without a dependency on BlockToNaturalLoopMap
since FlowGraphNaturalLoops
stores loops in descendant order of the header's post order number, so FlowGraphNaturalLoops::GetLoopByHeader
can have an efficient binary search implementation (there is a TODO about it). It should also be possible to walk the current loop and current block in lockstep, although it seems unnecessary to go that far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this utility can be implemented without a dependency on BlockToNaturalLoopMap since FlowGraphNaturalLoops stores loops in descendant order of the header's post order number
That seems simple enough -- I'll change it.
This needs some form of recursion, probably -- similarly to fgValueNumberBlocks.
I'm guessing we can't use a lambda for the recursive logic, right? If we need to split the recursive logic into another method, would it make sense to put the loop-aware RPO logic in some sort of visitor class to hide the recursive details? I suppose I could just stick the recursive logic in some struct local to the method as well...
Diffs look like a net PerfScore win, except on Linux arm32. The example diffs with size increases have trivial PerfScore diffs -- I'll run SPMI locally and take a look at the top PerfScore regressions. |
In cases with code size increases, I'm seeing spills in new places (usually off the hot path, hence the net PerfScore improvement), and increases in offsets between jumps can increase jump sizes as well, amplifying the effect. On Linux arm, looking at collections with PGO, I see instances of spilling/subpar register allocation in loops -- @kunalspathak I'm guessing this has less to do with changing the block order, and more to do with getting LSRA to allocate for loops first, right? Are we ok with taking this change (pending @jakobbotsch's review of the utility itself) if we plan to address allocation for loops separately? |
Sure. Looking at the
|
{ | ||
func(block); | ||
|
||
FlowGraphNaturalLoop* const loop = loops->GetLoopByHeader(block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to optimize the GetLoopByHeader
implementation since right now this is quadratic complexity for pathological cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed -- is it alright if I do this in a followup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The utility looks good to me, but I think to be safe we should make GetLoopByHeader
use a binary search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/ba-g blocked by warnings from 'System.Text.Json' security vulnerabilities |
) Part of dotnet#107749, and follow-up to dotnet#107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.
Part of #107749, and follow-up to #107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.
I plan to reuse the RPO computed during LSRA in
fgDoReversePostOrderLayout
once #107634 is in. To do this, I had to add a new phase check flag to disable checking basic block pre/postorder numbers, since the loop-aware RPO (or just a profile-aware RPO) won't match up with the expected DFS in our debug checks -- it seems simplest to just disable these checks altogether once we reach the backend.