JIT: Add loop-aware RPO, and use as LSRA's block sequence #108086

amanasifkhalid · 2024-09-20T21:50:51Z

Part of #107749, and follow-up to #107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.

I plan to reuse the RPO computed during LSRA in fgDoReversePostOrderLayout once #107634 is in. To do this, I had to add a new phase check flag to disable checking basic block pre/postorder numbers, since the loop-aware RPO (or just a profile-aware RPO) won't match up with the expected DFS in our debug checks -- it seems simplest to just disable these checks altogether once we reach the backend.

dotnet-policy-service · 2024-09-20T21:55:46Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

amanasifkhalid · 2024-09-20T22:07:55Z

src/coreclr/jit/flowgraph.cpp

+//
+// Notes:
+//   If the flow graph has loops, the DFS will be reordered such that loop bodies are compact.
+//   This will invalidate BasicBlock::bbPreorderNum and BasicBlock::bbPostorderNum.


I don't think we have any dependencies on bbPreorderNum or bbPostorderNum in the backend, but if we want to use loop-aware RPOs elsewhere in the JIT, I can work on making these members consistent.

amanasifkhalid · 2024-09-22T00:05:20Z

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. I decided to only run this when optimizing since the potential codegen improvement doesn't seem to warrant the TP cost in MinOpts. Diffs show up to a 0.16% TP cost in FullOpts, so moving layout entirely to the backend and reusing the RPO computation should easily pay for this. Thanks!

kunalspathak · 2024-09-23T12:50:08Z

Seems regression in linux/arm, windows/x86

AndyAyersMS · 2024-09-23T14:45:41Z

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. I decided to only run this when optimizing since the potential codegen improvement doesn't seem to warrant the TP cost in MinOpts. Diffs show up to a 0.16% TP cost in FullOpts, so moving layout entirely to the backend and reusing the RPO computation should easily pay for this. Thanks!

For min opts (at least conceptually) block order shouldn't matter, should it? There are no cross-block live registers. It would be good to verify this. If so, we might be able to save some more time in min opts by just using the linear chain order.

amanasifkhalid · 2024-09-23T15:31:42Z

For min opts (at least conceptually) block order shouldn't matter, should it? There are no cross-block live registers. It would be good to verify this. If so, we might be able to save some more time in min opts by just using the linear chain order.

I think you're right. I tried this out locally, and I'm not getting any asmdiffs. I'll open a separate PR for it.

jakobbotsch · 2024-10-04T10:43:31Z

src/coreclr/jit/flowgraph.cpp

+//   If the flow graph has loops, the DFS will be reordered such that loop bodies are compact.
+//   This will invalidate BasicBlock::bbPreorderNum and BasicBlock::bbPostorderNum.
+//
+FlowGraphDfsTree* Compiler::fgComputeLoopAwareDfs()


Is there anything gained from trying to represent this as an actual FlowGraphDfsTree? I think it would make more sense to have a utility function that given FlowGraphDfsTree and FlowGraphNaturalLoops visits the blocks in RPO that respects the loop structure. It would basically be a slight generalization of what we have in VN already.

The "compute DFS tree" into "identify loops" into "now create another DFS tree" seems wasteful and conceptually a bit odd.

Is there anything gained from trying to represent this as an actual FlowGraphDfsTree?

Probably not.

I think it would make more sense to have a utility function that given FlowGraphDfsTree and FlowGraphNaturalLoops visits the blocks in RPO that respects the loop structure.

That sounds sensible -- I'll try modeling this after FlowGraphNaturalLoop::VisitLoopBlocksReversePostOrder.

jakobbotsch · 2024-10-08T11:52:11Z

src/coreclr/jit/compiler.hpp

+    assert(blockToLoop != nullptr);
+
+    EnsureBasicBlockEpoch();
+    BlockSet visitedBlocks(BlockSetOps::MakeEmpty(this));


It would be better to use the post order number traits that you can get from the DFS tree.

Good point, I should probably make a note of phasing this bbNum dependency out elsewhere.

jakobbotsch · 2024-10-08T11:52:42Z

src/coreclr/jit/compiler.hpp

+        // (first when we visit its containing loop, and then later as we iterate
+        // through the initial RPO).
+        // Thus, we need to keep track of visited blocks.
+        if (!BlockSetOps::IsMember(this, visitedBlocks, block->bbNum))


TryAddElemD can be used as a replacement for IsMember + AddElemD.

jakobbotsch · 2024-10-08T12:04:08Z

src/coreclr/jit/compiler.hpp

+        // If this block is a loop header, visit the entire loop before moving on
+        if ((loop != nullptr) && (block == loop->GetHeader()))
+        {
+            loop->VisitLoopBlocksReversePostOrder(visitBlock);
+        }


I don't think this handles nested loops properly. This needs some form of recursion, probably -- similarly to fgValueNumberBlocks.

I think this utility can be implemented without a dependency on BlockToNaturalLoopMap since FlowGraphNaturalLoops stores loops in descendant order of the header's post order number, so FlowGraphNaturalLoops::GetLoopByHeader can have an efficient binary search implementation (there is a TODO about it). It should also be possible to walk the current loop and current block in lockstep, although it seems unnecessary to go that far.

I think this utility can be implemented without a dependency on BlockToNaturalLoopMap since FlowGraphNaturalLoops stores loops in descendant order of the header's post order number

That seems simple enough -- I'll change it.

This needs some form of recursion, probably -- similarly to fgValueNumberBlocks.

I'm guessing we can't use a lambda for the recursive logic, right? If we need to split the recursive logic into another method, would it make sense to put the loop-aware RPO logic in some sort of visitor class to hide the recursive details? I suppose I could just stick the recursive logic in some struct local to the method as well...

amanasifkhalid · 2024-10-09T14:30:14Z

Diffs look like a net PerfScore win, except on Linux arm32. The example diffs with size increases have trivial PerfScore diffs -- I'll run SPMI locally and take a look at the top PerfScore regressions.

amanasifkhalid · 2024-10-09T18:22:40Z

In cases with code size increases, I'm seeing spills in new places (usually off the hot path, hence the net PerfScore improvement), and increases in offsets between jumps can increase jump sizes as well, amplifying the effect. On Linux arm, looking at collections with PGO, I see instances of spilling/subpar register allocation in loops -- @kunalspathak I'm guessing this has less to do with changing the block order, and more to do with getting LSRA to allocate for loops first, right? Are we ok with taking this change (pending @jakobbotsch's review of the utility itself) if we plan to address allocation for loops separately?

kunalspathak · 2024-10-09T18:36:40Z

and more to do with getting LSRA to allocate for loops first, right? Are we ok with taking this change

Yes. I am ok with this change as in general i see improvements. Can you double check why there are some outliers in linux/arm64?

amanasifkhalid · 2024-10-09T18:59:34Z

Can you double check why there are some outliers in linux/arm64?

Sure. Looking at the jit-analyze output for benchmarks.run_pgo, the top size regressions are inflated by System.Collections.Concurrent.ConcurrentQueueSegment[System.__Canon]:TryDequeue(byref):ubyte:this. Diffs in register allocation increased its prolog size from 24 to 32 bytes, resulting in a modest PerfScore increase. Since this method shows up a bunch in the collection, its size regression is probably overrepresented:

Top method regressions (percentages):
          24 (8.70 % of base) : 117399.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[int]:TryDequeue(byref):ubyte:this (Tier1-OSR)
          24 (8.45 % of base) : 72235.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.45 % of base) : 129076.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1-OSR)
          24 (8.33 % of base) : 120304.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 133026.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 136275.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 107615.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 114678.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 45719.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 51372.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 65472.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 98632.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 100311.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 149127.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 61639.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 93932.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 100008.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 105708.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 136676.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1-OSR)
          24 (8.33 % of base) : 93912.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1-OSR)

coreclr_tests looks like the same story:

Top method regressions (percentages):
          24 (8.45 % of base) : 667925.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.45 % of base) : 668908.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 368590.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 668964.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (8.33 % of base) : 669355.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.__Canon]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 587424.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 644144.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 659792.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 468496.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 516724.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 585147.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 647655.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 650824.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 353368.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 522504.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 551140.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 575103.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 583848.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 662800.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)
          24 (7.79 % of base) : 668616.dasm - System.Collections.Concurrent.ConcurrentQueueSegment`1[System.Net.Sockets.SocketAsyncEngine+SocketIOEvent]:TryDequeue(byref):ubyte:this (Tier1)

jakobbotsch · 2024-10-09T22:05:31Z

src/coreclr/jit/compiler.hpp

+            {
+                func(block);
+
+                FlowGraphNaturalLoop* const loop = loops->GetLoopByHeader(block);


I think we may need to optimize the GetLoopByHeader implementation since right now this is quadratic complexity for pathological cases.

Agreed -- is it alright if I do this in a followup?

jakobbotsch

The utility looks good to me, but I think to be safe we should make GetLoopByHeader use a binary search.

kunalspathak

LGTM

amanasifkhalid · 2024-10-10T04:39:18Z

/ba-g blocked by warnings from 'System.Text.Json' security vulnerabilities

) Part of dotnet#107749, and follow-up to dotnet#107927. When computing a RPO of the flow graph, ensuring that the entirety of a loop body is visited before any of the loop's successors has the benefit of keeping the loop body compact in the traversal. This is certainly ideal when computing an initial block layout, and may be preferable for register allocation, too. Thus, this change formalizes loop-aware RPO creation as part of the flowgraph API surface, and uses it for LSRA's block sequence.

Add loop-aware RPO and use in LSRA

2d21d9e

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Sep 20, 2024

dotnet-policy-service bot assigned amanasifkhalid Sep 20, 2024

amanasifkhalid added 2 commits September 20, 2024 18:04

Add comments

9db278a

Fix lambda capture

cff70ec

amanasifkhalid commented Sep 20, 2024

View reviewed changes

Skip finding loops in MinOpts

5732b31

amanasifkhalid mentioned this pull request Sep 20, 2024

JIT: Flowgraph Modernization and Improved Block Layout in .NET 10 #107749

Open

30 tasks

This was referenced Sep 21, 2024

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Open

Error reported in diagnostic logs. Please examine the log for more details dotnet/dnceng#1928

Open

amanasifkhalid mentioned this pull request Sep 23, 2024

JIT: Use linear block order for MinOpts in LSRA #108147

Open

Merge branch 'main' into loop-aware-rpo

79201ce

This was referenced Oct 3, 2024

slow macOS - "##[error]The job running on agent Azure Pipelines 9 ran longer than the maximum time of 60 minutes." dotnet/dnceng#1883

Open

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

jakobbotsch reviewed Oct 4, 2024

View reviewed changes

amanasifkhalid added 2 commits October 7, 2024 22:27

Remove phase check flag

79026f1

Rewrite as block visitor utility

8d4b0b5

jakobbotsch reviewed Oct 8, 2024

View reviewed changes

amanasifkhalid added 3 commits October 8, 2024 15:50

Handle nested loops

ff4293d

Fix spacing

12565ea

Undo deletion

9141887

jakobbotsch reviewed Oct 9, 2024

View reviewed changes

jakobbotsch approved these changes Oct 9, 2024

View reviewed changes

kunalspathak approved these changes Oct 9, 2024

View reviewed changes

amanasifkhalid mentioned this pull request Oct 10, 2024

JIT: Use binary search in FlowGraphNaturalLoops::GetLoopByHeader #108739

Merged

amanasifkhalid merged commit e32148a into dotnet:main Oct 10, 2024
104 of 108 checks passed

amanasifkhalid deleted the loop-aware-rpo branch October 10, 2024 04:40

DrewScoggins mentioned this pull request Oct 15, 2024

[Perf] Linux/x64: 9 Regressions on 10/10/2024 4:40:22 AM #108891

Open

github-actions bot locked and limited conversation to collaborators Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Add loop-aware RPO, and use as LSRA's block sequence #108086

JIT: Add loop-aware RPO, and use as LSRA's block sequence #108086

amanasifkhalid commented Sep 20, 2024

dotnet-policy-service bot commented Sep 20, 2024

amanasifkhalid Sep 20, 2024

amanasifkhalid commented Sep 22, 2024

kunalspathak commented Sep 23, 2024

AndyAyersMS commented Sep 23, 2024

amanasifkhalid commented Sep 23, 2024

jakobbotsch Oct 4, 2024 •

edited

Loading

amanasifkhalid Oct 8, 2024

jakobbotsch Oct 8, 2024 •

edited

Loading

amanasifkhalid Oct 8, 2024

jakobbotsch Oct 8, 2024

jakobbotsch Oct 8, 2024

amanasifkhalid Oct 8, 2024 •

edited

Loading

amanasifkhalid commented Oct 9, 2024

amanasifkhalid commented Oct 9, 2024

kunalspathak commented Oct 9, 2024

amanasifkhalid commented Oct 9, 2024

jakobbotsch Oct 9, 2024

amanasifkhalid Oct 10, 2024

jakobbotsch left a comment

kunalspathak left a comment

amanasifkhalid commented Oct 10, 2024

JIT: Add loop-aware RPO, and use as LSRA's block sequence #108086

JIT: Add loop-aware RPO, and use as LSRA's block sequence #108086

Conversation

amanasifkhalid commented Sep 20, 2024

dotnet-policy-service bot commented Sep 20, 2024

amanasifkhalid Sep 20, 2024

Choose a reason for hiding this comment

amanasifkhalid commented Sep 22, 2024

kunalspathak commented Sep 23, 2024

AndyAyersMS commented Sep 23, 2024

amanasifkhalid commented Sep 23, 2024

jakobbotsch Oct 4, 2024 • edited Loading

Choose a reason for hiding this comment

amanasifkhalid Oct 8, 2024

Choose a reason for hiding this comment

jakobbotsch Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

amanasifkhalid Oct 8, 2024

Choose a reason for hiding this comment

jakobbotsch Oct 8, 2024

Choose a reason for hiding this comment

jakobbotsch Oct 8, 2024

Choose a reason for hiding this comment

amanasifkhalid Oct 8, 2024 • edited Loading

Choose a reason for hiding this comment

amanasifkhalid commented Oct 9, 2024

amanasifkhalid commented Oct 9, 2024

kunalspathak commented Oct 9, 2024

amanasifkhalid commented Oct 9, 2024

jakobbotsch Oct 9, 2024

Choose a reason for hiding this comment

amanasifkhalid Oct 10, 2024

Choose a reason for hiding this comment

jakobbotsch left a comment

Choose a reason for hiding this comment

kunalspathak left a comment

Choose a reason for hiding this comment

amanasifkhalid commented Oct 10, 2024

jakobbotsch Oct 4, 2024 •

edited

Loading

jakobbotsch Oct 8, 2024 •

edited

Loading

amanasifkhalid Oct 8, 2024 •

edited

Loading