Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Clean up liveness #103809

Merged
merged 8 commits into from
Jun 25, 2024
Merged

JIT: Clean up liveness #103809

merged 8 commits into from
Jun 25, 2024

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented Jun 21, 2024

We have a DFS tree available in both early liveness and SSA's liveness, and we can use it to make the data flow cheaper by running in a post-order over the DFS tree. This allows us to propagate the maximal amount of knowledge in each iteration and also to stop the data flow early when there is no cycle in the DFS tree.

We do not have the DFS tree available in lowering where we also call liveness. However, lowering was already iterating all blocks to remove dead blocks; switch this to fgDfsBlocksAndRemove to remove dead blocks and compute the DFS tree in one go, and remove the old code doing this.

Additionally there was a bunch of logic in liveness to consider debug scopes for debug codegen. I'm not sure what this code is doing; we do not run liveness in debug codegen, so seems like it can just be removed.

We have a DFS tree available in both early liveness and SSA's liveness,
and we can use it to make the data flow cheaper by running in an RPO
over the DFS tree. This allows us to propagate the maximal amount of
knowledge in each iteration and also to stop the data flow early when
there is no cycle in the DFS tree.

We do not have the DFS tree available in lowering where we also call
liveness. However, lowering was already iterating all blocks to remove
dead blocks; switch this to `fgDfsBlocksAndRemove` to remove dead blocks
and compute the DFS tree in one go, and remove the old code doing this.

Additionally there was a bunch of logic in liveness to consider debug
scopes for debug codegen. I'm not sure what this code is doing; we do
not run liveness in debug codegen, so seems like it can just be removed.
@jakobbotsch
Copy link
Member Author

We have a problem with switching lowering's liveness to use the DFS tree: we do not model throw helpers properly in the flow graph, which means that they are considered unreachable. With this PR that means they do not get this marked as live when lvaKeepAliveAndReportThis is true, or the generic context marked live when lvaReportParamTypeArg is true. This would not be a problem if they were truly unreachable, but we end up materializing edges to them in the backend, making this a problem.

One possibility is to do another pass over the throw helper blocks in liveness and then mark the volatile vars in them. I'm not a big fan of that. Ideally VisitAllSuccs would return the throw helpers for the blocks that may get edges to them, but I'm not sure how easy that is to do.

cc @AndyAyersMS, any suggestions here?

@jakobbotsch
Copy link
Member Author

One possibility is to do another pass over the throw helper blocks in liveness and then mark the volatile vars in them. I'm not a big fan of that.

Pushed a commit that does this, but it is not sufficient; these blocks also need to consider their EH successor's vars as live.

Even doing it this hacky way is a correctness issue; not modelling it means we won't build proper SSA that takes this kind of flow into account. I'll see if I can create an example.

@jakobbotsch
Copy link
Member Author

Even doing it this hacky way is a correctness issue; not modelling it means we won't build proper SSA that takes this kind of flow into account. I'll see if I can create an example.

Hmm, this will probably be ok since we should add the necessary PHIs as part of the predecessors that will be in the try region.

@jakobbotsch
Copy link
Member Author

Diffs seem to be because removing dead blocks before the first liveness has some benefits in some cases; particularly if it results in liveness no longer marking a local as EH-live because we removed a dead try region.

There are also cases where it allows fgUpdateFlowGraph to make more changes, e.g. by removing some blocks that now allow more compaction to take place. Before this change we wouldn't get rid of the blocks until after fgUpdateFlowGraph and would end up with uncompacted blocks; this can result in more native <-> IL mappings. This was the cause of one 0 size diff I checked.

@AndyAyersMS
Copy link
Member

cc @AndyAyersMS, any suggestions here?

Not really... I wonder sometimes if we should just make the throw helper flow explicit earlier and optimize it away when not needed.

@jakobbotsch
Copy link
Member Author

Not really... I wonder sometimes if we should just make the throw helper flow explicit earlier and optimize it away when not needed.

I think an improvement on the situation would be to at least create the BasicBlocks themselves as late as possible so that it can be that code's responsibility to get all their state (like their liveness) right, and so that we minimize the time we have these non-modelled basic blocks around.

@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @AndyAyersMS

Diffs. They come from removing unreachable blocks before the liveness pass, see above. Some nice TP improvements.

@AndyAyersMS
Copy link
Member

Additionally there was a bunch of logic in liveness to consider debug scopes for debug codegen. I'm not sure what this code is doing; we do not run liveness in debug codegen, so seems like it can just be removed.

Once upon a time we tracked lifetimes in debug code instead of reporting all gc refs as untracked. This changed in dotnet/coreclr#9869. Then dotnet/coreclr#12665 simplified minopts liveness accordingly.

You should remove JitMinOptsTrackGCrefs and related code; there is no going back.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM, but I recommend you do a bit more cleanup and remove the option to enable tracked GC reporting in minopts.

@jakobbotsch
Copy link
Member Author

Hmm, looks like there's a bunch of new 0-sized diffs from removing JitMinOptsTrackGCrefs (it was defaulted to 1 before outside xarch). Need to look into what those diffs are -- presumably something in GC info given that it had a use in gcMakeRegPtrTable.

@jakobbotsch
Copy link
Member Author

The diff in the jitdump from an example (****** START compiling System.Numerics.BitOperations:IsPow2(ulong):ubyte):

-Defining 2 call sites:
-    Offset 0x2c, size 4.
-    Offset 0x50, size 4.
+Defining 0 call sites:

Looks like it's coming from here:

if (noTrackedGCSlots && regMask == 0)
{
// No live GC refs in regs at the call -> don't record the call.
}

That is, it seems with noTrackedGCSlots we sometimes do not even bother recording in the GC info that there are some call sites. Before this PR this bool is true in MinOpts compilations, but only when compiling for x86 or x64. After this PR it is true for MinOpts compilations for all targets.

I checked on benchmarks.run_pgo.windows.arm64 which has 49901 MinOpts contexts (out of ~103k total contexts). In 34188 of those the optimization kicks in, and overall it reduces the total GC info size for those MinOpts contexts from 1303717 bytes to 1034336 bytes. The total size of the GC info including the optimized contexts goes from 5091149 bytes to 4821768 bytes.

The downside is that the new logic to allow suspension on return addresses does not kick in if we do not record the calls in the GC info. Regardless I don't think this PR should be changing this behavior here, so I am going to change it back to what it was (with the difference between xarch and other platforms), but definitely something to reconsider separately.

cc @VSadov @jkotas

@jkotas
Copy link
Member

jkotas commented Jun 24, 2024

the difference between xarch and other platforms

This came from https://github.com/dotnet/coreclr/pull/9869/files#diff-081cd7404db539556560f8746ba2f1928ec14424fd64192aba02316bedac4183R219 . I do not see a reason for having platform differences in GC encoding like this one. I think we should enable the x64 path everywhere.

I think it is fine to have worse GC suspension behavior with minopts. Minops means suboptimal performance, and GC suspension tail latency is part of the bundle.

@jakobbotsch
Copy link
Member Author

the difference between xarch and other platforms

This came from https://github.com/dotnet/coreclr/pull/9869/files#diff-081cd7404db539556560f8746ba2f1928ec14424fd64192aba02316bedac4183R219 . I do not see a reason for having platform differences in GC encoding like this one. I think we should enable the x64 path everywhere.

I think it is fine to have worse GC suspension behavior with minopts. Minops means suboptimal performance, and GC suspension tail latency is part of the bundle.

Sounds good to me, I'll submit a separate PR to enable the optimization globally once this one makes it in.

@VSadov
Copy link
Member

VSadov commented Jun 24, 2024

The downside is that the new logic to allow suspension on return addresses does not kick in if we do not record the calls in the GC info.

It is something to think about. Thanks for pointing at this!

MinOpts is generally not a problem with suspension latencies. I think that is because the code is not very tight. If we do not inline much there should be lots of opportunity for hijacking as well. If some call sites are not recorded and thus we lose some interruptibility points it might not be a big deal.

Not having calls recorded could be inconvenient for some other scenarios. Like putting hijacking on the same plan as asynchronous suspension. It would be still be doable, but without call sites recorded, it will be a bit of a leap of faith - "please suspend me at the caller location, believe me the location is GC safe, even though there is no callsite reported for it...".
Maybe it just means that some asserts will not be as precise as they could be.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants