-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Clean up liveness #103809
JIT: Clean up liveness #103809
Conversation
We have a DFS tree available in both early liveness and SSA's liveness, and we can use it to make the data flow cheaper by running in an RPO over the DFS tree. This allows us to propagate the maximal amount of knowledge in each iteration and also to stop the data flow early when there is no cycle in the DFS tree. We do not have the DFS tree available in lowering where we also call liveness. However, lowering was already iterating all blocks to remove dead blocks; switch this to `fgDfsBlocksAndRemove` to remove dead blocks and compute the DFS tree in one go, and remove the old code doing this. Additionally there was a bunch of logic in liveness to consider debug scopes for debug codegen. I'm not sure what this code is doing; we do not run liveness in debug codegen, so seems like it can just be removed.
We have a problem with switching lowering's liveness to use the DFS tree: we do not model throw helpers properly in the flow graph, which means that they are considered unreachable. With this PR that means they do not get One possibility is to do another pass over the throw helper blocks in liveness and then mark the volatile vars in them. I'm not a big fan of that. Ideally cc @AndyAyersMS, any suggestions here? |
Pushed a commit that does this, but it is not sufficient; these blocks also need to consider their EH successor's vars as live. Even doing it this hacky way is a correctness issue; not modelling it means we won't build proper SSA that takes this kind of flow into account. I'll see if I can create an example. |
Hmm, this will probably be ok since we should add the necessary PHIs as part of the predecessors that will be in the try region. |
Diffs seem to be because removing dead blocks before the first liveness has some benefits in some cases; particularly if it results in liveness no longer marking a local as EH-live because we removed a dead try region. There are also cases where it allows |
Not really... I wonder sometimes if we should just make the throw helper flow explicit earlier and optimize it away when not needed. |
I think an improvement on the situation would be to at least create the |
cc @dotnet/jit-contrib PTAL @AndyAyersMS Diffs. They come from removing unreachable blocks before the liveness pass, see above. Some nice TP improvements. |
Once upon a time we tracked lifetimes in debug code instead of reporting all gc refs as untracked. This changed in dotnet/coreclr#9869. Then dotnet/coreclr#12665 simplified minopts liveness accordingly. You should remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes LGTM, but I recommend you do a bit more cleanup and remove the option to enable tracked GC reporting in minopts.
Hmm, looks like there's a bunch of new 0-sized diffs from removing |
The diff in the jitdump from an example (****** START compiling System.Numerics.BitOperations:IsPow2(ulong):ubyte): -Defining 2 call sites:
- Offset 0x2c, size 4.
- Offset 0x50, size 4.
+Defining 0 call sites: Looks like it's coming from here: runtime/src/coreclr/jit/gcencode.cpp Lines 4497 to 4500 in 71ab8f1
That is, it seems with I checked on benchmarks.run_pgo.windows.arm64 which has 49901 MinOpts contexts (out of ~103k total contexts). In 34188 of those the optimization kicks in, and overall it reduces the total GC info size for those MinOpts contexts from 1303717 bytes to 1034336 bytes. The total size of the GC info including the optimized contexts goes from 5091149 bytes to 4821768 bytes. The downside is that the new logic to allow suspension on return addresses does not kick in if we do not record the calls in the GC info. Regardless I don't think this PR should be changing this behavior here, so I am going to change it back to what it was (with the difference between xarch and other platforms), but definitely something to reconsider separately. |
This came from https://github.com/dotnet/coreclr/pull/9869/files#diff-081cd7404db539556560f8746ba2f1928ec14424fd64192aba02316bedac4183R219 . I do not see a reason for having platform differences in GC encoding like this one. I think we should enable the x64 path everywhere. I think it is fine to have worse GC suspension behavior with minopts. Minops means suboptimal performance, and GC suspension tail latency is part of the bundle. |
Sounds good to me, I'll submit a separate PR to enable the optimization globally once this one makes it in. |
It is something to think about. Thanks for pointing at this! MinOpts is generally not a problem with suspension latencies. I think that is because the code is not very tight. If we do not inline much there should be lots of opportunity for hijacking as well. If some call sites are not recorded and thus we lose some interruptibility points it might not be a big deal. Not having calls recorded could be inconvenient for some other scenarios. Like putting hijacking on the same plan as asynchronous suspension. It would be still be doable, but without call sites recorded, it will be a bit of a leap of faith - "please suspend me at the caller location, believe me the location is GC safe, even though there is no callsite reported for it...". |
We have a DFS tree available in both early liveness and SSA's liveness, and we can use it to make the data flow cheaper by running in a post-order over the DFS tree. This allows us to propagate the maximal amount of knowledge in each iteration and also to stop the data flow early when there is no cycle in the DFS tree.
We do not have the DFS tree available in lowering where we also call liveness. However, lowering was already iterating all blocks to remove dead blocks; switch this to
fgDfsBlocksAndRemove
to remove dead blocks and compute the DFS tree in one go, and remove the old code doing this.Additionally there was a bunch of logic in liveness to consider debug scopes for debug codegen. I'm not sure what this code is doing; we do not run liveness in debug codegen, so seems like it can just be removed.