-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: should we do branch optimizations in MachBuffer? #6818
Comments
(Thanks to @jameysharp, @elliottt and others for talking through this a bit offline and especially Jamey for noting the implications on block order above. @alexcrichton if you want to discuss further we can definitely do so in the next Cranelift meeting, if that's easier!) |
Thanks for taking the time to write this up! I've added this to the agenda for tomorrow to go over if there's anything remaining, mostly because I'm still curious to dig into more details if you and others are willing!
My gut is that this level of optimization is best done earlier in the pipeline which is where my thinking stems from, but there's other concrete issues which I think would rely on doing branch-related optimizations earlier rather than at the very end:
Basically my thinking is that while we could make MachBuffer fancier related to branching it seems like much of the benefit should come from the mid-end optimizations and their knock-on effects.
This is a good point yeah and I think a good way to help me frame my thinking. On one hand we as a compiler have the problem of optimizing a CFG, for example A->B->C could be come A+B -> C under some conditions and things like that. Also things like I feel like MachBuffer has historically conflated these two operations, performing both CFG optimization and optimal linear layout of a CFG at the same time. My thinking has been that the CFG-optimizing parts of MachBuffer would ideally be removed into an independent pass in Cranelift to get executed at some point, perhaps intertwined with egraph optimizations somehow (e.g. to make use of constant propagation and/or GVN). I'll also clarify that I barely know what an EBB is much less advocate for it. And again if anything I'm thinking comes with a fundamental compile-time or runtime performance penalty then it's not worth it, I'm trying to see if we can have our cake and eat it too. I'll definitely admit that this may all be stemming from me misunderstanding MachBuffer and its optimizations. If all it's doing is the optimal linear layout of a CFG without extra optimizations then there's no escaping that class of problem so there's nothing more that can be done. All that being said your observation about the recent branches list being empty on islands I think is a great one and would be a way to solve the suboptimal branch generation from #6804. I might try to give that a stab in the future! |
Ok I talked more with folks at the Cranelift meeting today and the conclusion I took away was that the MachBuffer optimizations are all required at this time to have an optimal linear layout of a CFG we input. In that sense what I was mentioning above about CFG optimizations and optimal CFG layout the MachBuffer only deals with the latter which all sounds good. In that sense I believe this issue is resolved as the answer is "yes, the MachBuffer should continue to produce high-quality code". As for the one remaining issue of the eager promotion of veneers I think @cfallin's idea will solve it well. I mentioned today I had a branch (which is here) which implemented the |
Given that there are a small ISA-specific constant number of possible deadline-distances (e.g. 2^20, 2^27, 2^32) it occurs to me a priority queue isn't strictly necessary. We could keep one |
I think that would work as well yeah, but I think that'd still require the "flush" behavior during island insertion right? Otherwise I'm at least not too worried about overhead with a |
It's possible we can make do with just That said, and as I mentioned in the earlier thread, I worry a lot about complexity here so I'd want to be careful about premature optimization. One "deadline-keyed heap" that we pop from is a lot simpler than a radix-sort-like thing with N vecs, and it's not clear to me how to parameterize the latter on |
I figured we'd just have what amounts to a map keyed on the But I don't quite understand the interaction between veneer emission and label chomping and I'll take your word for it, both of you, that the binary heap is fine. |
This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in bytecodealliance#6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes bytecodealliance#6818
* Don't force veneers on island emission This commit is a follow-up to #6804 with the discussion on #6818. This undoes some of the changes from #6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in #6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes #6818 * Keep a running count for pending fixup deadline Update this as pending fixups are added and then clear this out when islands are emitted. * Don't force all fixups to go through a binary heap Instead process them immediately if they're ready.
* Don't force veneers on island emission This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in bytecodealliance#6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes bytecodealliance#6818 * Keep a running count for pending fixup deadline Update this as pending fixups are added and then clear this out when islands are emitted. * Don't force all fixups to go through a binary heap Instead process them immediately if they're ready.
@alexcrichton brought up in #6810
This issue is meant to answer the question: should we do this refactor, and why or why not?
First, addressing some specific points:
It's unclear to me why this is the case -- could you clarify? Constant-branch folding can be done on CLIF in earlier stages; nothing in
MachBuffer
precludes replacing abrif
with ajump
when we know the conditional is constant (and we hope to do that optimization in the future!).Unfortunately this is also not quite the case I think, for several reasons:
MachBuffer
's rearranging of branches requires knowledge of block order, because it is taking advantage of fallthroughs. We don't know block order when we're doing midend optimizations, and the flexibility is important because...MachBuffer
can then eat it (chomp the one jump in the body and redirect the label).So basically pass ordering and regalloc needs mean that block ordering is only known late, and this means that block-order-dependent changes cannot happen in the mid-end.
I think both of the above are possibly springing from a slight mis-interpretation of what purpose the
MachBuffer
's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step that takes us from N-target branches (every block terminator always specifies all destination labels) to branch-or-fallthrough "machine-style branches". The former is the standard CFG representation in most SSA compilers and is convenient to work with, all the way down into regalloc2. The latter is known as "EBBs" (extended basic blocks) and was formerly used in Cranelift. Removing the lowering step means propagating the output side's invariants and properties (EBBs) upward into all VCode.EBBs were removed 4 years or so ago (I think? it was before my time), for good reason: it requires subtle and not-well-understood tweaks to common algorithms, it's hard to reason about, and it infects all of the compiler backend. Basically moving back to EBBs would require a careful audit and refactor of regalloc2 and the ~30k lines of code in Cranelift that touch VCode, and that's a nonstarter.
EBBs also require knowledge of, and lock in, the block order, which is a weird nonlocal invariant thing that precludes a bunch of other optimizations as noted above.
So both branch-folding in machine-independent code and the lowering from N-target form to fallthrough form are useful transforms, but they serve different purposes. The former is an optimization and is optional; the latter is fundamental to the IR design, in that it lets us reason at a slightly higher semantic level and avoid huge complexities.
Finally, the original motivation for this discussion came from a perceived inability to make the branch simplifications work with a more complex forward-reference label resolution algorithm. I don't think this is necessarily or fundamentally the case. The label resolution (island/veneer insertion) and branch optimizations interact in subtle ways, but they are separate, tied by the invariant: the buffer of most-recent branches continguous to the end of the buffer (this is the visibility of the peephole opts) must correspond to the latest fixups in the fixup list.
The veneer insertion happens only at island generation, and islands clear the "most recent branches" list (because none are contiguous to the end of buffer anymore), so actually we can do whatever we want with more complex data structures and algorithms, without violating the invariant. The only requirement is that between islands, we do keep a linear list of most recent fixups (we can clear it whenever we emit a non-branch instruction).
So from my perspective, there is a proposal for a large refactor of the compiler backend, with high risk, motivated by two reasons that I don't think are really the case (branch constant folding gets better? --> no, different and orthogonal things; precludes more advanced veneer fixup? --> no, the invariant still allows this). It's entirely possible I've missed something or misunderstood some part of the proposal though -- curious what others think!
The text was updated successfully, but these errors were encountered: