Cranelift: should we do branch optimizations in MachBuffer? #6818

cfallin · 2023-08-07T19:05:45Z

And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.

This issue is meant to answer the question: should we do this refactor, and why or why not?

First, addressing some specific points:

extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.

It's unclear to me why this is the case -- could you clarify? Constant-branch folding can be done on CLIF in earlier stages; nothing in MachBuffer precludes replacing a brif with a jump when we know the conditional is constant (and we hope to do that optimization in the future!).

For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that

Unfortunately this is also not quite the case I think, for several reasons:

The MachBuffer's rearranging of branches requires knowledge of block order, because it is taking advantage of fallthroughs. We don't know block order when we're doing midend optimizations, and the flexibility is important because...
... we also don't know if some blocks will become empty or not, as a result of optimizations or regalloc. For example, we need to split critical edges so regalloc has a place to put spills/reloads in all cases; but many of those split edges end up being an empty block, and MachBuffer can then eat it (chomp the one jump in the body and redirect the label).

So basically pass ordering and regalloc needs mean that block ordering is only known late, and this means that block-order-dependent changes cannot happen in the mid-end.

I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step that takes us from N-target branches (every block terminator always specifies all destination labels) to branch-or-fallthrough "machine-style branches". The former is the standard CFG representation in most SSA compilers and is convenient to work with, all the way down into regalloc2. The latter is known as "EBBs" (extended basic blocks) and was formerly used in Cranelift. Removing the lowering step means propagating the output side's invariants and properties (EBBs) upward into all VCode.

EBBs were removed 4 years or so ago (I think? it was before my time), for good reason: it requires subtle and not-well-understood tweaks to common algorithms, it's hard to reason about, and it infects all of the compiler backend. Basically moving back to EBBs would require a careful audit and refactor of regalloc2 and the ~30k lines of code in Cranelift that touch VCode, and that's a nonstarter.

EBBs also require knowledge of, and lock in, the block order, which is a weird nonlocal invariant thing that precludes a bunch of other optimizations as noted above.

So both branch-folding in machine-independent code and the lowering from N-target form to fallthrough form are useful transforms, but they serve different purposes. The former is an optimization and is optional; the latter is fundamental to the IR design, in that it lets us reason at a slightly higher semantic level and avoid huge complexities.

Finally, the original motivation for this discussion came from a perceived inability to make the branch simplifications work with a more complex forward-reference label resolution algorithm. I don't think this is necessarily or fundamentally the case. The label resolution (island/veneer insertion) and branch optimizations interact in subtle ways, but they are separate, tied by the invariant: the buffer of most-recent branches continguous to the end of the buffer (this is the visibility of the peephole opts) must correspond to the latest fixups in the fixup list.

The veneer insertion happens only at island generation, and islands clear the "most recent branches" list (because none are contiguous to the end of buffer anymore), so actually we can do whatever we want with more complex data structures and algorithms, without violating the invariant. The only requirement is that between islands, we do keep a linear list of most recent fixups (we can clear it whenever we emit a non-branch instruction).

So from my perspective, there is a proposal for a large refactor of the compiler backend, with high risk, motivated by two reasons that I don't think are really the case (branch constant folding gets better? --> no, different and orthogonal things; precludes more advanced veneer fixup? --> no, the invariant still allows this). It's entirely possible I've missed something or misunderstood some part of the proposal though -- curious what others think!

The text was updated successfully, but these errors were encountered:

cfallin · 2023-08-07T19:06:41Z

(Thanks to @jameysharp, @elliottt and others for talking through this a bit offline and especially Jamey for noting the implications on block order above. @alexcrichton if you want to discuss further we can definitely do so in the next Cranelift meeting, if that's easier!)

alexcrichton · 2023-08-08T21:21:31Z

Thanks for taking the time to write this up! I've added this to the agenda for tomorrow to go over if there's anything remaining, mostly because I'm still curious to dig into more details if you and others are willing!

extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.

It's unclear to me why this is the case -- could you clarify?

My gut is that this level of optimization is best done earlier in the pipeline which is where my thinking stems from, but there's other concrete issues which I think would rely on doing branch-related optimizations earlier rather than at the very end:

Cranelift, egraphs, jump threading, and missed lowering rules #6154 - this is where current lowering rules won't match across blocks and while the jump isn't present in the final compiled output it's thwarting other things.
Further Improve Execution Speed with Explicit Bounds Checks #6094 - ignoring the spectre parts of that, GVN-ing bounds checks to eliminate dominated bounds checks can result in a bunch of dead code which would ideally be removed rather than simply having branches fused in the MachBuffer

Basically my thinking is that while we could make MachBuffer fancier related to branching it seems like much of the benefit should come from the mid-end optimizations and their knock-on effects.

I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step

This is a good point yeah and I think a good way to help me frame my thinking. On one hand we as a compiler have the problem of optimizing a CFG, for example A->B->C could be come A+B -> C under some conditions and things like that. Also things like br_if true a, b can become jump a which might enable further CFG simplifications. This, in my mind, is all about taking a messy CFG to a "clean" CFG which would handle the issues I described above (e.g. no ret-in-its-own-block and DCE would fall out of CFG rewriting). There is however still the problem of taking a perfect CFG and lowering it to optimal machine code.

I feel like MachBuffer has historically conflated these two operations, performing both CFG optimization and optimal linear layout of a CFG at the same time. My thinking has been that the CFG-optimizing parts of MachBuffer would ideally be removed into an independent pass in Cranelift to get executed at some point, perhaps intertwined with egraph optimizations somehow (e.g. to make use of constant propagation and/or GVN).

I'll also clarify that I barely know what an EBB is much less advocate for it. And again if anything I'm thinking comes with a fundamental compile-time or runtime performance penalty then it's not worth it, I'm trying to see if we can have our cake and eat it too.

I'll definitely admit that this may all be stemming from me misunderstanding MachBuffer and its optimizations. If all it's doing is the optimal linear layout of a CFG without extra optimizations then there's no escaping that class of problem so there's nothing more that can be done.

All that being said your observation about the recent branches list being empty on islands I think is a great one and would be a way to solve the suboptimal branch generation from #6804. I might try to give that a stab in the future!

alexcrichton · 2023-08-09T16:55:12Z

Ok I talked more with folks at the Cranelift meeting today and the conclusion I took away was that the MachBuffer optimizations are all required at this time to have an optimal linear layout of a CFG we input. In that sense what I was mentioning above about CFG optimizations and optimal CFG layout the MachBuffer only deals with the latter which all sounds good.

In that sense I believe this issue is resolved as the answer is "yes, the MachBuffer should continue to produce high-quality code".

As for the one remaining issue of the eager promotion of veneers I think @cfallin's idea will solve it well. I mentioned today I had a branch (which is here) which implemented the BinaryHeap idea although it only works by disabling optimization, so not landable as-is of course but wanted to share as it had the bits about removing precalculation of the island deadline.

jameysharp · 2023-08-09T17:09:18Z

Given that there are a small ISA-specific constant number of possible deadline-distances (e.g. 2^20, 2^27, 2^32) it occurs to me a priority queue isn't strictly necessary. We could keep one VecDeque per distance instead. Then all the labels that are at their deadlines are together at the beginning of their respective lists; you can still remove the last-added labels efficiently if needed; and nothing needs to repeatedly visit labels in the middle. Does that help?

alexcrichton · 2023-08-09T18:05:29Z

I think that would work as well yeah, but I think that'd still require the "flush" behavior during island insertion right? Otherwise I'm at least not too worried about overhead with a BinaryHeap and it feels like a simpler alternative to reach for, but I think I'm on a bit of a one-track mind right now...

cfallin · 2023-08-09T18:18:44Z

It's possible we can make do with just Vecs actually. The key thing is that (i) as fixups are pushed prior to an island, the only operation we need is "chomp" (so push/pop on a vec suffices); and after the island, we can't chomp anymore, so it suffices to sort by deadline. As you're observing, for a given class we will have monotonically increasing deadlines, so we can put the "long-term vecs" per class in reverse-deadline order and pop off the back of all of them.

That said, and as I mentioned in the earlier thread, I worry a lot about complexity here so I'd want to be careful about premature optimization. One "deadline-keyed heap" that we pop from is a lot simpler than a radix-sort-like thing with N vecs, and it's not clear to me how to parameterize the latter on LabelUse arms either (they are different per architecture), strictly in terms of generic programming in Rust. (I guess the trait on the enum could map the kinds into an integer index space, with each concrete implementation a repr(u8) or whatever, but now we're way down the rabbit hole.) So maybe it's best to try the simpler thing (EDIT for clarity: that would be @alexcrichton 's BinaryHeap suggestion) and then see whether we need anything more...

jameysharp · 2023-08-09T18:52:32Z

I figured we'd just have what amounts to a map keyed on the CodeOffset (which is a u32) from MachInstLabelUse::max_pos_range. We don't actually care what kind of label it is for this purpose; we care what maximum offset it has. There are no more than four distinct max offsets on any current target, so even linear-scan in an array-of-pairs-based "map" is asymptotically O(1) and the constant factors might be better than a HashMap.

But I don't quite understand the interaction between veneer emission and label chomping and I'll take your word for it, both of you, that the binary heap is fine.

This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in bytecodealliance#6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes bytecodealliance#6818

* Don't force veneers on island emission This commit is a follow-up to #6804 with the discussion on #6818. This undoes some of the changes from #6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in #6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes #6818 * Keep a running count for pending fixup deadline Update this as pending fixups are added and then clear this out when islands are emitted. * Don't force all fixups to go through a binary heap Instead process them immediately if they're ready.

* Don't force veneers on island emission This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in bytecodealliance#6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes bytecodealliance#6818 * Keep a running count for pending fixup deadline Update this as pending fixups are added and then clear this out when islands are emitted. * Don't force all fixups to go through a binary heap Instead process them immediately if they're ready.

cfallin mentioned this issue Aug 7, 2023

Cranelift: avoid quadratic behavior in label-fixup processing #6804

Merged

alexcrichton mentioned this issue Aug 8, 2023

Add MachBuffer branch optimizations to tomorrow's meeting agenda bytecodealliance/meetings#130

Merged

alexcrichton mentioned this issue Aug 24, 2023

Don't force veneers on island emission #6902

Merged

cfallin closed this as completed in #6902 Aug 25, 2023

tschneidereit mentioned this issue Oct 23, 2023

Cranelift, egraphs, jump threading, and missed lowering rules #6154

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: should we do branch optimizations in MachBuffer? #6818

Cranelift: should we do branch optimizations in MachBuffer? #6818

cfallin commented Aug 7, 2023

cfallin commented Aug 7, 2023

alexcrichton commented Aug 8, 2023

alexcrichton commented Aug 9, 2023

jameysharp commented Aug 9, 2023

alexcrichton commented Aug 9, 2023

cfallin commented Aug 9, 2023 •

edited

Loading

jameysharp commented Aug 9, 2023

Cranelift: should we do branch optimizations in MachBuffer? #6818

Cranelift: should we do branch optimizations in MachBuffer? #6818

Comments

cfallin commented Aug 7, 2023

cfallin commented Aug 7, 2023

alexcrichton commented Aug 8, 2023

alexcrichton commented Aug 9, 2023

jameysharp commented Aug 9, 2023

alexcrichton commented Aug 9, 2023

cfallin commented Aug 9, 2023 • edited Loading

jameysharp commented Aug 9, 2023

cfallin commented Aug 9, 2023 •

edited

Loading