Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: should we do branch optimizations in MachBuffer? #6818

Closed
cfallin opened this issue Aug 7, 2023 · 7 comments · Fixed by #6902
Closed

Cranelift: should we do branch optimizations in MachBuffer? #6818

cfallin opened this issue Aug 7, 2023 · 7 comments · Fixed by #6902

Comments

@cfallin
Copy link
Member

cfallin commented Aug 7, 2023

@alexcrichton brought up in #6810

And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.

This issue is meant to answer the question: should we do this refactor, and why or why not?

First, addressing some specific points:

extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.

It's unclear to me why this is the case -- could you clarify? Constant-branch folding can be done on CLIF in earlier stages; nothing in MachBuffer precludes replacing a brif with a jump when we know the conditional is constant (and we hope to do that optimization in the future!).

For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that

Unfortunately this is also not quite the case I think, for several reasons:

  • The MachBuffer's rearranging of branches requires knowledge of block order, because it is taking advantage of fallthroughs. We don't know block order when we're doing midend optimizations, and the flexibility is important because...
  • ... we also don't know if some blocks will become empty or not, as a result of optimizations or regalloc. For example, we need to split critical edges so regalloc has a place to put spills/reloads in all cases; but many of those split edges end up being an empty block, and MachBuffer can then eat it (chomp the one jump in the body and redirect the label).

So basically pass ordering and regalloc needs mean that block ordering is only known late, and this means that block-order-dependent changes cannot happen in the mid-end.


I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step that takes us from N-target branches (every block terminator always specifies all destination labels) to branch-or-fallthrough "machine-style branches". The former is the standard CFG representation in most SSA compilers and is convenient to work with, all the way down into regalloc2. The latter is known as "EBBs" (extended basic blocks) and was formerly used in Cranelift. Removing the lowering step means propagating the output side's invariants and properties (EBBs) upward into all VCode.

EBBs were removed 4 years or so ago (I think? it was before my time), for good reason: it requires subtle and not-well-understood tweaks to common algorithms, it's hard to reason about, and it infects all of the compiler backend. Basically moving back to EBBs would require a careful audit and refactor of regalloc2 and the ~30k lines of code in Cranelift that touch VCode, and that's a nonstarter.

EBBs also require knowledge of, and lock in, the block order, which is a weird nonlocal invariant thing that precludes a bunch of other optimizations as noted above.

So both branch-folding in machine-independent code and the lowering from N-target form to fallthrough form are useful transforms, but they serve different purposes. The former is an optimization and is optional; the latter is fundamental to the IR design, in that it lets us reason at a slightly higher semantic level and avoid huge complexities.


Finally, the original motivation for this discussion came from a perceived inability to make the branch simplifications work with a more complex forward-reference label resolution algorithm. I don't think this is necessarily or fundamentally the case. The label resolution (island/veneer insertion) and branch optimizations interact in subtle ways, but they are separate, tied by the invariant: the buffer of most-recent branches continguous to the end of the buffer (this is the visibility of the peephole opts) must correspond to the latest fixups in the fixup list.

The veneer insertion happens only at island generation, and islands clear the "most recent branches" list (because none are contiguous to the end of buffer anymore), so actually we can do whatever we want with more complex data structures and algorithms, without violating the invariant. The only requirement is that between islands, we do keep a linear list of most recent fixups (we can clear it whenever we emit a non-branch instruction).


So from my perspective, there is a proposal for a large refactor of the compiler backend, with high risk, motivated by two reasons that I don't think are really the case (branch constant folding gets better? --> no, different and orthogonal things; precludes more advanced veneer fixup? --> no, the invariant still allows this). It's entirely possible I've missed something or misunderstood some part of the proposal though -- curious what others think!

@cfallin
Copy link
Member Author

cfallin commented Aug 7, 2023

(Thanks to @jameysharp, @elliottt and others for talking through this a bit offline and especially Jamey for noting the implications on block order above. @alexcrichton if you want to discuss further we can definitely do so in the next Cranelift meeting, if that's easier!)

@alexcrichton
Copy link
Member

Thanks for taking the time to write this up! I've added this to the agenda for tomorrow to go over if there's anything remaining, mostly because I'm still curious to dig into more details if you and others are willing!

extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that.

It's unclear to me why this is the case -- could you clarify?

My gut is that this level of optimization is best done earlier in the pipeline which is where my thinking stems from, but there's other concrete issues which I think would rely on doing branch-related optimizations earlier rather than at the very end:

Basically my thinking is that while we could make MachBuffer fancier related to branching it seems like much of the benefit should come from the mid-end optimizations and their knock-on effects.

I think both of the above are possibly springing from a slight mis-interpretation of what purpose the MachBuffer's branch folding rules actually serve. It is not an optimization (in the sense of an optional thing); it is a lowering step

This is a good point yeah and I think a good way to help me frame my thinking. On one hand we as a compiler have the problem of optimizing a CFG, for example A->B->C could be come A+B -> C under some conditions and things like that. Also things like br_if true a, b can become jump a which might enable further CFG simplifications. This, in my mind, is all about taking a messy CFG to a "clean" CFG which would handle the issues I described above (e.g. no ret-in-its-own-block and DCE would fall out of CFG rewriting). There is however still the problem of taking a perfect CFG and lowering it to optimal machine code.

I feel like MachBuffer has historically conflated these two operations, performing both CFG optimization and optimal linear layout of a CFG at the same time. My thinking has been that the CFG-optimizing parts of MachBuffer would ideally be removed into an independent pass in Cranelift to get executed at some point, perhaps intertwined with egraph optimizations somehow (e.g. to make use of constant propagation and/or GVN).

I'll also clarify that I barely know what an EBB is much less advocate for it. And again if anything I'm thinking comes with a fundamental compile-time or runtime performance penalty then it's not worth it, I'm trying to see if we can have our cake and eat it too.

I'll definitely admit that this may all be stemming from me misunderstanding MachBuffer and its optimizations. If all it's doing is the optimal linear layout of a CFG without extra optimizations then there's no escaping that class of problem so there's nothing more that can be done.


All that being said your observation about the recent branches list being empty on islands I think is a great one and would be a way to solve the suboptimal branch generation from #6804. I might try to give that a stab in the future!

@alexcrichton
Copy link
Member

Ok I talked more with folks at the Cranelift meeting today and the conclusion I took away was that the MachBuffer optimizations are all required at this time to have an optimal linear layout of a CFG we input. In that sense what I was mentioning above about CFG optimizations and optimal CFG layout the MachBuffer only deals with the latter which all sounds good.

In that sense I believe this issue is resolved as the answer is "yes, the MachBuffer should continue to produce high-quality code".

As for the one remaining issue of the eager promotion of veneers I think @cfallin's idea will solve it well. I mentioned today I had a branch (which is here) which implemented the BinaryHeap idea although it only works by disabling optimization, so not landable as-is of course but wanted to share as it had the bits about removing precalculation of the island deadline.

@jameysharp
Copy link
Contributor

Given that there are a small ISA-specific constant number of possible deadline-distances (e.g. 2^20, 2^27, 2^32) it occurs to me a priority queue isn't strictly necessary. We could keep one VecDeque per distance instead. Then all the labels that are at their deadlines are together at the beginning of their respective lists; you can still remove the last-added labels efficiently if needed; and nothing needs to repeatedly visit labels in the middle. Does that help?

@alexcrichton
Copy link
Member

I think that would work as well yeah, but I think that'd still require the "flush" behavior during island insertion right? Otherwise I'm at least not too worried about overhead with a BinaryHeap and it feels like a simpler alternative to reach for, but I think I'm on a bit of a one-track mind right now...

@cfallin
Copy link
Member Author

cfallin commented Aug 9, 2023

It's possible we can make do with just Vecs actually. The key thing is that (i) as fixups are pushed prior to an island, the only operation we need is "chomp" (so push/pop on a vec suffices); and after the island, we can't chomp anymore, so it suffices to sort by deadline. As you're observing, for a given class we will have monotonically increasing deadlines, so we can put the "long-term vecs" per class in reverse-deadline order and pop off the back of all of them.

That said, and as I mentioned in the earlier thread, I worry a lot about complexity here so I'd want to be careful about premature optimization. One "deadline-keyed heap" that we pop from is a lot simpler than a radix-sort-like thing with N vecs, and it's not clear to me how to parameterize the latter on LabelUse arms either (they are different per architecture), strictly in terms of generic programming in Rust. (I guess the trait on the enum could map the kinds into an integer index space, with each concrete implementation a repr(u8) or whatever, but now we're way down the rabbit hole.) So maybe it's best to try the simpler thing (EDIT for clarity: that would be @alexcrichton 's BinaryHeap suggestion) and then see whether we need anything more...

@jameysharp
Copy link
Contributor

I figured we'd just have what amounts to a map keyed on the CodeOffset (which is a u32) from MachInstLabelUse::max_pos_range. We don't actually care what kind of label it is for this purpose; we care what maximum offset it has. There are no more than four distinct max offsets on any current target, so even linear-scan in an array-of-pairs-based "map" is asymptotically O(1) and the constant factors might be better than a HashMap.

But I don't quite understand the interaction between veneer emission and label chomping and I'll take your word for it, both of you, that the binary heap is fine.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this issue Aug 24, 2023
This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This
undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter
back to `emit_island`, and implements the changes discussed in bytecodealliance#6818.
Namely fixups are now tracked in a `pending_fixups` list for editing and
modification and then during island emission they're flushed into a
`BinaryHeap` tracked as `fixup_records`. Additionally calculation of the
size of the pending island is now done as a single calculation rather
than updating metadata as we go along. This is required because fixups
are no longer entirely cleared during island emission so the previous
logic of "reset the island size to zero" and have it get recalculated as
the island is emitted is no longer applicable. I opted to go with a
simplistic version of this for now which assumes that all veneers are
the worst case size, but if necessary I think this could be more optimal
by tracking each veneer in a counter.

Closes bytecodealliance#6818
github-merge-queue bot pushed a commit that referenced this issue Aug 25, 2023
* Don't force veneers on island emission

This commit is a follow-up to #6804 with the discussion on #6818. This
undoes some of the changes from #6804, such as bringing a size parameter
back to `emit_island`, and implements the changes discussed in #6818.
Namely fixups are now tracked in a `pending_fixups` list for editing and
modification and then during island emission they're flushed into a
`BinaryHeap` tracked as `fixup_records`. Additionally calculation of the
size of the pending island is now done as a single calculation rather
than updating metadata as we go along. This is required because fixups
are no longer entirely cleared during island emission so the previous
logic of "reset the island size to zero" and have it get recalculated as
the island is emitted is no longer applicable. I opted to go with a
simplistic version of this for now which assumes that all veneers are
the worst case size, but if necessary I think this could be more optimal
by tracking each veneer in a counter.

Closes #6818

* Keep a running count for pending fixup deadline

Update this as pending fixups are added and then clear this out when
islands are emitted.

* Don't force all fixups to go through a binary heap

Instead process them immediately if they're ready.
eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this issue Sep 6, 2023
* Don't force veneers on island emission

This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This
undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter
back to `emit_island`, and implements the changes discussed in bytecodealliance#6818.
Namely fixups are now tracked in a `pending_fixups` list for editing and
modification and then during island emission they're flushed into a
`BinaryHeap` tracked as `fixup_records`. Additionally calculation of the
size of the pending island is now done as a single calculation rather
than updating metadata as we go along. This is required because fixups
are no longer entirely cleared during island emission so the previous
logic of "reset the island size to zero" and have it get recalculated as
the island is emitted is no longer applicable. I opted to go with a
simplistic version of this for now which assumes that all veneers are
the worst case size, but if necessary I think this could be more optimal
by tracking each veneer in a counter.

Closes bytecodealliance#6818

* Keep a running count for pending fixup deadline

Update this as pending fixups are added and then clear this out when
islands are emitted.

* Don't force all fixups to go through a binary heap

Instead process them immediately if they're ready.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants