-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift: avoid quadratic behavior in label-fixup processing #6804
Conversation
I used your branch to compile a large module of ~115Mb, 3 times. x86_64: 11.5228035s, 11.0294104s, 11.2740543s Where on main it is was: x86_64: 13.9465882s, 11.2770526s, 11.0808334s Seems like a very big improvement! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit on the fence about this, but I think we should probably go with it. The main part that doesn't sit well with me is that I don't think there's a great way to describe this algorithm and its consequences. Any branch can impact any other branch at any time, so what a branch is doing can't be explained or understood with any local reasoning.
Running this through sightglass shows no perf difference, but that's expected. What I'd be worried about is that a critical branch in a hot loop could go awry and cause performance issues. That's a very precise situation to be in though and additionally in some sense if an island is emitted in the middle of a loop the you've already lost perf-wise anyway.
This also feels a bit weird in the sense that we do veneer promotion a few times but there's not really a great way to describe the consequences of it. For example there's not really a great reason to force all pending fixups to get a veneer every time an island is emitted other than "it was the smallest patch at the time to fix quadratic behavior". The comments indicate only in passing that the purpose of this is to avoid quadratic behavior, but I think it would be worth expanding on that. The quadratic-ness isn't necessarily inherent because as I described by representing fixup_records
with a BinaryHeap
it's possible to restore local reasoning about branches without quadratic behavior. That solution isn't viable, however, with the branch optimizations happening in the buffer.
I left a few coments as well for stylistic things, but given how this is expected to have little-to-no impact on runtime and clearly has a large impact on compile time I think it's good to land this. I do think the comments should be updated though to indicate that the purpose for this is purely to avoid larger changes and the performance implications of this are uncertain and likely to cause issues at some point, but that "point" is far enough away that we're buying ourselves runway.
Thanks for testing this! Would you be able to share a performance profile of the x86_64 compilation? Having a 100x slowdown for x86_64 is pretty surprising and may mean that there's lurking quadratic behavior there too worth looking into. |
Given the other numbers, I suspect maybe the units were meant to be seconds rather than milliseconds on that line? |
Yeah, unfortunately the branch opts have to happen in the buffer -- that's important for compile-time performance and lifting them elsewhere would be a large complexity shift (on the order of months of work) in the remainder of the backend because of the lower-level branch forms. (As an example, @elliottt worked to remove the last vestiges of EBBs and make branches truly two-target in CLIF as well, and that was something like a few weeks of delicate full-time work with a few followup issues.) We've been there before, we learned! That said, I don't know if I buy that the two issues are coupled to this degree. MachBuffer does both things -- branch folding and label fixups -- and the former has an invariant that any branches in the "most recent branches contiguous with end of buffer" list will have fixups -- but the invariant needs nothing (is vacuously true) if the recent-branches list is empty, and it is always empty after emitting an island. So I think one could do whatever one wants with a binary-heap or whatnot, leveraging that. However, there's also a complexity tradeoff here: I worry very much about bugs in this code (see the series of issues three summers ago that resulted in me writing the correctness proof in the comments) and I'd rather keep it using uniformly simple data structures where we can. "Nonlocal reasoning" sounds undesirable but that's blurring a bit what's going on here: basically we now have islands that cause every forward-crossing edge to go through a veneer. There are plenty of other ways that changes in some code can push some other code toward a different decision (regalloc, block ordering, side-effects blocking hoisting, ...) and we don't really worry about those nonlocal effects, or at least we accept them as normal. Anyway, I'll update on Monday based on your comments -- thanks very much for the review! |
Sorry, I made an error, my script errored for aarch64 with 'target not supported', used the wrong feature flag to compile wasmtime. I re-ran the script:
Running three times:
Seems more plausible :) |
@cfallin I don't think it's worth belaboring the point too much, but I think it's important to have more rationale in the code other than "otherwise it's quadratic". The actual result of the generated code here I personally think is pretty surprising where the purpose of veneers/promotion is to incrementally get more distance and that's not what's happening here because everything is "flushed" all at once no matter whether or not we're about to go out of range of a branch. At least to me it's surprising that given the otherwise meticulous attention to detail throughout the file this seemingly-large detail is quickly glossed over. And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today. I mostly want to clarify that it sounds like you're advocating that this must stay as-is and there's no viable alternative to the mach buffer's optimizations today. I'm not personally convinced about that myself. In any case though this is all orthogonal to the compilation-time performance gains and is probably best left for a future issue/discussion rather than here. @TimonPost thanks for double-checking, and those new numbers make sense! It's true that 12 seconds still feels a bit high (although 115MB is a big module too). If you'd like it might be worth digging into the performance profile there. Locally in my example program most of the time was spent in b-trees in register allocation, which I don't know how to improve. If that's the case for your test case too it's probably at the limit of speed it's going to reach for now. |
I might do some more profiling to see where most time is spent now. Also compiling the workspace with release profile, CARGO_PROFILE_RELEASE_OPT_LEVEL=z, CARGO_INCREMENTAL=1, helps us to optimize for binary size and get ~59MB instead of the 115MB. So after this PR there are no big issues left. |
Thanks @alexcrichton -- I filed #6818 to continue the larger discussion and added a more detailed "why" section to the top doc-comment here. I think actually doing deadlines properly might not be so bad, after more thought; will try to tackle this after some other higher-priority things. |
Looks good to me, and thanks for opening an issue! I'll read over that later today 👍 |
…dealliance#6804) * WIP: two-tier processing for MachBuffer veneer processing in island emission * Unconditionally emit veneers for label forward-refs that cross islands. * Clean up and fix tests * Review feedback * Add some more detailed comments.
…dealliance#6804) * WIP: two-tier processing for MachBuffer veneer processing in island emission * Unconditionally emit veneers for label forward-refs that cross islands. * Clean up and fix tests * Review feedback * Add some more detailed comments.
…ing" to 12.0.1 (#6897) * Cranelift: avoid quadratic behavior in label-fixup processing (#6804) * WIP: two-tier processing for MachBuffer veneer processing in island emission * Unconditionally emit veneers for label forward-refs that cross islands. * Clean up and fix tests * Review feedback * Add some more detailed comments. * Update the releases file to reflect the patch --------- Co-authored-by: Chris Fallin <chris@cfallin.org>
This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in bytecodealliance#6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes bytecodealliance#6818
* Don't force veneers on island emission This commit is a follow-up to #6804 with the discussion on #6818. This undoes some of the changes from #6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in #6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes #6818 * Keep a running count for pending fixup deadline Update this as pending fixups are added and then clear this out when islands are emitted. * Don't force all fixups to go through a binary heap Instead process them immediately if they're ready.
* Don't force veneers on island emission This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in bytecodealliance#6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes bytecodealliance#6818 * Keep a running count for pending fixup deadline Update this as pending fixups are added and then clear this out when islands are emitted. * Don't force all fixups to go through a binary heap Instead process them immediately if they're ready.
Currently,
MachBuffer
is used to resolve forward and backward references to labels (basic-block entry points and other branch targets) within function bodies as well as calls to functions within a whole module. It implements a single-pass algorithm that tracks "pending fixups" -- references to some other place -- and fills in the offsets when known, also handling upgrading ranges by emitting veneers on architectures that need that behavior.In #6798 it was reported that on aarch64, module emission is slow for a certain large module. This module ends up forcing many islands of veneers but has branches whose ranges cross those islands and need not be extended with a veneer yet. This case exposes quadratic behavior: the island-emission logic puts the fixup back in a list and replaces the fixup list with that leftover-fixup list after processing all fixups.
This PR instead does the following:
Branch26
) while lots of islands for shorter-range references (e.g.Branch19
) are emitted.PCRel32
for branches, for example), we put it in a separate list that we don't process again until the very last "last-chance island" after all emission.These two rules together will bound the amount of processing to
O(|fixups| * |label kinds|)
rather thanO(|fixups|^2)
.The first new rule theoretically causes more veneers to be emitted, but in practice is not too likely to make a difference, I think. Unfortunately my aarch64 laptop is unavailable at the moment; if someone on that platform could benchmark the impact that would be quite appreciated!
Fixes #6798.