Cranelift: avoid quadratic behavior in label-fixup processing #6804

cfallin · 2023-08-04T16:51:26Z

Currently, MachBuffer is used to resolve forward and backward references to labels (basic-block entry points and other branch targets) within function bodies as well as calls to functions within a whole module. It implements a single-pass algorithm that tracks "pending fixups" -- references to some other place -- and fills in the offsets when known, also handling upgrading ranges by emitting veneers on architectures that need that behavior.

In #6798 it was reported that on aarch64, module emission is slow for a certain large module. This module ends up forcing many islands of veneers but has branches whose ranges cross those islands and need not be extended with a veneer yet. This case exposes quadratic behavior: the island-emission logic puts the fixup back in a list and replaces the fixup list with that leftover-fixup list after processing all fixups.

This PR instead does the following:

When an island is emitted, all unresolved forward references get a veneer, even if their current label kind would still be in range. This prevents fixups from hanging around arbitrarily long when they have longer range kinds (e.g., aarch64's Branch26) while lots of islands for shorter-range references (e.g. Branch19) are emitted.
When a label fixup reaches its "longest-range" kind (PCRel32 for branches, for example), we put it in a separate list that we don't process again until the very last "last-chance island" after all emission.

The first new rule theoretically causes more veneers to be emitted, but in practice is not too likely to make a difference, I think. Unfortunately my aarch64 laptop is unavailable at the moment; if someone on that platform could benchmark the impact that would be quite appreciated!

Fixes #6798.

…mission

TimonPost · 2023-08-05T13:19:47Z

I used your branch to compile a large module of ~115Mb, 3 times.

x86_64: 11.5228035s, 11.0294104s, 11.2740543s
aarch64: 12.8739ms, 17.9411ms, 14.3515ms

Where on main it is was:

x86_64: 13.9465882s, 11.2770526s, 11.0808334s
aarch64: 157.5406149s, 144.1816011s, 144.1495896s

Seems like a very big improvement!

alexcrichton

I'm a bit on the fence about this, but I think we should probably go with it. The main part that doesn't sit well with me is that I don't think there's a great way to describe this algorithm and its consequences. Any branch can impact any other branch at any time, so what a branch is doing can't be explained or understood with any local reasoning.

Running this through sightglass shows no perf difference, but that's expected. What I'd be worried about is that a critical branch in a hot loop could go awry and cause performance issues. That's a very precise situation to be in though and additionally in some sense if an island is emitted in the middle of a loop the you've already lost perf-wise anyway.

This also feels a bit weird in the sense that we do veneer promotion a few times but there's not really a great way to describe the consequences of it. For example there's not really a great reason to force all pending fixups to get a veneer every time an island is emitted other than "it was the smallest patch at the time to fix quadratic behavior". The comments indicate only in passing that the purpose of this is to avoid quadratic behavior, but I think it would be worth expanding on that. The quadratic-ness isn't necessarily inherent because as I described by representing fixup_records with a BinaryHeap it's possible to restore local reasoning about branches without quadratic behavior. That solution isn't viable, however, with the branch optimizations happening in the buffer.

I left a few coments as well for stylistic things, but given how this is expected to have little-to-no impact on runtime and clearly has a large impact on compile time I think it's good to land this. I do think the comments should be updated though to indicate that the purpose for this is purely to avoid larger changes and the performance implications of this are uncertain and likely to cause issues at some point, but that "point" is far enough away that we're buying ourselves runway.

cranelift/codegen/src/machinst/buffer.rs

alexcrichton · 2023-08-05T17:49:22Z

@TimonPost

x86_64: 11.5228035s, 11.0294104s, 11.2740543s
aarch64: 12.8739ms, 17.9411ms, 14.3515ms

Thanks for testing this! Would you be able to share a performance profile of the x86_64 compilation? Having a 100x slowdown for x86_64 is pretty surprising and may mean that there's lurking quadratic behavior there too worth looking into.

cfallin · 2023-08-05T18:30:42Z

Given the other numbers, I suspect maybe the units were meant to be seconds rather than milliseconds on that line?

cfallin · 2023-08-05T18:42:39Z

The quadratic-ness isn't necessarily inherent because #6798 (comment) by representing fixup_records with a BinaryHeap it's possible to restore local reasoning about branches without quadratic behavior. That solution isn't viable, however, with the branch optimizations happening in the buffer.

Yeah, unfortunately the branch opts have to happen in the buffer -- that's important for compile-time performance and lifting them elsewhere would be a large complexity shift (on the order of months of work) in the remainder of the backend because of the lower-level branch forms. (As an example, @elliottt worked to remove the last vestiges of EBBs and make branches truly two-target in CLIF as well, and that was something like a few weeks of delicate full-time work with a few followup issues.) We've been there before, we learned!

That said, I don't know if I buy that the two issues are coupled to this degree. MachBuffer does both things -- branch folding and label fixups -- and the former has an invariant that any branches in the "most recent branches contiguous with end of buffer" list will have fixups -- but the invariant needs nothing (is vacuously true) if the recent-branches list is empty, and it is always empty after emitting an island. So I think one could do whatever one wants with a binary-heap or whatnot, leveraging that.

However, there's also a complexity tradeoff here: I worry very much about bugs in this code (see the series of issues three summers ago that resulted in me writing the correctness proof in the comments) and I'd rather keep it using uniformly simple data structures where we can. "Nonlocal reasoning" sounds undesirable but that's blurring a bit what's going on here: basically we now have islands that cause every forward-crossing edge to go through a veneer. There are plenty of other ways that changes in some code can push some other code toward a different decision (regalloc, block ordering, side-effects blocking hoisting, ...) and we don't really worry about those nonlocal effects, or at least we accept them as normal.

Anyway, I'll update on Monday based on your comments -- thanks very much for the review!

TimonPost · 2023-08-06T11:40:56Z

Sorry, I made an error, my script errored for aarch64 with 'target not supported', used the wrong feature flag to compile wasmtime. I re-ran the script:

fn main() {
    const SANDBOX_PATH: &'static str = "src/sandbox.wasm";

    let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default();

    for arch in &["x86_64", "aarch64"] {
        let start = Instant::now();
        let mut command = Command::new("wasmtime");
        command.arg("compile").arg(format!("--target={os}")).arg(format!("-o=output-{os}.wasm")).arg(
            SANDBOX_PATH,
        );

        println!("Start compiling on {}", arch);
        let child = command.spawn().unwrap();
        let output = child.wait_with_output().unwrap();
        times.insert(format!("sandbox-{}", arch), start.elapsed());

        println!(
            "[{:?}]: {os} ({})",
            start.elapsed(),
            String::from_utf8(output.stdout).unwrap()
        );
    }
}

Running three times:

x86_64: 11.6531752s, 11.5904301s, 11.5270225s
aarch64: 12.006055s, 11.5779827s,11.9106966s

Seems more plausible :)

alexcrichton · 2023-08-07T14:39:38Z

@cfallin I don't think it's worth belaboring the point too much, but I think it's important to have more rationale in the code other than "otherwise it's quadratic". The actual result of the generated code here I personally think is pretty surprising where the purpose of veneers/promotion is to incrementally get more distance and that's not what's happening here because everything is "flushed" all at once no matter whether or not we're about to go out of range of a branch. At least to me it's surprising that given the otherwise meticulous attention to detail throughout the file this seemingly-large detail is quickly glossed over.

And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.

I mostly want to clarify that it sounds like you're advocating that this must stay as-is and there's no viable alternative to the mach buffer's optimizations today. I'm not personally convinced about that myself. In any case though this is all orthogonal to the compilation-time performance gains and is probably best left for a future issue/discussion rather than here.

@TimonPost thanks for double-checking, and those new numbers make sense! It's true that 12 seconds still feels a bit high (although 115MB is a big module too). If you'd like it might be worth digging into the performance profile there. Locally in my example program most of the time was spent in b-trees in register allocation, which I don't know how to improve. If that's the case for your test case too it's probably at the limit of speed it's going to reach for now.

TimonPost · 2023-08-07T14:48:52Z

I might do some more profiling to see where most time is spent now. Also compiling the workspace with release profile, CARGO_PROFILE_RELEASE_OPT_LEVEL=z, CARGO_INCREMENTAL=1, helps us to optimize for binary size and get ~59MB instead of the 115MB. So after this PR there are no big issues left.

cfallin · 2023-08-07T19:59:44Z

Thanks @alexcrichton -- I filed #6818 to continue the larger discussion and added a more detailed "why" section to the top doc-comment here. I think actually doing deadlines properly might not be so bad, after more thought; will try to tackle this after some other higher-priority things.

alexcrichton · 2023-08-08T14:38:20Z

Looks good to me, and thanks for opening an issue! I'll read over that later today 👍

…dealliance#6804) * WIP: two-tier processing for MachBuffer veneer processing in island emission * Unconditionally emit veneers for label forward-refs that cross islands. * Clean up and fix tests * Review feedback * Add some more detailed comments.

…ing" to 12.0.1 (#6897) * Cranelift: avoid quadratic behavior in label-fixup processing (#6804) * WIP: two-tier processing for MachBuffer veneer processing in island emission * Unconditionally emit veneers for label forward-refs that cross islands. * Clean up and fix tests * Review feedback * Add some more detailed comments. * Update the releases file to reflect the patch --------- Co-authored-by: Chris Fallin <chris@cfallin.org>

This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in bytecodealliance#6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes bytecodealliance#6818

* Don't force veneers on island emission This commit is a follow-up to #6804 with the discussion on #6818. This undoes some of the changes from #6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in #6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes #6818 * Keep a running count for pending fixup deadline Update this as pending fixups are added and then clear this out when islands are emitted. * Don't force all fixups to go through a binary heap Instead process them immediately if they're ready.

* Don't force veneers on island emission This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter back to `emit_island`, and implements the changes discussed in bytecodealliance#6818. Namely fixups are now tracked in a `pending_fixups` list for editing and modification and then during island emission they're flushed into a `BinaryHeap` tracked as `fixup_records`. Additionally calculation of the size of the pending island is now done as a single calculation rather than updating metadata as we go along. This is required because fixups are no longer entirely cleared during island emission so the previous logic of "reset the island size to zero" and have it get recalculated as the island is emitted is no longer applicable. I opted to go with a simplistic version of this for now which assumes that all veneers are the worst case size, but if necessary I think this could be more optimal by tracking each veneer in a counter. Closes bytecodealliance#6818 * Keep a running count for pending fixup deadline Update this as pending fixups are added and then clear this out when islands are emitted. * Don't force all fixups to go through a binary heap Instead process them immediately if they're ready.

WIP: two-tier processing for MachBuffer veneer processing in island e…

456a469

…mission

cfallin mentioned this pull request Aug 4, 2023

Cranelift wasm module compilation seems slower #6798

Closed

Unconditionally emit veneers for label forward-refs that cross islands.

420dd42

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Aug 4, 2023

Clean up and fix tests

27e1d97

cfallin changed the title ~~WIP: two-tier processing for MachBuffer veneer processing in island emission.~~ Cranelift: avoid quadratic behavior in label-fixup processing Aug 4, 2023

cfallin marked this pull request as ready for review August 4, 2023 22:54

cfallin requested a review from a team as a code owner August 4, 2023 22:54

cfallin requested review from fitzgen and removed request for a team August 4, 2023 22:54

github-actions bot added the cranelift:area:aarch64 Issues related to AArch64 backend. label Aug 4, 2023

alexcrichton approved these changes Aug 5, 2023

View reviewed changes

cranelift/codegen/src/machinst/buffer.rs Outdated Show resolved Hide resolved

cranelift/codegen/src/machinst/buffer.rs Show resolved Hide resolved

cfallin added 2 commits August 7, 2023 12:43

Review feedback

a54d43b

Add some more detailed comments.

f29e856

cfallin enabled auto-merge August 7, 2023 19:55

cfallin added this pull request to the merge queue Aug 7, 2023

Merged via the queue into bytecodealliance:main with commit 4f1b38f Aug 7, 2023

cfallin deleted the quadratic-islands branch August 7, 2023 21:07

alexcrichton mentioned this pull request Aug 8, 2023

Cranelift: should we do branch optimizations in MachBuffer? #6818

Closed

TimonPost mentioned this pull request Aug 11, 2023

Update RELEASES.md #6838

Merged

TimonPost mentioned this pull request Aug 23, 2023

Is it possible to release a patch 12.0.1 #6889

Closed

TimonPost mentioned this pull request Aug 24, 2023

Backport: "Cranelift: avoid quadratic behavior in label-fixup processing" to 12.0.1 #6897

Merged

alexcrichton mentioned this pull request Aug 24, 2023

Don't force veneers on island emission #6902

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cranelift: avoid quadratic behavior in label-fixup processing #6804

Cranelift: avoid quadratic behavior in label-fixup processing #6804

cfallin commented Aug 4, 2023 •

edited

Loading

TimonPost commented Aug 5, 2023 •

edited

Loading

alexcrichton left a comment

alexcrichton commented Aug 5, 2023

cfallin commented Aug 5, 2023

cfallin commented Aug 5, 2023

TimonPost commented Aug 6, 2023 •

edited

Loading

alexcrichton commented Aug 7, 2023

TimonPost commented Aug 7, 2023 •

edited

Loading

cfallin commented Aug 7, 2023

alexcrichton commented Aug 8, 2023

Cranelift: avoid quadratic behavior in label-fixup processing #6804

Cranelift: avoid quadratic behavior in label-fixup processing #6804

Conversation

cfallin commented Aug 4, 2023 • edited Loading

TimonPost commented Aug 5, 2023 • edited Loading

alexcrichton left a comment

Choose a reason for hiding this comment

alexcrichton commented Aug 5, 2023

cfallin commented Aug 5, 2023

cfallin commented Aug 5, 2023

TimonPost commented Aug 6, 2023 • edited Loading

alexcrichton commented Aug 7, 2023

TimonPost commented Aug 7, 2023 • edited Loading

cfallin commented Aug 7, 2023

alexcrichton commented Aug 8, 2023

cfallin commented Aug 4, 2023 •

edited

Loading

TimonPost commented Aug 5, 2023 •

edited

Loading

TimonPost commented Aug 6, 2023 •

edited

Loading

TimonPost commented Aug 7, 2023 •

edited

Loading