Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: avoid quadratic behavior in label-fixup processing #6804

Merged
merged 5 commits into from
Aug 7, 2023

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Aug 4, 2023

Currently, MachBuffer is used to resolve forward and backward references to labels (basic-block entry points and other branch targets) within function bodies as well as calls to functions within a whole module. It implements a single-pass algorithm that tracks "pending fixups" -- references to some other place -- and fills in the offsets when known, also handling upgrading ranges by emitting veneers on architectures that need that behavior.

In #6798 it was reported that on aarch64, module emission is slow for a certain large module. This module ends up forcing many islands of veneers but has branches whose ranges cross those islands and need not be extended with a veneer yet. This case exposes quadratic behavior: the island-emission logic puts the fixup back in a list and replaces the fixup list with that leftover-fixup list after processing all fixups.

This PR instead does the following:

  • When an island is emitted, all unresolved forward references get a veneer, even if their current label kind would still be in range. This prevents fixups from hanging around arbitrarily long when they have longer range kinds (e.g., aarch64's Branch26) while lots of islands for shorter-range references (e.g. Branch19) are emitted.
  • When a label fixup reaches its "longest-range" kind (PCRel32 for branches, for example), we put it in a separate list that we don't process again until the very last "last-chance island" after all emission.

These two rules together will bound the amount of processing to O(|fixups| * |label kinds|) rather than O(|fixups|^2).

The first new rule theoretically causes more veneers to be emitted, but in practice is not too likely to make a difference, I think. Unfortunately my aarch64 laptop is unavailable at the moment; if someone on that platform could benchmark the impact that would be quite appreciated!

Fixes #6798.

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Aug 4, 2023
@cfallin cfallin changed the title WIP: two-tier processing for MachBuffer veneer processing in island emission. Cranelift: avoid quadratic behavior in label-fixup processing Aug 4, 2023
@cfallin cfallin marked this pull request as ready for review August 4, 2023 22:54
@cfallin cfallin requested a review from a team as a code owner August 4, 2023 22:54
@cfallin cfallin requested review from fitzgen and removed request for a team August 4, 2023 22:54
@github-actions github-actions bot added the cranelift:area:aarch64 Issues related to AArch64 backend. label Aug 4, 2023
@TimonPost
Copy link
Contributor

TimonPost commented Aug 5, 2023

I used your branch to compile a large module of ~115Mb, 3 times.

x86_64: 11.5228035s, 11.0294104s, 11.2740543s
aarch64: 12.8739ms, 17.9411ms, 14.3515ms

Where on main it is was:

x86_64: 13.9465882s, 11.2770526s, 11.0808334s
aarch64: 157.5406149s, 144.1816011s, 144.1495896s

Seems like a very big improvement!

Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit on the fence about this, but I think we should probably go with it. The main part that doesn't sit well with me is that I don't think there's a great way to describe this algorithm and its consequences. Any branch can impact any other branch at any time, so what a branch is doing can't be explained or understood with any local reasoning.

Running this through sightglass shows no perf difference, but that's expected. What I'd be worried about is that a critical branch in a hot loop could go awry and cause performance issues. That's a very precise situation to be in though and additionally in some sense if an island is emitted in the middle of a loop the you've already lost perf-wise anyway.

This also feels a bit weird in the sense that we do veneer promotion a few times but there's not really a great way to describe the consequences of it. For example there's not really a great reason to force all pending fixups to get a veneer every time an island is emitted other than "it was the smallest patch at the time to fix quadratic behavior". The comments indicate only in passing that the purpose of this is to avoid quadratic behavior, but I think it would be worth expanding on that. The quadratic-ness isn't necessarily inherent because as I described by representing fixup_records with a BinaryHeap it's possible to restore local reasoning about branches without quadratic behavior. That solution isn't viable, however, with the branch optimizations happening in the buffer.

I left a few coments as well for stylistic things, but given how this is expected to have little-to-no impact on runtime and clearly has a large impact on compile time I think it's good to land this. I do think the comments should be updated though to indicate that the purpose for this is purely to avoid larger changes and the performance implications of this are uncertain and likely to cause issues at some point, but that "point" is far enough away that we're buying ourselves runway.

cranelift/codegen/src/machinst/buffer.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/machinst/buffer.rs Show resolved Hide resolved
@alexcrichton
Copy link
Member

@TimonPost

x86_64: 11.5228035s, 11.0294104s, 11.2740543s
aarch64: 12.8739ms, 17.9411ms, 14.3515ms

Thanks for testing this! Would you be able to share a performance profile of the x86_64 compilation? Having a 100x slowdown for x86_64 is pretty surprising and may mean that there's lurking quadratic behavior there too worth looking into.

@cfallin
Copy link
Member Author

cfallin commented Aug 5, 2023

Given the other numbers, I suspect maybe the units were meant to be seconds rather than milliseconds on that line?

@cfallin
Copy link
Member Author

cfallin commented Aug 5, 2023

The quadratic-ness isn't necessarily inherent because #6798 (comment) by representing fixup_records with a BinaryHeap it's possible to restore local reasoning about branches without quadratic behavior. That solution isn't viable, however, with the branch optimizations happening in the buffer.

Yeah, unfortunately the branch opts have to happen in the buffer -- that's important for compile-time performance and lifting them elsewhere would be a large complexity shift (on the order of months of work) in the remainder of the backend because of the lower-level branch forms. (As an example, @elliottt worked to remove the last vestiges of EBBs and make branches truly two-target in CLIF as well, and that was something like a few weeks of delicate full-time work with a few followup issues.) We've been there before, we learned!

That said, I don't know if I buy that the two issues are coupled to this degree. MachBuffer does both things -- branch folding and label fixups -- and the former has an invariant that any branches in the "most recent branches contiguous with end of buffer" list will have fixups -- but the invariant needs nothing (is vacuously true) if the recent-branches list is empty, and it is always empty after emitting an island. So I think one could do whatever one wants with a binary-heap or whatnot, leveraging that.

However, there's also a complexity tradeoff here: I worry very much about bugs in this code (see the series of issues three summers ago that resulted in me writing the correctness proof in the comments) and I'd rather keep it using uniformly simple data structures where we can. "Nonlocal reasoning" sounds undesirable but that's blurring a bit what's going on here: basically we now have islands that cause every forward-crossing edge to go through a veneer. There are plenty of other ways that changes in some code can push some other code toward a different decision (regalloc, block ordering, side-effects blocking hoisting, ...) and we don't really worry about those nonlocal effects, or at least we accept them as normal.

Anyway, I'll update on Monday based on your comments -- thanks very much for the review!

@TimonPost
Copy link
Contributor

TimonPost commented Aug 6, 2023

Sorry, I made an error, my script errored for aarch64 with 'target not supported', used the wrong feature flag to compile wasmtime. I re-ran the script:

fn main() {
    const SANDBOX_PATH: &'static str = "src/sandbox.wasm";

    let mut times: HashMap<String, std::time::Duration,RandomState> = HashMap::default();

    for arch in &["x86_64", "aarch64"] {
        let start = Instant::now();
        let mut command = Command::new("wasmtime");
        command.arg("compile").arg(format!("--target={os}")).arg(format!("-o=output-{os}.wasm")).arg(
            SANDBOX_PATH,
        );

        println!("Start compiling on {}", arch);
        let child = command.spawn().unwrap();
        let output = child.wait_with_output().unwrap();
        times.insert(format!("sandbox-{}", arch), start.elapsed());

        println!(
            "[{:?}]: {os} ({})",
            start.elapsed(),
            String::from_utf8(output.stdout).unwrap()
        );
    }
}

Running three times:

x86_64: 11.6531752s, 11.5904301s, 11.5270225s
aarch64: 12.006055s, 11.5779827s,11.9106966s

Seems more plausible :)

@alexcrichton
Copy link
Member

@cfallin I don't think it's worth belaboring the point too much, but I think it's important to have more rationale in the code other than "otherwise it's quadratic". The actual result of the generated code here I personally think is pretty surprising where the purpose of veneers/promotion is to incrementally get more distance and that's not what's happening here because everything is "flushed" all at once no matter whether or not we're about to go out of range of a branch. At least to me it's surprising that given the otherwise meticulous attention to detail throughout the file this seemingly-large detail is quickly glossed over.

And again to clarify I'm not advocating for here-and-now removing branch optimizations from the mach buffer. I realize it's a deep refactoring and would take quite some work and that additionally this is the result of historical work and balancing efforts. I still do believe though that this probably isn't the best place to do this because it's extremely late in the pipeline and we're already unable to perform optimization such as constant-branch-folding with knock-on effects from that. For example I'd imagine that if constant-branch-folding were implemented then many of the optimizations that the mach buffer does would probably fall out of that, where in such a situation it may not be necessary for the mach buffer to be as clever as it is today.

I mostly want to clarify that it sounds like you're advocating that this must stay as-is and there's no viable alternative to the mach buffer's optimizations today. I'm not personally convinced about that myself. In any case though this is all orthogonal to the compilation-time performance gains and is probably best left for a future issue/discussion rather than here.


@TimonPost thanks for double-checking, and those new numbers make sense! It's true that 12 seconds still feels a bit high (although 115MB is a big module too). If you'd like it might be worth digging into the performance profile there. Locally in my example program most of the time was spent in b-trees in register allocation, which I don't know how to improve. If that's the case for your test case too it's probably at the limit of speed it's going to reach for now.

@TimonPost
Copy link
Contributor

TimonPost commented Aug 7, 2023

I might do some more profiling to see where most time is spent now. Also compiling the workspace with release profile, CARGO_PROFILE_RELEASE_OPT_LEVEL=z, CARGO_INCREMENTAL=1, helps us to optimize for binary size and get ~59MB instead of the 115MB. So after this PR there are no big issues left.

@cfallin cfallin enabled auto-merge August 7, 2023 19:55
@cfallin
Copy link
Member Author

cfallin commented Aug 7, 2023

Thanks @alexcrichton -- I filed #6818 to continue the larger discussion and added a more detailed "why" section to the top doc-comment here. I think actually doing deadlines properly might not be so bad, after more thought; will try to tackle this after some other higher-priority things.

@cfallin cfallin added this pull request to the merge queue Aug 7, 2023
Merged via the queue into bytecodealliance:main with commit 4f1b38f Aug 7, 2023
@cfallin cfallin deleted the quadratic-islands branch August 7, 2023 21:07
@alexcrichton
Copy link
Member

Looks good to me, and thanks for opening an issue! I'll read over that later today 👍

@TimonPost TimonPost mentioned this pull request Aug 11, 2023
eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this pull request Aug 18, 2023
…dealliance#6804)

* WIP: two-tier processing for MachBuffer veneer processing in island emission

* Unconditionally emit veneers for label forward-refs that cross islands.

* Clean up and fix tests

* Review feedback

* Add some more detailed comments.
TimonPost pushed a commit to TimonPost/wasmtime that referenced this pull request Aug 24, 2023
…dealliance#6804)

* WIP: two-tier processing for MachBuffer veneer processing in island emission

* Unconditionally emit veneers for label forward-refs that cross islands.

* Clean up and fix tests

* Review feedback

* Add some more detailed comments.
cfallin added a commit that referenced this pull request Aug 24, 2023
…ing" to 12.0.1 (#6897)

* Cranelift: avoid quadratic behavior in label-fixup processing (#6804)

* WIP: two-tier processing for MachBuffer veneer processing in island emission

* Unconditionally emit veneers for label forward-refs that cross islands.

* Clean up and fix tests

* Review feedback

* Add some more detailed comments.

* Update the releases file to reflect the patch

---------

Co-authored-by: Chris Fallin <chris@cfallin.org>
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Aug 24, 2023
This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This
undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter
back to `emit_island`, and implements the changes discussed in bytecodealliance#6818.
Namely fixups are now tracked in a `pending_fixups` list for editing and
modification and then during island emission they're flushed into a
`BinaryHeap` tracked as `fixup_records`. Additionally calculation of the
size of the pending island is now done as a single calculation rather
than updating metadata as we go along. This is required because fixups
are no longer entirely cleared during island emission so the previous
logic of "reset the island size to zero" and have it get recalculated as
the island is emitted is no longer applicable. I opted to go with a
simplistic version of this for now which assumes that all veneers are
the worst case size, but if necessary I think this could be more optimal
by tracking each veneer in a counter.

Closes bytecodealliance#6818
github-merge-queue bot pushed a commit that referenced this pull request Aug 25, 2023
* Don't force veneers on island emission

This commit is a follow-up to #6804 with the discussion on #6818. This
undoes some of the changes from #6804, such as bringing a size parameter
back to `emit_island`, and implements the changes discussed in #6818.
Namely fixups are now tracked in a `pending_fixups` list for editing and
modification and then during island emission they're flushed into a
`BinaryHeap` tracked as `fixup_records`. Additionally calculation of the
size of the pending island is now done as a single calculation rather
than updating metadata as we go along. This is required because fixups
are no longer entirely cleared during island emission so the previous
logic of "reset the island size to zero" and have it get recalculated as
the island is emitted is no longer applicable. I opted to go with a
simplistic version of this for now which assumes that all veneers are
the worst case size, but if necessary I think this could be more optimal
by tracking each veneer in a counter.

Closes #6818

* Keep a running count for pending fixup deadline

Update this as pending fixups are added and then clear this out when
islands are emitted.

* Don't force all fixups to go through a binary heap

Instead process them immediately if they're ready.
eduardomourar pushed a commit to eduardomourar/wasmtime that referenced this pull request Sep 6, 2023
* Don't force veneers on island emission

This commit is a follow-up to bytecodealliance#6804 with the discussion on bytecodealliance#6818. This
undoes some of the changes from bytecodealliance#6804, such as bringing a size parameter
back to `emit_island`, and implements the changes discussed in bytecodealliance#6818.
Namely fixups are now tracked in a `pending_fixups` list for editing and
modification and then during island emission they're flushed into a
`BinaryHeap` tracked as `fixup_records`. Additionally calculation of the
size of the pending island is now done as a single calculation rather
than updating metadata as we go along. This is required because fixups
are no longer entirely cleared during island emission so the previous
logic of "reset the island size to zero" and have it get recalculated as
the island is emitted is no longer applicable. I opted to go with a
simplistic version of this for now which assumes that all veneers are
the worst case size, but if necessary I think this could be more optimal
by tracking each veneer in a counter.

Closes bytecodealliance#6818

* Keep a running count for pending fixup deadline

Update this as pending fixups are added and then clear this out when
islands are emitted.

* Don't force all fixups to go through a binary heap

Instead process them immediately if they're ready.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cranelift wasm module compilation seems slower
3 participants