-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cranelift wasm module compilation seems slower #6798
Comments
Going to gather some more data and reopen if I have more. |
Indeed, Cranelift is the whole compiler, so it makes sense that "slow compilation" involves Cranelift! Thanks very much for the report; we definitely weren't aware of this. If you're able to bisect to a particular commit that would be really useful; otherwise, if you have an example wasm module that gets much slower to compile, that would be the next best thing. |
In addition to Chris' comments, here are a couple other thoughts: The profile shows a third of time is spent in emitting islands, which makes me think this module has some very large functions in it. That could explain some of the difference between targets: the threshold where a function is large enough to need islands is different on aarch64 than on x86. That said, I'm suspicious about the fact that different x86 systems have wildly different compile times too. It might be useful to run To normalize timing measurements you can do all your tests on the same computer so you aren't comparing the performance of different CPUs. You can use |
To the above points I wrote a small program to compile a 115 MB Use
|
Given this program: use std::fmt::{self, Write};
const F: usize = 2000; // number of functions
const N: usize = 1000; // number of calls per function
fn main() -> fmt::Result {
let mut s = String::new();
writeln!(s, "(module")?;
for _ in 0..F {
writeln!(s, "(func")?;
for _ in 0..N {
writeln!(s, "call $f")?;
}
writeln!(s, ")")?;
}
writeln!(s, "(func $f)")?;
writeln!(s, ")")?;
println!("{s}");
Ok(())
} which generates a wasm via:
I then can reproduce this issue I believe with:
I believe the issue here is that there's quadratic behavior the way that fixups are handled in the I stressed this above by setting the One "easy" fix is to change this line to a I was otherwise testing out locally an entirely different strategy where the Do others have a better idea about how to handle the Alternatively, one perhaps more radical idea would be to remove the branch optimizations entirely. My naive understanding of them is that they're better suited for restructuring the CFG (e.g. basic-block style mid-end optimizations) rather than emission-time optimizations. One unfortunate part here is that the branch optimizations which require |
@alexcrichton I'm taking a look at this now. My most basic unanswered question is actually: what changed recently? The original report above is that this is a regression between v9 and v10; the core
Do you mean avoiding use of the That's possible for sure; I guess the question would then be whether we switch back to full Abs8 relocations for all calls (and reintroduce support for relocs into Wasmtime as code won't be fully PIC anymore) or get island handling some other way. Fundamentally it seems to me that:
It seems to me that the most elegant approach may be to kick some label-fixup records off to a "at max range, don't reconsider until final fixup" list; then not consider them in I can try to prototype this today (but I have a pretty meeting-heavy day sadly so if you want to get to it first, please feel free!). |
Actually it turns out that a simple implementation of the above is a Caltrain-ride-sized chunk of work: #6804. It still doesn't resolve the issue because there is actually a 32-bit PCrel label kind that we want to be able to use, so the Branch26's stick around and participate in the quadratic dance. I think we want some sort of tiered deadline approach; continuing to think! |
The latest commit in #6804 now solves the issue, I think. I'll do more thorough testing and cleanup later in the day when I have a chance. One interesting consequence of the new change (every forward-ref that crosses an island gets a veneer) is that I think we no longer need a notion of worst-case size; but I'll verify that. It does also pessimize the case where a bunch of 26-bit forward refs cross a 19-bit-caused island in a very large function body, but "calls jump through an island in the middle of a function" is not technically wrong, so... (I'll measure with some benchmarks to be sure!) |
Thank you for confirming and swiftly addressing the optimization approach. I have a question that hasn't been answered yet: Why is there such a significant difference in performance between the aarch64 and x86_64 architectures? Also, I'd like to mention @cfallin and express a bit of uncertainty regarding whether this could be related to version regression. I can not confirm it does (also see benchmarks #6798 (comment)). However, various people in our team have noticed that recently the aarch64 performance has been noticeably slower compared to before. Our code base is growing fast so it is not odd to experience slower compile times, however, I wonder if there are perhaps certain things we do that are very heavy on wasm-generated-code (.wasm), and why there is such a big mismatch between targets? |
This is a consequence of the kinds of label-references (relocations, kind of) that the two architectures have, and more broadly a difference between RISC-y architectures and CISC-y ones. AArch64 has constant-size (32-bit) instructions, so its branches have either a 19-bit of 26-bit offset field (former for conditionals, which need to encode more information) -- but code can be larger than that. Likewise references to the "constant pool" have a 19-bit offset. To make larger code work, we create "veneers" in an "island" embedded in the code, where a branch first jumps to a longer-range branch (the veneer). The island also contains constants whose references are about to go out of range.This allows us to do single-pass emission, rather than going back and re-emitting with a longer-form branch (which may be a sequence of multiple insts and thus shift other offsets, causing a cascading fixup process). On the other hand, x86-64 has variable-length instructions and so can encode a 32-bit offset almost everywhere; we unconditionally use 32-bit offsets for conditional and unconditional jumps, and data references, so we never need an island of veneers/constants.
If you have an example module we can benchmark and a bisection range ("faster with vX, slower since vY") that'd be very helpful! |
If I understand Chris and Alex correctly, the relevant difference between x86_64 and aarch64 is that x86 branch instructions take 32-bit signed offsets, while aarch64 unconditional branches take effectively 28-bit signed offsets. Your 115MB wasm module is almost 2^27 bytes, so if the compiled aarch64 binary is bigger than the wasm input, then a branch from one end of it to the other is just past the threshold. If you had a 2GB wasm module I'd expect to start seeing similar problems on x86, assuming you didn't run into other bugs first. I would guess that things have gotten slower not because of Wasmtime changes but because as your wasm modules get bigger, the number of branches that overflow the signed offset limit increases. Oddly, I think this means that topo-sorting the functions in the wasm module would tend to avoid hitting this behavior in Wasmtime, and also let Wasmtime generate slightly better code. I'm not sure that's worth pursuing though. |
I think the answer here is @TimonPost's module got bigger. If I take my example module and compile it with Wasmtime 4.0.0 (to pick a random version) which is circa Dec 2022 I get x86_64 being 0.9s and aarch64 being 16s (different computer than my original measurements). In that sense I don't think that this is a new issue, I think this is a scaling issue that @TimonPost's module has run into now.
Oh sorry no definitely not, that's too load bearing to replace with something else! What I was saying is that for the inter-function use case the branch optimizations are not necessary (as there are no branches). I'll also clarify in that I'm not saying the branch optimizations aren't worth it, I'm saying that, if it works out perf-wise, it might be better to perform these optimizations at the mid-end instead of just before emission. (e.g. basic block jump threading and other basic block optimizations on the clif layer) Note though that I only think this is is a possible good idea if it retains all of the benefits that the current branch optimization brings, if something must be done at the mach buffer layer and can't be done at the clif layer then it's good to keep.
Nice! I'll also see your train ride and raise you a plane ride :) I'll take a look probably early next week 👍 (unless someone beats me to it) |
Ah, I see. I think it's probably best to keep |
I can confirm our |
We have noticed a recent regression in module compilation times on the MBP M1/M2. Sometimes it takes over 100 seconds when it used to take less than 15.
Several non-scientific numbers have been reported on various platforms. At first three individual Mac users started reporting extreme numbers, often exceeding +90 seconds, while Windows and Linux users report an average of around 50 seconds, with some reporting more normal numbers below 10 seconds. We have attempted to isolate the problem by manually clearing all caches and see how long it takes to recompile from scratch.
Upon further profiling, it appears that the performance bottleneck is related to the "cranelift" library (see image).
Any tips on how we can help find out the root cause here are appreciated. Are there perhaps benchmarks to run to see if regressions happened?
Some non-scientific data:
Perhaps introduced in v10.0.0 or v10.0.1, as we have not upgraded to 11 yet, and in 9.0.4 we fixed another module compilation issue and had not experienced any problems yet up till quite recently
The text was updated successfully, but these errors were encountered: