-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wasmtime 0.37 has much smaller call stack limit than 0.36 #4214
Comments
wasmtime 0.37.0 seems to have a much smaller call stack size. Several test cases with deep recursion are crashing the runtime. See: bytecodealliance/wasmtime#4214
cc @cfallin, this is probably related to regalloc2 perhaps? |
It could be; a larger stack frame in some particular function of this program due to different spilling behavior could very well cause this. @tiran, for context, we switched to a new register allocator (regalloc2) in wasmtime 0.37, and it works in quite a different way from the old one (generally better results, enough on average to justify switching, but not always).
I'm definitely happy to look at this -- it's possible that some heuristic can be improved, or that something is just not working as expected. We should also note though that stackframe size is a completely implementation-defined detail. One can configure the stack size of one's instances and sometimes this may be necessary for deeply recursive programs. To be clear I am absolutely wanting to dive into this and understand what happened, and we shouldn't have regressions of this sort if we can help it at all; but just want to set expectations at the same time :-) Thanks for reporting the issue! |
Nevermind this, I fail at reading occasionally; I see your .zip at the top of the comment. Thanks! |
I got an odd result with bisection locally reproducing with the I suspect there's probably x86_64-vs-aarch64 differences which makes it such that v0.37.0 fails for @tiran but doesn't fail for me. Otherwise though there may be something going on with the alias analysis added that accidentally increases stack sizes. (or it was really close to the limit prior and alias analysis just happened to push it over the edge or something like that) |
I did a quick analysis of compiled objects from the Before:
after:
so we are seeing significantly larger stack frames in some cases. My current hypothesis is that alias analysis is creating longer liveranges by doing what it does (finding aliases, using values from further up the function body), and so this is potentially causing more spilling. This is a pretty classic code-motion problem with the usual NP-hard flavor -- do we reuse a value, but pay for storing it, or do we recompute it (in this case reloading); the former is better if we have free registers, the latter if we don't, but compiler phasing puts this decision before we get to regalloc and know for sure. The alias analysis got some benefits but they were middling at best (see #4163), because it actually would do better when combined with GVN and other analyses concurrently; I'm happy to keep it in-tree but turn it off by default if we think that's the more predictable path for now, and turn it back on as part of a bigger mid-end optimization effort. Thoughts? |
and with alias analysis disabled, the distribution is:
so there are still some large stack frames, but slightly fewer. On my x86-64 machine, with alias analysis disabled I still get a Wasm stack overflow trap with the repro above. I suspect the difference wrt @alexcrichton 's results on aarch64 is that aarch64 has twice as many registers so needs smaller stackframes generally. So:
|
I picked the biggest stack frame in the object which was for
This function is indeed part of the recursive trace in the stack traces above, which means that inflating the stack size of this function indeed reduces the amount of recursion available. Given that 0.37.0 doesn't have alias analysis while I'm sure it affects stack usage somewhat (as is evidenced from my bisection run on aarch64) this seems more worrisome. While different stack sizes are expected is it expected that ra2 generates a stack size that's 4x larger for equivalent code on x86_64? |
Indeed, that corresponds to the 0x260 (608) in the histogram above; 0x950 (2384) bytes on Definitely not expected! I'm looking into the stackslot merging as mentioned above... |
Thank you for looking into the issue. I appreciate your promptly and detailed feedback. I'm in the process of adding experimental support for wasm32-emscripten and wasm32-wasi to CPython upstream. Basic Emscripten support is mostly done. We now have all downstream patches from Pyodide integrated into CPython upstream and are able to run the test suite successfully under Node. WASI support is in an earlier state. I had all crashes fixed with WASI SDK 15 and wasmtime 0.36. WASI SDK 16 fixed a bug in utimensat that affected CPython. Right now I'm going through all remaining test failures and added skip markers to test cases that do not work with wasmtime (e.g. symlinks with absolute paths, tests that depend on working chmod/umask, etc.). Once I have WASI builds + wasmtime figured out and stable, I plan to look into other runtime. Does WASI offer an API to introspect the call stack at runtime so we could determinate the Python recursion limit dynamically? @alexcrichton The function |
Currently there is no way for a wasm module to determine its own stack limit or otherwise see how big stack frames are, so there's no way for the wasm module itself to detect what its recursion limit should be. Otherwise though it definitely makes sense that The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well |
wasmtime 0.37.0 seems to have a much smaller call stack size. Several test cases with deep recursion are crashing the runtime. See: bytecodealliance/wasmtime#4214
The old logic, which did some linked-list rearranging to try to probe more-likely-to-be-free slots first and which was inherited straight from the original IonMonkey allocator, was slightly broken (error in translation and not in IonMonkey, to be clear): it did not get the list-splicing right, so quite often dropped a slot on the floor and failed to consider it for further reuse. On some experimentation, it seems to work just as well to keep a SmallVec of spillslot indices per size class instead, and save the last probe-point in order to spread load throughout the allocated slots while limiting the number of probes (to bound quadratic behavior). This change moves the maximum slot count from 285 to 92 in `python.wasm` from bytecodealliance/wasmtime#4214, and the maximum frame size from 2384 bytes to 752 bytes.
Alright, I've discovered a truly boneheaded mistake of my own making in the spillslot allocation in regalloc2; PR incoming, but the histogram is now:
or a maximum frame size of 752 bytes (on x86-64). The remaining delta from regalloc.rs I believe can be explained by: (i) regalloc.rs reuses frame space between integer and float/vec freelists, while RA2 does not, for simplicity (they are different size classes); and (ii) RA2 bounds the probing (testing for overlap of liveranges) of spillslots to 10 attempts max, to avoid quadratic behavior, while regalloc.rs does not have such a bound, so gets more compaction sometimes at the cost of compile time. |
Changes are up in a PR; maybe if we're lucky, we can get it reviewed, version-bumped, and pulled into Cranelift today before the 0.38 branch starts its two-week freeze/fuzzing period on Jun 5 (this weekend)! |
* Fix spillslot allocation to actually reuse spillslots. The old logic, which did some linked-list rearranging to try to probe more-likely-to-be-free slots first and which was inherited straight from the original IonMonkey allocator, was slightly broken (error in translation and not in IonMonkey, to be clear): it did not get the list-splicing right, so quite often dropped a slot on the floor and failed to consider it for further reuse. On some experimentation, it seems to work just as well to keep a SmallVec of spillslot indices per size class instead, and save the last probe-point in order to spread load throughout the allocated slots while limiting the number of probes (to bound quadratic behavior). This change moves the maximum slot count from 285 to 92 in `python.wasm` from bytecodealliance/wasmtime#4214, and the maximum frame size from 2384 bytes to 752 bytes.
@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report! |
Thank you for the quick analysis and fix. Much appreciated!
@alexcrichton Will do! :) As I mentioned before we are at a very early stage of WASI support. I'm new to WebAssembly world, too. I started with Emscripten support for CPython little more than half a year ago. I'll start to open feature requests once I have a better understanding of WASI. In the mean time you can find my notes at https://github.com/python/cpython/blob/main/Tools/wasm/README.md |
Test Case
wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.
Steps to Reproduce
cd wasmtime-stack-py
wasmtime run --dir . python.wasm -- recursive.py 1500
with wasmtime 0.36 and 0.37Expected Results
wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.
Actual Results
wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.
Versions and Environment
Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux
Operating system: Linux (Fedora 36)
Architecture: X84_64
WASI SDK: 16.0
Python 3.12-dev (latest commit on main as of today)
Extra Info
You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/
The text was updated successfully, but these errors were encountered: