wasmtime 0.37 has much smaller call stack limit than 0.36 #4214

tiran · 2022-06-03T14:20:32Z

Test Case

wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.

Steps to Reproduce

extract zipfile
cd wasmtime-stack-py
Run wasmtime run --dir . python.wasm -- recursive.py 1500 with wasmtime 0.36 and 0.37

Expected Results

wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         3574: 0x193776 - <unknown>!run_mod
         3575: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         3576: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         3577: 0x1b23d3 - <unknown>!Py_RunMain
         3578: 0x1b2ede - <unknown>!pymain_main
         3579: 0x1b2f8d - <unknown>!Py_BytesMain
         3580: 0x4c52 - <unknown>!main
         3581: 0x31b09e - <unknown>!__main_void
         3582: 0x31b01d - <unknown>!__original_main
         3583: 0x4c36 - <unknown>!_start
         3584: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Actual Results

wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         1145: 0x193776 - <unknown>!run_mod
         1146: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         1147: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         1148: 0x1b23d3 - <unknown>!Py_RunMain
         1149: 0x1b2ede - <unknown>!pymain_main
         1150: 0x1b2f8d - <unknown>!Py_BytesMain
         1151: 0x4c52 - <unknown>!main
         1152: 0x31b09e - <unknown>!__main_void
         1153: 0x31b01d - <unknown>!__original_main
         1154: 0x4c36 - <unknown>!_start
         1155: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Versions and Environment

Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux

Operating system: Linux (Fedora 36)

Architecture: X84_64

WASI SDK: 16.0

Python 3.12-dev (latest commit on main as of today)

Extra Info

You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/

The text was updated successfully, but these errors were encountered:

wasmtime 0.37.0 seems to have a much smaller call stack size. Several test cases with deep recursion are crashing the runtime. See: bytecodealliance/wasmtime#4214

alexcrichton · 2022-06-03T14:59:05Z

cc @cfallin, this is probably related to regalloc2 perhaps?

cfallin · 2022-06-03T16:13:55Z

It could be; a larger stack frame in some particular function of this program due to different spilling behavior could very well cause this. @tiran, for context, we switched to a new register allocator (regalloc2) in wasmtime 0.37, and it works in quite a different way from the old one (generally better results, enough on average to justify switching, but not always).

@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific .wasm that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.

I'm definitely happy to look at this -- it's possible that some heuristic can be improved, or that something is just not working as expected.

We should also note though that stackframe size is a completely implementation-defined detail. One can configure the stack size of one's instances and sometimes this may be necessary for deeply recursive programs. To be clear I am absolutely wanting to dive into this and understand what happened, and we shouldn't have regressions of this sort if we can help it at all; but just want to set expectations at the same time :-)

Thanks for reporting the issue!

cfallin · 2022-06-03T16:49:14Z

@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific .wasm that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.

Nevermind this, I fail at reading occasionally; I see your .zip at the top of the comment. Thanks!

alexcrichton · 2022-06-03T17:10:40Z

I got an odd result with bisection locally reproducing with the *.zip above. I'm on an aarch64 machine and the v0.37.0 release actually runs the recursive.py just fine. On main, however, I get the stack overflow as reported here. Bisection points to 0824abb as the "cause" and I can confirm that before that commit recursive.py runs ok but on that commit recursive.py runs out of stack.

I suspect there's probably x86_64-vs-aarch64 differences which makes it such that v0.37.0 fails for @tiran but doesn't fail for me. Otherwise though there may be something going on with the alias analysis added that accidentally increases stack sizes. (or it was really close to the limit prior and alias analysis just happened to push it over the edge or something like that)

cfallin · 2022-06-03T17:23:41Z

I did a quick analysis of compiled objects from the python.wasm in the zip before (0.35.2 that I happened to have lying around) and after (main from today). I grepped out all sub $IMM, %rsp instructions, which appear in prologues to allocate stack frames, then sorted and counted instances to form histograms.

Before:

after:

so we are seeing significantly larger stack frames in some cases.

My current hypothesis is that alias analysis is creating longer liveranges by doing what it does (finding aliases, using values from further up the function body), and so this is potentially causing more spilling. This is a pretty classic code-motion problem with the usual NP-hard flavor -- do we reuse a value, but pay for storing it, or do we recompute it (in this case reloading); the former is better if we have free registers, the latter if we don't, but compiler phasing puts this decision before we get to regalloc and know for sure.

The alias analysis got some benefits but they were middling at best (see #4163), because it actually would do better when combined with GVN and other analyses concurrently; I'm happy to keep it in-tree but turn it off by default if we think that's the more predictable path for now, and turn it back on as part of a bigger mid-end optimization effort. Thoughts?

cfallin · 2022-06-03T17:35:08Z

and with alias analysis disabled, the distribution is:

so there are still some large stack frames, but slightly fewer. On my x86-64 machine, with alias analysis disabled I still get a Wasm stack overflow trap with the repro above. I suspect the difference wrt @alexcrichton 's results on aarch64 is that aarch64 has twice as many registers so needs smaller stackframes generally.

So:

I can take a look at stackslot merging in RA2 to see if it's doing anything clearly wrong, or easy to fix;
We can turn off alias analysis for now.

alexcrichton · 2022-06-03T18:04:22Z

I picked the biggest stack frame in the object which was for _wasm_function_3078 (_PyEval_EvalFrameDefault) and compared the sizes of it:

	0.36.0	0.37.0
x86_64	608	2816
aarch64	528	1168

This function is indeed part of the recursive trace in the stack traces above, which means that inflating the stack size of this function indeed reduces the amount of recursion available.

Given that 0.37.0 doesn't have alias analysis while I'm sure it affects stack usage somewhat (as is evidenced from my bisection run on aarch64) this seems more worrisome. While different stack sizes are expected is it expected that ra2 generates a stack size that's 4x larger for equivalent code on x86_64?

cfallin · 2022-06-03T18:21:09Z

Indeed, that corresponds to the 0x260 (608) in the histogram above; 0x950 (2384) bytes on main.

Definitely not expected! I'm looking into the stackslot merging as mentioned above...

tiran · 2022-06-03T18:26:04Z

Thank you for looking into the issue. I appreciate your promptly and detailed feedback.

I'm in the process of adding experimental support for wasm32-emscripten and wasm32-wasi to CPython upstream. Basic Emscripten support is mostly done. We now have all downstream patches from Pyodide integrated into CPython upstream and are able to run the test suite successfully under Node. WASI support is in an earlier state. I had all crashes fixed with WASI SDK 15 and wasmtime 0.36. WASI SDK 16 fixed a bug in utimensat that affected CPython. Right now I'm going through all remaining test failures and added skip markers to test cases that do not work with wasmtime (e.g. symlinks with absolute paths, tests that depend on working chmod/umask, etc.).

Once I have WASI builds + wasmtime figured out and stable, I plan to look into other runtime. Does WASI offer an API to introspect the call stack at runtime so we could determinate the Python recursion limit dynamically?

@alexcrichton The function _PyEval_EvalFrameDefault is the core of Python's ceval loop and executes Python byte code. It's a very long function with lots of locals. On GCC and clang it normally uses computed gotos. Since WASM target does not support computed gotos, it falls back to a long switch statement.

alexcrichton · 2022-06-03T18:34:32Z

Currently there is no way for a wasm module to determine its own stack limit or otherwise see how big stack frames are, so there's no way for the wasm module itself to detect what its recursion limit should be.

Otherwise though it definitely makes sense that _PyEval_EvalFrameDefault needs to be pretty optimized! @cfallin is investigating the possibility of an accidental bug being introduced when Wasmtime switched from one register allocator to another. The new register allocator seems to produce much larger stack sizes than before for _PyEval_EvalFrameDefault which is believed to be an unintended result of switching register allocators.

The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well

wasmtime 0.37.0 seems to have a much smaller call stack size. Several test cases with deep recursion are crashing the runtime. See: bytecodealliance/wasmtime#4214

The old logic, which did some linked-list rearranging to try to probe more-likely-to-be-free slots first and which was inherited straight from the original IonMonkey allocator, was slightly broken (error in translation and not in IonMonkey, to be clear): it did not get the list-splicing right, so quite often dropped a slot on the floor and failed to consider it for further reuse. On some experimentation, it seems to work just as well to keep a SmallVec of spillslot indices per size class instead, and save the last probe-point in order to spread load throughout the allocated slots while limiting the number of probes (to bound quadratic behavior). This change moves the maximum slot count from 285 to 92 in `python.wasm` from bytecodealliance/wasmtime#4214, and the maximum frame size from 2384 bytes to 752 bytes.

cfallin · 2022-06-03T21:05:47Z

Alright, I've discovered a truly boneheaded mistake of my own making in the spillslot allocation in regalloc2; PR incoming, but the histogram is now:

or a maximum frame size of 752 bytes (on x86-64). The remaining delta from regalloc.rs I believe can be explained by: (i) regalloc.rs reuses frame space between integer and float/vec freelists, while RA2 does not, for simplicity (they are different size classes); and (ii) RA2 bounds the probing (testing for overlap of liveranges) of spillslots to 10 attempts max, to avoid quadratic behavior, while regalloc.rs does not have such a bound, so gets more compaction sometimes at the cost of compile time.

cfallin · 2022-06-03T21:28:57Z

Changes are up in a PR; maybe if we're lucky, we can get it reviewed, version-bumped, and pulled into Cranelift today before the 0.38 branch starts its two-week freeze/fuzzing period on Jun 5 (this weekend)!

* Fix spillslot allocation to actually reuse spillslots. The old logic, which did some linked-list rearranging to try to probe more-likely-to-be-free slots first and which was inherited straight from the original IonMonkey allocator, was slightly broken (error in translation and not in IonMonkey, to be clear): it did not get the list-splicing right, so quite often dropped a slot on the floor and failed to consider it for further reuse. On some experimentation, it seems to work just as well to keep a SmallVec of spillslot indices per size class instead, and save the last probe-point in order to spread load throughout the allocated slots while limiting the number of probes (to bound quadratic behavior). This change moves the maximum slot count from 285 to 92 in `python.wasm` from bytecodealliance/wasmtime#4214, and the maximum frame size from 2384 bytes to 752 bytes.

cfallin · 2022-06-04T00:13:48Z

@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report!

tiran · 2022-06-04T12:34:55Z

@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report!

Thank you for the quick analysis and fix. Much appreciated!

The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well

@alexcrichton Will do! :) As I mentioned before we are at a very early stage of WASI support. I'm new to WebAssembly world, too. I started with Emscripten support for CPython little more than half a year ago. I'll start to open feature requests once I have a better understanding of WASI. In the mean time you can find my notes at https://github.com/python/cpython/blob/main/Tools/wasm/README.md

tiran added the bug Incorrect behavior in the current implementation that needs fixing label Jun 3, 2022

tiran added a commit to tiran/cpython_builddep that referenced this issue Jun 3, 2022

Downgrade to wasmtime 0.36.0

0be7089

wasmtime 0.37.0 seems to have a much smaller call stack size. Several test cases with deep recursion are crashing the runtime. See: bytecodealliance/wasmtime#4214

tiran mentioned this issue Jun 3, 2022

Downgrade to wasmtime 0.36.0 tiran/cpython_builddep#32

Merged

tiran added a commit to tiran/cpython_builddep that referenced this issue Jun 3, 2022

Downgrade to wasmtime 0.36.0 (#32)

46d5c0e

wasmtime 0.37.0 seems to have a much smaller call stack size. Several test cases with deep recursion are crashing the runtime. See: bytecodealliance/wasmtime#4214

cfallin mentioned this issue Jun 3, 2022

Fix spillslot allocation to actually reuse spillslots. bytecodealliance/regalloc2#56

Merged

cfallin mentioned this issue Jun 3, 2022

Upgrade to regalloc2 v0.2.2. #4222

Merged

cfallin closed this as completed in #4222 Jun 4, 2022

tiran mentioned this issue Jul 28, 2022

test_compiler_recursion_limit fails on wasm32-wasi python/cpython#95335

Closed

s0me0ne-unkn0wn mentioned this issue Jan 4, 2023

Executor Environment parameterization paritytech/polkadot#6161

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wasmtime 0.37 has much smaller call stack limit than 0.36 #4214

wasmtime 0.37 has much smaller call stack limit than 0.36 #4214

tiran commented Jun 3, 2022

alexcrichton commented Jun 3, 2022

cfallin commented Jun 3, 2022 •

edited

Loading

cfallin commented Jun 3, 2022

alexcrichton commented Jun 3, 2022

cfallin commented Jun 3, 2022

cfallin commented Jun 3, 2022

alexcrichton commented Jun 3, 2022

cfallin commented Jun 3, 2022

tiran commented Jun 3, 2022

alexcrichton commented Jun 3, 2022 •

edited

Loading

cfallin commented Jun 3, 2022

cfallin commented Jun 3, 2022

cfallin commented Jun 4, 2022

tiran commented Jun 4, 2022

wasmtime 0.37 has much smaller call stack limit than 0.36 #4214

wasmtime 0.37 has much smaller call stack limit than 0.36 #4214

Comments

tiran commented Jun 3, 2022

Test Case

Steps to Reproduce

Expected Results

Actual Results

Versions and Environment

Extra Info

alexcrichton commented Jun 3, 2022

cfallin commented Jun 3, 2022 • edited Loading

cfallin commented Jun 3, 2022

alexcrichton commented Jun 3, 2022

cfallin commented Jun 3, 2022

cfallin commented Jun 3, 2022

alexcrichton commented Jun 3, 2022

cfallin commented Jun 3, 2022

tiran commented Jun 3, 2022

alexcrichton commented Jun 3, 2022 • edited Loading

cfallin commented Jun 3, 2022

cfallin commented Jun 3, 2022

cfallin commented Jun 4, 2022

tiran commented Jun 4, 2022

cfallin commented Jun 3, 2022 •

edited

Loading

alexcrichton commented Jun 3, 2022 •

edited

Loading