Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wasmtime 0.37 has much smaller call stack limit than 0.36 #4214

Closed
tiran opened this issue Jun 3, 2022 · 14 comments · Fixed by #4222
Closed

wasmtime 0.37 has much smaller call stack limit than 0.36 #4214

tiran opened this issue Jun 3, 2022 · 14 comments · Fixed by #4222
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@tiran
Copy link

tiran commented Jun 3, 2022

Test Case

wasmtime-stack-py.zip wasm32-wasi build of CPython main branch (3.12-dev) with part of the stdlib and a reproducer file.

Steps to Reproduce

  • extract zipfile
  • cd wasmtime-stack-py
  • Run wasmtime run --dir . python.wasm -- recursive.py 1500 with wasmtime 0.36 and 0.37

Expected Results

wasmtime 0.36 works correctly with a Python recursion limit up to a little more than 1500.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         3574: 0x193776 - <unknown>!run_mod
         3575: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         3576: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         3577: 0x1b23d3 - <unknown>!Py_RunMain
         3578: 0x1b2ede - <unknown>!pymain_main
         3579: 0x1b2f8d - <unknown>!Py_BytesMain
         3580: 0x4c52 - <unknown>!main
         3581: 0x31b09e - <unknown>!__main_void
         3582: 0x31b01d - <unknown>!__original_main
         3583: 0x4c36 - <unknown>!_start
         3584: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Actual Results

wasmtime 0.37 fails with a much lower recursion limit. It starts to run into call stack exhaustion at about 480 Python recursions and 1155 WASM call stacks. Please note that the highest WASM call stack number with wasmtime 0.36 is 3584.

Error: failed to run main module `python.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: call stack exhausted
       wasm backtrace:
           0: <unknown>!_PyEval_EvalFrameDefault
           1: 0x14e399 - <unknown>!_PyEval_Vector
           2: 0x57e85 - <unknown>!_PyFunction_Vectorcall
           3: 0x57d9d - <unknown>!PyObject_CallOneArg
           4: 0xc85dd - <unknown>!slot_tp_repr
...
         1145: 0x193776 - <unknown>!run_mod
         1146: 0x192e05 - <unknown>!_PyRun_SimpleFileObject
         1147: 0x1924bb - <unknown>!_PyRun_AnyFileObject
         1148: 0x1b23d3 - <unknown>!Py_RunMain
         1149: 0x1b2ede - <unknown>!pymain_main
         1150: 0x1b2f8d - <unknown>!Py_BytesMain
         1151: 0x4c52 - <unknown>!main
         1152: 0x31b09e - <unknown>!__main_void
         1153: 0x31b01d - <unknown>!__original_main
         1154: 0x4c36 - <unknown>!_start
         1155: 0x3343f7 - <unknown>!_start.command_export
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

Versions and Environment

Wasmtime version or commit: wasmtime-v0.37.0-x86_64-linux

Operating system: Linux (Fedora 36)

Architecture: X84_64

WASI SDK: 16.0

Python 3.12-dev (latest commit on main as of today)

Extra Info

You can find build scripts and artifacts in the git repo https://github.com/ethanhs/python-wasm/

@tiran tiran added the bug Incorrect behavior in the current implementation that needs fixing label Jun 3, 2022
tiran added a commit to tiran/cpython_builddep that referenced this issue Jun 3, 2022
wasmtime 0.37.0 seems to have a much smaller call stack size. Several
test cases with deep recursion are crashing the runtime.

See: bytecodealliance/wasmtime#4214
@alexcrichton
Copy link
Member

cc @cfallin, this is probably related to regalloc2 perhaps?

@cfallin
Copy link
Member

cfallin commented Jun 3, 2022

It could be; a larger stack frame in some particular function of this program due to different spilling behavior could very well cause this. @tiran, for context, we switched to a new register allocator (regalloc2) in wasmtime 0.37, and it works in quite a different way from the old one (generally better results, enough on average to justify switching, but not always).

@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific .wasm that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.

I'm definitely happy to look at this -- it's possible that some heuristic can be improved, or that something is just not working as expected.

We should also note though that stackframe size is a completely implementation-defined detail. One can configure the stack size of one's instances and sometimes this may be necessary for deeply recursive programs. To be clear I am absolutely wanting to dive into this and understand what happened, and we shouldn't have regressions of this sort if we can help it at all; but just want to set expectations at the same time :-)

Thanks for reporting the issue!

@cfallin
Copy link
Member

cfallin commented Jun 3, 2022

@tiran, I see your note about "artifacts and build scripts in the repo", but could you give a URL for a specific .wasm that I could download and compile, to compare the output before-and-after? That would greatly help diagnosing.

Nevermind this, I fail at reading occasionally; I see your .zip at the top of the comment. Thanks!

@alexcrichton
Copy link
Member

I got an odd result with bisection locally reproducing with the *.zip above. I'm on an aarch64 machine and the v0.37.0 release actually runs the recursive.py just fine. On main, however, I get the stack overflow as reported here. Bisection points to 0824abb as the "cause" and I can confirm that before that commit recursive.py runs ok but on that commit recursive.py runs out of stack.

I suspect there's probably x86_64-vs-aarch64 differences which makes it such that v0.37.0 fails for @tiran but doesn't fail for me. Otherwise though there may be something going on with the alias analysis added that accidentally increases stack sizes. (or it was really close to the limit prior and alias analysis just happened to push it over the edge or something like that)

@cfallin
Copy link
Member

cfallin commented Jun 3, 2022

I did a quick analysis of compiled objects from the python.wasm in the zip before (0.35.2 that I happened to have lying around) and after (main from today). I grepped out all sub $IMM, %rsp instructions, which appear in prologues to allocate stack frames, then sorted and counted instances to form histograms.

Before:

   2517 0x10
   1654 0x20
   1069 0x30
   1376 0x40
    864 0x50
    495 0x60
    302 0x70
    150 0x80
    121 0x90
     61 0xa0
     39 0xb0
     20 0xc0
     15 0xd0
     19 0xe0
      8 0xf0
      5 0x100
      5 0x110
      2 0x120
      1 0x130
      1 0x160
      2 0x170
      1 0x180
      2 0x190
      2 0x1a0
      1 0x260

after:

   2471 0x10
   1640 0x20
   1212 0x30
   1471 0x40
    791 0x50
    424 0x60
    273 0x70
    141 0x80
    101 0x90
     62 0xa0
     32 0xb0
     33 0xc0
     26 0xd0
     20 0xe0
     11 0xf0
      3 0x100
      6 0x110
      8 0x120
      8 0x130
      6 0x140
      1 0x150
      2 0x180
      1 0x190
      2 0x1a0
      1 0x1b0
      1 0x1c0
      1 0x1e0
      2 0x1f0
      2 0x200
      1 0x310
      2 0x3e0
      1 0x550
      1 0x620
      1 0x950

so we are seeing significantly larger stack frames in some cases.

My current hypothesis is that alias analysis is creating longer liveranges by doing what it does (finding aliases, using values from further up the function body), and so this is potentially causing more spilling. This is a pretty classic code-motion problem with the usual NP-hard flavor -- do we reuse a value, but pay for storing it, or do we recompute it (in this case reloading); the former is better if we have free registers, the latter if we don't, but compiler phasing puts this decision before we get to regalloc and know for sure.

The alias analysis got some benefits but they were middling at best (see #4163), because it actually would do better when combined with GVN and other analyses concurrently; I'm happy to keep it in-tree but turn it off by default if we think that's the more predictable path for now, and turn it back on as part of a bigger mid-end optimization effort. Thoughts?

@cfallin
Copy link
Member

cfallin commented Jun 3, 2022

and with alias analysis disabled, the distribution is:

   2473 0x10
   1644 0x20
   1202 0x30
   1500 0x40
    807 0x50
    431 0x60
    255 0x70
    142 0x80
     83 0x90
     60 0xa0
     32 0xb0
     32 0xc0
     16 0xd0
     21 0xe0
     14 0xf0
      5 0x100
      6 0x110
      6 0x120
      8 0x130
      3 0x140
      1 0x150
      2 0x180
      3 0x190
      2 0x1b0
      1 0x1e0
      4 0x1f0
      1 0x360
      1 0x370
      1 0x3d0
      1 0x560
      1 0x640
      1 0x930

so there are still some large stack frames, but slightly fewer. On my x86-64 machine, with alias analysis disabled I still get a Wasm stack overflow trap with the repro above. I suspect the difference wrt @alexcrichton 's results on aarch64 is that aarch64 has twice as many registers so needs smaller stackframes generally.

So:

  • I can take a look at stackslot merging in RA2 to see if it's doing anything clearly wrong, or easy to fix;
  • We can turn off alias analysis for now.

@alexcrichton
Copy link
Member

I picked the biggest stack frame in the object which was for _wasm_function_3078 (_PyEval_EvalFrameDefault) and compared the sizes of it:

0.36.0 0.37.0
x86_64 608 2816
aarch64 528 1168

This function is indeed part of the recursive trace in the stack traces above, which means that inflating the stack size of this function indeed reduces the amount of recursion available.

Given that 0.37.0 doesn't have alias analysis while I'm sure it affects stack usage somewhat (as is evidenced from my bisection run on aarch64) this seems more worrisome. While different stack sizes are expected is it expected that ra2 generates a stack size that's 4x larger for equivalent code on x86_64?

@cfallin
Copy link
Member

cfallin commented Jun 3, 2022

Indeed, that corresponds to the 0x260 (608) in the histogram above; 0x950 (2384) bytes on main.

Definitely not expected! I'm looking into the stackslot merging as mentioned above...

@tiran
Copy link
Author

tiran commented Jun 3, 2022

Thank you for looking into the issue. I appreciate your promptly and detailed feedback.

I'm in the process of adding experimental support for wasm32-emscripten and wasm32-wasi to CPython upstream. Basic Emscripten support is mostly done. We now have all downstream patches from Pyodide integrated into CPython upstream and are able to run the test suite successfully under Node. WASI support is in an earlier state. I had all crashes fixed with WASI SDK 15 and wasmtime 0.36. WASI SDK 16 fixed a bug in utimensat that affected CPython. Right now I'm going through all remaining test failures and added skip markers to test cases that do not work with wasmtime (e.g. symlinks with absolute paths, tests that depend on working chmod/umask, etc.).

Once I have WASI builds + wasmtime figured out and stable, I plan to look into other runtime. Does WASI offer an API to introspect the call stack at runtime so we could determinate the Python recursion limit dynamically?

@alexcrichton The function _PyEval_EvalFrameDefault is the core of Python's ceval loop and executes Python byte code. It's a very long function with lots of locals. On GCC and clang it normally uses computed gotos. Since WASM target does not support computed gotos, it falls back to a long switch statement.

@alexcrichton
Copy link
Member

alexcrichton commented Jun 3, 2022

Currently there is no way for a wasm module to determine its own stack limit or otherwise see how big stack frames are, so there's no way for the wasm module itself to detect what its recursion limit should be.

Otherwise though it definitely makes sense that _PyEval_EvalFrameDefault needs to be pretty optimized! @cfallin is investigating the possibility of an accidental bug being introduced when Wasmtime switched from one register allocator to another. The new register allocator seems to produce much larger stack sizes than before for _PyEval_EvalFrameDefault which is believed to be an unintended result of switching register allocators.

The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well

tiran added a commit to tiran/cpython_builddep that referenced this issue Jun 3, 2022
wasmtime 0.37.0 seems to have a much smaller call stack size. Several
test cases with deep recursion are crashing the runtime.

See: bytecodealliance/wasmtime#4214
cfallin added a commit to cfallin/regalloc2 that referenced this issue Jun 3, 2022
The old logic, which did some linked-list rearranging to try to probe
more-likely-to-be-free slots first and which was inherited straight from
the original IonMonkey allocator, was slightly broken (error in
translation and not in IonMonkey, to be clear): it did not get the
list-splicing right, so quite often dropped a slot on the floor and
failed to consider it for further reuse.

On some experimentation, it seems to work just as well to keep a
SmallVec of spillslot indices per size class instead, and save the last
probe-point in order to spread load throughout the allocated slots while
limiting the number of probes (to bound quadratic behavior).

This change moves the maximum slot count from 285 to 92 in `python.wasm`
from bytecodealliance/wasmtime#4214, and the maximum frame size from
2384 bytes to 752 bytes.
@cfallin
Copy link
Member

cfallin commented Jun 3, 2022

Alright, I've discovered a truly boneheaded mistake of my own making in the spillslot allocation in regalloc2; PR incoming, but the histogram is now:

   2471 0x10
   1640 0x20
   1212 0x30
   1471 0x40
    829 0x50
    432 0x60
    274 0x70
    151 0x80
     97 0x90
     57 0xa0
     43 0xb0
     23 0xc0
     15 0xd0
     11 0xe0
      5 0xf0
      7 0x100
      5 0x110
      4 0x120
      3 0x130
      1 0x170
      3 0x190
      1 0x1b0
      1 0x1e0
      2 0x2f0

or a maximum frame size of 752 bytes (on x86-64). The remaining delta from regalloc.rs I believe can be explained by: (i) regalloc.rs reuses frame space between integer and float/vec freelists, while RA2 does not, for simplicity (they are different size classes); and (ii) RA2 bounds the probing (testing for overlap of liveranges) of spillslots to 10 attempts max, to avoid quadratic behavior, while regalloc.rs does not have such a bound, so gets more compaction sometimes at the cost of compile time.

@cfallin
Copy link
Member

cfallin commented Jun 3, 2022

Changes are up in a PR; maybe if we're lucky, we can get it reviewed, version-bumped, and pulled into Cranelift today before the 0.38 branch starts its two-week freeze/fuzzing period on Jun 5 (this weekend)!

cfallin added a commit to bytecodealliance/regalloc2 that referenced this issue Jun 3, 2022
* Fix spillslot allocation to actually reuse spillslots.

The old logic, which did some linked-list rearranging to try to probe
more-likely-to-be-free slots first and which was inherited straight from
the original IonMonkey allocator, was slightly broken (error in
translation and not in IonMonkey, to be clear): it did not get the
list-splicing right, so quite often dropped a slot on the floor and
failed to consider it for further reuse.

On some experimentation, it seems to work just as well to keep a
SmallVec of spillslot indices per size class instead, and save the last
probe-point in order to spread load throughout the allocated slots while
limiting the number of probes (to bound quadratic behavior).

This change moves the maximum slot count from 285 to 92 in `python.wasm`
from bytecodealliance/wasmtime#4214, and the maximum frame size from
2384 bytes to 752 bytes.
@cfallin
Copy link
Member

cfallin commented Jun 4, 2022

@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report!

@tiran
Copy link
Author

tiran commented Jun 4, 2022

@tiran this just merged, and made the window for 0.38.0; assuming all goes well (i.e. we don't have to back anything out), it should be included in our release scheduled for Jun 20. Thanks again for the report!

Thank you for the quick analysis and fix. Much appreciated!

The work you're doing sounds great though! If you've got specific issues with wasmtime's WASI support feel free to open issues here as well, and additionally if you've got thoughts about WASI in general I'm sure the WASI repository would be welcome to issues as well

@alexcrichton Will do! :) As I mentioned before we are at a very early stage of WASI support. I'm new to WebAssembly world, too. I started with Emscripten support for CPython little more than half a year ago. I'll start to open feature requests once I have a better understanding of WASI. In the mean time you can find my notes at https://github.com/python/cpython/blob/main/Tools/wasm/README.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants