Implement support for tail calls in Wasmtime and Cranelift #1065

brunoczim · 2018-11-24T17:19:51Z

To write a functional language compiler using this IR, tail call eliminations would be desirable. Are there any plans to support this? I couldn't find any details in the docs.

lachlansneff · 2018-11-24T18:23:24Z

Webassembly will be getting explicit tail calls soon, so cranelift will have to support as well. At the moment, tce is not supported. I assume we'll add an instruction.

sunfishcode · 2018-11-25T13:07:20Z

Agreed. We'll need direct and indirect forms. And, we'll need a new calling convention, since native calling conventions don't support guaranteed tail calling in general.

For the calling convention, I imagine we can pick between implementing GHC's and/or HiPE's, and/or inventing something Cranelift-specific.

michalmuskala · 2018-12-25T20:41:07Z

I would be interested in this for a project I'm working on.

As far as I understand, the primary issue with calling conventions and tail calls are callee-saved registers. The issue is that there's generally no way of efficiently knowing if they were already spilt to stack or not by the time we're in a function prologue and similarly if we should restore them (and even from where) in the epilogue.

That's one of the reasons, why AFAIK all the calling conventions in LLVM that support tail calls don't have any callee-saved registers. That way the tail call can easily deallocate the current frame and just jump into the new function passing all the arguments in the registers or on the stack if needed, but does not keep anything on the stack that would need to be manipulated when finally returning.

In LLVM (at least on x64) the HiPE convention means that all integer types are "upgraded" to i64, all stack slots are 8-byte wide and 8-byte aligned. It additionally pins 6 registers - for the process struct (similar to the VM data Cranelift already supports), for the heap pointer and for 4 arguments.
The GHC convention also promotes all integer types to i64 and pins 10 registers - 4 for VM state and 6 for arguments. It additionally pins some floating point registers for arguments.

Looking through the code and some opened issues and PRs, I can see there's quite some work on special-casing some arguments to be pinned in specific registers, etc. I wonder if that could be somehow generalized for the front-end to provide those settings, if needed.

As far as what I would consider ideal support for my project - I'd be happy with instructions to execute the guaranteed tail calls as well as a way to pin certain registers for needed data structures. I believe it's not very scalable for the code generator itself to support all the various schemes - it probably would be better for the front-end to provide this information somehow. A separate issue is stackmaps that is already touched on in a separate PR.

wingo · 2019-08-01T14:47:33Z

I am not sure that implementing wasm tail calls will require a specific calling convention, as the set of functions that may be tail-called is not known to the compiler. It is up to the tail-caller to restore registers that it saved when it performs the call, along with popping off any arguments pushed on the stack and possibly shuffling the saved return address. It is always possible to do this, AFAIU.

sunfishcode · 2019-08-06T13:49:10Z

@wingo Cranelift will indeed need special calling conventions for tail calls. You're right that if the current tail call proposal advances in its current form, we won't know which wasm functions are tail callable and which are not, but that just means we'll need to mark all wasm functions with the special calling conventions.

wingo · 2019-08-08T09:26:59Z

Hah I am an idiot, indeed a new convention is needed -- callees always have to pop spilled arguments, because callers no longer have the information to do this.

wingo · 2019-08-08T11:57:30Z

See also https://bugzilla.mozilla.org/show_bug.cgi?id=1572400.

birktj · 2020-04-03T19:12:06Z

Has any work been done on this lately?

simvux · 2020-07-11T06:03:27Z

Could this also be implemented in the x86 codegen?

L-as · 2021-09-07T13:59:39Z

Any update on this? Is it easy enough that an outsider could do it?

cfallin · 2021-09-07T15:46:14Z

Hi @L-as -- no, there unfortunately hasn't been progress on this. I'd love for that to be different; if you or another motivated person wants to tackle it, it's possible, but it's likely a month or more of delicate surgery updating our ABI and runtime code to use a new calling convention that supports tail calls.

It's the sort of problem that feels simple-ish at first -- indeed if you only have args that go in registers, and no callee-saves or spillslots, it can be as simple as some moves to set up the tailcall args and then a jump... but imposing those requirements on the CLIF producer to obtain a guaranteed tailcall would lead to some surprisingly brittle behavior where e.g. changing some unrelated code requires a spill and suddenly breaks the tailcall. Basically, in the general case, one needs to clean up one's own stackframe, restore callee-saves, and remove one's stack args from the stack, and then set up args on the stack for the callee, but possibly some of those args need to come from your spillslots or stack args (and if there are enough of them you can't hold all of them in the register file at once) so you need to do the arbitrary-state-permute possibly with some temps somewhere, or maybe in the worst case you set up the new frame before you pop your own then memmove() it up, but then what about return-area pointers that need to be fixed up... ah and take care to mind any stack-limit checks we need to do as well... in short, it's a mess.

The Bugzilla bug for SpiderMonkey's Wasm tail call support here contains a link to 25-page Google Doc that outlines in some good detail new ABI that the SpiderMonkey folks have designed to support this... it's a lot of work, but understanding that (or at least the reasons for its decisions) would be a good starting point.

Anyway, if you or anyone else has the resources to tackle this and want to dig in more, I'd be happy to help work out details!

andrew-johnson-4 · 2022-12-29T01:13:43Z

@cfallin This may be really naïve of me, but what happens if the caller is made responsible for the tail-call elimination. In the simplest case of a recursive tail-call this is just a reentrant function call with new parameters. How much of a pain and/or performance hit would this be. I am just interested in this issue in general, even if it isn't "the best option possible".

cfallin · 2022-12-29T01:31:07Z

@andrew-johnson-4 can you give a bit more detail what you mean by "caller is made responsible"? E.g., do you mean the callee should return some value that means "now call the next function"? (So f calls g tail-calls h becomes: f calls g, g returns to f, f calls h, somehow.) Or something else?

The general problem to wrestle with is easiest to see (IMHO) in a minimized example: f and g tail-call each other, both have enough arguments that some of them are passed on the stack, and they have different numbers of arguments. E.g., on x86-64 where the first six i64 args are passed in registers, say that f has 8 i64 args and g has 10 i64 args.

Then when f tail-calls g, it needs 32 bytes of stack-arg space (the remaining 4 i64 args), but it itself only was given 16 bytes of stack-arg space. Furthermore, the original caller of f, in the standard SysV ABI, is responsible for popping those 16 bytes of space. There's no way to satisfy all the constraints for a true tail-call in this case in the SysV ABI. That's why the usual way to achieve true tail-calls in the arbitrary case is to adopt a "callee-pop" ABI instead: f and g both expect a certain number of args on the stack, and pop them before returning. This way, f's tail-call to g can do that pop, and then push g's args; and when g returns to the original caller of f, there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args.

It's also possible to make it work in a "caller-pop" scheme, but requires dynamic information to be passed around about the size of argument and return-value areas. In the general case, dynamic approaches are less efficient (at the very least they require an extra arg and retval).

I suspect that what you're thinking of with "caller is made responsible" is either something like caller-pop, or else is a sort of "trampoline" scheme where the callee returns either "next func to call" or "final return" states and the callsite runs a worklist/driver loop. I'm curious to hear more what you have in mind though!

andrew-johnson-4 · 2022-12-29T01:50:55Z

Yes, the f g recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.

To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an Either return type of Either<TailCall,RetVal>. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.

I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This Either is the worst case scenario that would still potentially go into compiled code. If the f and g et. al. are finite and known then this can be optimized. Again, if it is only f recursion that makes this so much simpler.

I see this as primarily a space optimization problem, not so much a speed problem. For simple functional programs with loops this is a huge deal.

"There's no way to satisfy all the constraints for a true tail-call in this case" yes, no unified calling convention. "there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args." that is certainly advantageous to have a unified interface. What would that entail development-wise to support?

From a previously linked doc "Agreeing on common conventions allows SpiderMonkey to freely choose the compilation strategy for different WebAssembly functions, and to change that choice over time." This is a design constraint that I specifically don't care about right now, but probably wasmtime does.

cfallin · 2022-12-29T02:47:13Z

Either<TailCall,RetVal>

Indeed, that's the "trampoline" approach. It's definitely used in the wild; e.g. when targeting Wasm there's this comment about a Haskell compiler, and IIRC, Clojure provides this (via an explicit API exposed to the user?) as a replacement for tail calls on the JVM. It seems like by far the easiest option at the moment, and could be extended to the corecursive case (tail-call to another function) by returning a "function pointer" (index into function table) as part of the TailCall type...

that is certainly advantageous to have a unified interface. What would that entail development-wise to support?

Probably about a month of effort, depending on who's doing the work and their familiarity with the codebase :-) Adding a new ABI is not too technically challenging on its own, but the cross-cutting interactions with everything else -- backtraces, trampolines elsewhere in the system, unwind info, etc. -- make for a likely somewhat gnarly project.

andrew-johnson-4 · 2022-12-29T02:52:32Z

I'll try to familiarize myself more with cranelift before I volunteer. There is lots of low-hanging fruit on my end before tce is important. I would be afraid of completely beggaring your devs if I tried now :O

cfallin · 2022-12-29T03:18:10Z

It's actually somewhat likely that @fitzgen will work on this sometime early next year, or at least, that was my most recent understanding :-) Having others interested in the issue as well is never a bad thing, of course!

bjorn3 · 2022-12-29T07:08:34Z

By the way tail calls to _Unwind_Resume will be necessary for exceptions. This should be less hard than the general case though as it doesn't require any args being passed on the stack.

andrew-johnson-4 · 2023-01-04T22:14:55Z

@cfallin Is it safe to assume that jumping to the entry block is one special case of a true tail-call? Does this violate SSA or other invariants potentially?

jameysharp · 2023-01-04T22:38:18Z

I don't know that we've documented this, but no, Cranelift doesn't allow branching back to the entry block. I know one reason is that there's no block to hoist loop-invariant code into if the loop's entry doesn't have a dominator. I feel like this has come up as a problem in other contexts too but I don't remember them.

That said, you can always emit an entry block that immediately jumps to another block with the same signature, forwarding the arguments, and branch back to that second block instead of tail-calling.

Either way this only works for a function that tail-calls itself. Cranelift has no way to express a jump to a block in a different function, since block IDs are function-scoped. But in that special case, yes, you could implement tail-calls today.

fitzgen · 2023-01-10T16:49:29Z

FWIW, I am working on an RFC for this at the moment.

bjorn3 · 2023-01-26T19:05:41Z

The RFC in question is bytecodealliance/rfcs#29.

fitzgen · 2023-04-21T18:33:46Z

The big blocker for tail calls was overhauling Wasmtime's trampolines, which I've just posted a PR for: #6262

Will start on Cranelift support for tail calls when that lands.

fitzgen · 2023-07-06T22:38:15Z

So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.

Supporting caller-save registers with the tail calling convention is possible, but requires more work. Either we do parallel moves (and also avoid building the temporary stack frames for tail calls with stack arguments) or we add a bunch of fiddly offset computation for tail calls. Both possible, but I don't want to do either right this very moment. I'd really like to get a wasmtime::Config knob for Wasm tail calls shipping first.

So here is our updated plan:

Only use tail as the Wasm internal calling convention when the Wasm tail calls proposal is enabled. This avoids any impact on existing Wasm programs that don't use tail calls. However, it requires a bit of unfortunate plumbing, but that's fine.
Finish implementing the Wasm tail calls proposal on all platforms and make sure it is fully spec compliant and all that.
Fix the performance issues after that knob is exposed.
Switch all Wasm internal calling conventions to tail after that.
... profit

Additionally, we would not enable tail calls by default in Wasmtime until the performance issues are addressed.

fitzgen · 2023-07-25T21:10:33Z

FYI, Wasmtime is gaining a Config::wasm_tail_call method and --features tail-call CLI flag in #6774 and (unless you're on s390x) once that lands you can start running .wasms with tail calls on Wasmtime!

As a reminder, tail calls won't be enabled by default until we

resolve the tail calling convention performance regression (Cranelift: Get the tail calling convention as fast as fast on Sightglass #6759)
get s390x support for tail calls (Cranelift: get tail calls working on s390x #6530)
and meet all the other usual standards for enabling a Wasm feature by default: https://docs.wasmtime.dev/contributing-implementing-wasm-proposals.html#enabling-support-for-a-proposal-by-default

alexcrichton · 2024-05-10T02:16:18Z

Tail calls are now enabled by default in Wasmtime except for s390x, so I think this is done.

alexcrichton transferred this issue from bytecodealliance/cranelift Feb 28, 2020

alexcrichton added the cranelift Issues related to the Cranelift code generator label Feb 28, 2020

raviqqe mentioned this issue Sep 5, 2021

Fix tail call in wasm32 pen-lang/pen#325

Merged

cfallin added the enhancement label May 4, 2022

cfallin added this to Cranelift: feature support May 10, 2022

fitzgen self-assigned this Apr 21, 2023

fitzgen changed the title ~~Is there any support for tail call eliminations? Are there any plans about it?~~ Implement support tail calls in Wasmtime and Cranelift Apr 21, 2023

bjorn3 mentioned this issue May 5, 2023

Explicit Tail Calls rust-lang/rfcs#3407

Open

fitzgen changed the title ~~Implement support tail calls in Wasmtime and Cranelift~~ Implement support for tail calls in Wasmtime and Cranelift Jul 6, 2023

fitzgen mentioned this issue Jul 21, 2023

Cranelift: Get the tail calling convention as fast as fast on Sightglass #6759

Closed

yorickpeterse mentioned this issue Dec 21, 2023

Replace LLVM with Cranelift inko-lang/inko#674

Closed

alexcrichton closed this as completed May 10, 2024

github-project-automation bot moved this to Done in Cranelift: feature support May 10, 2024

cfallin mentioned this issue Aug 19, 2024

Add Tail Call Optimization #9142

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement support for tail calls in Wasmtime and Cranelift #1065

Implement support for tail calls in Wasmtime and Cranelift #1065

brunoczim commented Nov 24, 2018

lachlansneff commented Nov 24, 2018

sunfishcode commented Nov 25, 2018

michalmuskala commented Dec 25, 2018 •

edited

Loading

wingo commented Aug 1, 2019

sunfishcode commented Aug 6, 2019

wingo commented Aug 8, 2019

wingo commented Aug 8, 2019

birktj commented Apr 3, 2020

simvux commented Jul 11, 2020

L-as commented Sep 7, 2021

cfallin commented Sep 7, 2021

andrew-johnson-4 commented Dec 29, 2022

cfallin commented Dec 29, 2022

andrew-johnson-4 commented Dec 29, 2022 •

edited

Loading

cfallin commented Dec 29, 2022

andrew-johnson-4 commented Dec 29, 2022

cfallin commented Dec 29, 2022

bjorn3 commented Dec 29, 2022

andrew-johnson-4 commented Jan 4, 2023

jameysharp commented Jan 4, 2023

fitzgen commented Jan 10, 2023

bjorn3 commented Jan 26, 2023

fitzgen commented Apr 21, 2023

fitzgen commented Jul 6, 2023

fitzgen commented Jul 25, 2023

alexcrichton commented May 10, 2024

Implement support for tail calls in Wasmtime and Cranelift #1065

Implement support for tail calls in Wasmtime and Cranelift #1065

Comments

brunoczim commented Nov 24, 2018

lachlansneff commented Nov 24, 2018

sunfishcode commented Nov 25, 2018

michalmuskala commented Dec 25, 2018 • edited Loading

wingo commented Aug 1, 2019

sunfishcode commented Aug 6, 2019

wingo commented Aug 8, 2019

wingo commented Aug 8, 2019

birktj commented Apr 3, 2020

simvux commented Jul 11, 2020

L-as commented Sep 7, 2021

cfallin commented Sep 7, 2021

andrew-johnson-4 commented Dec 29, 2022

cfallin commented Dec 29, 2022

andrew-johnson-4 commented Dec 29, 2022 • edited Loading

cfallin commented Dec 29, 2022

andrew-johnson-4 commented Dec 29, 2022

cfallin commented Dec 29, 2022

bjorn3 commented Dec 29, 2022

andrew-johnson-4 commented Jan 4, 2023

jameysharp commented Jan 4, 2023

fitzgen commented Jan 10, 2023

bjorn3 commented Jan 26, 2023

fitzgen commented Apr 21, 2023

fitzgen commented Jul 6, 2023

fitzgen commented Jul 25, 2023

alexcrichton commented May 10, 2024

michalmuskala commented Dec 25, 2018 •

edited

Loading

andrew-johnson-4 commented Dec 29, 2022 •

edited

Loading