Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement support for tail calls in Wasmtime and Cranelift #1065

Closed
brunoczim opened this issue Nov 24, 2018 · 26 comments
Closed

Implement support for tail calls in Wasmtime and Cranelift #1065

brunoczim opened this issue Nov 24, 2018 · 26 comments
Assignees
Labels
cranelift Issues related to the Cranelift code generator enhancement

Comments

@brunoczim
Copy link

To write a functional language compiler using this IR, tail call eliminations would be desirable. Are there any plans to support this? I couldn't find any details in the docs.

@lachlansneff
Copy link
Contributor

Webassembly will be getting explicit tail calls soon, so cranelift will have to support as well. At the moment, tce is not supported. I assume we'll add an instruction.

@sunfishcode
Copy link
Member

Agreed. We'll need direct and indirect forms. And, we'll need a new calling convention, since native calling conventions don't support guaranteed tail calling in general.

For the calling convention, I imagine we can pick between implementing GHC's and/or HiPE's, and/or inventing something Cranelift-specific.

@michalmuskala
Copy link

michalmuskala commented Dec 25, 2018

I would be interested in this for a project I'm working on.

As far as I understand, the primary issue with calling conventions and tail calls are callee-saved registers. The issue is that there's generally no way of efficiently knowing if they were already spilt to stack or not by the time we're in a function prologue and similarly if we should restore them (and even from where) in the epilogue.

That's one of the reasons, why AFAIK all the calling conventions in LLVM that support tail calls don't have any callee-saved registers. That way the tail call can easily deallocate the current frame and just jump into the new function passing all the arguments in the registers or on the stack if needed, but does not keep anything on the stack that would need to be manipulated when finally returning.

In LLVM (at least on x64) the HiPE convention means that all integer types are "upgraded" to i64, all stack slots are 8-byte wide and 8-byte aligned. It additionally pins 6 registers - for the process struct (similar to the VM data Cranelift already supports), for the heap pointer and for 4 arguments.
The GHC convention also promotes all integer types to i64 and pins 10 registers - 4 for VM state and 6 for arguments. It additionally pins some floating point registers for arguments.

Looking through the code and some opened issues and PRs, I can see there's quite some work on special-casing some arguments to be pinned in specific registers, etc. I wonder if that could be somehow generalized for the front-end to provide those settings, if needed.

As far as what I would consider ideal support for my project - I'd be happy with instructions to execute the guaranteed tail calls as well as a way to pin certain registers for needed data structures. I believe it's not very scalable for the code generator itself to support all the various schemes - it probably would be better for the front-end to provide this information somehow. A separate issue is stackmaps that is already touched on in a separate PR.

@wingo
Copy link

wingo commented Aug 1, 2019

I am not sure that implementing wasm tail calls will require a specific calling convention, as the set of functions that may be tail-called is not known to the compiler. It is up to the tail-caller to restore registers that it saved when it performs the call, along with popping off any arguments pushed on the stack and possibly shuffling the saved return address. It is always possible to do this, AFAIU.

@sunfishcode
Copy link
Member

@wingo Cranelift will indeed need special calling conventions for tail calls. You're right that if the current tail call proposal advances in its current form, we won't know which wasm functions are tail callable and which are not, but that just means we'll need to mark all wasm functions with the special calling conventions.

@wingo
Copy link

wingo commented Aug 8, 2019

Hah I am an idiot, indeed a new convention is needed -- callees always have to pop spilled arguments, because callers no longer have the information to do this.

@wingo
Copy link

wingo commented Aug 8, 2019

@alexcrichton alexcrichton transferred this issue from bytecodealliance/cranelift Feb 28, 2020
@alexcrichton alexcrichton added the cranelift Issues related to the Cranelift code generator label Feb 28, 2020
@birktj
Copy link

birktj commented Apr 3, 2020

Has any work been done on this lately?

@simvux
Copy link
Contributor

simvux commented Jul 11, 2020

Could this also be implemented in the x86 codegen?

@L-as
Copy link
Contributor

L-as commented Sep 7, 2021

Any update on this? Is it easy enough that an outsider could do it?

@cfallin
Copy link
Member

cfallin commented Sep 7, 2021

Hi @L-as -- no, there unfortunately hasn't been progress on this. I'd love for that to be different; if you or another motivated person wants to tackle it, it's possible, but it's likely a month or more of delicate surgery updating our ABI and runtime code to use a new calling convention that supports tail calls.

It's the sort of problem that feels simple-ish at first -- indeed if you only have args that go in registers, and no callee-saves or spillslots, it can be as simple as some moves to set up the tailcall args and then a jump... but imposing those requirements on the CLIF producer to obtain a guaranteed tailcall would lead to some surprisingly brittle behavior where e.g. changing some unrelated code requires a spill and suddenly breaks the tailcall. Basically, in the general case, one needs to clean up one's own stackframe, restore callee-saves, and remove one's stack args from the stack, and then set up args on the stack for the callee, but possibly some of those args need to come from your spillslots or stack args (and if there are enough of them you can't hold all of them in the register file at once) so you need to do the arbitrary-state-permute possibly with some temps somewhere, or maybe in the worst case you set up the new frame before you pop your own then memmove() it up, but then what about return-area pointers that need to be fixed up... ah and take care to mind any stack-limit checks we need to do as well... in short, it's a mess.

The Bugzilla bug for SpiderMonkey's Wasm tail call support here contains a link to 25-page Google Doc that outlines in some good detail new ABI that the SpiderMonkey folks have designed to support this... it's a lot of work, but understanding that (or at least the reasons for its decisions) would be a good starting point.

Anyway, if you or anyone else has the resources to tackle this and want to dig in more, I'd be happy to help work out details!

@andrew-johnson-4
Copy link

@cfallin This may be really naïve of me, but what happens if the caller is made responsible for the tail-call elimination. In the simplest case of a recursive tail-call this is just a reentrant function call with new parameters. How much of a pain and/or performance hit would this be. I am just interested in this issue in general, even if it isn't "the best option possible".

@cfallin
Copy link
Member

cfallin commented Dec 29, 2022

@andrew-johnson-4 can you give a bit more detail what you mean by "caller is made responsible"? E.g., do you mean the callee should return some value that means "now call the next function"? (So f calls g tail-calls h becomes: f calls g, g returns to f, f calls h, somehow.) Or something else?

The general problem to wrestle with is easiest to see (IMHO) in a minimized example: f and g tail-call each other, both have enough arguments that some of them are passed on the stack, and they have different numbers of arguments. E.g., on x86-64 where the first six i64 args are passed in registers, say that f has 8 i64 args and g has 10 i64 args.

Then when f tail-calls g, it needs 32 bytes of stack-arg space (the remaining 4 i64 args), but it itself only was given 16 bytes of stack-arg space. Furthermore, the original caller of f, in the standard SysV ABI, is responsible for popping those 16 bytes of space. There's no way to satisfy all the constraints for a true tail-call in this case in the SysV ABI. That's why the usual way to achieve true tail-calls in the arbitrary case is to adopt a "callee-pop" ABI instead: f and g both expect a certain number of args on the stack, and pop them before returning. This way, f's tail-call to g can do that pop, and then push g's args; and when g returns to the original caller of f, there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args.

It's also possible to make it work in a "caller-pop" scheme, but requires dynamic information to be passed around about the size of argument and return-value areas. In the general case, dynamic approaches are less efficient (at the very least they require an extra arg and retval).

I suspect that what you're thinking of with "caller is made responsible" is either something like caller-pop, or else is a sort of "trampoline" scheme where the callee returns either "next func to call" or "final return" states and the callsite runs a worklist/driver loop. I'm curious to hear more what you have in mind though!

@andrew-johnson-4
Copy link

andrew-johnson-4 commented Dec 29, 2022

Yes, the f g recursion is the worst variant. I use that as a benchmark for testing that tail-calls are optimized.

To synthesize tail-call optimization right now (I want to use cranelift for a pet project) I am looking at the option of writing a function with an Either return type of Either<TailCall,RetVal>. Every marked tailcall function is executed in a loop that loads up the call, then either reenters some function or breaks the loop with a return value.

I am currently transpiling my interpreted toy language into JIT segments. I have the benefit of whole program optimization, which makes it a bit easier. This Either is the worst case scenario that would still potentially go into compiled code. If the f and g et. al. are finite and known then this can be optimized. Again, if it is only f recursion that makes this so much simpler.

I see this as primarily a space optimization problem, not so much a speed problem. For simple functional programs with loops this is a huge deal.

"There's no way to satisfy all the constraints for a true tail-call in this case" yes, no unified calling convention. "there's no need to somehow know that the actual returnee had 32 bytes rather than 16 of stack args." that is certainly advantageous to have a unified interface. What would that entail development-wise to support?

From a previously linked doc "Agreeing on common conventions allows SpiderMonkey to freely choose the compilation strategy for different WebAssembly functions, and to change that choice over time." This is a design constraint that I specifically don't care about right now, but probably wasmtime does.

@cfallin
Copy link
Member

cfallin commented Dec 29, 2022

Either<TailCall,RetVal>

Indeed, that's the "trampoline" approach. It's definitely used in the wild; e.g. when targeting Wasm there's this comment about a Haskell compiler, and IIRC, Clojure provides this (via an explicit API exposed to the user?) as a replacement for tail calls on the JVM. It seems like by far the easiest option at the moment, and could be extended to the corecursive case (tail-call to another function) by returning a "function pointer" (index into function table) as part of the TailCall type...

that is certainly advantageous to have a unified interface. What would that entail development-wise to support?

Probably about a month of effort, depending on who's doing the work and their familiarity with the codebase :-) Adding a new ABI is not too technically challenging on its own, but the cross-cutting interactions with everything else -- backtraces, trampolines elsewhere in the system, unwind info, etc. -- make for a likely somewhat gnarly project.

@andrew-johnson-4
Copy link

I'll try to familiarize myself more with cranelift before I volunteer. There is lots of low-hanging fruit on my end before tce is important. I would be afraid of completely beggaring your devs if I tried now :O

@cfallin
Copy link
Member

cfallin commented Dec 29, 2022

It's actually somewhat likely that @fitzgen will work on this sometime early next year, or at least, that was my most recent understanding :-) Having others interested in the issue as well is never a bad thing, of course!

@bjorn3
Copy link
Contributor

bjorn3 commented Dec 29, 2022

By the way tail calls to _Unwind_Resume will be necessary for exceptions. This should be less hard than the general case though as it doesn't require any args being passed on the stack.

@andrew-johnson-4
Copy link

@cfallin Is it safe to assume that jumping to the entry block is one special case of a true tail-call? Does this violate SSA or other invariants potentially?

@jameysharp
Copy link
Contributor

I don't know that we've documented this, but no, Cranelift doesn't allow branching back to the entry block. I know one reason is that there's no block to hoist loop-invariant code into if the loop's entry doesn't have a dominator. I feel like this has come up as a problem in other contexts too but I don't remember them.

That said, you can always emit an entry block that immediately jumps to another block with the same signature, forwarding the arguments, and branch back to that second block instead of tail-calling.

Either way this only works for a function that tail-calls itself. Cranelift has no way to express a jump to a block in a different function, since block IDs are function-scoped. But in that special case, yes, you could implement tail-calls today.

@fitzgen
Copy link
Member

fitzgen commented Jan 10, 2023

FWIW, I am working on an RFC for this at the moment.

@bjorn3
Copy link
Contributor

bjorn3 commented Jan 26, 2023

The RFC in question is bytecodealliance/rfcs#29.

@fitzgen
Copy link
Member

fitzgen commented Apr 21, 2023

The big blocker for tail calls was overhauling Wasmtime's trampolines, which I've just posted a PR for: #6262

Will start on Cranelift support for tail calls when that lands.

@fitzgen fitzgen self-assigned this Apr 21, 2023
@fitzgen fitzgen changed the title Is there any support for tail call eliminations? Are there any plans about it? Implement support tail calls in Wasmtime and Cranelift Apr 21, 2023
@fitzgen
Copy link
Member

fitzgen commented Jul 6, 2023

So @jameysharp and I did a little profiling/investigation of switching the internal Wasm calling convention over to tail on our sightglass benchmarks. I was really expecting this to have no measurable change, but unfortunately it looks like it has a ~7% overhead on bz2 and spidermonkey.wasm and ~1% overhead on pulldown-cmark. This is surprising! We think this means that we ~frequently call functions that don't have enough register pressure to clobber all callee-save registers, and since tail only has caller-save registers and zero callee-save registers, we are doing more spills than we used to. Enough more that it is really measurable.

Supporting caller-save registers with the tail calling convention is possible, but requires more work. Either we do parallel moves (and also avoid building the temporary stack frames for tail calls with stack arguments) or we add a bunch of fiddly offset computation for tail calls. Both possible, but I don't want to do either right this very moment. I'd really like to get a wasmtime::Config knob for Wasm tail calls shipping first.

So here is our updated plan:

  • Only use tail as the Wasm internal calling convention when the Wasm tail calls proposal is enabled. This avoids any impact on existing Wasm programs that don't use tail calls. However, it requires a bit of unfortunate plumbing, but that's fine.
  • Finish implementing the Wasm tail calls proposal on all platforms and make sure it is fully spec compliant and all that.
  • Fix the performance issues after that knob is exposed.
  • Switch all Wasm internal calling conventions to tail after that.
  • ... profit

Additionally, we would not enable tail calls by default in Wasmtime until the performance issues are addressed.

@fitzgen fitzgen changed the title Implement support tail calls in Wasmtime and Cranelift Implement support for tail calls in Wasmtime and Cranelift Jul 6, 2023
@fitzgen
Copy link
Member

fitzgen commented Jul 25, 2023

FYI, Wasmtime is gaining a Config::wasm_tail_call method and --features tail-call CLI flag in #6774 and (unless you're on s390x) once that lands you can start running .wasms with tail calls on Wasmtime!

As a reminder, tail calls won't be enabled by default until we

@alexcrichton
Copy link
Member

Tail calls are now enabled by default in Wasmtime except for s390x, so I think this is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator enhancement
Projects
Development

No branches or pull requests