Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreter issues with let #35

Closed
titzer opened this issue Aug 2, 2020 · 18 comments
Closed

Interpreter issues with let #35

titzer opened this issue Aug 2, 2020 · 18 comments

Comments

@titzer
Copy link
Contributor

titzer commented Aug 2, 2020

The let instruction is difficult to implement in an interpreter without significant overhead. In particular, a let instruction introduces a new block scope and binds new local variable indices that are accessible (and writeable) in that scope. This requires the use of a second stack pointer if using the common implementation technique where local variables are preallocated before the operand stack on a single array that increases with the call stack. Secondly, leaving the scope of the let block must unbind or pop these variable indices. As the scope of a let is terminated with a normal end, there is no way for an interpreter to know where the scope ends unless by dynamically tracking control scopes. This is not the case with any of the existing control constructs in Wasm, and is specific to let. Dynamically tracking control scopes is not necessary for any reason currently, and would penalize all functions, even ones without let.

A simple solution to this problem is to require that functions pre-declare all of the space that they will need for let bindings in their bodies. This allows an interpreter to preallocate space for let bindings, doesn't require a second stack pointer, and also benefits JITs, since they must also use a dynamic data structure (e.g. to track SSA values) that would have to grow and shrink with lets. Since the preamble of a function is just a series of local variable declarations with value types, we will probably need to reserve a value from the binary encoding of the value type space to indicate a pre-declaration of let indexing space.

I think the other alternatives for avoiding this problem (e.g. not having let, but requiring definite initialization of locals with value types that have no default value such as non-null references) are worse.

@taralx
Copy link

taralx commented Aug 3, 2020

I'm sorry if this is a stupid question, but how does an interpreter handle branch instructions if it doesn't track control scopes? Doesn't it need to pop the stack?

@titzer
Copy link
Contributor Author

titzer commented Aug 3, 2020

There is no need for an explicit control stack. An interpreter has a side data structure to make branches efficient. The side data structure is a function of pc -> (target_pc, val_count, pop_count). It is only needed for if, else, and br*, i.e. there is no work for block, end, try, etc--they are no-ops in the interpreter. The data structure is very sparse, so it typically is only about 10% of the original code size. Naively, lookup is O(logn), but the entries can be arranged in control flow order and a second PC used so that there is no lookup necessary (i.e. O(1) cost for any branch).

The let instruction fundamentally breaks this strategy.

@taralx
Copy link

taralx commented Aug 3, 2020

Does it make sense to use a different block terminator for let?

@titzer
Copy link
Contributor Author

titzer commented Aug 3, 2020

I considered that as an option, but it still means that the interpreter would need an additional stack pointer (typically already has 2 stack pointers) because now the stack grows in an additional direction. For context, a typical interpreter will do the following:

storage = | locals A | operand stack A ---> .... |

with a "stack pointer" pointing to the top of the operand stack, moving up and down with each instruction, and a non-moving "frame pointer" pointing to the bottom of the locals. The storage may be part of the interpreter's actual call stack (i.e. allocated on the "C" or "machine stack") or another array (just another data structure). Either way, a call to B would push a new frame like so:

storage = | locals A | operand stack A | locals B | operand stack B --> ... |

That doesn't work if let can effectively grow the locals storage of the topmost frame. You can try to split the storage into two independent stacks, or try to reverse the direction of the local growth, like

storage = | .....<-- | B locals | locals A | operand stack A | operand stack B --> ... |

Or you can put the locals and stack at different ends of that array,

storage = | locals A | locals B | --> ... ... <-- operand stack B | operand stack A |

but you can't do either of those tricks with the intepreter's actual execution stack, which forces you to use a datastructure, which you now have to manage with 3 pointers.

Either way, it costs you, and it's complicated.

There is also the aesthetic aspect of having a special terminator for let just for an interpreter, which kind of bothers me.

Also, all of the above assumes that we don't do de Bruijn indexing for let-bound locals, because none of that would work anyway. Separate issues, but worth mentioning.

@taralx
Copy link

taralx commented Aug 3, 2020

Sounds reasonable. To be clear, I think asking for an upper bound on let-depth is perfectly reasonable.

@binji
Copy link
Member

binji commented Aug 3, 2020

If the interpreter is already making a pre-pass to generate the branch side table, can't it do the same to generate a local lookup side-table? That is, local.{get,set,tee} all have a mapping from pc -> frame offset.

@titzer
Copy link
Contributor Author

titzer commented Aug 4, 2020

@binji that'd be too much space.

I suppose the prepass could compute the maximum let size (or simply include it in the total number of locals), but the indexing issue still remains.

@rossberg
Copy link
Member

rossberg commented Aug 5, 2020

I agree that the prepass should be able to compute the maximum easily.

As for the relative indexing, that's designed to correspond to a stack-like handling of the local index space. So I think all you'd need to do is treat the locals array as a downward-growing stack (whose maximum size you've precomputed), initialise it with the function-locals in reverse order, and then index all locals from top of stack.

@titzer
Copy link
Contributor Author

titzer commented Aug 5, 2020

I understand the motivation for this numbering scheme for let. It is actually elegant and keeps small indices for recently-bound variables. Whatever scheme we choose doesn't make much of a difference for a JIT that can statically rearrange the entire frame. (A JIT's SSA environment or abstract stack data structure just needs to be growable for locals, whereas nowadays the SSA environment for locals is of fixed size. That's not a big issue.)

But for an interpreter or a simple JIT that would follow an interpreter's stack frame layout, this numbering scheme doesn't quite work no matter which way you point stacks, even if you know the maximum. I thought through multiple variants of what you suggest, but it only works if you copy arguments upon a function call. The interpreter can't overlap the stack of caller and callee like it can now, because "local 0" (i.e. the first argument) is deeper in the stack than "local 1". A new let bound variable would then inherit index 0 (knocking local 0 up to local 1) and thus a stack pointer adjustment would be necessary. But adjusting the stack pointer to be below the old local 0 upon (dynamic) execution of a let would incorrectly point further into the caller's frame, where there are values already. Thus the need to copy arguments (to the end of a preallocated locals region) to fix that.

But really, the biggest problem is with dynamic adjustment of the stack pointer with let. I described that earlier in the thread. An interpreter doesn't know which end matches with which let unless it tracks control scopes dynamically. That's a big cost and has to be paid even if there aren't any lets. As there is no other reason currently to track control scopes, it would be highly undesirable, IMO.

It's worth mentioning for posterity it would be possible to avoid that argument copy if there was a different instruction to access let-bound variables, like let.get let.set and let.tee. That way an interpreter would know to use a different indexing scheme that doesn't mix up local 0 (i.e. first argument) with let 0 (i.e. most-recent let-bound variable). Unfortunately it still requires dynamic adjustment of the stack pointer and thus has the same matching-end problem.

If let-bound variables extend the existing numbering scheme, then I think the prepass to compute the maximum is enough for an interpreter and there will be no overhead at all for using let or not using let. Another advantage of extending the existing numbering is that analyzing a function's code for uses of a particular local variable index (e.g. in a simple Wasm tool, or just eyeballing code) doesn't have to track let scopes, because the relative numbering of a given local variable doesn't change for the entire function.

Thus I would propose that let bound variables just get new indices at the end of the locals.

Sorry for the noise on this issue. I hate making an interpreter a design criteria, as it was never one in the past, but this one kind of blew up in my face as I was implementing it :-)

@rossberg
Copy link
Member

rossberg commented Aug 5, 2020

@titzer, to check that I understood this correctly, are you saying that you want to keep function arguments in place on the operand stack and push locals there as well, so that you can directly index args off their original location? And you'd want to reserve additional space on the operand stack to copy let-locals after those?

If so, then I'm afraid you'll eventually run into more problems than just let. For example, leaving arguments in place won't work with func.bind, because it would require inserting the bound arguments before the ones on stack when invoking call_ref. I suspect it also requires an extra round of copying for tail calls.

I was assuming an interpreter would pop a function's arguments and move them to an array of locals (which might still be allocated on the operand stack). That's almost as simple but with none of the problems. AFAICS, the only disadvantage is the need to move the arguments, but is that really a notable cost for an interpreter?

@titzer
Copy link
Contributor Author

titzer commented Aug 5, 2020

Yes to the first paragraph. That already works smoothly for not let-bound locals.

func.bind creates a closure (on the heap), so it's already in a different ballpark. It itself only ever pops one argument off the stack, not an entire argument list. call.ref has to deal not only with closures over the first argument, but actually a whole chain of closures over closures over closures. In general an argument can be inserted anywhere. It's impossible to avoid copying and rearranging arguments at least once, so there's no avoiding that one. It's alright if calling a bound function is a little slower.

It's not necessary to use two arrays of values in an interpreter; one will do, because frames can overlap in this way. That's how Java's operand stack works and our current numbering scheme allows that same trick.

Copying isn't necessary now. But copying is not the main problem, it's the dynamic adjustment of the stack pointer that would require tracking control scopes to find matching ends as I described in the first comment. Maybe I should write shorter comments.

@rossberg
Copy link
Member

rossberg commented Aug 5, 2020

Re the latter problem: how do you deal with adjusting the operand stack height when branching? For that, don't you need to record a stack height with each control construct in the prepass data? Couldn't you reuse this mechanism for the locals' height?

I realise this is slightly more work for an interpreter then indexing the other way round. OTOH, the nice property of relative indexing is that it is consistent with the treatment of labels. And that it composes: inner indices don't depend on outer context. This is particularly nice for code transformations and in a single-pass producer. Would be unfortunate to lose that.

For closures: if you have a locals array then dealing with nested closures is easy I think: you go depth first and copy each one's bindings to the array and finally pop the remaining operands off the stack. No stack twiddling required.

@binji
Copy link
Member

binji commented Aug 5, 2020

@binji that'd be too much space.

To what extent can this interpreter modify the original bytecode? If it can, then I can imagine other optimizations here. For example, you could rewrite the local index directly if there is space (which there typically should be). You can have a sentinel value to a side table when that's not possible.

@titzer
Copy link
Contributor Author

titzer commented Aug 5, 2020

@binji The interpreter is designed to never modify the original bytecode (think: the original module bytes are mapped read-only in memory). It would only copy and mutate a copy of the bytecode for debugging.

@rossberg Earlier in the thread I explained the side data structure and how it works for control scopes. It's space prohibitive to have side data structure entries for local variable access instructions. I get your point about local code transforms but it doesn't really work because any time you move any code that references a local index into or out of a let scope, you have to renumber anyway. It really only works if you move an entire scope at once, which is just a special case.

@binji
Copy link
Member

binji commented Aug 5, 2020

OK, if you can't modify the bytecode, then you can still collect additional information in the prepass:

  • which end instructions mark the end of a let
  • which local.{get,set,tee} are accessing locals that may be stack values
  • which branches leave a let block scope

Then your side table doesn't need to be stored for all local accesses (which I agree would be prohibitive -- I measured 2.5 million in a 19MB wasm file).

@titzer
Copy link
Contributor Author

titzer commented Aug 6, 2020

I have to mull your suggestions over a bit, but for context:

  • end doesn't do need to do any work now, so that's a regression
  • it's difficult to have optional sidetable entries if you want O(1) lookup, because sidetable entries are sorted in control flow order and managed with a second pc that is consulted only when needed and moved in tandem at branches. The seocnd pc essentially always points at the next sidetable entry that will be needed, so no lookup is ever necessary. That's not only a time savings, but a space savings as well, because you don't need to store the PC in entries. I suppose it would be possible to add the pc to the entry so that you could check if the current actual pc matches the next entry pc, and thus detect if there was an entry or not, but that's a regression, and a bad one if you need to do dynamic work for locals.
  • I don't see how marking which local operations may refer to stack values helps. lets effectively renumber all of the locals, so every local access in a let scope would be marked.
  • I suppose the answer could just be "penalize functions with lets--we'll just copy/overwrite their entire bytecode with something completely" because essentially everything about implementing this directly is worse. That really sucks because changing the offsets of any bytecodes ruins the debuggability gain of using the original bytecode.

At this point I'm not hopeful about finding a better solution and even more dubious that there is any material benefit to the current numbering scheme.

@rossberg
Copy link
Member

@titzer, what I was implying above is that, if you organise locals as a stack within the function's frame (an array with pre-computed max size), then all you need to add to your side table is another field local_pop_count, that is used at branches.

The only additional issue is that you'll also need this local_pop_count for end, at least those matching a let. To avoid a table entry for every end you could either make table entries optional for end, or if that is a problem, we could even use a separate opcode endlet.

There should be no need to store anything about individual local.get/set/tee instructions, nor to distinguish function- from let-locals.

any time you move any code that references a local index into or out of a let scope, you have to renumber anyway.

You only need to renumber indices that point to locals outside the code you move -- same as for labels. That's the usual shift operation for de Bruijn indices. It's notably simpler than a general index substitution.

@rossberg
Copy link
Member

Let was removed, so this is obsolete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants