Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of Binaryen in a stack machine world? #663

Open
kripken opened this issue Aug 4, 2016 · 78 comments
Open

Future of Binaryen in a stack machine world? #663

kripken opened this issue Aug 4, 2016 · 78 comments

Comments

@kripken
Copy link
Member

kripken commented Aug 4, 2016

The WebAssembly project recently decided to switch from being an AST to being a stack machine. This issue is to discuss the implications of that for the Binaryen project, which is AST-based.

History: Binaryen's initial design and goals were simple: WebAssembly was an AST. By building a compiler that uses that AST as its internal IR, we could stay very close to our output format, which allows

  • Fast compilation to the target, since almost no lowering is needed.
  • WebAssembly-to-WebAssembly optimization, as we can load it, optimize it, and write it.
  • Using WebAssembly as our IR means optimizations very specific to WebAssembly are easy to express (e.g., block return optimizations), which generic compilers might find harder to do.
  • ASTs are convenient as an input IR, allowing compilers to WebAssembly to use Binaryen as their backend. The asm2wasm and mir2wasm projects are examples of this.
  • ASTs are a reasonable optimization IR, allowing flexible optimizations to be written in nice ways.

But the foundations for all that are now in question with WebAssembly's pivot to a stack machine.

The first practical consequence is that things in the WebAssembly MVP will not be expressible as an AST, e.g.

call gimme-i32
call gimme-i32
call no-value-returned
i32.add

This stack machine code does two calls to get i32s, then a void call. The void call is not placed on the stack, and the add adds the i32s. In an AST that order of operations is not directly expressible, although a "first" or "reverse-comma" operator can work, <x, y> where x is done, then y, and then x is returned (which is the opposite of the comma operator in C, JS, etc.). So we can write

(i32.add
  (call gimme-i32)
  (first
    (call gimme-i32)
    (call no-value-returned)
  )
)

This technically works, but is awkward. In particular, the "first" operator vanishes when actually emitting WebAssembly. This puts us in the uncomfortable position of optimizations needing to take into account that more AST nodes might mean less code size. And the obvious simple IR for such optimizations is just the linear list of stack machine instructions.

That's just the beginning: A major justification for the stack machine approach is multi-values, which are not in the MVP but will be added later. As with "first", technically one could invent some AST construct, but it's awkward. Further possible future features already showing up in discussions include pick (copy a value from up the stack) and other stack machine tricks. It's clear that WebAssembly is moving in a direction that Binaryen's initial design is just not a good fit for.

The bottom line is that Binaryen was designed based on what WebAssembly was at the time. It seemed like an elegant approach to use the WebAssembly AST as a compiler IR. That tied Binaryen's internals to the WebAssembly design, which was a risk, but it paid off - until now, as WebAssembly has pivoted. So we need to decide what to do.

Options include:

  • Deprecate Binaryen: Consider a fresh start in another toolchain project, or shift efforts to an existing one.
  • Keep on trucking, for now: Add "first", and add other things later, despite Binaryen getting less effective at its purpose over time. The maintenance burden will rise as the IRs get more incompatible. This delays what I suspect is the inevitable.
  • Redesign Binaryen (1 IR): Switch to a stack machine IR. That would entail an almost complete rewrite, so it's not much different than the deprecate-binaryen option.
  • Redesign Binaryen (2 IRs): Add a stack machine IR alongside the AST. This seems very complex and risky, especially if the IRs are meant to be bidirectionally translatable, which is necessary if we can still both consume and emit WebAssembly.
  • Stop consuming WebAssembly: Keep the current AST, but it would be the "Binaryen IR" and no longer WebAssembly. This is sort of like forking the IR: WebAssembly is going in a stack machine direction, but the Binaryen IR is not following it there. Binaryen IR would be isomorphic to a subset of WebAssembly, but no more. In particular, this means we would no longer consume WebAssembly, we would only be able to emit it.

More details on the stop-consuming-WebAssembly option:

  • Binaryen would be a compiler to WebAssembly and nothing more: not a WebAssembly interpreter, not a WebAssembly linker, etc. etc.
  • Binaryen's IR would not be able to express all of WebAssembly directly. We would have an output layer that can translate that into WebAssembly. Since we would be a subset of WebAssembly, that should be a simple process, with little changes to our current code. In the short term we just wouldn't emit code not isomorphic to an AST (i.e. we would emit a subset of WebAssembly), but perhaps in time we could have our WebAssembly output layer emit stack machine-specific patterns as a final-layer optimization. (So we might eventually have 2 IRs, but not bidirectionally translatable ones.)
  • Binaryen's AST would stay representable in text format in the current s-expression format. As that format will likely end up being changed for the stack machine, this means forking the s-expression format, into something I guess we can call "Binaryen's IR text format". We would fork the spec test suite and add it to the Binaryen test suite. wasm-shell would run those tests, and should be renamed to binaryen-shell (or byn-shell?).
  • Binaryen's wasm-opt tool would be renamed binaryen-opt (or byn-opt?) as it would operate on Binaryen IR, not WebAssembly. Binaryen would stop being a wasm-to-wasm optimizer, which was a goal we thought could be useful for other projects - we would be dropping that goal.
  • Binaryen's wasm-as, wasm-dis tools would likewise be renamed as they would operate on a Binaryen binary format. In other words, as with the s-expression format, this would be a fork, of the binary format. Note also that this means that Binaryen would no longer be a tool people can use to disassemble and assemble wasm files for toolchain purposes.
  • Binaryen would still have an interpreter - in fact some optimization passes, like Precompute, need one - but it would be a Binaryen IR interpreter and not a WebAssembly interpreter. (Perhaps useful as an emterpreter replacement.)
  • No changes to asm2wasm and mir2wasm, as Binaryen's IR and C API are not changing. There should be no downside whatsoever for those compilers, and in particular not for the entire emscripten-asm2wasm path for compiling C++ to WebAssembly, which already works now.
  • s2wasm is tricky since its input is basically WebAssembly. We would need to modify or replace it, in tandem with the wasm backend. Several options here, with varying amounts of work.
  • No more wasm.js or binaryen.js, which could execute WebAssembly in JS for polyfilling purposes. You might say that wasm.js fulfilled its purpose of helping the toolchain side before JS engines got proper wasm support, and at this stage, we don't need it as much. So that is probably not a big deal. For binaryen.js, it could no longer work as a (slow) client-side polyfill for WebAssembly, but other interpreters could be compiled instead.
  • The bottom line is that Binaryen would be refocused from a general-purpose WebAssembly compiler and toolchain library - trying to do everything in that space - to a far more narrow niche of a specific compiler library for WebAssembly (that uses an AST designed to be isomorphic to a subset of WebAssembly).

As you can see from the amount of text, the stop-consuming-WebAssembly option is the one I've thought most about. Not because I like it necessarily, but because all the other options have downsides that worry me a lot more. But I have no definite opinion on any of this yet, hoping to hear what other people think.

Thoughts?

@ghost
Copy link

ghost commented Aug 5, 2016

Can you give an example of stack-machine code that can not be expressed (with only encoding loss) in a structured AST? There might not be any such examples, given that dropping stack elements is currently constrained to block boundaries.

An explicit block1 was considered some time ago see WebAssembly/design#650 As you note if it is explicit then it does not solve the problem as using it changes the code. While the text format might use it just for presentation, what if people use it in excess and this still need to be encoded. Also this is still limited to structured uses in the scope of the block.

But this can not express multiple uses of the function results, and does not work for functions returning multiple values in general. A general solution is needed, and I appeal to people to solve the general case first then optimize it for high frequency cases.

The example could also be presented as the following:

(letc (($c0 (gimme-i32))
       ($c1 (gimme-i32)))
  (no-value-returned)
  (i32.add $c0 $c1))

And when people code references to these local constants that do not fit the stack usage pattern they would simply be encoded with get_value aka pick/pick_and_void. We could also move the function arguments onto the stack. Why not consider this proposal, it was communicated some time ago yet seems to have been ignored and there has been no rationale given?

The other option is the 'Stop consuming WebAssembly' option but I would look at it from a different perspective and view it as conceding that the wasm binary code is just a compressed encoding of the language that developers will actually work with and that the language that binaryen consumed would be the wasm text format and debugging would step through that code not the stack code which would be a throwaway intermediate encoding, just as debuggers do not step through compressed encodings. This would focus on the SSA decoded format, which is really what this is all about, and it might be better for analysis and transforms, but distant from the stack machine code. This would be a last-resort strategy, working around the product of the wasm group, but why should the group be in such a position, it is our group.

@dschuff
Copy link
Member

dschuff commented Aug 5, 2016

In the past we've kicked around the idea of defining a flavor of wasm (an IR, if you like) which is not exactly wasm but more natural for many compilers to produce (aka a producer-oriented/subwasm/flat wasm). What you are describing as Binaryen's new non-isomorphic IR sounds a lot like that in some respects. As you have observed, Binaryen would then be a one-way tool (at least the compiler part would be; obviously from the user perspective Binaryen has always been a loosely-related collection of tools; e.g. we could still have wasm-as and wasm-dis if we wanted to, they would just not use the same IR).

I think you are basically correct about the consequenses of moving Binaryen to this style of IR. Presumably it would evolve a bit (since it doesn't have to precisely track the current flavor of wasm we could rethink pain points that exist or as we find them), and libbinaryen (C API, optimizers, etc) would track this evolution.
For s2wasm it would depend on what we wanted to do with LLVM; we could make LLVM target the "flatwasm" IR if we wanted to. Moving the compiler part of LLVM fully to a non-wasm IR might have interesting consequences for linking formats and the other parts of LLVM, but that's a decision we can make independently of what we want Binaryen to be.

Practically speaking, we may choose over some near-term period to keep on trucking and paper over the differences while we work out what we want the end product to be.

@kripken
Copy link
Member Author

kripken commented Aug 5, 2016

@JSStats: what you're suggesting sounds more like for the WebAssembly design than for Binaryen? This issue is just for Binaryen responding to a WebAssembly change.

@dschuff: I agree that trucking on for a while sounds reasonable, but it's not quite that easy. I've been working towards that in the (increasingly misnamed ;) since it's all about stack now) land-drop branch, and it's been quite a lot of work so far. In addition, it looks like we'd need to rewrite our s-expression parser from scratch. If it's indeed the case that this is a significant amount of work, it seems unwise to do it only to likely get rid of it later - so it makes sense to focus on the change in direction now.

@dschuff
Copy link
Member

dschuff commented Aug 5, 2016

That makes sense; If we do want to just go toward a new IR, we do have sexpr-wasm-prototype which can take the role of binary encoding, decoding, and standalone interpretation. Then the IR can be designed for ease of production from whatever class of compilers we care about, and for wasm-specific optimizations.

@kripken
Copy link
Member Author

kripken commented Aug 5, 2016

Yeah, sexpr-wasm-prototype is perfect for that, it has a proper stack-based interpreter in fact. And I think I saw @binji already has a web port so it could be used as a clientside polyfill (if not I'd be happy to help with that).

I think in the forked-IR path, if we go that route, binaryen might still end up with some wasm import facilities, but they might not be lossless. For example a "wasm2byn" could handle patterns that can't work in an AST by introducing new locals (sort of like the project whose name escapes me that translates wasm into LLVM IR in the sense that it's also not meant to be lossless). And of course we can emit wasm, so optimizing wasm to wasm in binaryen might still be useful, although that's an open question.

@kripken
Copy link
Member Author

kripken commented Aug 6, 2016

I am leaning more in the forked-IR direction. I considered in more detail the option of rewriting binaryen to use a single stack-based IR, but I just don't think it makes sense. The original design principles still relevant are

  • A single main IR has major simplicity and efficiency benefits.
  • That single main IR should be transformable so we can run optimization passes on it. (Without this requirement, the single main IR could just be the wasm binary format.)

But doing those with a stack machine IR seems very hard. On the one hand it's true that some optimizations are easy to write (e.g. removing dead code) while others become possible (stack-machine specific things). But on the other, consider a simple pass that reduces eqz(gt(x, y)) into lte(x, y) (i.e. !(x > y) into x <= y, which is true on integers). Right now in our AST an eqz node has a pointer to the child node. In a stack machine, options are

  • No child pointer. The child is then defined by the location in the array, and might be right before the parent, but also might not be (and even trickier with more than one child).
    • Optimization passes would therefore need to maintain an expression stack, and every optimization pass would need to consult that data structure for child operations. This might not be so bad when traversing in the natural direction, but anything else (other orders, random access, etc.) become tricky.
    • Code receiving a pointer to a node must also receive a "context" pointer to the array, as the node by itself is no longer enough (in our AST right now, we've intentionally made it so a node pointer is always enough; in fact expressions stand on their own even without knowing the module, which makes function parallelism trivial).
    • On the other hand, not having a child pointer would make each node smaller, which is a significant upside. But nodes would still need a "next" pointer (forming a linked list) if we want it to be efficient to move nodes around, insert them, etc., which seems like a strong requirement. So this would only help when the number of child nodes is >1. Still significant, but less.
  • Have a child pointer. This means
    • Node sizes strictly increase compared to our current AST, given the "next" pointer.
    • Each modification of the code needs to be updated in two places, the linked list and the child pointers, which adds overhead compared to now and also introduces bug risk.

@qwertie
Copy link

qwertie commented Aug 8, 2016

@kripken Do you know what the decision-making process is for WebAssembly? I'm increasingly getting the feeling that major decisions are made in private discussions behind closed doors, or other unknown places, with little important discussion happening in the design repo. Did the change to a stack machine catch you by surprise as well?

Based on what I know so far (which is no doubt less than what you know), "Keep on trucking, for now" looks like the best option (for now), because

  • I don't know if the stack machine is a final, irreversible decision.
  • It looks like there are AST equivalents for the new stack machine semantics (have you discovered anything in the stack machine that can't be expressed in AST form with a new operator like 'first'?)
  • By maintaining an AST form for Wasm code, you maintain knowledge about the two-way isomorphism between AST and stack machine. I think that's just useful knowledge from an academic perspective. It could be useful to other engineers in the future and useful to the computer science field.
  • Converting stack-based wasm to an AST is a potentially useful feature to many people. By keeping this ability, you reduce the need for people to migrate to other tools, making your tool more popular and by implication, you yourself ;)
  • It is probably easier to adapt gradually to the changes in Wasm as they happen. We're predicting certain things in the future like a pick operator and multi-values, but we don't know that these will actually happen in the form currently being considered. I think before ripping out functionality or engaging in a major rewrite, you should be certain it is necessary.
  • (edit) As you mentioned, ASTs seem easier to work with for some common optimizations.

This puts us in the uncomfortable position of optimizations needing to take into account that more AST nodes might mean less code size

Are "virtual" nodes really uncomfortable in terms of code complexity, or it is mainly uncomfortable psychologically? If it's the latter, I think you'll get used to it after awhile.

@qwertie
Copy link

qwertie commented Aug 8, 2016

I am leaning more in the forked-IR direction.

If "forked-IR" means "an AST-based IR that is very similar to, and easily convertible to, the Wasm stack machine, such that arbitrary Wasm stack-machine code can be converted easily to the IR," it sounds like a good plan to me. It's even better if Wasm-to-IR-to-Wasm and/or IR-to-Wasm-to-IR are (guaranteed to be) lossless roundtrips.

@kripken
Copy link
Member Author

kripken commented Aug 8, 2016

@qwertie: Yes, I was taken by surprise by the stack machine change. And from what I've seen online and off I was the only browser-vendor person that opposed it. And you can see work going on for it (e.g. in the spec repo). So for better or for worse, I think it's safe to assume it's happening, even if nothing is truly final until it lands, is specced, and ships.

there are AST equivalents for the new stack machine semantics (have you discovered anything in the stack machine that can't be expressed in AST form with a new operator like 'first'?)

Yes, a "first" can solve the current issues, with some awkwardness, but as discussed above multi-values are almost certain to happen, and "pick" and others seem very possible. It's clear the awkwardness for an AST will increase.

By maintaining an AST form for Wasm code, you maintain knowledge about the two-way isomorphism between AST and stack machine. I think that's just useful knowledge from an academic perspective. It could be useful to other engineers in the future and useful to the computer science field.

I agree and those are some of the reasons I opposed WebAssembly moving to a stack machine. (My main reason is the negative implications for the text format.)

Converting stack-based wasm to an AST is a potentially useful feature to many people.

I agree, and I think we might still support a form of that, but it wouldn't be lossless. For example, a wasm2binaryen tool might use additional locals for stack machine code we can't directly represent.

Are "virtual" nodes really uncomfortable in terms of code complexity, or it is mainly uncomfortable psychologically? If it's the latter, I think you'll get used to it after awhile.

I think there is a fundamental issue here. The optimizer currently has some useful principles like

  • AST nodes ~ work. That makes it easier to write optimization passes. In particular, it is obviously the "right thing" in most cases to replace some AST nodes with a smaller number of AST nodes.
  • Removing AST nodes is good. And not just individually as implied by the previous point: if optimization passes decrease the number of nodes, we have a good guarantee of repeated optimization reaching a fixed point (sort of like a loss function in gradient descent or a Lyapunov function in differential equations; i.e., we have a monotonically decreasing function in the number of nodes, and it can't decrease forever, so we must halt).
  • The same points also apply when replacing "work" with "code size", i.e., when optimizing for size over speed.

Now, there are exceptions to those principles. For example, a block does no work by itself, e.g. (block (block ..) x) is the same as (block .. x). So we lose the "AST nodes ~ work" property. But we do still have the principle that decreasing AST nodes is good (for AST size and code size). Another example is that some math operation might be faster with more nodes, like say turning an integer divide by a constant into a sequence of shifts and adds. That violates the first two principles, so the optimizer has to be careful about it. In this case, we can just not do the reverse of that operation, but it does mean we need to keep that in mind when writing additional integer math optimizations. At least, though, the third principle remains in that more AST nodes means more code size, so we clearly know not to do it when optimizing for size.

The problem with "first" is even trickier in that all three of those principles are no longer true: adding such nodes can reduce output code size and work, but the opposite is true in other cases. So we can't just not do the inverse, and it isn't trivial what to do when optimizing for size, and so all optimization passes that remove or create "first"s must be coordinated. Now, of course this isn't an insurmountable problem. But it (and multi-values, and pick, etc.) add complexity and awkwardness to our optimization IR, and not just in a few places. An AST is just not a good 1-to-1 IR for such stack machine code.

If "forked-IR" means "an AST-based IR that is very similar to, and easily convertible to, the Wasm stack machine, such that arbitrary Wasm stack-machine code can be converted easily to the IR,"

No, sorry, the name might be confusing. Forking here means that we diverge from WebAssembly by not adopting the stack machine elements that are awkward in our AST. A better name might be "only output wasm", as it implies wasm is only our output format, while the forked IR is our input format, internal IR, binary and text format, etc. But the direction of Binaryen IR => WebAssembly would remain a very easy and fast operation, of course (since we would be pretty much a subset of WebAssembly).

@qwertie
Copy link

qwertie commented Aug 9, 2016

OIC - "~" means "varies proportionally with". Aka x α y...

It seems to me you can model most of this with a per-node weight - say, most operations have weight 1, multiplications have weight 2-ish, divisions have weight 10-ish... and you can define three weights, one for "optimize for speed", one for "optimize for size" and one for "balanced". If an algorithm needs positive weights, that's okay: it might be better to give first a weight of 0.001 rather than 0 since w; x; y + z has the (small) virtue of being easier to understand than w; first (y, x) + z. On the other hand, another optimizer might have chosen w; first (y, x) + z intentionally, let's say because w might modify memory read by x... anyway, I've never written optimizers so I'm probably not helping :)

I see that multi-values are likely to happen as currently envisioned because they seem easy for the back-end to support. But local variables do everything we would use pick for, so I think it'll be awhile before they consider that - unless they eliminate local variables entirely (a slightly horrifying thought).

FWIW I support your opposition to the stack machine, or at least I think there should be a closely-related AST variant of Wasm for the text format and for binaryen. I recognize the (small) value of the stack machine in the final back-end, but still wonder if there's some way to achieve a similar benefit in the AST paradigm (I'm cursing myself for not having found the time to learn Wasm better, otherwise I'd love to investigate this.)

It sounds like you weren't really consulted about this decision, but do you know what the decision-making process is?

@ghost
Copy link

ghost commented Aug 9, 2016

@kripken A sketch of a single pass stack-machine to text format algorithm is at WebAssembly/design#753 This does not use a first operator, rather lexical constants, which would seem to work with the current proposal in the stack-machine to text direction, but be fragile in the other direction without the pick operator. Perhaps you can work it into something concrete, or if you find any show stoppers or loss then please report it so it can be understood and pondered. You could explore the utility of the pick operator, which would probably eliminate a lot of local variables.

@kripken
Copy link
Member Author

kripken commented Aug 9, 2016

OIC - "~" means "varies proportionally with". Aka x α y...

Yeah, sorry, I should have been more clear.

It seems to me you can model most of this with a per-node weight [..]

Yes, but the issue is that "first" would need to have a negative weight in the context of code size. That's already troubling, but in addition, even that isn't true as the weight depends on the surrounding code, it shouldn't be negative in other cases, so it doesn't even have an intrinsically definable weight.

But local variables do everything we would use pick for, so I think it'll be awhile before they consider that

It might be a while, but I hope it'll happen at least for the text format: with pick you can avoid oddities like a let variable that can only be used once and in certain places.

do you know what the decision-making process is?

I don't know that there is anything formal written up. For the informal stuff I'm not sure since I'm not very good at politics ;)

@qwertie
Copy link

qwertie commented Aug 9, 2016

@kripken I'm not seeing it. What's an example where first would have a negative weight?

@kripken
Copy link
Member Author

kripken commented Aug 9, 2016

Imagine that we want to do a call, set it to a local, a store, and then use the call result (and the store and the call can't be reordered):

(i32.eqz
  (first
    (tee_local $x (call ..))
    (store ..)
  )
)

vs.

(set_local $x (call ..))
(store ..)
(i32.eqz
  (get_local $x)
)

Both of those have 5 AST nodes, but the former has one less node in the stack machine output since it doesn't need the get_local.

@qwertie
Copy link

qwertie commented Aug 9, 2016

A negative weight isn't needed here: if first's weight is zero or near-zero, the first version has lower total weight so the optimizer should prefer it.

@ghost
Copy link

ghost commented Aug 9, 2016

@kripken If you introduce first now then the text format will probably be stuck with it and given that the lexical constants can also represent this pattern then there are now two text formats for the same stack machine code introducing a loss. I don't even think first aka block1 is a familiar operator to JS programmers so why bother introducing it, it's a dead end. A plan is needed.

@kripken
Copy link
Member Author

kripken commented Aug 9, 2016

@qwertie: sure, but it can't be zero in other cases, and a specific non-zero value might work in some but not others.

The underlying issue is that in a stack machine, it's easy to see the cost of this operator (or rather, what it does, since it doesn't exist there). In an AST an analysis is needed to calculate the cost.

@drom
Copy link

drom commented Aug 10, 2016

Agree with @JSStats that first not readable nor native to "stack machine" concept.

@kripken
Copy link
Member Author

kripken commented Aug 10, 2016

At this stage I'd like to propose that we go in the forked-IR, aka "only output wasm" direction:

  • The internal IR, our AST, does not change, but it is now called Binaryen IR. In time it will diverge from WebAssembly (e.g. on the stack machine issue) but it will remain a core goal to stay as close as possible (same types, semantics, etc.) and basically to have an IR that trivially compiles to WebAssembly with almost no work because it's practically WebAssembly already.
  • We no longer focus on WebAssembly as an input. It is an output for us, and it is our primary output aside from our internal IR. (But see wasm2byn below.)
  • Our main focus is on supporting compilers to WebAssembly, e.g., asm2wasm, mir2wasm, s2wasm, etc. The more specific goals when supporting such compilers remain unchanged: to make it easy to write a compiler to WebAssembly, to be fast, and to emit good code.
  • The current binary format support we have becomes the Binaryen IR Binary Format, perhaps with suffix .byn (pronounced "bine"; see previous discussion).
  • The current s-expression support we have becomes the Binaryen IR Text Format, perhaps with suffix .tyn (pronounced "tine"; spelling is parallel to .byn, replacing 'b' for binary with 't' for text). Our own .wast files in the test suite are renamed to the new suffix.
  • asm2wasm => asm2byn, wasm-as => byn-as, wasm-dis => byn-dis, wasm-opt => byn-opt, wasm-shell => byn-shell. No actual change but input and output where relevant are switched from WebAssembly to Binaryen IR.
  • Add a byn2wasm tool that emits WebAssembly. As our current binary format is identical to wasm, we can share almost all that code for now (just have a flag for the divergent points). emcc will call this tool to emit the wasm binary when compiling, etc.
  • wasm.js => byn-emcc.js, this integrates with emcc and allow running Binaryen IR in a compiled interpreter in JS. Useful for toolchain testing, testing the Binaryen IR interpreter (which is important since we use it in the compiler, e.g. --precompute so far), and ensuring our semantics don't diverge too much from WebAssembly. (I previously considered removing this, but changed my mind.)
  • binaryen.js was a standalone wasm interpreter. We no longer input wasm, so we leave that to other tools, and can remove it. (But see wasm2byn below.)
  • s2wasm => s2byn, it would now emit Binaryen IR. Note that s2wasm must track changes in the wasm backend, which for now doesn't emit stack machine code that is a problem for our IR, but if that happens, s2wasm might need to do lossy conversion (see wasm2byn below).
  • We can still run a subset of the wasm spec tests, basically everything but where stack machine notation pops up. This would let us keep coverage on all the rest of the suite, which is very useful for testing our interpreter and keeping our semantics as close as possible to WebAssembly. And it makes sense to keep our text format compatible with .wast files as much as possible anyhow (and as with the binary format almost all the parsing code could be shared).
  • wasm.h, wasm-validation.h and so forth become byn-*.h, etc.
  • Internally, we use the variable name "wasm" for a module in various places, that should be renamed to "module".

(Assuming we agree on this proposal, I am of course signing up to do all the above work; but if anyone wants to help that's great.)

Possible future steps, worth mentioning now since they might influence the discussion:

  • With our own text and binary formats, I think it makes sense to introduce optional basic block + branches support. We have that support in our C and C++ APIs already, in the form of the relooper API and its IR, so this would just reflect that in our IR's text and binary formats. This would be little work. It would help in internal testing as well as external compilers, e.g., right now mir2wasm uses the Binaryen C API with the relooper, and if we add basic blocks to the text and binary formats then it and similar projects could in principle generate one of those formats directly if they prefer. Basically, this would make our IR easy to target for a wide range of compilers: AST-based or CFG based, both would be supported.
  • Introduce a more sophisticated wasm output layer, that can e.g. reorder stack machine code into forms we can't directly represent in our IR. This might be a secondary IR.
  • Introduce a wasm2byn tool that imports WebAssembly. This would not be lossless as we can't represent all WebAssembly directly, so it might e.g. use more locals for some stack machine code, etc. But overall it would be a simple importer to write, and could be useful for various things even without being lossless, like a wasm-to-wasm optimizer, a standalone wasm interpreter, etc. (which would be trivial by just chaining wasm2byn to our byn* tools).

Thoughts?

@ghost
Copy link

ghost commented Aug 10, 2016

@kripken A fundamental problem with this plan is that it still does not address multiple values. Development needs to more forward, develop a counter proposal that addresses the issues, rather than just resisting development and shuffling the chairs. People care about wasm because it is a deployment platform.

I think the general point you are making about the disconnect between the language that producers want to easily emit and the twisted stack machine constraints is a key point. The same was seen with the x87, a numerical stack machine, where compilers just wanted to connect inputs and output but had to work around frustrating stack machine constraints, and to optimize for these constraints, and and to be able to annotate where values were for debugging. The pick operator helps a little as it can access values in general up the stack, but if the requirement that a block stack be empty except for the out values stands then this will require really frustrating swap and drop sequences to clean up the stack.

If wasm is just a target, and not optimized for code size, then the expressionless register based code is much simpler for producers.

@kripken
Copy link
Member Author

kripken commented Aug 11, 2016

@JSStats: I agree we need a plan for multi-values, and if we don't have a good one then we can't move forward here.

One option is to just not add them to our AST. This assumes we can still generate good code without it. I don't know if that's true.

Another is to add it to our AST with something like

(set_local_multi $x $y
  (multi
    (..expression to assign to $x..)
    (..expression to assign to $y..)
  )
)

i.e. multi creates a multiple value, and set_local_multi can destructure it into components and assign them into separate locals. And multiple values can flow through blocks, ifs, and function returns. This seems plausible, but I'm not sure.

@ghost
Copy link

ghost commented Aug 11, 2016

@kripken The lexical constants, the pick aka get_value proposal, seems to be able to address the multiple values use case with a minimum complexity for an expression based language, so function return values are immediately assigned to these lexical constants and are then handled by single value expressions. It is close to the stack-machine proposal, but with some significant differences from what has been proposed.

The other option is to depend on local variables, and this takes wasm down the expressionless encoding path. The encoding efficiency was not as good, and the live range of definitions is not as well defined, but it would be a simple compiler target.

A language with more general multiple values support was explored in WebAssembly/wabt#66 and the comments detail the design and operations.

@rossberg
Copy link
Member

On 8 August 2016 at 23:14, Alon Zakai notifications@github.com wrote:

@qwertie https://github.com/qwertie: Yes, I was taken by surprise by
the stack machine change. And from what I've seen online and off I was the
only browser-vendor person that opposed it. And you can see work going on
for it (e.g. in the spec repo). So for better or for worse, I think it's
safe to assume it's happening, even if nothing is truly final until it
lands, is specced, and ships.

The superficial history is that most of it "happened" in ongoing
conversations between a couple of Mozillians ad Googlers. Luke & Dan have
been arguing in the more stack-ish direction for a while. Ben & I were
rather skeptical at first, but feared "slow drift" (such as more
incremental changes of the drop kind). The pivot point was the suggestion
to "allow void-expressions anywhere", which quite plainly breaks the AST
semantics. So, to see to the end game, we prototyped a stack machine in
formal spec, ml-proto and V8. It turned out to be simple enough to convert
us over.

That said, I can definitely see the downsides. It is less structured, after
all. And structure is usually good.

@dschuff
Copy link
Member

dschuff commented Aug 11, 2016

@kripken
In general I'm on board with Binaryen's IR not needing to be wasm per se. Binaryen (as you are conceiving it here) is essentially a tool and optimization framework for compilers to more easily target wasm, and the IR you use for those optimizations doesn't need to correspond or translate 1:1 with wasm, even if you use it to implement a wasm->wasm optimizer.

I'm a bit confused as to why we need to formalize and have a binary format (and e.g. and full-blown interpreter) for this IR. I can buy having a text format (which is useful for writing tests). But the following things make me nervous:

It sounds like you are conflating 2 different things with this proposal: A compiler IR meant for optimizations, and machine format/VM that you want compilers to target. Do we want to try to make this IR stable? (my opinion: no) Do we want to write a linker for this format? (my opinion: no). Do we want to define ABIs in this format? (probably not). A wise soul once cautioned the LLVM community about doing this and even though LLVM IR might be an extreme example compared to wasm, the underlying lesson is that that the goals of an optimization IR are different from a platform (and here especially I mean "platform" in a broader sense than just a virtual machine, e.g. ecosystem, interchange format, etc).

For example, we know that it's likely that an optimizing compiler might want to use a CFG and not do its own control-flow restructuring, so Binaryen's library includes a relooper API. This is fine if we convert it to a structured IR before optimizing. But we wouldn't want to require that, so what do we do if we force everything through a binary format that matches the IR? Making a binary format that supports both seems like a bad idea because even aside from the complexity, it's incompatible with the stated goal of keeping the IR close to wasm.

Another way to say it is, the goal of keeping the IR as close to wasm as possible may be opposed to the goal of making it useful for optimizations, and making more like a "platform" (binary format,e tc) is opposed to the goal of making it flexible for different use cases.

Hopefully that wasn't too ramble-y.

@kripken
Copy link
Member Author

kripken commented Aug 11, 2016

I definitely don't want to define a new VM or machine format or platform here. The only goal I have in mind is a library that makes compiling to wasm easy, fast, and emits good code. Period.

And I believe an interpreter serves that goal very well:

  • In the --precompute pass, in just a few lines we find stuff the interpreter can run and replace it with the result. This is much, much easier than writing such a pass by hand, and it also has a guarantee of working on all of our IR.
  • This could also be used to remove global ctors by running them at compile time. Emscripten's asm.js optimizer does this now (by running the asm.js in a JS engine!). Compare this to LLVM's GlobalOpt which tries to do the same, but just handles the IR that it was written to handle, and aborts otherwise (e.g. it fails on removing the ctors added in a C++ iostream hello world).
  • A first-class interpreter is also very useful for testing. I've wished many times LLVM's lli worked better than it does.

And we already have a fully-functional interpreter now, so it makes sense to keep it.

Regarding a binary format: it's useful for the same reasons LLVM has a binary format, it's convenient and fast (in particular it will improve our compile times, unless we write all-in-one tools instead of chaining independent ones). And as with an interpreter, we already have one.

Linking: I don't feel strongly about this one. On the one hand it could be useful (like llvm-link is useful). But it seems very low priority at best.

Stability: I agree, not a goal. (While in theory it could help compilers targeting us, the burden on us would be far too great.)

CFG support: You're right that adding this to our IR formats diverges us more from wasm. But:

  • Not by a lot. Literally it's just adding something like (cfg (..block1..) (..) ..). It's not much of a divergence.
  • It's useful. As you said, many compilers want to emit CFGs. Why should they need to use our C API if the IR format would be easier for them?
  • It would also be easier for us for testing, look at the current relooper tests, it's a bunch of ugly C programs using the C API. It would be much nicer to use IR in text form. Just like in LLVM, you want to write tests in IR, you don't want to write C programs using the LLVM C API.

I'm not happy about diverging from wasm, and would still love to find a way to avoid it. But it does have upsides, like being able to make small but useful additions to our IR formats such as CFG support.

@kripken
Copy link
Member Author

kripken commented Aug 26, 2016

Another reason for Binaryen to have its own file format (and not keep using wasm) is that wasm's stack machine pivot causes some issues with unreachable code, namely that adding or removing a branch that alters code reachability can turn wasm code from valid to invalid and vice versa. Details are still being discussed last I heard, but that seems like an annoying property in an object file format (especially when bringing up a new compiler).

Overall, I think it's fair to say that wasm is focused on what browsers want to consume, while an object file format would have somewhat different objectives in my opinion:

  • Fast to link object files together (O(1) with low constant factor)
  • Fast to "link" an object file to a shippable wasm file that browsers can consume
  • As optimizable/LTOable as possible given those constraints (since on the web code size matters more than any other platform by a huge margin)
  • As easy as possible for compilers to emit

.byn files as described above should be able to achieve those goals (fast linking hasn't changed; "link" to wasm is an almost trivial operation since types&semantics are essentially identical; optimizability is pretty good by being AST-based, as evidence we have the current Binaryen IR optimization passes; relatively easy to emit by avoiding stack oddities as mentioned above, and also as discussed earlier we can add CFG support, etc.).

I had originally thought that the toolchain using raw wasm files for object files was an excellent opportunity for wasm - avoiding an extra format simplifies the ecosystem substantially. And that led to the original Binaryen design. But I think that's just not viable any more, wasm's design is evolving in ways that make sense for browsers but have downsides for toolchains. A single format can't fit all I suppose.

@lukehutch
Copy link

lukehutch commented Feb 1, 2017

I am building a compiler for a new language, and came to the conclusion that the most important backend I could build would be a wasm backend, given that it has the potential to become the standard way to ship processor-agnostic binaries for server, desktop and mobile, both inside and outside the browser.

I just now came back to get an update on the progress of wasm, for the first time in 6 months, and discovered the switch from AST to stack-based architecture. I consider this to be a huge mistake (and I think the binayen folks on this thread are being too diplomatic about how to cope with the upstream change!).

There are several reasons why switching to a stack-based architecture is a bad idea:

(1) As pointed out in other comments, this is a major step backwards for the transparency of code and ease of learning on the web: stack-based code is nearly impossible to read.

(2) Stack-based architectures are massively inefficient at runtime, which wastes battery on mobile, causes UI latency and jank (which incurs a cognitive burden in users), kills baby polar bears etc. etc.
The following paper shows how much overhead a stack-based VM incurs compared to a register-based VM, after significant effort was expended to optimize both:

https://www.usenix.org/legacy/events/vee05/full_papers/p153-yunhe.pdf

As far as I can tell, the reasons for switching to a stack-based architecture was initially the trigger of allowing void expressions anywhere, but ended up coming down to the increased ease of building the runtime (since you don't have to build as much compiler backend infrastructure if you're already at a lower level than an AST), as evidenced in the following comment by @rossberg-chromium from earlier in this issue:

The superficial history is that most of it "happened" in ongoing
conversations between a couple of Mozillians ad Googlers. Luke & Dan have
been arguing in the more stack-ish direction for a while. Ben & I were
rather skeptical at first, but feared "slow drift" (such as more
incremental changes of the drop kind). The pivot point was the suggestion
to "allow void-expressions anywhere", which quite plainly breaks the AST
semantics. So, to see to the end game, we prototyped a stack machine in
formal spec, ml-proto and V8. It turned out to be simple enough to convert
us over.

It is likely that even more overhead would be incurred between compilers forced to compile from a stack-based IR and an AST-based IR than even the difference between stack-based and register-based IR, since the higher level of an AST-based IR makes so many optimizations simpler. For an example of the types of hoops you have to jump through to safely perform even simple dead-code elimination in a stack-based IR, see this paper:

http://set.ee/publications/bytecode07.pdf

So many simple optimizations require careful thought, soundness proofs, and new abstractions (such as swapping virtual stack snapshots due to differences in control flow) when reasoning about a stack-based VM, and most peephole optimizations are simply undoing some ineffeciencies imposed on the code by lowering to a stack-based form (e.g. replacing DUP SWAP with DUP), because the representation is so low-level:

http://www.rigwit.co.uk/thesis/chap-5.pdf
https://users.ece.cmu.edu/~koopman/stack_compiler/stack_co.html

Some other reasons why stack-based representations can be more inefficient include the following (there are many -- the quoted points are taken from this Wikipedia page):

  • Stack machines must always properly unwind the stack before exiting an execution frame (they cannot leave unneeded values on the stack, the way that a register machine can simply leave unneeded values in registers, depending on the calling convention).
  • "Some in the industry believe that stack machines execute more data cache cycles for temporary values and local variables than do register machines."
  • "In register machines, a subexpression which is used multiple times with the same result value can be evaluated just once and its result saved in a fast register. . . With stack machines, in contrast, results can be stored in one of two ways: A temporary variable in memory. . . The second way leaves a computed value on the data stack, duplicating it as needed." (both are inefficient)
  • "Rigid code order: Scheduling memory accesses requires explicit, spare registers. (Out-of-order execution) is not possible on stack machines without exposing some aspect of the micro-architecture to the programmer."
  • "it takes about 1.88 stack-machine instructions to do the work of a register machine's RISC instruction."
  • Stack machines "hide a faster register machine inside . . . if the code stream instead had explicit register-select fields which directly manipulated the underlying register file, the compiler could make better use of all registers and the program would run faster."
  • "Interpreters for virtual stack machines are often slower than interpreters for other styles of virtual machine. This slowdown is worst when running on host machines with deep execution pipelines, such as current x86 chips. A program has to execute more instructions when compiled to a stack machine than when compiled to a register machine or memory-to-memory machine. Every variable load or constant requires its own separate Load instruction, instead of being bundled within the instruction which uses that value. "
  • "In some interpreters, the interpreter must execute a N-way switch jump to decode the next opcode and branch to its steps for that particular opcode. . . The host machine's prefetch mechanisms are unable to predict and fetch the target of that indexed or indirect jump. So the host machine's execution pipeline must restart each time the hosted interpreter decodes another virtual instruction. This happens more often for virtual stack machines than for other styles of virtual machine."
  • "Pure stack machines are quite inefficient for procedures which access multiple fields from the same object. The stack machine code must reload the object pointer for each pointer+offset calculation."

(3) AOT compilation from a stack-based IR is much more complicated than AOT compilation from an AST. This means that some implementers of wasm will simply choose to build an interpreter, rather than an AOT compiler, leading to great inefficiencies. (On the other hand, if an interpreter is what is wanted, building an interpreter for an AST-based IR would not be much more complicated than building an interpreter for a stack-based IR -- so compilation suffers with a stack-based design, but interpretation does not suffer either way.)

Basically, stack based IRs / bytecode systems are a terrible idea if you care about either interpreter efficiency, difficulty of writing optimizing AOT compilers, or about code transparency, readability or learnability. It is sad that four guys made this decision, perhaps without considering all aspects of these issues, which will affect both billions of devices and billions of users. This is a major step back for the web.

Is there any chance the decision could be revisited?

(I see this is a binaryen bug, but at least people are talking about the issues here -- is there a better venue for this discussion?)

@kripken
Copy link
Member Author

kripken commented Feb 2, 2017

@lukehutch Personally I agree with you - I think it was a mistake for wasm to make this change.

However, the specific arguments about perf are mostly incorrect - it's true that executing a stack machine can be very inefficient compared to a register machine etc., but the goal in wasm isn't to actually execute the code that way - VMs would translate it to something efficient first (optimized machine code in most cases).

With that out of the way, as I said, I definitely agree that the stack machine change was a bad idea for the rest of your reasons - code transparency, readability, learnability. And more generally, openness in the wide sense.

Sadly I don't think there is much of a chance to revisit this. The small group of people driving the wasm design are all in agreement on this matter (as you quoted), and everyone is working hard towards shipping. But, if you want to try, then opening an issue on the wasm design repo would be the right place.

@lukehutch
Copy link

@kripken Correct, wasm will typically be AOT-compiled down to native code. But most of the points I raised are also true of AOT compiled code, not just interpreted code.

From the Wikipedia page on stack machines: "The object code translators for the HP 3000 and Tandem T/16 are another example.(15)(16) They translated stack code sequences into equivalent sequences of RISC code. Minor 'local' optimizations removed much of the overhead of a stack architecture. Spare registers were used to factor out repeated address calculations. The translated code still retained plenty of emulation overhead from the mismatch between original and target machines. Despite that burden, the cycle efficiency of the translated code matched the cycle efficiency of the original stack code. And when the source code was recompiled directly to the register machine via optimizing compilers, the efficiency doubled. This shows that the stack architecture and its non-optimizing compilers were wasting over half of the power of the underlying hardware."

Refs:

(15) HP3000 Emulation on HP Precision Architecture Computers, Arndt Bergh, Keith Keilman, Daniel Magenheimer, and James Miller, Hewlett-Packard Journal, Dec 1987, p87-89. (PDF)

(16) Migrating a CISC Computer Family onto RISC via Object Code Translation, K. Andrews and D. Sand, Proceedings of ASPLOS-V, October, 1992

Both of these papers are old, but the point is still true today that optimizing stack-based IR code is exceptionally difficult. Stack-based IRs make the code less efficient, and most of the difficult work of building an optimizing compiler is just using local optimizations to undo the inefficiency of the stack machine. Once you undo that efficiency, you're left with code that is borderline obfuscated, difficult to optimize further, and much less efficient than code that has not passed through a stack IR stage.

Thanks on the pointer to the wasm design repo, I'll post there.

@kripken
Copy link
Member Author

kripken commented Feb 2, 2017

Yeah, it's true that

This shows that the stack architecture and its non-optimizing compilers were wasting over half of the power of the underlying hardware.

But in wasm we are talking about optimizing compilers, and specifically ones using SSA IRs. The wasm stack machine code is translated into the VM's SSA IR, at which point it doesn't matter if it came from a stack machine representation or register machine representation or AST or anything else - it's a bunch of operations on virtual registers in basic blocks in a CFG. Then it can be optimized exactly like the best native compilers do today.

We also have perf numbers on wasm running in current VMs - it's very close to native speed. asm.js was already close, and wasm improves on that. So throughput is just not an issue here.

@ghost
Copy link

ghost commented Feb 2, 2017

@lukehutch It's not so bad, sleep on it and take another look. There is not really a 'stack machine' rather wasm is designed to decode to validated SSA form quickly, and with a few extensions code could be encoded in SSA form, so the 'stack machine' could be viewed as just an efficient encoding. The stack encoding is frustrating to work with, so just don't, it's not intended to be an IR, it's an encoding, just decode it to SSA form and work with that, and wasm is designed to make this easy to work with untrusted code by being able to validate and decode it to SSA in a single pass - that is good isn't it? Even the local variables are just place holders for holding definitions while decoding and are expected to be gone at the end of the decoding stage.

The design is not even complete, we don't have high performance decode/compile implementations yet and I suspect that some more changes will be necessary to support this. Wasm is optimized to compile down to machine code AOT, before it runs anyway, not incrementally compiled as needed and not optimize for an interpreter which is still possible.

@lukehutch
Copy link

@Wllang then why not encode directly in SSA form? Adding the stack-based form only obfuscates the problem. That doesn't make any sense, there is no value added by the stack-based representation other than the intellectual curiosity of how to optimally solve the new problems that a stack machine introduces when you're trying to convert to a more useful form.

@kripken
Copy link
Member Author

kripken commented Feb 2, 2017

The problem is that SSA form is not compact - it's much larger than an AST or stack-based representation.

@ghost
Copy link

ghost commented Feb 2, 2017

@lukehutch I am trying to do just that, made some recent progress, but I don't have a conclusion on the encoding efficiency just yet. It is possible that @kripken is right, that the local variables will prove to be an efficient encoding and SSA 'much larger', but I am a long way from conceding just yet and if I do I'll be able to explain it. There are certain many functions that encode well in SSA form, but also see data flow patterns that are a challenge and I need to work through them.

@ghost
Copy link

ghost commented Feb 2, 2017

@lukehutch Here's a quick example of where I'm at with it, the main function from a zlib benchmark. Compare it to binareyen IR and I hope people see merit in this, perhaps imagine it in your preferred language format. Obviously the code below needs work, lots of patterns that are not optimal, just trying to get to the proof of concept stage. E.g. many of the br_case paths need jump threading and can just fall through, the loop could fall through here, and I have a long long list of other patterns that need work. You'll just have to wait on the encoding efficiency, or perhaps ask @kripken if he really knows.

(func $main ((n0 i32) (n1 i32)) (i32)
   (declare (export "main"))
   (set n2 (i32.sub (i32.load :offset 4 0) 16))
   (i32.store :offset 4 0 n2)
   (set n14
        (block $block-1
          (set n4
               (block $block-2
                 (br_if $block-2 (values 500) (i32.lt_s n0 2))
                 (set n3 (i32.add (i32.load8_s (i32.load :offset 4 n1)) -48))
                 (when (i32.gt_u n3 5)
                   (i32.store n2 n3)
                   (drop (call $printf 160 n2))
                   (br $block-1 (values -1)))
                 (br_case nil (0 1 2 3 4 5 0) n3
                      (case (br $block-1 (values 0)))
                      (case (br $block-2 (values 60)))
                      (case (br $block-2 (values 250)))
                      (case (br $block-2 (values 500)))
                      (case (br $block-2 (values 2500)))
                      (case (br $block-2 (values 5000))))))
          (set n5 (call $malloc 100000))
          (block $block-12
            (loop $repeat-8 (n6 n7 n8) (values 0 0 17)
               (set (n9 n10)
                    (block $block-9
                      (when (i32.lt_s n7 1)
                        (unless (i32.and n6 7)
                          (br $block-9 (values (i32.and n6 31) 0)))
                        (br $block-9 (values n7 (i32.rem_u (i32.mul n6 n6) 6714))))
                      (br $block-9 (values (i32.add n7 -1) n8))))
               (i32.store8 (i32.add n5 n6) n10)
               (set n11 (i32.add n6 1))
               (br_if $repeat-8 (values n11 n9 n10) (i32.ne n11 100000))
               (br $block-12 (values))))
          (block $block-14
            (loop $repeat-13 (n12) (values 0)
               (call $doit n5 100000 n12)
               (set n13 (i32.add n12 1))
               (br_if $repeat-13 (values n13) (i32.lt_s n13 n4))
               (br $block-14 (values))))
          (drop (call $puts 176))
          (values 0)))
   (i32.store :offset 4 0 (i32.add n2 16))
   (values n14))

@lukehutch
Copy link

@Wllang thanks for the example. What is this exactly? Is this some SSA code converted from the lowlevel stack representation, rendered into s-expression form, or is this a proposed alternative to the current stack machine model?

@ghost
Copy link

ghost commented Feb 2, 2017

@lukehutch It's binaryen output decoded to SSA retaining the control structure. There are already provisions in wasm to do encode much of this and it's a minor variation of the encoding:

  • provisions for blocks returning multiple values, presented as (set (v1 v2) (block ...))

  • provisions for loops having a signature which is used for the phi nodes here. The phi nodes and their values popped on loop entry are presented as (loop (n1 n2) (values v1 v2) ...) in the loop header here.

  • To encode the above requires adding the pick operator.

  • The above is presented in packed expression format, but it can be presented one operator per statement, it's just a different presentation. It's not clear above which definitions need pick and which are consumed in stack order, but that seems an encoding detail. Some redundant type information is also omitted in the presentation above.

  • The encoding benefits from having block ends unwind the stack.

  • The function arguments are pushed on the stack.

  • Changed br_table to a table-case operator as the constraint on br_table of passing a fixed set of values to all targets was trouble to work with. Need to revisit this.

  • The lack of a label and separate signature for the loop exit path was trouble, so wrapped them all in a block and need to revisit this.

  • The control structure need not be preserved, it can be convert to basic blocks, perhaps numbering them to retain some ordering.

  • It is not necessary to start with binaryen wasm output. E.g. it is possible to start with machine code and structure the CFG and convert to SSA then convert it to the above format, and I do that too, write out C and compile it and decode it back to wasm just to double check. Found much simpler algorithms than the relooper to structure the code, one or two pages of code.

Here's the above translate roughly to C, the C compiler cleans it up. Obviously a linear memory scheme needs to be chosen too, but for trusted code aligned to the target memory it would be as simple as follows. Have wasm code compiled in this way demonstrated running on IoT devices. Add pointer masking and good type derivation to minimise it and the code has a new layer of security and relatively good performance even without memory management. Add C type annotations and it can interoperate with C libraries cleanly. It's not valid C code due to all the casting, needs some compiler flags, perhaps there is a better approach, need to revisit.

uint32_t main(uint32_t n0, uint32_t n1) {
    const uint32_t n2 = (*(uint32_t *)(0 + 4)) - 0x10;
    *(uint32_t *)(0 + 4) = n2;
    uint32_t n14;
    {
        uint32_t n4;
        {
            if ((int32_t)n0 < (int32_t)2) {
                n4 = 0x1F4;
                goto $block_2;
            }
            const uint32_t n3 = ((int32_t)*(int8_t *)(*(uint32_t *)(n1 + 4))) + -48;
            if (n3 > 5) {
                *(uint32_t *)n2 = n3;
                _printf(0xA0, n2);
                n14 = -1;
                goto $block_1;
            }
            switch (n3) {
            case 1:
                    n4 = 0x3C;
                    goto $block_2;
            case 2:
                    n4 = 0xFA;
                    goto $block_2;
            case 3:
                    n4 = 0x1F4;
                    goto $block_2;
            case 4:
                    n4 = 0x9C4;
                    goto $block_2;
            case 5:
                    n4 = 0x1388;
                    goto $block_2;
            case 0:
            default:
                    n14 = 0;
                    goto $block_1;
            }
        }
    $block_2:;
        const uint32_t n5 = _malloc(0x186A0);
        {
            uint32_t n6 = 0;
            uint32_t n7 = 0;
            uint32_t n8 = 0x11;
        $repeat_8:;
            {
                uint32_t n9;
                uint32_t n10;
                {
                    if ((int32_t)n7 < (int32_t)1) {
                        if ((n6 & 7) == 0) {
                            n9 = n6 & 0x1F;
                            n10 = 0;
                            goto $block_9;
                        }
                        n9 = n7;
                        n10 = ((n6 * n6) % 0x1A3A);
                        goto $block_9;
                    }
                    n9 = n7 + -1;
                    n10 = n8;
                    goto $block_9;
                }
            $block_9:;
                *(uint8_t *)(n5 + n6) = n10;
                const uint32_t n11 = n6 + 1;
                if (n11 != 0x186A0) {
                    n6 = n11;
                    n7 = n9;
                    n8 = n10;
                    goto $repeat_8;
                }
                goto $block_12;
            }
        }
    $block_12:;
        {
            uint32_t n12 = 0;
        $repeat_13:;
            {
                doit(n5, 0x186A0, n12);
                const uint32_t n13 = n12 + 1;
                if ((int32_t)n13 < (int32_t)n4) {
                    n12 = n13;
                    goto $repeat_13;
                }
                goto $block_14;
            }
        }
    $block_14:;
        _puts(0xB0);
        n14 = 0;
    }
 $block_1:;
    *(uint32_t *)(0 + 4) = n2 + 0x10;
    return n14;
}

@lukehutch
Copy link

@Wllang Thanks for explaining. The thing I don't understand is: why encode in stack form at all, if all the consumers of wasm are going to have to undo the stack representation (as you have) in order to do anything useful?

(@kripken I don't think code size is the full answer here -- there must be a non-SSA form that is also non-stack-based, and much easier to work with than the current stack-based system.)

@ghost
Copy link

ghost commented Feb 2, 2017

@lukehutch The efficiency comes from the common case of definitions that have a single use and are consumed in a stack like order, or in other words code that can be represented as an expression and encoded pre-order or post-order. Serial encodings are just not a good representation for some 'useful things' to do with the code, pre-order or post-order.

@rossberg
Copy link
Member

rossberg commented Feb 2, 2017

@lukehutch, the most compact representation for expressions is a pre- or post-order serialisation. The latter turns out to be slightly more efficient to process on the consumer side. A stack machine essentially is just the generalisation of such a post-order encoding. As such, it is pretty much optimal. In particular, it keeps all "register indices" entirely implicit.

@sunfishcode
Copy link
Member

Of course, what's optimal for expressions may not be optimal in general. However, single-use values are very common, so the stack machine encoding is quite effective.

@lukehutch
Copy link

@Wllang @rossberg-chromium are you saying that for this particular stack machine, there is a guaranteed bijective mapping between the stack representation and the AST, and the stack representation is only used as an efficient serialization mechanism for the AST? If the stack representation is only used to serialize the AST, and it is neither intended to be interpreted nor to be displayed in stack-based form in browser debugging tools, then it makes sense that this is a non-issue.

(It is not true for stack machines in general that the stack representation can be unambiguously translated into an AST, depending on what operations are supported, and what the semantics of those operations are. For example, all deserialization bets are off if the stack machine supports a goto instruction that is able to jump to any instruction, or if it supports operations that can push a varying number of parameters onto the stack, depending on an input value.)

@ghost
Copy link

ghost commented Feb 3, 2017

@lukehutch I gave up long ago on seeking an encoding with a one-to-one mapping to a useful presentation language, and it took me too long to realize that it's not necessary as you can define a canonical decoding that should be sufficient, sorry to people on that one. So consider the wasm binary as having many possible encodings for the same canonical language, and variations on that canonical language can be encoded in an unknown section. Perhaps you'll come to the same realisation at some point, but it seems healthy to explore and challenge.

Practically the stack code is also interpreted and is even the default view-source format it seems. I would like to see some web browser hooks to allow custom presentation formats for view-source so the community can explore a range of text formats, and perhaps it will be possible some day.

The wasm code has structured control flow, which relates back to the validation. There are no operations that push a dynamic number of values and the stack depth is always uniquely defined (ignoring some differences in unreachable code).

@rossberg
Copy link
Member

rossberg commented Feb 3, 2017 via email

@kripken
Copy link
Member Author

kripken commented Feb 3, 2017

@rossberg-chromium, that's the current state now, but if as expected wasm ends up adding stacky operations like pick, multiple return values, etc., then the situation for the inverse will worsen, won't it? Or has something changed?

@rossberg
Copy link
Member

rossberg commented Feb 3, 2017

@kripken, true, then you need auxiliary let's in general -- one reason why I'm not particularly fond of pick or other stack hacking ops, and would rather prefer a destructuring let block as primitive, which is more structured.

@lukehutch
Copy link

Practically the stack code is also interpreted and is even the default view-source format it seems.

If the stack code is interpreted directly as a stack machine only when single-stepping in the debugger, that's fine (and, presumably, the debugger would also be free to single-step through an AST-ified representation of the stack format if it wanted to give the user the option of stepping at a higher level). However, if the fundamental stack-based execution model of wasm is preserved even in the AOT-compiled code, meaning that there is an actual stack machine (beyond the call stack) used to store all intermediate values even in compiled code, then there are enormous efficiency implications. See the links I provided in my initial comment for quantitative evidence and qualitative explanations for how badly the stack machine model can impact performance (and therefore battery life, etc.).

So I guess this is at the core of my concern: what will AOT compilers do with wasm stack machine code, and how will it be run outside of the debugger?

@kripken
Copy link
Member Author

kripken commented Feb 4, 2017

However, if the fundamental stack-based execution model of wasm is preserved even in the AOT-compiled code, meaning that there is an actual stack machine (beyond the call stack) used to store all intermediate values even in compiled code, then there are enormous efficiency implications.

As mentioned above, nothing to worry about: the stack machine representation is converted into standard compiler SSA form and optimized, just like clang or gcc would, into machine code. That is how SpiderMonkey and V8 work right now, so you can benchmark this if you want, no need to speculate. For example, here's emscripten's corrections benchmark:

test_corrections (test_benchmark.benchmark) ... 
        clang: mean: 14.877 (+-0.001) secs  median: 14.877  range: 14.876-14.878  (noise: 0.004%)  (3 runs)
     sm-asmjs: mean: 16.035 (+-0.105) secs  median: 15.961  range: 15.948-16.183  (noise: 0.658%)  (3 runs)   Relative: 1.08 X slower
      sm-wasm: mean: 15.561 (+-0.210) secs  median: 15.412  range: 15.408-15.857  (noise: 1.348%)  (3 runs)   Relative: 1.05 X slower
ok

That shows asm.js being 8% slower than native clang, and wasm is slightly faster, at 5% slower.

@lukehutch
Copy link

@kripken thanks for confirming this. What is the reason for the remaining 5% overhead? Does it come down to reduced opportunities for optimization in wasm SSA form code? Or overhead of sandboxing?

@kripken
Copy link
Member Author

kripken commented Feb 4, 2017

Could be those, but it could also be noise - gcc and clang have differences between them too, often much larger. Sometimes one register allocator happens to work better on a particular function, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants