Fair reusing of wasm runtime instances #3011

pepyakin · 2019-07-03T19:41:05Z

Closes #2051
Closes #2967
Supersedes #2938

Background

In #1345 an optimization for wasm runtime instantiation was implemented. The idea was that we would track the areas using some heuristic (used_size and lowest_used) that were used in the course of runtime execution and then zero them. However, it turned out that:

It assumed that the area where rustc usually puts the data section (i.e. static data, globals, etc) doesn't change. Introduction of mutable static variables in the runtime violated this assumption and ultimately caused Fix wasm module reuse #2051.
Upon a closer inspection of the heuristic, it cleared only the stack leaving the heap untouched. This can potentially (however, unlikely) cause undefined behavior because a compiler can assume and take advantage of the fact that WebAssembly guarantees the memory to be zeroed.

There were a couple attempts to fix it:

Initially, we thought that full instantiation in the background will do and tried to implement this approach in Ensure clean wasm instances #2931, but it turned out that this would make us use Arc instead of Rc in wasmi which would considerably harm the execution performance.
Then we assumed that maybe the initial assumption that @arkpar is not correct and instantiation doesn't take a lot of time. So we came up with Ensure clean wasm instances via synchronous clone. #2938. Initial bencmark results showed potential in this approach, however, it turned out that the algorithm was flawed since it copied only up to used_size (the highest accessed memory location within the execution) and @arkpar was indeed right: there is a lot of unnecessary work being done. The current node-runtime allocates 1MB of stack space, has data section and on top of that we preallocate a lot of memory upfront, and copying and/or zeroing it takes a lot of time. The execution time on generating 1500 blocks almost doubled.

It turned out that it was hard to fix this issue with the given constraints (e.g. preserving determinism, allowing of execution of general wasm modules, etc) without regressing the performance.

One option was to actually ban mutable static variables in the runtime. However, IMO, I think it is hard to ensure that we don't accidentally bring some sort of mutable static globals into the runtime and also it seems that they can be indeed useful.

Solution

Because of the above I decided to take the following approach along with the following tradeoffs:

Implement the mmap backend for linear memory in wasmi Use mmap for allocation wasmi-labs/wasmi#190 and introduce a function for quickly zeroing the memory instance. This allowed us to allocate arbitrary amounts of memory which is zeroed lazily on the first access. erase function allows to basically reallocate. The downside is that mmap is inherently platform dependent and works only on unix systems. E.g. Windows will suffer noticable slowdowns. I also tried to leverage the GlobalAlloc::alloc_zeroed function, but it turns out that it falls back to bzero (analogue of memset specialized for 0) on macOS which gives noticeble slowdowns (approx. 40 secs vs 230 secs for transaction factory of master on MasterTo1 with 1500 blocks). However, we can do the similar trick on Windows since it has the similar APIs.
Extract the data segments (chunks of memory that are copied to the linear memory at the instantiation time) from the wasm module and cache them. At the time when the new wasm runtime instance is requested, the linear memory is zeroed and then the data segments are copied.
Scan the mutable global variables and restore their values when the new wasm runtime instance is requested. This solves the Wasm stack pointer is not restored to its initial value #2967.
Ban the start function in wasm runtime modules. start function is run as the last step of instantiation and theoretically can do arbitrary changes to the wasm memory. This is OK for rustc produced binaries since they don't actually leverage the start function and IMO it is unlikely that it will use them. However, it might pose a problem for languages such as C++ which can have "pre-main" logic such as constructor initialization used in globals. This is not a fundamential problem since we can still run the start function and scan the memory for chunks of memory instead of relying on the static data segments. So this restriction can be lifted in the follow-up PRs.
Require the __heap_base global variable. It turned out that the allocator also depended on the used_size to detect the starting position of the heap which can be problematic. Luckily for us, wasm-ld (LLD) has a convention to expose the __heap_base in the produced binaries which points to the one byte past from the data section which we can use to seed the allocator. This solves potential problems that the used_size heuristic can miss, like BSS. I am not sure how this restriction can be lifted in the future though.

The initial benchmarks shows that there is no noticeble regression introduced in this PR compared to the heuristic approach we were using. Although unexpectedly now substrate performs a little bit better on macOS than on Linux.

I hope that these tradeoffs are reasonable.

cc @arkpar @gavofyork @cmichi

TODO:

test the performance on block import (was blocked transaction factory generates blocks that can not be imported with import-blocks #2977)
deal with the broken libc dependency (see the CI failure)

@pepyakin

Original is from @pepyakin in 3d7b27f. I adapted it to work with the latest master.

…-wasm-instances-v2

# Conflicts: # node/runtime/src/lib.rs

Demi-Marie

This looks good to me.

Demi-Marie · 2019-07-13T03:16:44Z