Switch the runtime from wasmi to wasmtime + cranelift #1700

leoyvens · 2020-06-02T18:23:16Z

This is primarily so we can use NaN canonicalization to be certain that wasm execution will be deterministic, but it has other advantages. wasmi was only passively maintained, while wasmtime is an incredibly active project so we'll have access to any cutting edge wasm features we might need. Wasmtime has backtraces on wasm traps, which should help with debugging subgraph errors.

On performance, wasmtime is faster since it's a jit engine while wasmi is an interpreter. Also timeout checks are more efficient in wasmtime. Testing with the moloch subgraph, with wasmtime it spent a total of 150ms in compilation and 100ms in execution of wasm code, while with wasmi it spent 1 second in execution. This subgraphs takes only a couple minutes to sync so it's not very representative, but it shows that wasm execution got way faster, though was already relatively fast, so this shouldn't be much of an overall sync time gain for most subgraphs. But now I'd proclaim that the cost of wasm execution is totally negligible for most subgraphs. We'll see if this shows in the index node CPU usage, I believe it should.

Wasmtime uses anyhow as an error library, so I started using it in the runtime and migrated some code in other crates from failure to anyhow. Working with anyhow has been great, you can use context to add information to errors or turn options into errors. It's just important to remember to format the error with {:#} so the full chain of causes is printed.

With a bit of refactoring in ipfs_map and tests, block_on_allow_panic is now just a proxy to futures::block_on, as it should.

The big_int_to_i32 and i32_to_big_int host exports are removed since graph-ts stopped depending on those over a year ago.

On the implementation, the main interface to instantiate a wasmtime Instance is through a Linker. Unlike wasmi, it's not easy to access give host exports access to the context they need, so the WasmInstance needed to be wrapped in an Rc<RefCell to be shared. This is returned as WasmInstanceHandle which mananges the instance and makes sure it's dropped. The link! macro is a convenience for adding host exports. Timeouts are delivered through the asynchronous wasmtime interrupts, so each module has a tokio task which is serves as a watchdog to interrupt the instance after the timeout elapses.

That3Percent

First batch of comments. More to follow.

runtime/wasm/src/host.rs

core/src/subgraph/instance.rs

core/src/subgraph/instance_manager.rs

graph/src/task_spawn.rs

graphql/src/schema/ast.rs

runtime/wasm/src/asc_abi/asc_ptr.rs

That3Percent

Another batch

graph/src/lib.rs

runtime/wasm/src/asc_abi/mod.rs

runtime/wasm/src/host.rs

runtime/wasm/src/host_exports.rs

That3Percent

The latest batch of comments. Sorry this review is taking a while, there's quite a lot to go over.

runtime/wasm/src/mapping.rs

store/test-store/src/lib.rs

runtime/wasm/src/module/into_wasm_ret.rs

runtime/wasm/src/module/mod.rs

That3Percent · 2020-06-03T22:02:28Z

runtime/wasm/src/module/mod.rs

+        let ptr = self.arena_start_ptr as usize;
+
+        // Safe because we are accessing and immediately dropping the reference to the data.
+        unsafe { self.memory.data_unchecked_mut()[ptr..(ptr + bytes.len())].copy_from_slice(bytes) }


I audited this for safety and I think it checks out. The situation may be a bit more tenuous than the comment implies though. The comment implies that the safety of this operation can be audited locally as if only the code within the unsafe block matters. This is not true.

Unfortunately, all the code which has access to the path WasmInstance::memory: Memory (which includes the set of all code which might access WasmInstance::instance: wasmtime::Instance and WasmInstance::memory_allocate) is a part of the surface area of the unsafety because these are paths that an additional reference (shared or not) could be created which aliases this borrow. Note that the data_unchecked_mut method doesn't even require &mut self! This is one of the most unsafe API's I've seen in the wild in some time.

This is concerning because these data are accessible by host exports! This is critical to security because this is exactly the kind of place for an attack vector to try to read memory owned by the indexer (like a private key, which we need to make sure an indexer does not have).

Is there a way to sandbox all paths to this memory in a way that less code has to be audited? Ideally, we could drop the module and the memory into something rock-solid with less than 100 lines of surface area.

This comment really is insufficient, so I'll explain my reasoning. First instance and memory are !Sync, at least for now, so these unsafe blocks can't be called concurrently.

Given that, the only way this API can cause unsoundness outside an unsafe block is if that block leaks a reference to the memory to safe code. We can adopt as an invariant that no unsafe block does this, documenting and auditing in the two unsafe blocks we have. The beauty being that if this invariant is locally true for all unsafe blocks, then all our code is safe.

Checking all the paths that use Instance, tracking usage of memory for these paths and concluding that the code really is sound is one way to audit this, but imo it's the more complicated way, even if we use privacy to further restrict the paths that can access the instance. We can take advantage of the particularities of this API to provide a tighter argument for soundness.

Let's start by implementing !Sync for WasmiInstance to make sure this situation doesn't change under our feet.

I was definitely taking this too lightly, I've added a doc test that WasmInstance is !Sync, and written a comment that describes our reasoning for why this is sound. Thanks for verifying this and helping me better elaborate on the safety here.

That3Percent · 2020-06-03T22:15:17Z

runtime/wasm/src/module/mod.rs

+            // This task is likely to outlive the instance, which is fine.
+            let interrupt_handle = instance.store().interrupt_handle().unwrap();
+            let should_interrupt = should_interrupt.clone();
+            graph::spawn(async move {


It is possible to construct a subgraph that intentionally avoids timeout and enters an infinite loop. There are several problems here:

Calling ipfs_map once is enough to disable the timeout forever. The timeout is never restored, so the subgraph could enter an infinite loop right after this.

Even if the timeout was restored, this check never triggers a second time if it happens to check while the timeout was disabled

The use of atomics is overkill for something that is not on the hot path

The code that disables should_interrupt is limited to ipfs_map but does not count eg: ipfs_cat which may also want to disable the inturrupt.

Even if the inturrupt were disabled for just the duration of the ipfs_map call and we turned it back on at the end, code spawned in other modules needs to count toward the limit.

We discussed outside of the context of this PR that we will want to handle timeouts in a more deterministic way, but before I get to writing that RFC I think we need to do better here since this is an attack vector as-is.

I suggest writing something like a shared Arc<Mutex<Stopwatch>> that can be paused for only the duration of time that is external to any possible wasm code executing (eg: inside ipfs reading code, but be running for ipfs_map callbacks) Or maybe just not disabling the timeout at all.

That3Percent

Last batch of comments for this pass

runtime/wasm/src/module/mod.rs

That3Percent · 2020-06-04T16:07:12Z

runtime/wasm/src/module/mod.rs

+        let ptr = self.arena_start_ptr as usize;
+
+        // Safe because we are accessing and immediately dropping the reference to the data.
+        unsafe { self.memory.data_unchecked_mut()[ptr..(ptr + bytes.len())].copy_from_slice(bytes) }


Let's start by implementing !Sync for WasmiInstance to make sure this situation doesn't change under our feet.

That3Percent · 2020-06-04T16:17:42Z

runtime/wasm/src/module/mod.rs

-}
-
-impl<E> HostError for HostExportError<E> where E: fmt::Debug + fmt::Display + Send + Sync + 'static {}
+    fn get(&self, offset: u32, size: u32) -> Vec<u8> {


Are we sure that this is deterministic? I think it probably is, but I want to be sure. Imagine that a subgraph reads from a specific memory address that was not given to it by a previous call to raw_new . Does WebAssembly guarantee that all of the memory layout be deterministic? Will something like upgrading the wasmtime dependency possibly affect that memory layout? Is some of the memory returned by instance.get_memory("memory") controlled by wasmtime? Where did that name come from and why didn't we use Memory::new() to limit this to just the arena?

Can you access the stack with this memory? Can the WebAssembly compiler introduce optimizations that would eliminate dead code and thereby parts of the stack that might be read by some other function?

I went digging to find good answers to your questions.

The store represents the global state of a Wasm module, this includes the memory. In the future there may be multiple memories, but currently there is only one memory per module. And at least AS calls it "memory". There exists the option of having AS import a memory that we initialize ourselves instead of letting it initialize its own memory, but that's not what we've done so far and I'm sure what difference that makes. There is no uninitialized memory in Wasm, memory is zeroed when initialized or grown.

The Wasm stack is separate from the store and its memory, so it cannot be accessed by the host.

Wasm is deterministic, and what we're doing from the host here is no different from what a Wasm load or store instruction. I see no indication that accessing memory from the host should be any different from accessing it from Wasm.

And to your point on optimizations, Cranelift is a general-purpose optimizing compiler, and in principle that is scary for determinism. As a minor point the optimization level is set to the default of None, so dead code elimination isn't performed, but that could change. The important thing is that even a general purpose optimizing compiler should not violate the spec when compiling Wasm. From what I could see, Cranelift considers any store to be a side effect and therefore not a candidate for dead code elimination. I did find one mention of dead store removal, ~~but I think that's just a mistake in the comment~~ which I asked about here bytecodealliance/wasmtime#1828, and it seems that while it's not implemented it's theoretically possible. But we can assume that if a store is visible to a host function, then it's not a dead store.

All that said, there is an alternative backend for wasmtime called Lightbeam, which is overall a more conservative backend and is what Substrate uses. I didn't try it because it doesn't support nan canonicalization.

I am satisfied that this is fully deterministic. Consider putting some of your learnings in the comment.

I've added some comments around memory and cranelift, and also explicitly set OptLevel::None.

That3Percent · 2020-06-09T13:56:13Z

runtime/wasm/src/host.rs

-            bail!(
+        ensure!(
+            !handlers.is_empty(),
+            format!(


There is no need to trigger another round of review on account of this, but ensure! already supports formatting so there is no need for the format! here.

That3Percent · 2020-06-09T14:06:59Z

runtime/wasm/src/mapping.rs

        let mut config = wasmtime::Config::new();
        config.strategy(wasmtime::Strategy::Cranelift).unwrap();
        config.interruptable(true); // For timeouts.
        config.cranelift_nan_canonicalization(true); // For NaN determinism.
+        config.cranelift_opt_level(wasmtime::OptLevel::None);


I don't think this will be necessary to do but I won't complain much. As a matter of curiosity, I wouldn't be surprised if this ended up being faster anyway if a lot of code in a trigger is run-once per compile.

Yhea some mappings might be worse off overall in the runtime/compile time tradeoff.

That3Percent · 2020-06-09T14:35:53Z

runtime/wasm/src/module/mod.rs

        let flags = self.asc_get(flags);
+
+        // Pause the timeout while running ipfs_map
+        self.timeout_stopwatch.lock().unwrap().stop();


This needs to use a scope-based drop API. (I've had good luck with the simplicity of the defer crate) As it is currently if the inner function panics or returns an error then the stopwatch is never restarted. This will result in the timeout task to be leaked.

That3Percent · 2020-06-09T14:39:21Z

runtime/wasm/src/module/mod.rs

+                        timeout.checked_sub(timeout_stopwatch.lock().unwrap().elapsed());
+                    match time_left {
+                        None => break interrupt_handle.interrupt(), // Timed out.
+                        Some(time) => tokio::time::delay_for(time).await,


Consider having a minimum wait time. There is a diabolical case where the stopwatch pauses with very little time left and this becomes an expensive spin loop that starves other futures. The "right" way to implement this is to awake when the stopwatch is unpaused, but I think you want to expend as little effort as possible and the minimum delay seems simpler.

That3Percent · 2020-06-09T14:42:12Z

runtime/wasm/src/module/stopwatch.rs

+    pub fn stop(&mut self) {
+        self.elapsed = self.elapsed();
+        self.start_time = None;
+        self.split_time = None;


I don't care, but as long as we're making changes note that split_time is unused. I'm surprised there is no warning for this.

leoyvens · 2020-06-10T13:50:26Z

@That3Percent Ok that really wasn't hard. Thanks to defer which is really nifty. Time for the final 🤞 review?

That3Percent

👏

That3Percent · 2020-06-10T18:47:25Z

runtime/wasm/src/module/test.rs


+    // Ipfs host functions use `block_on` which must be called from a sync context.
+    tokio::task::block_in_place(|| {


This is the problem with these APIs… there are at least 3 functions that sound like the thing you might want - spawn_blocking, block_on, and block_in_place. But, all of these require global knowledge of everything happening both up the stack and down the stack and have some very subtle problems and differences in their semantics.

I'm racking my brain trying to figure out what the right approach is, but I think the best we can get at the moment may be to replicate what graph-node is actually doing and put modules in their own thread to replicate the circumstances that ipfs.map is called in since that is very important to whether this test reflects whether the implementation will behave in production.

Thinking longer-term, the way that web APIs have been designed is that there are no async functions with sync APIs. Full stop. There is no "readFile()", only "readFile(callback)". Because of this, calling async functions within a mapping is a no-no.

The only way we can fix this is to embrace this in our own APIs. With IPFS this is relatively easy. Since the file becomes its own data source the problem is solved automatically since the handler becomes the callback. The only sticky part will be the entity_get API. There are 2 approaches here. We could change the entity_get API to literally take a callback - but that would break a large number of existing subgraphs for reasons our community may find difficult to understand. The preferred but larger effort way is that it should be possible to inspect and re-write the WebAssembly in terms of callbacks, in exactly the same way that would happen if we imagine that this was an async method and any call to entity_get had an implicit .await. It seems like a pretty big lift, but it's the only way I can think of to get this to behave correctly and retain the API we have.

@Jannis I'd be interested in your opinion of the above for a long-term solution.

What is the ideal threading/task architecture for us remains an open question, but for the purpose of fixing this tests I believe block_in_place not working is a bug tokio-rs/tokio#2603. I've switched to using an OS thread in the ipfs tests, to replicate what we do in the real code.

leoyvens · 2020-06-10T20:08:37Z

Given the problems we just had with futures block_on and in the principle of not mixing executor libraries, I've switch our functions in task_spawn.rs to use the tokio block_on, and added a comment explaining what's going on in our executor situation.

That3Percent · 2020-06-10T20:39:10Z

runtime/wasm/src/module/test/abi.rs

-    let res: Result<(), _> = func();
-    assert!(res.unwrap_err().to_string().contains(TRAP_TIMEOUT));
-}
+// #[tokio::test(threaded_scheduler)]


Now I'm really suspicious that this has been commented out.

Pure accident, it's a bit slower so I did it to run all tests faster when attempting to reproduce the failure.

Fixed, thanks for catching.

That3Percent · 2020-06-10T20:42:33Z

runtime/wasm/src/module/test.rs

+
+    // Ipfs host functions use `block_on` which must be called from a sync context,
+    // so we replicate what we do `spawn_module`.
+    std::thread::spawn(move || {


I'm getting fatigued on this PR, so for a future improvement let's put the spawning of the runtime, module, and thread as a single api to replace the test_module API and just do this in one place, for all tests.

That would make sense.

That3Percent · 2020-06-10T20:46:07Z

graph/src/task_spawn.rs

+//! underlying executor implementation which currently is tokio. This serves a few purposes:
+//! - Avoid depending directly on tokio APIs, making upgrades or a potential switch easier.
+//! - Reflect our chosen default semantics of aborting on task panic, offering `*_allow_panic`
+//!   functoins to opt out of that.


typo -> functions

That3Percent

🎊

leoyvens · 2020-06-10T23:43:39Z

@That3Percent Thanks for the intense review and help, merging!

leoyvens requested review from Jannis and That3Percent June 2, 2020 18:23

leoyvens force-pushed the leo/its-wasmtime branch 2 times, most recently from c9d6130 to 02362a5 Compare June 2, 2020 18:55

That3Percent suggested changes Jun 2, 2020

View reviewed changes

leoyvens force-pushed the leo/its-wasmtime branch from 4796723 to f1c4bbe Compare June 3, 2020 13:25

That3Percent suggested changes Jun 3, 2020

View reviewed changes

That3Percent suggested changes Jun 4, 2020

View reviewed changes

runtime/wasm/src/module/mod.rs Outdated Show resolved Hide resolved

leoyvens force-pushed the leo/its-wasmtime branch 2 times, most recently from 9cc8d94 to b933244 Compare June 4, 2020 14:19

That3Percent suggested changes Jun 4, 2020

View reviewed changes

That3Percent changed the title ~~Switch the runtime from wasmi to wasmtime + craneflit~~ Switch the runtime from wasmi to wasmtime + cranelift Jun 5, 2020

leoyvens force-pushed the leo/its-wasmtime branch 2 times, most recently from 102ce40 to b2500c4 Compare June 8, 2020 19:42

leoyvens requested review from That3Percent and removed request for Jannis June 8, 2020 20:03

That3Percent reviewed Jun 9, 2020

View reviewed changes

That3Percent suggested changes Jun 9, 2020

View reviewed changes

leoyvens added 10 commits June 10, 2020 10:38

runtime: Switch to wasmtime

fc4a420

graph: unconditionally use futures::block_onin block_on_allow_panic

cba9444

*: Use more anyhow::Error

54fa314

runtime: Depend on wasmtime master

93dc84e

graph: Add CompatErr

eff818b

runtime, graphql: Address review

cddb1e4

runtime: Use released wasmtime version

ecdfb69

runtime: Better i32 to/from u32 conversion

058ea6c

runtime: Use weak references to the instance in host exports

a42b393

runtime: Use anyhow::ensure

8e2caf5

leoyvens added 5 commits June 10, 2020 10:38

runtime: Properly explain unsafe usage

21f9c26

runtime: Expand comments

740f03d

runtime: Rename instance and borrow_instance

f815c8b

runtime: depend on wasmtime master, use u32 where possible

d98c0cc

runtime: Don't disable timeout due to ipfs.map

c090839

leoyvens force-pushed the leo/its-wasmtime branch from b2500c4 to c090839 Compare June 10, 2020 13:38

runtime: Address review

7d57905

leoyvens requested a review from That3Percent June 10, 2020 13:50

That3Percent approved these changes Jun 10, 2020

View reviewed changes

runtime: Fix flaky tests

f8f3116

That3Percent suggested changes Jun 10, 2020

View reviewed changes

leoyvens added 2 commits June 10, 2020 16:20

runtime: Spawn test modules that use block_on in their own thread

33ebb5c

graph: Use tokio block_on instead of futures block_on

fd616a0

runtime: Remove redundant block_in_place in test

1d08115

That3Percent suggested changes Jun 10, 2020

View reviewed changes

leoyvens added 2 commits June 10, 2020 17:49

graph: Fix typo

36da2e1

runtime: Uncomment test

65a2183

That3Percent approved these changes Jun 10, 2020

View reviewed changes

leoyvens merged commit ecb61ca into master Jun 10, 2020

leoyvens deleted the leo/its-wasmtime branch June 10, 2020 23:43


		// Ipfs host functions use `block_on` which must be called from a sync context.
		tokio::task::block_in_place(\|\| {

Switch the runtime from wasmi to wasmtime + cranelift #1700

Switch the runtime from wasmi to wasmtime + cranelift #1700

Conversation

leoyvens commented Jun 2, 2020 • edited Loading

That3Percent left a comment

Choose a reason for hiding this comment

That3Percent left a comment

Choose a reason for hiding this comment

That3Percent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leoyvens Jun 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

That3Percent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

That3Percent Jun 4, 2020 • edited Loading

Choose a reason for hiding this comment

leoyvens Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leoyvens commented Jun 10, 2020

That3Percent left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leoyvens Jun 10, 2020 • edited Loading

Choose a reason for hiding this comment

leoyvens commented Jun 10, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

That3Percent left a comment

Choose a reason for hiding this comment

leoyvens commented Jun 10, 2020

leoyvens commented Jun 2, 2020 •

edited

Loading

leoyvens Jun 4, 2020 •

edited

Loading

That3Percent Jun 4, 2020 •

edited

Loading

leoyvens Jun 5, 2020 •

edited

Loading

leoyvens Jun 10, 2020 •

edited

Loading