Wasmtime: add one-entry call-indirect caching. #8509

cfallin · 2024-04-29T22:44:42Z

In WebAssembly, an indirect call is somewhat slow, because of the indirection required by CFI (control-flow integrity) sandboxing. In particular, a "function pointer" in most source languages compiled to Wasm is represented by an index into a table of funcrefs. The call_indirect instruction then has to do the following steps to invoke a function pointer:

Load the funcref table's base and length values from the vmctx.
Bounds-check the invoked index against the actual table size; trap if out-of-bounds.
Spectre mitigation (cmove) on that bounds-check.
Load the vmfuncref from the table given base and index.
- For lazy table init, check if this is a non-initialized funcref pointer, and initialize the entry.
Load the signature from the funcref struct and compare it against the call_indirect's expected signature; trap if wrong.
Load the actual code pointer for the callee's Wasm-ABI entry point.
Load the callee vmctx (which may be different for a cross-module call).
Put that vmctx in arg 0, our vmctx in arg 1, and invoke the loaded code pointer with an indirect call instruction.

Compare and contrast to the process involved in invoking a native function pointer:

Invoke the code pointer with an indirect call instruction.

This overhead buys us something -- it is part of the SFI sandbox boundary -- but it is very repetitive and unnecessary work in most cases when indirect function calls are performed repeatedly (such as within an inner loop).

This PR introduces the idea of caching: if we know that the result of all the above checks won't change, then if we use the same index as "the last time" (for some definition), we can skip straight to the "invoke the code pointer" step, with a cached code pointer from that last time.

Concretely, it introduces a two-word struct inlined into the vmctx for each call_indirect instruction in the module (up to a limit):

The last invoked index;
The code pointer that index corresponded to.

When compiling the module, we check whether the table could possibly be mutable at a given index once read: any instructions like table.set, or the whole table exported thus writable from the outside. We also check whether index 0 is a non-null funcref. If neither of these things are true, then we know we can cache an index-to-code-pointer mapping, and we know we can use index 0 as a sentinel for "no cached value".

We then make use of the struct for each indirect call site and generate code to check if the index matches; if so, call cached pointer; if not, load the vmfuncref, check the signature, check that the callee vmctx is the same as caller (intra-module case), and stash the code pointer and index away (fill the cache), then make the call.

On an in-development branch of SpiderMonkey-in-Wasm with ICs (using indirect calls), this is about a 20% speedup; I haven't yet measured on other benchmarks. It is expected that this might be an instantiation-time slowdown due to a larger vmctx (but we could use madvise to zero if needed).

This feature is off by default right now.

cfallin · 2024-04-29T22:47:02Z

(I'll run some benchmarks to see what the compile-time impact is like, and how this benefits non-SpiderMonkey workloads; TBD.)

In WebAssembly, an indirect call is somewhat slow, because of the indirection required by CFI (control-flow integrity) sandboxing. In particular, a "function pointer" in most source languages compiled to Wasm is represented by an index into a table of funcrefs. The `call_indirect` instruction then has to do the following steps to invoke a function pointer: - Load the funcref table's base and length values from the vmctx. - Bounds-check the invoked index against the actual table size; trap if out-of-bounds. - Spectre mitigation (cmove) on that bounds-check. - Load the `vmfuncref` from the table given base and index. - For lazy table init, check if this is a non-initialized funcref pointer, and initialize the entry. - Load the signature from the funcref struct and compare it against the `call_indirect`'s expected signature; trap if wrong. - Load the actual code pointer for the callee's Wasm-ABI entry point. - Load the callee vmctx (which may be different for a cross-module call). - Put that vmctx in arg 0, our vmctx in arg 1, and invoke the loaded code pointer with an indirect call instruction. Compare and contrast to the process involved in invoking a native function pointer: - Invoke the code pointer with an indirect call instruction. This overhead buys us something -- it is part of the SFI sandbox boundary -- but it is very repetitive and unnecessary work in *most* cases when indirect function calls are performed repeatedly (such as within an inner loop). This PR introduces the idea of *caching*: if we know that the result of all the above checks won't change, then if we use the same index as "the last time" (for some definition), we can skip straight to the "invoke the code pointer" step, with a cached code pointer from that last time. Concretely, it introduces a two-word struct inlined into the vmctx for each `call_indirect` instruction in the module (up to a limit): - The last invoked index; - The code pointer that index corresponded to. When compiling the module, we check whether the table could possibly be mutable at a given index once read: any instructions like `table.set`, or the whole table exported thus writable from the outside. We also check whether index 0 is a non-null funcref. If neither of these things are true, then we know we can cache an index-to-code-pointer mapping, and we know we can use index 0 as a sentinel for "no cached value". We then make use of the struct for each indirect call site and generate code to check if the index matches; if so, call cached pointer; if not, load the vmfuncref, check the signature, check that the callee vmctx is the same as caller (intra-module case), and stash the code pointer and index away (fill the cache), then make the call. On an in-development branch of SpiderMonkey-in-Wasm with ICs (using indirect calls), this is about a 20% speedup; I haven't yet measured on other benchmarks. It is expected that this might be an instantiation-time slowdown due to a larger vmctx (but we could use madvise to zero if needed). This feature is off by default right now.

cfallin · 2024-04-29T23:45:23Z

Benchmarks:

Sightglass measures the following for spidermonkey.wasm:

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1456249767.40 ± 1315409020.02 (confidence = 99%)

  original.so is 1.01x to 1.17x faster than cache-indirect-calls.so!

  [17934023164 18070879867.60 18215783322] cache-indirect-calls.so
  [15377276916 16614630100.20 17840489432] original.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 27069455.80 ± 22219096.46 (confidence = 99%)

  original.so is 1.00x to 1.05x faster than cache-indirect-calls.so!

  [1079208474 1095097604.00 1142114396] cache-indirect-calls.so
  [1035034082 1068028148.20 1094035960] original.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [834480 856949.40 902120] cache-indirect-calls.so
  [831744 860156.60 978994] original.so

so compilation time does take a little hit but my system is noisy; somewhere between 1% and 17%, the mean looks to be around 8.7%. Instantiation apparently unaffected. And execution time... slightly slower? (But again, noisy system.)

I tried running the next benchmark in the list (pulldown-cmark) and Sightglass crashed mysteriously, so on to trusty hyperfine: no deltas in runtime on spidermonkey or bz2, my two usual suspects.

I suspect that mostly we'll see mostly in-the-noise results for most benchmarks because indirect calls are somewhat rare in C-ish code; it will matter mostly when compiling dynamic languages, or something like Java with a lot of virtual calls. The speedup on weval'd SpiderMonkey is real though so I'd like to at least have this in tree as an option :-)

github-actions · 2024-04-30T02:44:29Z

Subscribe to Label Action

cc @saulecabrera

This issue or pull request has been labeled: "wasmtime:api", "wasmtime:config", "winch"

Thus the following users have been cc'd because of the following labels:

saulecabrera: winch

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

github-actions · 2024-04-30T02:44:42Z

Label Messager: wasmtime:config

It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:

If you added a new Config method, you wrote extensive documentation for
it.

Our documentation should be of the following form:

Short, simple summary sentence.

More details. These details can be multiple paragraphs. There should be
information about not just the method, but its parameters and results as
well.

Is this method fallible? If so, when can it return an error?

Can this method panic? If so, when does it panic?

# Example

Optional example here.

If you added a new Config method, or modified an existing one, you
ensured that this configuration is exercised by the fuzz targets.

For example, if you expose a new strategy for allocating the next instance
slot inside the pooling allocator, you should ensure that at least one of our
fuzz targets exercises that new strategy.

Often, all that is required of you is to ensure that there is a knob for this
configuration option in wasmtime_fuzzing::Config (or one
of its nested structs).

Rarely, this may require authoring a new fuzz target to specifically test this
configuration. See our docs on fuzzing for more details.
If you are enabling a configuration option by default, make sure that it
has been fuzzed for at least two weeks before turning it on by default.

To modify this label's message, edit the .github/label-messager/wasmtime-config.md file.

To add new label messages or remove existing label messages, edit the
.github/label-messager.json configuration file.

Learn more.

fitzgen · 2024-04-30T15:26:39Z

FWIW, I did a 100-iteration sightglass run of the default suite and got these results:

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [472878 584750.13 1435913] cache.so
  [471975 647599.28 1762837] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [7422842 8744920.24 28196476] cache.so
  [7439782 9281270.19 32569639] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [165456 245749.33 728069] cache.so
  [161764 251740.72 723740] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [95484875 104221743.11 196171026] cache.so
  [95454637 105918836.28 208434699] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [358282063 404458268.52 568579026] cache.so
  [367557812 399158971.64 551388853] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [194765165 225581880.12 437412507] cache.so
  [197703277 222917937.46 281571195] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [895587497 923280792.22 972838288] cache.so
  [891917856 927697965.55 1121115203] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [8325711261 8502522129.94 8732981273] cache.so
  [8370183347 8530524550.49 9984684780] main.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [214742 289658.22 800685] cache.so
  [204642 290351.57 854625] main.so

I think it is safe to say that this doesn't have an affect on our regular suite, and I expect that the JS speed ups you are reporting rely on weval'ing first.

fitzgen

Overall looks great, but I have a few nitpicks, mostly related to improving things for future code readers. Thanks!

crates/environ/src/compile/module_environ.rs

crates/environ/src/vmoffsets.rs

crates/wasmtime/src/config.rs

tests/disas/indirect-call-caching-exclude-0-index.wat

tests/disas/indirect-call-caching-exclude-table-export.wat

tests/disas/indirect-call-caching-exclude-table-writes.wat

crates/cranelift/src/func_environ.rs

cfallin · 2024-04-30T20:31:00Z

Updated, let me know what you think!

cfallin · 2024-04-30T20:33:38Z

(Oh, and forgot to mention: thanks for the thorough benchmarks on this! I've gotta dig into why Sightglass is so finicky for me...)

…r bits but not needed in actual runtime).

github-actions · 2024-04-30T22:44:25Z

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "fuzzing", "wasmtime:ref-types"

Thus the following users have been cc'd because of the following labels:

fitzgen: fuzzing, wasmtime:ref-types

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

fitzgen · 2024-05-01T16:14:50Z

Oh also, I mentioned this in the Cranelift meeting, but wanted to put it down here for posterity too: it seems like the limit on the number of caches isn't actually implemented.

cfallin · 2024-05-01T16:53:48Z

Yep, I missed that entirely in this iteration of the patch -- sorry about that! Just updated. I ended up wiring the limit through to a new config option (with docs, CLI flag, fuzzing) so that I could easily test it without a huge test input.

fitzgen

🚀

fitzgen · 2024-05-01T21:06:27Z

crates/cranelift/src/func_environ.rs

+    /// Get the address of the call-indirect cache slot for a given callsite.
+    pub fn call_indirect_cache_slot_addr(
+        &mut self,
+        call_site: CallIndirectSiteIndex,
+        vmctx: ir::Value,
+    ) -> ir::Value {
+        let offset = self.env.offsets.vmctx_call_indirect_cache(call_site);
+        self.builder.ins().iadd_imm(vmctx, i64::from(offset))
+    }
+
+    /// Load the cached index and code pointer for an indirect call.
+    ///
+    /// Generates IR like:
+    ///
+    /// ```ignore
+    /// v1 = load.i64 cache_ptr+0 ;; cached index (cache key)
+    /// v2 = load.i64 cache_ptr+8 ;; cached raw code pointer (cache value)
+    /// ```
+    ///
+    /// and returns `(index, code_ptr)` (e.g. from above, `(v1, v2)`).
+    fn load_cached_indirect_index_and_code_ptr(


This is great, thank you. I really appreciate having this stuff factored out into bite-sized chunks.

cfallin · 2024-05-01T23:17:54Z

The merge queue did its job (nice!) and found a conflict with #8515 which merged after this PR's CI ran. I pushed a fix which is fairly small but @fitzgen maybe PTAL at least the last commit just to be sure?

cfallin · 2024-05-01T23:21:39Z

(or @jameysharp feel free to review last commit too as it interacts with your change in #8515!)

crates/types/src/lib.rs

alexcrichton · 2024-05-03T18:29:17Z

crates/cli-flags/src/lib.rs

+        /// Whether to enable call-indirect caching.
+        pub cache_call_indirects: Option<bool>,
+        /// The maximum call-indirect cache slot count.
+        ///
+        /// One slot is allocated per indirect callsite; if the module
+        /// has more indirect callsites than this limit, then the
+        /// first callsites in linear order in the code section, up to
+        /// the limit, will receive a cache slot.
+        pub max_call_indirect_cache_slots: Option<usize>,


Could these be moved to -O options as "tuning" or "optimizaion" options?

The WasmOptions group here is intended for semantic differences in the generated code and these options shouldn't affect any generated code.

(thanks for adding cli flags though!)

Ah, yeah, for sure -- will do in a followup PR.

As per [this comment], the call-indirect caching options should really be in the `-O` option namespace (for optimizations), not `-W` (for semantically-visible Wasm standards options); the original PR simply put them in the wrong place, and this PR moves them. [this comment]: bytecodealliance#8509 (comment)

As per [this comment], the call-indirect caching options should really be in the `-O` option namespace (for optimizations), not `-W` (for semantically-visible Wasm standards options); the original PR simply put them in the wrong place, and this PR moves them. [this comment]: #8509 (comment)

…nce#8509)" This reverts commit 01c07a1.

cfallin requested a review from a team as a code owner April 29, 2024 22:44

cfallin requested review from fitzgen and removed request for a team April 29, 2024 22:44

cfallin force-pushed the call-indirect-caching branch from 5981cc3 to 08d4364 Compare April 29, 2024 22:56

github-actions bot added wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:config Issues related to the configuration of Wasmtime winch Winch issues or pull requests labels Apr 30, 2024

fitzgen reviewed Apr 30, 2024

View reviewed changes

cfallin added 3 commits April 30, 2024 13:11

Addressed review feedback.

8422179

Added some more comments.

674fa03

Merge remote-tracking branch 'upstream/main' into call-indirect-caching

1ba55cd

Allow unused VMCallIndirectCache struct (defined for parity with othe…

036af45

…r bits but not needed in actual runtime).

github-actions bot added fuzzing Issues related to our fuzzing infrastructure wasmtime:ref-types Issues related to reference types and GC in Wasmtime labels Apr 30, 2024

Add a limit to the number of call-indirect cache slots.

a238300

cfallin force-pushed the call-indirect-caching branch from 3820843 to a238300 Compare May 1, 2024 16:55

fitzgen approved these changes May 1, 2024

View reviewed changes

fitzgen added this pull request to the merge queue May 1, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 1, 2024

cfallin added 2 commits May 1, 2024 16:07

Merge remote-tracking branch 'upstream/main' into call-indirect-caching

cf52e86

Fix merge conflict: handle ConstOp element offset.

4eb658f

fitzgen approved these changes May 2, 2024

View reviewed changes

crates/types/src/lib.rs Outdated Show resolved Hide resolved

crates/types/src/lib.rs Outdated Show resolved Hide resolved

Review feedback.

36c73ad

cfallin enabled auto-merge May 2, 2024 06:12

cfallin added this pull request to the merge queue May 2, 2024

Merged via the queue into bytecodealliance:main with commit 01c07a1 May 2, 2024
21 checks passed

cfallin deleted the call-indirect-caching branch May 2, 2024 06:49

alexcrichton reviewed May 3, 2024

View reviewed changes

cfallin mentioned this pull request May 3, 2024

Indirect call caching: put options under -O namespace. #8543

Merged

cfallin added a commit to cfallin/wasmtime that referenced this pull request May 6, 2024

Revert "Wasmtime: add one-entry call-indirect caching. (bytecodeallia…

4a9f61a

…nce#8509)" This reverts commit 01c07a1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasmtime: add one-entry call-indirect caching. #8509

Wasmtime: add one-entry call-indirect caching. #8509

cfallin commented Apr 29, 2024

cfallin commented Apr 29, 2024

cfallin commented Apr 29, 2024

github-actions bot commented Apr 30, 2024

github-actions bot commented Apr 30, 2024

fitzgen commented Apr 30, 2024

fitzgen left a comment

cfallin commented Apr 30, 2024

cfallin commented Apr 30, 2024

github-actions bot commented Apr 30, 2024

fitzgen commented May 1, 2024

cfallin commented May 1, 2024

fitzgen left a comment

fitzgen May 1, 2024

cfallin commented May 1, 2024 •

edited

Loading

cfallin commented May 1, 2024

alexcrichton May 3, 2024

cfallin May 3, 2024

Wasmtime: add one-entry call-indirect caching. #8509

Wasmtime: add one-entry call-indirect caching. #8509

Conversation

cfallin commented Apr 29, 2024

cfallin commented Apr 29, 2024

cfallin commented Apr 29, 2024

github-actions bot commented Apr 30, 2024

Subscribe to Label Action

github-actions bot commented Apr 30, 2024

Label Messager: wasmtime:config

fitzgen commented Apr 30, 2024

fitzgen left a comment

Choose a reason for hiding this comment

cfallin commented Apr 30, 2024

cfallin commented Apr 30, 2024

github-actions bot commented Apr 30, 2024

Subscribe to Label Action

fitzgen commented May 1, 2024

cfallin commented May 1, 2024

fitzgen left a comment

Choose a reason for hiding this comment

fitzgen May 1, 2024

Choose a reason for hiding this comment

cfallin commented May 1, 2024 • edited Loading

cfallin commented May 1, 2024

alexcrichton May 3, 2024

Choose a reason for hiding this comment

cfallin May 3, 2024

Choose a reason for hiding this comment

cfallin commented May 1, 2024 •

edited

Loading