Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wasmtime: add one-entry call-indirect caching. #8509

Merged
merged 9 commits into from
May 2, 2024

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Apr 29, 2024

In WebAssembly, an indirect call is somewhat slow, because of the indirection required by CFI (control-flow integrity) sandboxing. In particular, a "function pointer" in most source languages compiled to Wasm is represented by an index into a table of funcrefs. The call_indirect instruction then has to do the following steps to invoke a function pointer:

  • Load the funcref table's base and length values from the vmctx.
  • Bounds-check the invoked index against the actual table size; trap if out-of-bounds.
  • Spectre mitigation (cmove) on that bounds-check.
  • Load the vmfuncref from the table given base and index.
    • For lazy table init, check if this is a non-initialized funcref pointer, and initialize the entry.
  • Load the signature from the funcref struct and compare it against the call_indirect's expected signature; trap if wrong.
  • Load the actual code pointer for the callee's Wasm-ABI entry point.
  • Load the callee vmctx (which may be different for a cross-module call).
  • Put that vmctx in arg 0, our vmctx in arg 1, and invoke the loaded code pointer with an indirect call instruction.

Compare and contrast to the process involved in invoking a native function pointer:

  • Invoke the code pointer with an indirect call instruction.

This overhead buys us something -- it is part of the SFI sandbox boundary -- but it is very repetitive and unnecessary work in most cases when indirect function calls are performed repeatedly (such as within an inner loop).

This PR introduces the idea of caching: if we know that the result of all the above checks won't change, then if we use the same index as "the last time" (for some definition), we can skip straight to the "invoke the code pointer" step, with a cached code pointer from that last time.

Concretely, it introduces a two-word struct inlined into the vmctx for each call_indirect instruction in the module (up to a limit):

  • The last invoked index;
  • The code pointer that index corresponded to.

When compiling the module, we check whether the table could possibly be mutable at a given index once read: any instructions like table.set, or the whole table exported thus writable from the outside. We also check whether index 0 is a non-null funcref. If neither of these things are true, then we know we can cache an index-to-code-pointer mapping, and we know we can use index 0 as a sentinel for "no cached value".

We then make use of the struct for each indirect call site and generate code to check if the index matches; if so, call cached pointer; if not, load the vmfuncref, check the signature, check that the callee vmctx is the same as caller (intra-module case), and stash the code pointer and index away (fill the cache), then make the call.

On an in-development branch of SpiderMonkey-in-Wasm with ICs (using indirect calls), this is about a 20% speedup; I haven't yet measured on other benchmarks. It is expected that this might be an instantiation-time slowdown due to a larger vmctx (but we could use madvise to zero if needed).

This feature is off by default right now.

@cfallin cfallin requested a review from a team as a code owner April 29, 2024 22:44
@cfallin cfallin requested review from fitzgen and removed request for a team April 29, 2024 22:44
@cfallin
Copy link
Member Author

cfallin commented Apr 29, 2024

(I'll run some benchmarks to see what the compile-time impact is like, and how this benefits non-SpiderMonkey workloads; TBD.)

In WebAssembly, an indirect call is somewhat slow, because of the
indirection required by CFI (control-flow integrity) sandboxing. In
particular, a "function pointer" in most source languages compiled to
Wasm is represented by an index into a table of funcrefs. The
`call_indirect` instruction then has to do the following steps to invoke
a function pointer:

- Load the funcref table's base and length values from the vmctx.
- Bounds-check the invoked index against the actual table size; trap if
  out-of-bounds.
- Spectre mitigation (cmove) on that bounds-check.
- Load the `vmfuncref` from the table given base and index.
  - For lazy table init, check if this is a non-initialized funcref
    pointer, and initialize the entry.
- Load the signature from the funcref struct and compare it against the
  `call_indirect`'s expected signature; trap if wrong.
- Load the actual code pointer for the callee's Wasm-ABI entry point.
- Load the callee vmctx (which may be different for a cross-module
  call).
- Put that vmctx in arg 0, our vmctx in arg 1, and invoke the loaded
  code pointer with an indirect call instruction.

Compare and contrast to the process involved in invoking a native
function pointer:

- Invoke the code pointer with an indirect call instruction.

This overhead buys us something -- it is part of the SFI sandbox
boundary -- but it is very repetitive and unnecessary work in *most*
cases when indirect function calls are performed repeatedly (such as
within an inner loop).

This PR introduces the idea of *caching*: if we know that the result of
all the above checks won't change, then if we use the same index as "the
last time" (for some definition), we can skip straight to the "invoke
the code pointer" step, with a cached code pointer from that last time.

Concretely, it introduces a two-word struct inlined into the vmctx for
each `call_indirect` instruction in the module (up to a limit):

- The last invoked index;
- The code pointer that index corresponded to.

When compiling the module, we check whether the table could possibly be
mutable at a given index once read: any instructions like `table.set`,
or the whole table exported thus writable from the outside. We also
check whether index 0 is a non-null funcref. If neither of these things
are true, then we know we can cache an index-to-code-pointer mapping,
and we know we can use index 0 as a sentinel for "no cached value".

We then make use of the struct for each indirect call site and generate
code to check if the index matches; if so, call cached pointer; if not,
load the vmfuncref, check the signature, check that the callee vmctx is
the same as caller (intra-module case), and stash the code pointer and
index away (fill the cache), then make the call.

On an in-development branch of SpiderMonkey-in-Wasm with ICs (using
indirect calls), this is about a 20% speedup; I haven't yet measured on
other benchmarks. It is expected that this might be an
instantiation-time slowdown due to a larger vmctx (but we could use
madvise to zero if needed).

This feature is off by default right now.
@cfallin cfallin force-pushed the call-indirect-caching branch from 5981cc3 to 08d4364 Compare April 29, 2024 22:56
@cfallin
Copy link
Member Author

cfallin commented Apr 29, 2024

Benchmarks:

  • Sightglass measures the following for spidermonkey.wasm:
compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1456249767.40 ± 1315409020.02 (confidence = 99%)

  original.so is 1.01x to 1.17x faster than cache-indirect-calls.so!

  [17934023164 18070879867.60 18215783322] cache-indirect-calls.so
  [15377276916 16614630100.20 17840489432] original.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 27069455.80 ± 22219096.46 (confidence = 99%)

  original.so is 1.00x to 1.05x faster than cache-indirect-calls.so!

  [1079208474 1095097604.00 1142114396] cache-indirect-calls.so
  [1035034082 1068028148.20 1094035960] original.so

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [834480 856949.40 902120] cache-indirect-calls.so
  [831744 860156.60 978994] original.so

so compilation time does take a little hit but my system is noisy; somewhere between 1% and 17%, the mean looks to be around 8.7%. Instantiation apparently unaffected. And execution time... slightly slower? (But again, noisy system.)

I tried running the next benchmark in the list (pulldown-cmark) and Sightglass crashed mysteriously, so on to trusty hyperfine: no deltas in runtime on spidermonkey or bz2, my two usual suspects.

I suspect that mostly we'll see mostly in-the-noise results for most benchmarks because indirect calls are somewhat rare in C-ish code; it will matter mostly when compiling dynamic languages, or something like Java with a lot of virtual calls. The speedup on weval'd SpiderMonkey is real though so I'd like to at least have this in tree as an option :-)

@github-actions github-actions bot added wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:config Issues related to the configuration of Wasmtime winch Winch issues or pull requests labels Apr 30, 2024
Copy link

Subscribe to Label Action

cc @saulecabrera

This issue or pull request has been labeled: "wasmtime:api", "wasmtime:config", "winch"

Thus the following users have been cc'd because of the following labels:

  • saulecabrera: winch

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Copy link

Label Messager: wasmtime:config

It looks like you are changing Wasmtime's configuration options. Make sure to
complete this check list:

  • If you added a new Config method, you wrote extensive documentation for
    it.

    Our documentation should be of the following form:

    Short, simple summary sentence.
    
    More details. These details can be multiple paragraphs. There should be
    information about not just the method, but its parameters and results as
    well.
    
    Is this method fallible? If so, when can it return an error?
    
    Can this method panic? If so, when does it panic?
    
    # Example
    
    Optional example here.
    
  • If you added a new Config method, or modified an existing one, you
    ensured that this configuration is exercised by the fuzz targets.

    For example, if you expose a new strategy for allocating the next instance
    slot inside the pooling allocator, you should ensure that at least one of our
    fuzz targets exercises that new strategy.

    Often, all that is required of you is to ensure that there is a knob for this
    configuration option in wasmtime_fuzzing::Config (or one
    of its nested structs).

    Rarely, this may require authoring a new fuzz target to specifically test this
    configuration. See our docs on fuzzing for more details.

  • If you are enabling a configuration option by default, make sure that it
    has been fuzzed for at least two weeks before turning it on by default.


To modify this label's message, edit the .github/label-messager/wasmtime-config.md file.

To add new label messages or remove existing label messages, edit the
.github/label-messager.json configuration file.

Learn more.

@fitzgen
Copy link
Member

fitzgen commented Apr 30, 2024

FWIW, I did a 100-iteration sightglass run of the default suite and got these results:

instantiation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [472878 584750.13 1435913] cache.so
  [471975 647599.28 1762837] main.so

execution :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [7422842 8744920.24 28196476] cache.so
  [7439782 9281270.19 32569639] main.so

instantiation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [165456 245749.33 728069] cache.so
  [161764 251740.72 723740] main.so

execution :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [95484875 104221743.11 196171026] cache.so
  [95454637 105918836.28 208434699] main.so

compilation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [358282063 404458268.52 568579026] cache.so
  [367557812 399158971.64 551388853] main.so

compilation :: cycles :: benchmarks/bz2/benchmark.wasm

  No difference in performance.

  [194765165 225581880.12 437412507] cache.so
  [197703277 222917937.46 281571195] main.so

execution :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [895587497 923280792.22 972838288] cache.so
  [891917856 927697965.55 1121115203] main.so

compilation :: cycles :: benchmarks/spidermonkey/benchmark.wasm

  No difference in performance.

  [8325711261 8502522129.94 8732981273] cache.so
  [8370183347 8530524550.49 9984684780] main.so

instantiation :: cycles :: benchmarks/pulldown-cmark/benchmark.wasm

  No difference in performance.

  [214742 289658.22 800685] cache.so
  [204642 290351.57 854625] main.so

I think it is safe to say that this doesn't have an affect on our regular suite, and I expect that the JS speed ups you are reporting rely on weval'ing first.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, but I have a few nitpicks, mostly related to improving things for future code readers. Thanks!

crates/environ/src/compile/module_environ.rs Outdated Show resolved Hide resolved
crates/environ/src/vmoffsets.rs Outdated Show resolved Hide resolved
crates/environ/src/vmoffsets.rs Outdated Show resolved Hide resolved
crates/wasmtime/src/config.rs Show resolved Hide resolved
crates/cranelift/src/func_environ.rs Outdated Show resolved Hide resolved
crates/cranelift/src/func_environ.rs Outdated Show resolved Hide resolved
@cfallin
Copy link
Member Author

cfallin commented Apr 30, 2024

Updated, let me know what you think!

@cfallin
Copy link
Member Author

cfallin commented Apr 30, 2024

(Oh, and forgot to mention: thanks for the thorough benchmarks on this! I've gotta dig into why Sightglass is so finicky for me...)

@github-actions github-actions bot added fuzzing Issues related to our fuzzing infrastructure wasmtime:ref-types Issues related to reference types and GC in Wasmtime labels Apr 30, 2024
Copy link

Subscribe to Label Action

cc @fitzgen

This issue or pull request has been labeled: "fuzzing", "wasmtime:ref-types"

Thus the following users have been cc'd because of the following labels:

  • fitzgen: fuzzing, wasmtime:ref-types

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

@fitzgen
Copy link
Member

fitzgen commented May 1, 2024

Oh also, I mentioned this in the Cranelift meeting, but wanted to put it down here for posterity too: it seems like the limit on the number of caches isn't actually implemented.

@cfallin
Copy link
Member Author

cfallin commented May 1, 2024

Yep, I missed that entirely in this iteration of the patch -- sorry about that! Just updated. I ended up wiring the limit through to a new config option (with docs, CLI flag, fuzzing) so that I could easily test it without a huge test input.

@cfallin cfallin force-pushed the call-indirect-caching branch from 3820843 to a238300 Compare May 1, 2024 16:55
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Comment on lines +1152 to +1172
/// Get the address of the call-indirect cache slot for a given callsite.
pub fn call_indirect_cache_slot_addr(
&mut self,
call_site: CallIndirectSiteIndex,
vmctx: ir::Value,
) -> ir::Value {
let offset = self.env.offsets.vmctx_call_indirect_cache(call_site);
self.builder.ins().iadd_imm(vmctx, i64::from(offset))
}

/// Load the cached index and code pointer for an indirect call.
///
/// Generates IR like:
///
/// ```ignore
/// v1 = load.i64 cache_ptr+0 ;; cached index (cache key)
/// v2 = load.i64 cache_ptr+8 ;; cached raw code pointer (cache value)
/// ```
///
/// and returns `(index, code_ptr)` (e.g. from above, `(v1, v2)`).
fn load_cached_indirect_index_and_code_ptr(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you. I really appreciate having this stuff factored out into bite-sized chunks.

@fitzgen fitzgen added this pull request to the merge queue May 1, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks May 1, 2024
@cfallin
Copy link
Member Author

cfallin commented May 1, 2024

The merge queue did its job (nice!) and found a conflict with #8515 which merged after this PR's CI ran. I pushed a fix which is fairly small but @fitzgen maybe PTAL at least the last commit just to be sure?

@cfallin
Copy link
Member Author

cfallin commented May 1, 2024

(or @jameysharp feel free to review last commit too as it interacts with your change in #8515!)

crates/types/src/lib.rs Outdated Show resolved Hide resolved
crates/types/src/lib.rs Outdated Show resolved Hide resolved
@cfallin cfallin enabled auto-merge May 2, 2024 06:12
@cfallin cfallin added this pull request to the merge queue May 2, 2024
Merged via the queue into bytecodealliance:main with commit 01c07a1 May 2, 2024
21 checks passed
@cfallin cfallin deleted the call-indirect-caching branch May 2, 2024 06:49
Comment on lines +247 to +255
/// Whether to enable call-indirect caching.
pub cache_call_indirects: Option<bool>,
/// The maximum call-indirect cache slot count.
///
/// One slot is allocated per indirect callsite; if the module
/// has more indirect callsites than this limit, then the
/// first callsites in linear order in the code section, up to
/// the limit, will receive a cache slot.
pub max_call_indirect_cache_slots: Option<usize>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these be moved to -O options as "tuning" or "optimizaion" options?

The WasmOptions group here is intended for semantic differences in the generated code and these options shouldn't affect any generated code.

(thanks for adding cli flags though!)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, for sure -- will do in a followup PR.

cfallin added a commit to cfallin/wasmtime that referenced this pull request May 3, 2024
As per [this comment], the call-indirect caching options should really
be in the `-O` option namespace (for optimizations), not `-W` (for
semantically-visible Wasm standards options); the original PR simply put
them in the wrong place, and this PR moves them.

[this comment]: bytecodealliance#8509 (comment)
github-merge-queue bot pushed a commit that referenced this pull request May 3, 2024
As per [this comment], the call-indirect caching options should really
be in the `-O` option namespace (for optimizations), not `-W` (for
semantically-visible Wasm standards options); the original PR simply put
them in the wrong place, and this PR moves them.

[this comment]: #8509 (comment)
cfallin added a commit to cfallin/wasmtime that referenced this pull request May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fuzzing Issues related to our fuzzing infrastructure wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:config Issues related to the configuration of Wasmtime wasmtime:ref-types Issues related to reference types and GC in Wasmtime winch Winch issues or pull requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants