-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wasmtime: add one-entry call-indirect caching. #8509
Conversation
(I'll run some benchmarks to see what the compile-time impact is like, and how this benefits non-SpiderMonkey workloads; TBD.) |
In WebAssembly, an indirect call is somewhat slow, because of the indirection required by CFI (control-flow integrity) sandboxing. In particular, a "function pointer" in most source languages compiled to Wasm is represented by an index into a table of funcrefs. The `call_indirect` instruction then has to do the following steps to invoke a function pointer: - Load the funcref table's base and length values from the vmctx. - Bounds-check the invoked index against the actual table size; trap if out-of-bounds. - Spectre mitigation (cmove) on that bounds-check. - Load the `vmfuncref` from the table given base and index. - For lazy table init, check if this is a non-initialized funcref pointer, and initialize the entry. - Load the signature from the funcref struct and compare it against the `call_indirect`'s expected signature; trap if wrong. - Load the actual code pointer for the callee's Wasm-ABI entry point. - Load the callee vmctx (which may be different for a cross-module call). - Put that vmctx in arg 0, our vmctx in arg 1, and invoke the loaded code pointer with an indirect call instruction. Compare and contrast to the process involved in invoking a native function pointer: - Invoke the code pointer with an indirect call instruction. This overhead buys us something -- it is part of the SFI sandbox boundary -- but it is very repetitive and unnecessary work in *most* cases when indirect function calls are performed repeatedly (such as within an inner loop). This PR introduces the idea of *caching*: if we know that the result of all the above checks won't change, then if we use the same index as "the last time" (for some definition), we can skip straight to the "invoke the code pointer" step, with a cached code pointer from that last time. Concretely, it introduces a two-word struct inlined into the vmctx for each `call_indirect` instruction in the module (up to a limit): - The last invoked index; - The code pointer that index corresponded to. When compiling the module, we check whether the table could possibly be mutable at a given index once read: any instructions like `table.set`, or the whole table exported thus writable from the outside. We also check whether index 0 is a non-null funcref. If neither of these things are true, then we know we can cache an index-to-code-pointer mapping, and we know we can use index 0 as a sentinel for "no cached value". We then make use of the struct for each indirect call site and generate code to check if the index matches; if so, call cached pointer; if not, load the vmfuncref, check the signature, check that the callee vmctx is the same as caller (intra-module case), and stash the code pointer and index away (fill the cache), then make the call. On an in-development branch of SpiderMonkey-in-Wasm with ICs (using indirect calls), this is about a 20% speedup; I haven't yet measured on other benchmarks. It is expected that this might be an instantiation-time slowdown due to a larger vmctx (but we could use madvise to zero if needed). This feature is off by default right now.
5981cc3
to
08d4364
Compare
Benchmarks:
so compilation time does take a little hit but my system is noisy; somewhere between 1% and 17%, the mean looks to be around 8.7%. Instantiation apparently unaffected. And execution time... slightly slower? (But again, noisy system.) I tried running the next benchmark in the list (pulldown-cmark) and Sightglass crashed mysteriously, so on to trusty I suspect that mostly we'll see mostly in-the-noise results for most benchmarks because indirect calls are somewhat rare in C-ish code; it will matter mostly when compiling dynamic languages, or something like Java with a lot of virtual calls. The speedup on weval'd SpiderMonkey is real though so I'd like to at least have this in tree as an option :-) |
Subscribe to Label Action
This issue or pull request has been labeled: "wasmtime:api", "wasmtime:config", "winch"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
Label Messager: wasmtime:configIt looks like you are changing Wasmtime's configuration options. Make sure to
To modify this label's message, edit the To add new label messages or remove existing label messages, edit the |
FWIW, I did a 100-iteration sightglass run of the default suite and got these results:
I think it is safe to say that this doesn't have an affect on our regular suite, and I expect that the JS speed ups you are reporting rely on weval'ing first. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great, but I have a few nitpicks, mostly related to improving things for future code readers. Thanks!
Updated, let me know what you think! |
(Oh, and forgot to mention: thanks for the thorough benchmarks on this! I've gotta dig into why Sightglass is so finicky for me...) |
…r bits but not needed in actual runtime).
Subscribe to Label Actioncc @fitzgen
This issue or pull request has been labeled: "fuzzing", "wasmtime:ref-types"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
Oh also, I mentioned this in the Cranelift meeting, but wanted to put it down here for posterity too: it seems like the limit on the number of caches isn't actually implemented. |
Yep, I missed that entirely in this iteration of the patch -- sorry about that! Just updated. I ended up wiring the limit through to a new config option (with docs, CLI flag, fuzzing) so that I could easily test it without a huge test input. |
3820843
to
a238300
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚀
/// Get the address of the call-indirect cache slot for a given callsite. | ||
pub fn call_indirect_cache_slot_addr( | ||
&mut self, | ||
call_site: CallIndirectSiteIndex, | ||
vmctx: ir::Value, | ||
) -> ir::Value { | ||
let offset = self.env.offsets.vmctx_call_indirect_cache(call_site); | ||
self.builder.ins().iadd_imm(vmctx, i64::from(offset)) | ||
} | ||
|
||
/// Load the cached index and code pointer for an indirect call. | ||
/// | ||
/// Generates IR like: | ||
/// | ||
/// ```ignore | ||
/// v1 = load.i64 cache_ptr+0 ;; cached index (cache key) | ||
/// v2 = load.i64 cache_ptr+8 ;; cached raw code pointer (cache value) | ||
/// ``` | ||
/// | ||
/// and returns `(index, code_ptr)` (e.g. from above, `(v1, v2)`). | ||
fn load_cached_indirect_index_and_code_ptr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thank you. I really appreciate having this stuff factored out into bite-sized chunks.
(or @jameysharp feel free to review last commit too as it interacts with your change in #8515!) |
/// Whether to enable call-indirect caching. | ||
pub cache_call_indirects: Option<bool>, | ||
/// The maximum call-indirect cache slot count. | ||
/// | ||
/// One slot is allocated per indirect callsite; if the module | ||
/// has more indirect callsites than this limit, then the | ||
/// first callsites in linear order in the code section, up to | ||
/// the limit, will receive a cache slot. | ||
pub max_call_indirect_cache_slots: Option<usize>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could these be moved to -O
options as "tuning" or "optimizaion" options?
The WasmOptions
group here is intended for semantic differences in the generated code and these options shouldn't affect any generated code.
(thanks for adding cli flags though!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah, for sure -- will do in a followup PR.
As per [this comment], the call-indirect caching options should really be in the `-O` option namespace (for optimizations), not `-W` (for semantically-visible Wasm standards options); the original PR simply put them in the wrong place, and this PR moves them. [this comment]: bytecodealliance#8509 (comment)
As per [this comment], the call-indirect caching options should really be in the `-O` option namespace (for optimizations), not `-W` (for semantically-visible Wasm standards options); the original PR simply put them in the wrong place, and this PR moves them. [this comment]: #8509 (comment)
In WebAssembly, an indirect call is somewhat slow, because of the indirection required by CFI (control-flow integrity) sandboxing. In particular, a "function pointer" in most source languages compiled to Wasm is represented by an index into a table of funcrefs. The
call_indirect
instruction then has to do the following steps to invoke a function pointer:vmfuncref
from the table given base and index.call_indirect
's expected signature; trap if wrong.Compare and contrast to the process involved in invoking a native function pointer:
This overhead buys us something -- it is part of the SFI sandbox boundary -- but it is very repetitive and unnecessary work in most cases when indirect function calls are performed repeatedly (such as within an inner loop).
This PR introduces the idea of caching: if we know that the result of all the above checks won't change, then if we use the same index as "the last time" (for some definition), we can skip straight to the "invoke the code pointer" step, with a cached code pointer from that last time.
Concretely, it introduces a two-word struct inlined into the vmctx for each
call_indirect
instruction in the module (up to a limit):When compiling the module, we check whether the table could possibly be mutable at a given index once read: any instructions like
table.set
, or the whole table exported thus writable from the outside. We also check whether index 0 is a non-null funcref. If neither of these things are true, then we know we can cache an index-to-code-pointer mapping, and we know we can use index 0 as a sentinel for "no cached value".We then make use of the struct for each indirect call site and generate code to check if the index matches; if so, call cached pointer; if not, load the vmfuncref, check the signature, check that the callee vmctx is the same as caller (intra-module case), and stash the code pointer and index away (fill the cache), then make the call.
On an in-development branch of SpiderMonkey-in-Wasm with ICs (using indirect calls), this is about a 20% speedup; I haven't yet measured on other benchmarks. It is expected that this might be an instantiation-time slowdown due to a larger vmctx (but we could use madvise to zero if needed).
This feature is off by default right now.