Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid vector allocations in wasm->host calls #3294

Merged

Conversation

alexcrichton
Copy link
Member

This commit improves the runtime support for wasm-to-host invocations
for functions created with Func::new or wasmtime_func_new in the C
API. Previously a Vec (sometimes a SmallVec) would be dynamically
allocated on each host call to store the arguments that are coming from
wasm and going to the host. In the case of the wasmtime crate we need
to decode the u128-stored values, and in the case of the C API we need
to decode the Val into the C API's wasmtime_val_t.

The technique used in this commit is to store a singular Vec<T> inside
the "store", be it the literal Store<T> or within the T in the case
of the C API, which can be reused across wasm->host calls. This means
that we're unlikely to actually perform dynamic memory allocation and
instead we should hit a faster path where the Vec always has enough
capacity.

Note that this is just a mild improvement for Func::new-based
functions. It's still the case that Func::wrap is much faster, but
unfortunately the C API doesn't have access to Func::wrap, so the main
motivation here is accelerating the C API.

crates/wasmtime/src/store.rs Outdated Show resolved Hide resolved
@github-actions github-actions bot added wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API. labels Sep 3, 2021
@github-actions
Copy link

github-actions bot commented Sep 3, 2021

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api", "wasmtime:c-api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api, wasmtime:c-api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 3, 2021
The fastest way to call a WebAssembly function with Wasmtime is to use
the `TypedFunc` API and methods. This is only available to Rust code,
however, due to the usage of generics. The C API as a result is left to
only be able to use `Func::call`, which is quite slow today. While
`Func::call` has a lot of reasons that it's slow, some major
contributors are:

* Space must be allocated for the arguments/return values to call the
  trampoline with. This `u128` storage is allocated on all
  `Func::call`-based calls today.

* The function's type is loaded to typecheck the arguments, and this
  requires taking an rwlock in the `Engine` as well as cloning out the
  `FuncType` itself.

* For the C API the slice of inputs needs to be translated to a slice of
  `Val`, and the results are translated from a vector of `Val` back to a
  vector of `wasmtime_val_t`.

These two operations are particularly costly and the goal of this commit
is to solve these two issues. The solution implemented here is a new
structure, called `FuncStorage`, which can be created within an `Engine`
on a per-function-type basis. This storage is then used with a new API,
`Func::call_with_storage`, which removes the first two slowdowns mentioned
above. Each `FuncStorage` stores a copy of the `FuncType` it's intended
to be used with. Additionally it stores an appropriately-sized
`Vec<u128>` for storage of trampoline-encoded arguments.

The final bullet above is solved with tweaks to the
`Func::call_with_storage` API relative to `Func::call` where the
parameters/results are both iterators instead of slices.

This new API is intended to be a "power user" API for the Rust crate,
but is expected to be more commonly used with the C API since it's such
a large performance improvement to calling wasm functions.

Overall I'm not overly happy with this API. It solves a lot of the slow
`wasmtime_func_call` problem, but the APIs added here are pretty
unfortunate I think. Ideally we could solve this issue with no
additional API surface area. For example the first bullet could be
solved with a solution along the lines of bytecodealliance#3294 where vectors are stored
in a `Store` and reused per-call. The third bullet could probably be
fixed with the same style and also changing `Func::call` to taking a
`&mut [Val]` as an argument instead of returning a boxed slice. The
second bullet though is probably one of the harder ones to fix. Each
`Func` could store it's fully-fleshed-out `FuncType`, but that's a
relatively large impact and would also likely require changing
`FuncType` to internally use `Arc<[WasmType]>` or similar. In any case
I'm hoping that this can help spur on some creativity for someone to
find a better solution to this issue.
This commit improves the runtime support for wasm-to-host invocations
for functions created with `Func::new` or `wasmtime_func_new` in the C
API. Previously a `Vec` (sometimes a `SmallVec`) would be dynamically
allocated on each host call to store the arguments that are coming from
wasm and going to the host. In the case of the `wasmtime` crate we need
to decode the `u128`-stored values, and in the case of the C API we need
to decode the `Val` into the C API's `wasmtime_val_t`.

The technique used in this commit is to store a singular `Vec<T>` inside
the "store", be it the literal `Store<T>` or within the `T` in the case
of the C API, which can be reused across wasm->host calls. This means
that we're unlikely to actually perform dynamic memory allocation and
instead we should hit a faster path where the `Vec` always has enough
capacity.

Note that this is just a mild improvement for `Func::new`-based
functions. It's still the case that `Func::wrap` is much faster, but
unfortunately the C API doesn't have access to `Func::wrap`, so the main
motivation here is accelerating the C API.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 3, 2021
The fastest way to call a WebAssembly function with Wasmtime is to use
the `TypedFunc` API and methods. This is only available to Rust code,
however, due to the usage of generics. The C API as a result is left to
only be able to use `Func::call`, which is quite slow today. While
`Func::call` has a lot of reasons that it's slow, some major
contributors are:

* Space must be allocated for the arguments/return values to call the
  trampoline with. This `u128` storage is allocated on all
  `Func::call`-based calls today.

* The function's type is loaded to typecheck the arguments, and this
  requires taking an rwlock in the `Engine` as well as cloning out the
  `FuncType` itself.

* For the C API the slice of inputs needs to be translated to a slice of
  `Val`, and the results are translated from a vector of `Val` back to a
  vector of `wasmtime_val_t`.

These two operations are particularly costly and the goal of this commit
is to solve these two issues. The solution implemented here is a new
structure, called `FuncStorage`, which can be created within an `Engine`
on a per-function-type basis. This storage is then used with a new API,
`Func::call_with_storage`, which removes the first two slowdowns mentioned
above. Each `FuncStorage` stores a copy of the `FuncType` it's intended
to be used with. Additionally it stores an appropriately-sized
`Vec<u128>` for storage of trampoline-encoded arguments.

The final bullet above is solved with tweaks to the
`Func::call_with_storage` API relative to `Func::call` where the
parameters/results are both iterators instead of slices.

This new API is intended to be a "power user" API for the Rust crate,
but is expected to be more commonly used with the C API since it's such
a large performance improvement to calling wasm functions.

Overall I'm not overly happy with this API. It solves a lot of the slow
`wasmtime_func_call` problem, but the APIs added here are pretty
unfortunate I think. Ideally we could solve this issue with no
additional API surface area. For example the first bullet could be
solved with a solution along the lines of bytecodealliance#3294 where vectors are stored
in a `Store` and reused per-call. The third bullet could probably be
fixed with the same style and also changing `Func::call` to taking a
`&mut [Val]` as an argument instead of returning a boxed slice. The
second bullet though is probably one of the harder ones to fix. Each
`Func` could store it's fully-fleshed-out `FuncType`, but that's a
relatively large impact and would also likely require changing
`FuncType` to internally use `Arc<[WasmType]>` or similar. In any case
I'm hoping that this can help spur on some creativity for someone to
find a better solution to this issue.
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alexcrichton alexcrichton merged commit c736735 into bytecodealliance:main Sep 3, 2021
@alexcrichton alexcrichton deleted the less-hostcall-alloc branch September 3, 2021 20:14
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 3, 2021
The fastest way to call a WebAssembly function with Wasmtime is to use
the `TypedFunc` API and methods. This is only available to Rust code,
however, due to the usage of generics. The C API as a result is left to
only be able to use `Func::call`, which is quite slow today. While
`Func::call` has a lot of reasons that it's slow, some major
contributors are:

* Space must be allocated for the arguments/return values to call the
  trampoline with. This `u128` storage is allocated on all
  `Func::call`-based calls today.

* The function's type is loaded to typecheck the arguments, and this
  requires taking an rwlock in the `Engine` as well as cloning out the
  `FuncType` itself.

* For the C API the slice of inputs needs to be translated to a slice of
  `Val`, and the results are translated from a vector of `Val` back to a
  vector of `wasmtime_val_t`.

These two operations are particularly costly and the goal of this commit
is to solve these two issues. The solution implemented here is a new
structure, called `FuncStorage`, which can be created within an `Engine`
on a per-function-type basis. This storage is then used with a new API,
`Func::call_with_storage`, which removes the first two slowdowns mentioned
above. Each `FuncStorage` stores a copy of the `FuncType` it's intended
to be used with. Additionally it stores an appropriately-sized
`Vec<u128>` for storage of trampoline-encoded arguments.

The final bullet above is solved with tweaks to the
`Func::call_with_storage` API relative to `Func::call` where the
parameters/results are both iterators instead of slices.

This new API is intended to be a "power user" API for the Rust crate,
but is expected to be more commonly used with the C API since it's such
a large performance improvement to calling wasm functions.

Overall I'm not overly happy with this API. It solves a lot of the slow
`wasmtime_func_call` problem, but the APIs added here are pretty
unfortunate I think. Ideally we could solve this issue with no
additional API surface area. For example the first bullet could be
solved with a solution along the lines of bytecodealliance#3294 where vectors are stored
in a `Store` and reused per-call. The third bullet could probably be
fixed with the same style and also changing `Func::call` to taking a
`&mut [Val]` as an argument instead of returning a boxed slice. The
second bullet though is probably one of the harder ones to fix. Each
`Func` could store it's fully-fleshed-out `FuncType`, but that's a
relatively large impact and would also likely require changing
`FuncType` to internally use `Arc<[WasmType]>` or similar. In any case
I'm hoping that this can help spur on some creativity for someone to
find a better solution to this issue.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 8, 2021
This commit is an alternative to bytecodealliance#3298 which achieves effectively the
same goal of optimizing the `Func::call` API as well as its C API
sibling of `wasmtime_func_call`. The strategy taken here is different
than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with `Func::call` with:

* Looking up the type of a function, to typecheck the arguments with and
  use to guide how the results should be loaded, no longer hits the
  rwlock in the `Engine` but instead each `Func` contains its own
  `FuncType`. This can be an unnecessary allocation for funcs not used
  with `Func::call`, so this is a downside of this implementation
  relative to bytecodealliance#3298. A mitigating factor, though, is that instance
  exports are loaded lazily into the `Store` and in theory not too many
  funcs are active in the store as `Func` objects.

* Temporary storage is amortized with a long-lived `Vec` in the `Store`
  rather than allocating a new vector on each call. This is basically
  the same strategy as bytecodealliance#3294 only applied to different types in
  different places. Specifically `wasmtime::Store` now retains a
  `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for
  calling `Func::call`.

* Finally, an API breaking change is made to `Func::call` and its type
  signature (as well as `Func::call_async`). Instead of returning
  `Box<[Val]>` as it did before this function now takes a
  `results: &mut [Val]` parameter. This allows the caller to manage the
  allocation and we can amortize-remove it in `wasmtime_func_call` by
  using space after the parameters in the `Vec<Val>` we're passing in.
  This change is naturally a breaking change and we'll want to consider
  it carefully, but mitigating factors are that most embeddings are
  likely using `TypedFunc::call` instead and this signature taking a
  mutable slice better aligns with `Func::new` which receives a mutable
  slice for the results.

Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than bytecodealliance#3298.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 8, 2021
This commit is an alternative to bytecodealliance#3298 which achieves effectively the
same goal of optimizing the `Func::call` API as well as its C API
sibling of `wasmtime_func_call`. The strategy taken here is different
than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with `Func::call` with:

* Looking up the type of a function, to typecheck the arguments with and
  use to guide how the results should be loaded, no longer hits the
  rwlock in the `Engine` but instead each `Func` contains its own
  `FuncType`. This can be an unnecessary allocation for funcs not used
  with `Func::call`, so this is a downside of this implementation
  relative to bytecodealliance#3298. A mitigating factor, though, is that instance
  exports are loaded lazily into the `Store` and in theory not too many
  funcs are active in the store as `Func` objects.

* Temporary storage is amortized with a long-lived `Vec` in the `Store`
  rather than allocating a new vector on each call. This is basically
  the same strategy as bytecodealliance#3294 only applied to different types in
  different places. Specifically `wasmtime::Store` now retains a
  `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for
  calling `Func::call`.

* Finally, an API breaking change is made to `Func::call` and its type
  signature (as well as `Func::call_async`). Instead of returning
  `Box<[Val]>` as it did before this function now takes a
  `results: &mut [Val]` parameter. This allows the caller to manage the
  allocation and we can amortize-remove it in `wasmtime_func_call` by
  using space after the parameters in the `Vec<Val>` we're passing in.
  This change is naturally a breaking change and we'll want to consider
  it carefully, but mitigating factors are that most embeddings are
  likely using `TypedFunc::call` instead and this signature taking a
  mutable slice better aligns with `Func::new` which receives a mutable
  slice for the results.

Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than bytecodealliance#3298.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 9, 2021
This commit is an alternative to bytecodealliance#3298 which achieves effectively the
same goal of optimizing the `Func::call` API as well as its C API
sibling of `wasmtime_func_call`. The strategy taken here is different
than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with `Func::call` with:

* Looking up the type of a function, to typecheck the arguments with and
  use to guide how the results should be loaded, no longer hits the
  rwlock in the `Engine` but instead each `Func` contains its own
  `FuncType`. This can be an unnecessary allocation for funcs not used
  with `Func::call`, so this is a downside of this implementation
  relative to bytecodealliance#3298. A mitigating factor, though, is that instance
  exports are loaded lazily into the `Store` and in theory not too many
  funcs are active in the store as `Func` objects.

* Temporary storage is amortized with a long-lived `Vec` in the `Store`
  rather than allocating a new vector on each call. This is basically
  the same strategy as bytecodealliance#3294 only applied to different types in
  different places. Specifically `wasmtime::Store` now retains a
  `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for
  calling `Func::call`.

* Finally, an API breaking change is made to `Func::call` and its type
  signature (as well as `Func::call_async`). Instead of returning
  `Box<[Val]>` as it did before this function now takes a
  `results: &mut [Val]` parameter. This allows the caller to manage the
  allocation and we can amortize-remove it in `wasmtime_func_call` by
  using space after the parameters in the `Vec<Val>` we're passing in.
  This change is naturally a breaking change and we'll want to consider
  it carefully, but mitigating factors are that most embeddings are
  likely using `TypedFunc::call` instead and this signature taking a
  mutable slice better aligns with `Func::new` which receives a mutable
  slice for the results.

Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than bytecodealliance#3298.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 13, 2021
This commit is an alternative to bytecodealliance#3298 which achieves effectively the
same goal of optimizing the `Func::call` API as well as its C API
sibling of `wasmtime_func_call`. The strategy taken here is different
than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with `Func::call` with:

* Looking up the type of a function, to typecheck the arguments with and
  use to guide how the results should be loaded, no longer hits the
  rwlock in the `Engine` but instead each `Func` contains its own
  `FuncType`. This can be an unnecessary allocation for funcs not used
  with `Func::call`, so this is a downside of this implementation
  relative to bytecodealliance#3298. A mitigating factor, though, is that instance
  exports are loaded lazily into the `Store` and in theory not too many
  funcs are active in the store as `Func` objects.

* Temporary storage is amortized with a long-lived `Vec` in the `Store`
  rather than allocating a new vector on each call. This is basically
  the same strategy as bytecodealliance#3294 only applied to different types in
  different places. Specifically `wasmtime::Store` now retains a
  `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for
  calling `Func::call`.

* Finally, an API breaking change is made to `Func::call` and its type
  signature (as well as `Func::call_async`). Instead of returning
  `Box<[Val]>` as it did before this function now takes a
  `results: &mut [Val]` parameter. This allows the caller to manage the
  allocation and we can amortize-remove it in `wasmtime_func_call` by
  using space after the parameters in the `Vec<Val>` we're passing in.
  This change is naturally a breaking change and we'll want to consider
  it carefully, but mitigating factors are that most embeddings are
  likely using `TypedFunc::call` instead and this signature taking a
  mutable slice better aligns with `Func::new` which receives a mutable
  slice for the results.

Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than bytecodealliance#3298.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 13, 2021
This commit is an alternative to bytecodealliance#3298 which achieves effectively the
same goal of optimizing the `Func::call` API as well as its C API
sibling of `wasmtime_func_call`. The strategy taken here is different
than bytecodealliance#3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with `Func::call` with:

* Looking up the type of a function, to typecheck the arguments with and
  use to guide how the results should be loaded, no longer hits the
  rwlock in the `Engine` but instead each `Func` contains its own
  `FuncType`. This can be an unnecessary allocation for funcs not used
  with `Func::call`, so this is a downside of this implementation
  relative to bytecodealliance#3298. A mitigating factor, though, is that instance
  exports are loaded lazily into the `Store` and in theory not too many
  funcs are active in the store as `Func` objects.

* Temporary storage is amortized with a long-lived `Vec` in the `Store`
  rather than allocating a new vector on each call. This is basically
  the same strategy as bytecodealliance#3294 only applied to different types in
  different places. Specifically `wasmtime::Store` now retains a
  `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for
  calling `Func::call`.

* Finally, an API breaking change is made to `Func::call` and its type
  signature (as well as `Func::call_async`). Instead of returning
  `Box<[Val]>` as it did before this function now takes a
  `results: &mut [Val]` parameter. This allows the caller to manage the
  allocation and we can amortize-remove it in `wasmtime_func_call` by
  using space after the parameters in the `Vec<Val>` we're passing in.
  This change is naturally a breaking change and we'll want to consider
  it carefully, but mitigating factors are that most embeddings are
  likely using `TypedFunc::call` instead and this signature taking a
  mutable slice better aligns with `Func::new` which receives a mutable
  slice for the results.

Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as bytecodealliance#3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than bytecodealliance#3298.
alexcrichton added a commit that referenced this pull request Sep 21, 2021
* Optimize `Func::call` and its C API

This commit is an alternative to #3298 which achieves effectively the
same goal of optimizing the `Func::call` API as well as its C API
sibling of `wasmtime_func_call`. The strategy taken here is different
than #3298 though where a new API isn't created, rather a small tweak to
an existing API is done. Specifically this commit handles the major
sources of slowness with `Func::call` with:

* Looking up the type of a function, to typecheck the arguments with and
  use to guide how the results should be loaded, no longer hits the
  rwlock in the `Engine` but instead each `Func` contains its own
  `FuncType`. This can be an unnecessary allocation for funcs not used
  with `Func::call`, so this is a downside of this implementation
  relative to #3298. A mitigating factor, though, is that instance
  exports are loaded lazily into the `Store` and in theory not too many
  funcs are active in the store as `Func` objects.

* Temporary storage is amortized with a long-lived `Vec` in the `Store`
  rather than allocating a new vector on each call. This is basically
  the same strategy as #3294 only applied to different types in
  different places. Specifically `wasmtime::Store` now retains a
  `Vec<u128>` for `Func::call`, and the C API retains a `Vec<Val>` for
  calling `Func::call`.

* Finally, an API breaking change is made to `Func::call` and its type
  signature (as well as `Func::call_async`). Instead of returning
  `Box<[Val]>` as it did before this function now takes a
  `results: &mut [Val]` parameter. This allows the caller to manage the
  allocation and we can amortize-remove it in `wasmtime_func_call` by
  using space after the parameters in the `Vec<Val>` we're passing in.
  This change is naturally a breaking change and we'll want to consider
  it carefully, but mitigating factors are that most embeddings are
  likely using `TypedFunc::call` instead and this signature taking a
  mutable slice better aligns with `Func::new` which receives a mutable
  slice for the results.

Overall this change, in the benchmark of "call a nop function from the C
API" is not quite as good as #3298. It's still a bit slower, on the
order of 15ns, because there's lots of capacity checks around vectors
and the type checks are slightly less optimized than before. Overall
though this is still significantly better than today because allocations
and the rwlock to acquire the type information are both avoided. I
personally feel that this change is the best to do because it has less
of an API impact than #3298.

* Rebase issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants