Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new Func::wrap_cabi API #3345

Closed

Conversation

alexcrichton
Copy link
Member

This commit adds a new API to the wasmtime::Func type for wrapping a
C-ABI function with a well-defined signature derived from a wasm type
signature. The purpose of this API is to add the-most-optimized-we-can
path for using the C API and having wasm call host functions. Previously
when wasm called a host function it would perform these steps:

  1. Using a trampoline, place all arguments into a u128* array on the
    stack.
  2. Call Func::invoke which uses the type of the function (dynamically)
    to read values from this u128*.
  3. Values are placed within a Vec<Val> after being read.
  4. The C API receives &[Val] and translates this to
    &[wasmtime_val_t], iterating over each value and copying its
    contents into a new vector.
  5. Then the host function is actually called.
  6. The above argument-shuffling steps are all performed in reverse for
    the results, shipping everything through wasmtime_val_t and Val.

PRs such as #3319 and related attempts have made this sequence faster,
but the numbers on #3319 show that even after we get all the allocation
and such bits out of the way we're still spending quite a lot of time
shuffling arguments back-and-forth relative to the Func::wrap API that
Rust can use.

This commit fixes the issue by eliminating all steps except 1/5 above.
Although we still place all arguments on the stack and read them out
again to call the C-defined function with it's much faster than pushing
this all through the Val and wasmtime_val_t machinery. This overall
gets the cost of a wasm->host call basically on-par with Func::wrap,
although it's still not as optimized. Benchmarking the overhead of
wasm->host calls, where i64 returns one i64 value and many takes 5
i32 parameters and returns one i64 value, the numbers I get are:

Import Rust C before C after
i64 1ns 40ns 7ns
many 1ns 91ns 10ns

This commit is a clear win over the previous implementation, but it's
even still somewhat far away from Rust. That being said I think I'm out
of ideas of how to make this better. Without open-coding much larger
portions of wasmtime I'm not sure how much better we can get here. The
time in C after this commit is almost entirely spent in trampolines
storing the arguments to the stack and loading them later, and at this
point I'm not sure how much more optimized we can get than that since
Rust needs to enter the picture here somehow to handle the Wasmtime
fiddly-bits of calling back into C. I'm hopeful, though, that this is
such a large improvement from before that the question of optimizing
this further in C is best left for another day.

The new Func::wrap_cabi method is unlikely to ever get used from Rust,
but a wasmtime_func_wrap API was added to the C API to mirror
Func::wrap where if the host function pointer has a specific ABI this
function can be called instead of wasmtime_func_new.

This commit adds a new API to the `wasmtime::Func` type for wrapping a
C-ABI function with a well-defined signature derived from a wasm type
signature. The purpose of this API is to add the-most-optimized-we-can
path for using the C API and having wasm call host functions. Previously
when wasm called a host function it would perform these steps:

1. Using a trampoline, place all arguments into a `u128*` array on the
   stack.
2. Call `Func::invoke` which uses the type of the function (dynamically)
   to read values from this `u128*`.
3. Values are placed within a `Vec<Val>` after being read.
4. The C API receives `&[Val]` and translates this to
   `&[wasmtime_val_t]`, iterating over each value and copying its
   contents into a new vector.
5. Then the host function is actually called.
6. The above argument-shuffling steps are all performed in reverse for
   the results, shipping everything through `wasmtime_val_t` and `Val`.

PRs such as bytecodealliance#3319 and related attempts have made this sequence faster,
but the numbers on bytecodealliance#3319 show that even after we get all the allocation
and such bits out of the way we're still spending quite a lot of time
shuffling arguments back-and-forth relative to the `Func::wrap` API that
Rust can use.

This commit fixes the issue by eliminating all steps except 1/5 above.
Although we still place all arguments on the stack and read them out
again to call the C-defined function with it's much faster than pushing
this all through the `Val` and `wasmtime_val_t` machinery. This overall
gets the cost of a wasm->host call basically on-par with `Func::wrap`,
although it's still not as optimized. Benchmarking the overhead of
wasm->host calls, where `i64` returns one i64 value and `many` takes 5
`i32` parameters and returns one `i64` value, the numbers I get are:

| Import | Rust | C before | C after |
|--------|------|----------|---------|
| `i64`  | 1ns  | 40ns     | 7ns     |
| `many` | 1ns  | 91ns     | 10ns    |

This commit is a clear win over the previous implementation, but it's
even still somewhat far away from Rust. That being said I think I'm out
of ideas of how to make this better. Without open-coding much larger
portions of `wasmtime` I'm not sure how much better we can get here. The
time in C after this commit is almost entirely spent in trampolines
storing the arguments to the stack and loading them later, and at this
point I'm not sure how much more optimized we can get than that since
Rust needs to enter the picture here somehow to handle the Wasmtime
fiddly-bits of calling back into C. I'm hopeful, though, that this is
such a large improvement from before that the question of optimizing
this further in C is best left for another day.

The new `Func::wrap_cabi` method is unlikely to ever get used from Rust,
but a `wasmtime_func_wrap` API was added to the C API to mirror
`Func::wrap` where if the host function pointer has a specific ABI this
function can be called instead of `wasmtime_func_new`.
@github-actions github-actions bot added lightbeam Issues related to the Lightbeam compiler wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API. labels Sep 13, 2021
@github-actions
Copy link

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "lightbeam", "wasmtime:api", "wasmtime:c-api"

Thus the following users have been cc'd because of the following labels:

  • peterhuene: wasmtime:api, wasmtime:c-api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 14, 2021
This commit is what is hopefully going to be my last installment within
the saga of optimizing function calls in/out of WebAssembly modules in
the C API. This is yet another alternative approach to bytecodealliance#3345 (sorry) but
also contains everything necessary to make the C API fast. As in bytecodealliance#3345
the general idea is just moving checks out of the call path in the same
style of `TypedFunc`.

This new strategy takes inspiration from previously learned attempts
effectively "just" exposes how we previously passed `*mut u128` through
trampolines for arguments/results. This storage format is formalized
through a new `ValRaw` union that is exposed from the `wasmtime` crate.
By doing this it made it relatively easy to expose two new APIs:

* `Func::new_unchecked`
* `Func::call_unchecked`

These are the same as their checked equivalents except that they're
`unsafe` and they work with `*mut ValRaw` rather than safe slices of
`Val`. Working with these eschews type checks and such and requires
callers/embedders to do the right thing.

These two new functions are then exposed via the C API with new
functions, enabling C to have a fast-path of calling/defining functions.
This fast path is akin to `Func::wrap` in Rust, although that API can't
be built in C due to C not having generics in the same way that Rust
has.

For some benchmarks, the benchmarks here are:

* `nop` - Call a wasm function from the host that does nothing and
  returns nothing.
* `i64` - Call a wasm function from the host, the wasm function calls a
  host function, and the host function returns an `i64` all the way out to
  the original caller.
* `many` - Call a wasm function from the host, the wasm calls
   host function with 5 `i32` parameters, and then an `i64` result is
   returned back to the original host
* `i64` host - just the overhead of the wasm calling the host, so the
  wasm calls the host function in a loop.
* `many` host - same as `i64` host, but calling the `many` host function.

All numbers in this table are in nanoseconds, and this is just one
measurement as well so there's bound to be some variation in the precise
numbers here.

| Name      | Rust | C (before) | C (after) |
|-----------|------|------------|-----------|
| nop       | 19   | 112        | 25        |
| i64       | 22   | 207        | 32        |
| many      | 27   | 189        | 34        |
| i64 host  | 2    | 38         | 5         |
| many host | 7    | 75         | 8         |

The main conclusion here is that the C API is significantly faster than
before when using the `*_unchecked` variants of APIs. The Rust
implementation is still the ceiling (or floor I guess?) for performance
The main reason that C is slower than Rust is that a little bit more has
to travel through memory where on the Rust side of things we can
monomorphize and inline a bit more to get rid of that. Overall though
the costs are way way down from where they were originally and I don't
plan on doing a whole lot more myself at this time. There's various
things we theoretically could do I've considered but implementation-wise
I think they'll be much more weighty.
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 14, 2021
This commit is what is hopefully going to be my last installment within
the saga of optimizing function calls in/out of WebAssembly modules in
the C API. This is yet another alternative approach to bytecodealliance#3345 (sorry) but
also contains everything necessary to make the C API fast. As in bytecodealliance#3345
the general idea is just moving checks out of the call path in the same
style of `TypedFunc`.

This new strategy takes inspiration from previously learned attempts
effectively "just" exposes how we previously passed `*mut u128` through
trampolines for arguments/results. This storage format is formalized
through a new `ValRaw` union that is exposed from the `wasmtime` crate.
By doing this it made it relatively easy to expose two new APIs:

* `Func::new_unchecked`
* `Func::call_unchecked`

These are the same as their checked equivalents except that they're
`unsafe` and they work with `*mut ValRaw` rather than safe slices of
`Val`. Working with these eschews type checks and such and requires
callers/embedders to do the right thing.

These two new functions are then exposed via the C API with new
functions, enabling C to have a fast-path of calling/defining functions.
This fast path is akin to `Func::wrap` in Rust, although that API can't
be built in C due to C not having generics in the same way that Rust
has.

For some benchmarks, the benchmarks here are:

* `nop` - Call a wasm function from the host that does nothing and
  returns nothing.
* `i64` - Call a wasm function from the host, the wasm function calls a
  host function, and the host function returns an `i64` all the way out to
  the original caller.
* `many` - Call a wasm function from the host, the wasm calls
   host function with 5 `i32` parameters, and then an `i64` result is
   returned back to the original host
* `i64` host - just the overhead of the wasm calling the host, so the
  wasm calls the host function in a loop.
* `many` host - same as `i64` host, but calling the `many` host function.

All numbers in this table are in nanoseconds, and this is just one
measurement as well so there's bound to be some variation in the precise
numbers here.

| Name      | Rust | C (before) | C (after) |
|-----------|------|------------|-----------|
| nop       | 19   | 112        | 25        |
| i64       | 22   | 207        | 32        |
| many      | 27   | 189        | 34        |
| i64 host  | 2    | 38         | 5         |
| many host | 7    | 75         | 8         |

The main conclusion here is that the C API is significantly faster than
before when using the `*_unchecked` variants of APIs. The Rust
implementation is still the ceiling (or floor I guess?) for performance
The main reason that C is slower than Rust is that a little bit more has
to travel through memory where on the Rust side of things we can
monomorphize and inline a bit more to get rid of that. Overall though
the costs are way way down from where they were originally and I don't
plan on doing a whole lot more myself at this time. There's various
things we theoretically could do I've considered but implementation-wise
I think they'll be much more weighty.
@alexcrichton
Copy link
Member Author

Ok I apologize for all the runarounds here, but finishing out this implementation for host->wasm calls being optimized in the C API I've settled on #3350 which is both a bit faster than this implementation but also takes a different approach, so I'm going to close this.

@alexcrichton alexcrichton deleted the add-c-api-func-wrap branch September 14, 2021 20:18
alexcrichton added a commit to alexcrichton/wasmtime that referenced this pull request Sep 21, 2021
This commit is what is hopefully going to be my last installment within
the saga of optimizing function calls in/out of WebAssembly modules in
the C API. This is yet another alternative approach to bytecodealliance#3345 (sorry) but
also contains everything necessary to make the C API fast. As in bytecodealliance#3345
the general idea is just moving checks out of the call path in the same
style of `TypedFunc`.

This new strategy takes inspiration from previously learned attempts
effectively "just" exposes how we previously passed `*mut u128` through
trampolines for arguments/results. This storage format is formalized
through a new `ValRaw` union that is exposed from the `wasmtime` crate.
By doing this it made it relatively easy to expose two new APIs:

* `Func::new_unchecked`
* `Func::call_unchecked`

These are the same as their checked equivalents except that they're
`unsafe` and they work with `*mut ValRaw` rather than safe slices of
`Val`. Working with these eschews type checks and such and requires
callers/embedders to do the right thing.

These two new functions are then exposed via the C API with new
functions, enabling C to have a fast-path of calling/defining functions.
This fast path is akin to `Func::wrap` in Rust, although that API can't
be built in C due to C not having generics in the same way that Rust
has.

For some benchmarks, the benchmarks here are:

* `nop` - Call a wasm function from the host that does nothing and
  returns nothing.
* `i64` - Call a wasm function from the host, the wasm function calls a
  host function, and the host function returns an `i64` all the way out to
  the original caller.
* `many` - Call a wasm function from the host, the wasm calls
   host function with 5 `i32` parameters, and then an `i64` result is
   returned back to the original host
* `i64` host - just the overhead of the wasm calling the host, so the
  wasm calls the host function in a loop.
* `many` host - same as `i64` host, but calling the `many` host function.

All numbers in this table are in nanoseconds, and this is just one
measurement as well so there's bound to be some variation in the precise
numbers here.

| Name      | Rust | C (before) | C (after) |
|-----------|------|------------|-----------|
| nop       | 19   | 112        | 25        |
| i64       | 22   | 207        | 32        |
| many      | 27   | 189        | 34        |
| i64 host  | 2    | 38         | 5         |
| many host | 7    | 75         | 8         |

The main conclusion here is that the C API is significantly faster than
before when using the `*_unchecked` variants of APIs. The Rust
implementation is still the ceiling (or floor I guess?) for performance
The main reason that C is slower than Rust is that a little bit more has
to travel through memory where on the Rust side of things we can
monomorphize and inline a bit more to get rid of that. Overall though
the costs are way way down from where they were originally and I don't
plan on doing a whole lot more myself at this time. There's various
things we theoretically could do I've considered but implementation-wise
I think they'll be much more weighty.
alexcrichton added a commit that referenced this pull request Sep 24, 2021
* Add `*_unchecked` variants of `Func` APIs for the C API

This commit is what is hopefully going to be my last installment within
the saga of optimizing function calls in/out of WebAssembly modules in
the C API. This is yet another alternative approach to #3345 (sorry) but
also contains everything necessary to make the C API fast. As in #3345
the general idea is just moving checks out of the call path in the same
style of `TypedFunc`.

This new strategy takes inspiration from previously learned attempts
effectively "just" exposes how we previously passed `*mut u128` through
trampolines for arguments/results. This storage format is formalized
through a new `ValRaw` union that is exposed from the `wasmtime` crate.
By doing this it made it relatively easy to expose two new APIs:

* `Func::new_unchecked`
* `Func::call_unchecked`

These are the same as their checked equivalents except that they're
`unsafe` and they work with `*mut ValRaw` rather than safe slices of
`Val`. Working with these eschews type checks and such and requires
callers/embedders to do the right thing.

These two new functions are then exposed via the C API with new
functions, enabling C to have a fast-path of calling/defining functions.
This fast path is akin to `Func::wrap` in Rust, although that API can't
be built in C due to C not having generics in the same way that Rust
has.

For some benchmarks, the benchmarks here are:

* `nop` - Call a wasm function from the host that does nothing and
  returns nothing.
* `i64` - Call a wasm function from the host, the wasm function calls a
  host function, and the host function returns an `i64` all the way out to
  the original caller.
* `many` - Call a wasm function from the host, the wasm calls
   host function with 5 `i32` parameters, and then an `i64` result is
   returned back to the original host
* `i64` host - just the overhead of the wasm calling the host, so the
  wasm calls the host function in a loop.
* `many` host - same as `i64` host, but calling the `many` host function.

All numbers in this table are in nanoseconds, and this is just one
measurement as well so there's bound to be some variation in the precise
numbers here.

| Name      | Rust | C (before) | C (after) |
|-----------|------|------------|-----------|
| nop       | 19   | 112        | 25        |
| i64       | 22   | 207        | 32        |
| many      | 27   | 189        | 34        |
| i64 host  | 2    | 38         | 5         |
| many host | 7    | 75         | 8         |

The main conclusion here is that the C API is significantly faster than
before when using the `*_unchecked` variants of APIs. The Rust
implementation is still the ceiling (or floor I guess?) for performance
The main reason that C is slower than Rust is that a little bit more has
to travel through memory where on the Rust side of things we can
monomorphize and inline a bit more to get rid of that. Overall though
the costs are way way down from where they were originally and I don't
plan on doing a whole lot more myself at this time. There's various
things we theoretically could do I've considered but implementation-wise
I think they'll be much more weighty.

* Tweak `wasmtime_externref_t` API comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lightbeam Issues related to the Lightbeam compiler wasmtime:api Related to the API of the `wasmtime` crate itself wasmtime:c-api Issues pertaining to the C API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant