Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Recycle value stacks to avoid allocation costs #184

Merged
merged 1 commit into from Jun 12, 2019
Merged

RFC: Recycle value stacks to avoid allocation costs #184

merged 1 commit into from Jun 12, 2019

Conversation

ghost
Copy link

@ghost ghost commented Jun 9, 2019

We hope to use wasmi to implement a policy engine for low-volume packet inspection in an embedded setting. Initial CPU profiles using toy policies showed almost 40% CPU time spent in Vec::extend_with to allocate and initialize a new value stack for each function call.

This MR tries to address this by adding API to allow applications to reuse stack allocations (for both value and call stacks).

As a side effect, it also provides API for applications to configure the stack size limits (which is rather useful for us as we only need a few kB of value stack in our particular application).

A micro benchmark executing a minimal policy for single packet optimized for size showed the following changes:

  • Originally, an iteration took 205 µs.
  • With this MR, but without doing any recycling, an iteration took 295 µs.
  • With this MR, and using the recycling, an iteration took 165 µs. (Also variance dropped from 1.7% to 0.6%.)
  • With this MR, using the recycling as well as keeping the stack "dirty", an iteration took 75 µs.
  • With this MR, using the recycling as well as a "clean" but much smaller stack of 32 kB, an iteration took 77 µs.

At least the following questions remain open from my point of view:

  • The performance dropped significantly without recycling and I am not certain where this is coming from. I currently suspect that making the limits configurable prevents some micro optimizations that add up to the observed slow down.
  • Zeroing out the stack before reuse does not seem to be necessary to pass the test suite and not doing it significantly improves performance. However, I am unsure if this would mean that programs could observe "dirty" values on the stack or whether the spec requires the stack to be zeroed before invoking an exported function.

@parity-cla-bot
Copy link

It looks like @adam-rhebo hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement at the following URL: https://cla.parity.io

Once you've signed, please reply to this thread with [clabot:check] to prove it.

Many thanks,

Parity Technologies CLA Bot

/// Same as [`invoke`].
///
/// [`invoke`]: #method.invoke
pub fn invoke_with_stack<E: Externals>(
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered folding this into invoke by adding stack_recycler: Option<&mut StackRecycler> as a parameter, but did not do so to avoid breaking the API.

let limit = this
.as_ref()
.map_or(DEFAULT_VALUE_STACK_LIMIT, |this| this.value_stack_limit)
/ ::core::mem::size_of::<RuntimeValueInternal>();
Copy link
Author

@ghost ghost Jun 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be size_of::<RuntimeValue> but I suspect this was a typo as RuntimeValue does include the tag in addition to the 64 bits of payload, but the slice is made of RuntimeValueInternal.

@ghost
Copy link
Author

ghost commented Jun 9, 2019

As an additional data point these are results of the included benchmarks on my machine before

test bench_regex_redux ... bench:   3,700,627 ns/iter (+/- 17,403)
test bench_rev_comp    ... bench:   8,217,647 ns/iter (+/- 11,277)
test bench_tiny_keccak ... bench:   4,467,537 ns/iter (+/- 153,282)
test fac_opt           ... bench:      15,351 ns/iter (+/- 1,622)
test fac_recursive     ... bench:      16,975 ns/iter (+/- 6,130)
test recursive_ok      ... bench:   1,099,639 ns/iter (+/- 22,430)
test recursive_trap    ... bench:     126,832 ns/iter (+/- 5,552)

and after these changes

test bench_regex_redux ... bench:   3,846,837 ns/iter (+/- 66,684)
test bench_rev_comp    ... bench:   8,497,919 ns/iter (+/- 28,065)
test bench_tiny_keccak ... bench:   4,388,131 ns/iter (+/- 39,272)
test fac_opt           ... bench:      30,699 ns/iter (+/- 11,836)
test fac_recursive     ... bench:      32,653 ns/iter (+/- 454)
test recursive_ok      ... bench:   1,192,462 ns/iter (+/- 4,996)
test recursive_trap    ... bench:     148,877 ns/iter (+/- 1,696)

@ghost
Copy link
Author

ghost commented Jun 9, 2019

Ah, it seems changing the default size of the value stack from (1024 * 1024) / ::core::mem::size_of::<RuntimeValue>() to (1024 * 1024) / ::core::mem::size_of::<RuntimeValueInternal>() is responsible for the performance regressions as it effectively increased the default stack size from 512 kB to 1 MB and when reverting the performance returns to the previous values when recycling is not used.

Should I adjust the default size back to 512 kB or leave it as is?

@pepyakin pepyakin self-requested a review June 10, 2019 13:13
@ghost
Copy link
Author

ghost commented Jun 11, 2019

It looks like @adam-rhebo hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement at the following URL: https://cla.parity.io

Once you've signed, please reply to this thread with [clabot:check] to prove it.

Many thanks,

Parity Technologies CLA Bot

[clabot:check]

@parity-cla-bot
Copy link

It looks like @adam-rhebo signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

@pepyakin
Copy link
Collaborator

Hello, @adam-rhebo ! First, thank you for the PR.

The code looks good, however my local benchmark report massive slowdowns on the latest commit:

 name               master-2960f1b.bench ns/iter  stack-recycling-2.bench ns/iter  diff ns/iter  diff %  speedup
 bench_regex_redux  2,708,982                     2,768,588                              59,606   2.20%   x 0.98
 bench_rev_comp     6,187,121                     6,209,233                              22,112   0.36%   x 1.00
 bench_tiny_keccak  3,296,202                     3,289,373                              -6,829  -0.21%   x 1.00
 fac_opt            21,475                        42,373                                 20,898  97.31%   x 0.51
 fac_recursive      23,583                        45,191                                 21,608  91.63%   x 0.52
 recursive_ok       996,758                       1,130,878                             134,120  13.46%   x 0.88
 recursive_trap     110,301                       157,495                                47,194  42.79%   x 0.70

Looking at these results it is clear that the less the running time is the more is the slowdown. Can we fix it?

Should I adjust the default size back to 512 kB or leave it as is?

Hm, it was 1MB before, isn't it? My logic is as follows:

  1. pub const DEFAULT_VALUE_STACK_LIMIT: usize = (1024 * 1024) / ::core::mem::size_of::<RuntimeValue>();
  2. we used to resize the value stack to have DEFAULT_VALUE_STACK_LIMIT items.
  3. assuming that each item is of size_of::<RuntimeValue[Internal]> size and the items are laid out densely, the whole value stack should occupy ((1024 * 1024) / size_of::<RuntimeValue>()) * size_of::<RuntimeValue>() bytes or simply 1MB.

@ghost
Copy link
Author

ghost commented Jun 11, 2019

The code looks good, however my local benchmark report massive slowdowns on the latest commit:

I think the slowdown is due to doubling the effective default stack size, at least that is what my local measurements suggest.

Hm, it was 1MB before, isn't it? My logic is as follows:

The stack is made up of values of type RuntimeValueInternal which is smaller than RuntimeValue as it is missing the enum tag. Hence the actual size ATM is

((1024 * 1024) / size_of::<RuntimeValue>()) * size_of::<RuntimeValueInternal>() ==
((1024 * 1024) / 16) * 8 ==
512 * 1024

@ghost
Copy link
Author

ghost commented Jun 11, 2019

Looking at these results it is clear that the less the running time is the more is the slowdown.

I think this is due to the stack initialization dominating the execution time for these small examples and hence doubling the stack size almost doubles the execution time.

@pepyakin
Copy link
Collaborator

Ah, yeah, you are right! I am ok with leaving the stack size as is now.

@pepyakin pepyakin merged commit 284c907 into wasmi-labs:master Jun 12, 2019
@ghost ghost deleted the stack-recycling branch June 13, 2019 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants