RFC: Recycle value stacks to avoid allocation costs #184

ghost · 2019-06-09T18:27:43Z

We hope to use wasmi to implement a policy engine for low-volume packet inspection in an embedded setting. Initial CPU profiles using toy policies showed almost 40% CPU time spent in Vec::extend_with to allocate and initialize a new value stack for each function call.

This MR tries to address this by adding API to allow applications to reuse stack allocations (for both value and call stacks).

As a side effect, it also provides API for applications to configure the stack size limits (which is rather useful for us as we only need a few kB of value stack in our particular application).

A micro benchmark executing a minimal policy for single packet optimized for size showed the following changes:

Originally, an iteration took 205 µs.
With this MR, but without doing any recycling, an iteration took 295 µs.
With this MR, and using the recycling, an iteration took 165 µs. (Also variance dropped from 1.7% to 0.6%.)
With this MR, using the recycling as well as keeping the stack "dirty", an iteration took 75 µs.
With this MR, using the recycling as well as a "clean" but much smaller stack of 32 kB, an iteration took 77 µs.

At least the following questions remain open from my point of view:

The performance dropped significantly without recycling and I am not certain where this is coming from. I currently suspect that making the limits configurable prevents some micro optimizations that add up to the observed slow down.
Zeroing out the stack before reuse does not seem to be necessary to pass the test suite and not doing it significantly improves performance. However, I am unsure if this would mean that programs could observe "dirty" values on the stack or whether the spec requires the stack to be zeroed before invoking an exported function.

parity-cla-bot · 2019-06-09T18:27:45Z

It looks like @adam-rhebo hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement at the following URL: https://cla.parity.io

Once you've signed, please reply to this thread with [clabot:check] to prove it.

Many thanks,

Parity Technologies CLA Bot

ghost · 2019-06-09T18:33:42Z

src/func.rs

+    /// Same as [`invoke`].
+    ///
+    /// [`invoke`]: #method.invoke
+    pub fn invoke_with_stack<E: Externals>(


I considered folding this into invoke by adding stack_recycler: Option<&mut StackRecycler> as a parameter, but did not do so to avoid breaking the API.

ghost · 2019-06-09T18:36:10Z

src/runner.rs

+        let limit = this
+            .as_ref()
+            .map_or(DEFAULT_VALUE_STACK_LIMIT, |this| this.value_stack_limit)
+            / ::core::mem::size_of::<RuntimeValueInternal>();


This used to be size_of::<RuntimeValue> but I suspect this was a typo as RuntimeValue does include the tag in addition to the 64 bits of payload, but the slice is made of RuntimeValueInternal.

ghost · 2019-06-09T18:49:14Z

As an additional data point these are results of the included benchmarks on my machine before

test bench_regex_redux ... bench:   3,700,627 ns/iter (+/- 17,403)
test bench_rev_comp    ... bench:   8,217,647 ns/iter (+/- 11,277)
test bench_tiny_keccak ... bench:   4,467,537 ns/iter (+/- 153,282)
test fac_opt           ... bench:      15,351 ns/iter (+/- 1,622)
test fac_recursive     ... bench:      16,975 ns/iter (+/- 6,130)
test recursive_ok      ... bench:   1,099,639 ns/iter (+/- 22,430)
test recursive_trap    ... bench:     126,832 ns/iter (+/- 5,552)

and after these changes

test bench_regex_redux ... bench:   3,846,837 ns/iter (+/- 66,684)
test bench_rev_comp    ... bench:   8,497,919 ns/iter (+/- 28,065)
test bench_tiny_keccak ... bench:   4,388,131 ns/iter (+/- 39,272)
test fac_opt           ... bench:      30,699 ns/iter (+/- 11,836)
test fac_recursive     ... bench:      32,653 ns/iter (+/- 454)
test recursive_ok      ... bench:   1,192,462 ns/iter (+/- 4,996)
test recursive_trap    ... bench:     148,877 ns/iter (+/- 1,696)

ghost · 2019-06-09T19:41:43Z

Ah, it seems changing the default size of the value stack from (1024 * 1024) / ::core::mem::size_of::<RuntimeValue>() to (1024 * 1024) / ::core::mem::size_of::<RuntimeValueInternal>() is responsible for the performance regressions as it effectively increased the default stack size from 512 kB to 1 MB and when reverting the performance returns to the previous values when recycling is not used.

Should I adjust the default size back to 512 kB or leave it as is?

ghost · 2019-06-11T11:13:49Z

It looks like @adam-rhebo hasn't signed our Contributor License Agreement, yet.

The purpose of a CLA is to ensure that the guardian of a project's outputs has the necessary ownership or grants of rights over all contributions to allow them to distribute under the chosen licence.
Wikipedia

You can read and sign our full Contributor License Agreement at the following URL: https://cla.parity.io

Once you've signed, please reply to this thread with [clabot:check] to prove it.

Many thanks,

Parity Technologies CLA Bot

[clabot:check]

parity-cla-bot · 2019-06-11T11:13:50Z

It looks like @adam-rhebo signed our Contributor License Agreement. 👍

Many thanks,

Parity Technologies CLA Bot

pepyakin · 2019-06-11T13:02:30Z

Hello, @adam-rhebo ! First, thank you for the PR.

The code looks good, however my local benchmark report massive slowdowns on the latest commit:

 name               master-2960f1b.bench ns/iter  stack-recycling-2.bench ns/iter  diff ns/iter  diff %  speedup
 bench_regex_redux  2,708,982                     2,768,588                              59,606   2.20%   x 0.98
 bench_rev_comp     6,187,121                     6,209,233                              22,112   0.36%   x 1.00
 bench_tiny_keccak  3,296,202                     3,289,373                              -6,829  -0.21%   x 1.00
 fac_opt            21,475                        42,373                                 20,898  97.31%   x 0.51
 fac_recursive      23,583                        45,191                                 21,608  91.63%   x 0.52
 recursive_ok       996,758                       1,130,878                             134,120  13.46%   x 0.88
 recursive_trap     110,301                       157,495                                47,194  42.79%   x 0.70

Looking at these results it is clear that the less the running time is the more is the slowdown. Can we fix it?

Should I adjust the default size back to 512 kB or leave it as is?

Hm, it was 1MB before, isn't it? My logic is as follows:

pub const DEFAULT_VALUE_STACK_LIMIT: usize = (1024 * 1024) / ::core::mem::size_of::<RuntimeValue>();
we used to resize the value stack to have DEFAULT_VALUE_STACK_LIMIT items.
assuming that each item is of size_of::<RuntimeValue[Internal]> size and the items are laid out densely, the whole value stack should occupy ((1024 * 1024) / size_of::<RuntimeValue>()) * size_of::<RuntimeValue>() bytes or simply 1MB.

ghost · 2019-06-11T13:28:40Z

The code looks good, however my local benchmark report massive slowdowns on the latest commit:

I think the slowdown is due to doubling the effective default stack size, at least that is what my local measurements suggest.

Hm, it was 1MB before, isn't it? My logic is as follows:

The stack is made up of values of type RuntimeValueInternal which is smaller than RuntimeValue as it is missing the enum tag. Hence the actual size ATM is

((1024 * 1024) / size_of::<RuntimeValue>()) * size_of::<RuntimeValueInternal>() ==
((1024 * 1024) / 16) * 8 ==
512 * 1024

ghost · 2019-06-11T15:08:50Z

Looking at these results it is clear that the less the running time is the more is the slowdown.

I think this is due to the stack initialization dominating the execution time for these small examples and hence doubling the stack size almost doubles the execution time.

pepyakin · 2019-06-12T08:50:17Z

Ah, yeah, you are right! I am ok with leaving the stack size as is now.

Recycle value stacks to avoid allocation costs

09d7088

ghost commented Jun 9, 2019

View reviewed changes

pepyakin self-requested a review June 10, 2019 13:13

pepyakin merged commit 284c907 into wasmi-labs:master Jun 12, 2019

ghost deleted the stack-recycling branch June 13, 2019 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Recycle value stacks to avoid allocation costs #184

RFC: Recycle value stacks to avoid allocation costs #184

ghost commented Jun 9, 2019 •

edited by ghost

Loading

parity-cla-bot commented Jun 9, 2019

ghost Jun 9, 2019

ghost Jun 9, 2019 •

edited by ghost

Loading

ghost commented Jun 9, 2019

ghost commented Jun 9, 2019 •

edited by ghost

Loading

ghost commented Jun 11, 2019

parity-cla-bot commented Jun 11, 2019

pepyakin commented Jun 11, 2019

ghost commented Jun 11, 2019

ghost commented Jun 11, 2019

pepyakin commented Jun 12, 2019

RFC: Recycle value stacks to avoid allocation costs #184

RFC: Recycle value stacks to avoid allocation costs #184

Conversation

ghost commented Jun 9, 2019 • edited by ghost Loading

parity-cla-bot commented Jun 9, 2019

ghost Jun 9, 2019

Choose a reason for hiding this comment

ghost Jun 9, 2019 • edited by ghost Loading

Choose a reason for hiding this comment

ghost commented Jun 9, 2019

ghost commented Jun 9, 2019 • edited by ghost Loading

ghost commented Jun 11, 2019

parity-cla-bot commented Jun 11, 2019

pepyakin commented Jun 11, 2019

ghost commented Jun 11, 2019

ghost commented Jun 11, 2019

pepyakin commented Jun 12, 2019

ghost commented Jun 9, 2019 •

edited by ghost

Loading

ghost Jun 9, 2019 •

edited by ghost

Loading

ghost commented Jun 9, 2019 •

edited by ghost

Loading