refactor: Change to copy function in read_region #730

slave5vw · 2021-01-19T09:26:12Z

Motivation

Due to wasmer's memory type patch, dynamic memory type was changed to static memory type, resulting in high performance improvement. wasmerio/wasmer#1299
As a result, there are fewer bound check assemblies for dynamic memory types, which reduces code size and improves performance. It works for all backends.

However, after applied, when measured with erc20 in cosmwasm, the improved performance range was not consistently recorded.
(at least 30% to up to 200%. i.e: when singlepass, 418txs -> 532 ~ 870txs)
So, I had to find out where the difference occurred.

The log below is the measurement result when a JIT compiled function is called from wasmer.
For each transfer, A difference of about 50us occurs irregularly.

address 3608BA37 is handle

debug: func "deallocate" | func:360B5DE8 | elapsed:4.838µs
debug: func "allocate" | func:360B5709 | elapsed:15.843µs
debug: func "allocate" | func:360B5709 | elapsed:4.115µs
debug: func "handle" | func:360B5709 | elapsed:5.189µs
func:3608BA37 | elapsed:199.764µs
debug: func "deallocate" | func:360B5DE8 | elapsed:4.826µs
debug: func "allocate" | func:360B5709 | elapsed:18.979µs
debug: func "allocate" | func:360B5709 | elapsed:3.872µs
debug: func "handle" | func:360B5709 | elapsed:5.222µs
func:3608BA37 | elapsed:207.639µs
debug: func "deallocate" | func:360B5DE8 | elapsed:4.667µs
debug: func "allocate" | func:360B5709 | elapsed:21.81µs
debug: func "allocate" | func:360B5709 | elapsed:4.543µs
debug: func "handle" | func:360B5709 | elapsed:5.699µs
func:3608BA37 | elapsed:236.879µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.159µs
debug: func "allocate" | func:360B5709 | elapsed:21.678µs
debug: func "allocate" | func:360B5709 | elapsed:4.599µs
debug: func "handle" | func:360B5709 | elapsed:6.005µs
func:3608BA37 | elapsed:297.624µs
debug: func "deallocate" | func:360B5DE8 | elapsed:6.274µs
debug: func "allocate" | func:360B5709 | elapsed:19.402µs
debug: func "allocate" | func:360B5709 | elapsed:4.457µs
debug: func "handle" | func:360B5709 | elapsed:5.347µs
func:3608BA37 | elapsed:213.839µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.012µs
debug: func "allocate" | func:360B5709 | elapsed:15.524µs
debug: func "allocate" | func:360B5709 | elapsed:4.157µs
debug: func "handle" | func:360B5709 | elapsed:5.215µs
func:3608BA37 | elapsed:197.728µs
debug: func "deallocate" | func:360B5DE8 | elapsed:4.924µs
debug: func "allocate" | func:360B5709 | elapsed:28.02µs
debug: func "allocate" | func:360B5709 | elapsed:5.18µs
debug: func "handle" | func:360B5709 | elapsed:8.691µs
func:3608BA37 | elapsed:360.173µs
debug: func "deallocate" | func:360B5DE8 | elapsed:7.916µs
debug: func "allocate" | func:360B5709 | elapsed:20.315µs
debug: func "allocate" | func:360B5709 | elapsed:4.221µs
debug: func "handle" | func:360B5709 | elapsed:5.81µs
func:3608BA37 | elapsed:244.343µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.193µs
debug: func "allocate" | func:360B5709 | elapsed:21.005µs
debug: func "allocate" | func:360B5709 | elapsed:4.536µs
debug: func "handle" | func:360B5709 | elapsed:5.448µs
func:3608BA37 | elapsed:236.151µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.057µs
debug: func "allocate" | func:360B5709 | elapsed:21.288µs
debug: func "allocate" | func:360B5709 | elapsed:4.263µs
debug: func "handle" | func:360B5709 | elapsed:5.594µs
func:3608BA37 | elapsed:238.434µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.078µs

This is by no means small compared to the total execution time.
However, it is difficult to know from this information alone which part of the handle function is causing the cause.

As a result of analyzing using the profiler tool, I found that every time the execution was slow, read_region took up time.
due to the low resolution of the profiler, read_region was only captured when it was slow and not when it was fast.
(I'd like to write more details about the profiling process, but it will take a long time, so I skip it.)

Proposed Solution

As you know well, optimization is difficult if you copy it repeatedly with a for statement.
Using a method like memcpy, you can lead to optimizations such as SIMD copying.
It is effective even if the memory size is not necessarily large.

As a result of patching the read_region,

recorded stable performance
and maximum performance improvement 10%
(when singlepass, 532 ~ 870txs -> 876 ~ 913txs)

In the comment, I found that already had this kind of issue.
So, for some reason, I'd like to hear your opinion on the change back to the for statement method.

When using the for statement, optimization such as SIMD copy is impossible, makes overhead.

webmaster128 · 2021-01-19T10:16:27Z

@maurolacy could you look into this proposal and let us know what you think? Also check the latest Wasmer code how they implment the raw copy in WASI. If we want to use this or a similar implementation, we should be heavily testing read_region (don't know to what degree this is done at the moment).

maurolacy · 2021-01-20T08:33:26Z

Will look into it in more detail during the day, but (except for the formatting issues) this looks good to me already.

The for loop seems to be there just because of the Cell encapsulation, i. e. cautiously avoiding premature optimization.

ethanfrey · 2021-01-20T12:31:36Z

Yeah, this was marked as posssibly inefficient, but safe.

Good to revisit what wasmer has done, and maybe ask them what they recommend. The PR may open up to panics/UB when getting an overly large region size for which no memory was allocated (malicious creation of Region in the contract side).

There is an equivalent call in write_region, which I would think to be safer to optimize (the bounds of the region we copy come from trusted go/vm code, not untrusted contracts). We still need to sanity check if there is enough memory available for this (eg. contract implements a lying allocate function)

slave5vw · 2021-01-21T05:28:25Z

The PR may open up to panics/UB when getting an overly large region size for which no memory was allocated (malicious creation of Region in the contract side).

Thanks for comments!
However, But I have a different opinion carefully.

First, Cell::get() in the previous code is implemented as follows.
This is the almost same as the unsafe type cast. The difference is race condition safety is guaranteed due to Cell's !sync trait.
https://doc.rust-lang.org/src/core/cell.rs.html#429-433
https://doc.rust-lang.org/src/core/cell.rs.html#1711

So, I think if the mentioned unallocated or large sized problem can occur, the approach using Cell seems the same problem can occur as directly cast. Because It makes sense as a problem that has nothing to do with race condition when single-thread. Cells do not guarantee access to unallocated memory areas. I think WasmPtr is responsible for this.
Is there any example that happened before? or Is there anything I misunderstood? If so, Please comment.

(btw, I think there are any no problems about race conditions until the wasm thread feature spec is implemented. And even if after the thread feature is implemented, I think we will not use that feature.)

maurolacy · 2021-01-21T07:47:01Z

So, I think if the mentioned unallocated or large sized problem can occur, the approach using Cell seems the same problem can occur as directly cast.

Yes, I agree.

I was taking a look yesterday, and the issues @ethanfrey mentions are related to potential attacks to the memory.
I think it would be very interesting to perform an analysis of wasm / wasmer memory model from a security / safety standpoint, but that that is clearly beyond the changes that this PR is suggesting.

As you say, these changes are equivalent, as far as I see, to the original version in terms of security / safety.

maurolacy · 2021-01-21T08:14:25Z

@slave5vw can you please fix the format (cargo fmt), and generally make the CI happy?

slave5vw · 2021-01-21T08:22:53Z

@slave5vw can you please fix the format (cargo fmt), and generally make the CI happy?

Oh, I forgot. Thanks!

ethanfrey · 2021-01-21T09:00:50Z

One more comment. Are the benchmarking scripts used to show the performance gain in the repo already? Or part of this PR?

It would be good to start tracking that. And maybe we can improve as well further (with the write_memory section for example)

slave5vw · 2021-01-21T09:08:37Z

Oh, unfortunately, the company's current policy makes it difficult to share the Benchmark tool. Sorry.
However, the operation is simple. After creating multiple accounts, execute erc20 transfer to try repeatedly.

And the profiler tool I used is instruments on mac. You can use the counter library.
With this tool, you can view profiling results for only the desired time interval through the timeline. It will be helpful in identifying this issue. (Other profilers usually only see the aggregate results from start to finish, which was not helpful in identifying this issue.)

maurolacy · 2021-01-21T09:26:07Z

Well, we can add benchmarks to measure read/write_region directly, and we should be able to the see the performance gain of this change.

@slave5vw if you want to do it, go ahead. Otherwise, I'll take it.

In any case, let's do it in a different PR, so we can easily see the impact of this.

slave5vw · 2021-01-21T09:31:50Z

@maurolacy I would appreciate it if you take works for benchmarks.
and write_region has little change. So I didn't include it in pr. But It was a limited test in the case of the erc20, so it may make sense for other samples.

webmaster128 · 2021-01-21T09:37:34Z

We have higher level benchmarks that can benchmark all public interfaces of cosmwasm-vm. Those include contract executions. I would not go into microbenchmarking at this point because what does it help if the call gets 20x faster but that speedup is not relevant in the bigger picture?

ethanfrey · 2021-01-21T09:42:31Z

We have higher level benchmarks that can benchmark all public interfaces of cosmwasm-vm.

@slave5vw or @maurolacy can someone run those benchmarks on main and then on this branch and post the results here (or do we track performance inside the repo in an md?)

It would be nice to demo the speedup of call_handle (1%? 20%?).

maurolacy · 2021-01-21T09:45:22Z

can someone run those benchmarks on main and then on this branch

I'll do it.

slave5vw · 2021-01-21T09:47:42Z

but that speedup is not relevant in the bigger picture?

@webmaster128 It's hard to know what it means,
The exact purpose of this pr is to stabilize the irregular performance that occurs after wasmer static bound patch.
wasmerio/wasmer#1299 (Note that this patch works regardless of the backend. Not just for cranelift.)
If you want to independently check the irregular performance without side-effects, I recommend try applying the patch above PR to cosmwasm 0.10.1 and wasmer 0.17.0.

@slave5vw

On behalf of @slave5vw

maurolacy · 2021-01-21T10:31:20Z

Here are some results:

"execute handle" (calls call_handle) in main:
singlepass: [47.191 us 48.384 us 50.100 us]
cranelift: [44.035 us 45.701 us 47.007 us]

"execute handle" (calls call_handle) in read_region-copy (branch I created for this):
singlepass: [56.238 us 57.568 us 59.715 us]
cranelift: [43.016 us 45.889 us 49.848 us]

So, this is in fact worse for singlepass(!) and "irrelevant" (within the measured calls, of course) for cranelift. I'm assuming these benchmarks are comparable. I'll run again locally, to see if there are spurious factors / variations or so here.

References:
main latest benchs: https://app.circleci.com/pipelines/github/CosmWasm/cosmwasm/2172/workflows/38460158-6bb6-4d6f-87ab-75b6a0bd3d6b/jobs/21182
read_region-copy benchs: https://app.circleci.com/pipelines/github/CosmWasm/cosmwasm/2174/workflows/59a2fb13-860d-4e5e-9d69-11fec9e158d3/jobs/21209

slave5vw · 2021-01-21T10:41:27Z

emm, CI test result is strange. The benchmark result as Tendermint log, will be left as a comment tomorrow. I'm already off work.

And could you tell me which contract was used in the test? It looks a lot shorter than the erc20's execution.

maurolacy · 2021-01-21T11:28:57Z

OK, I've run benchmarks again in both branches, and now results make more sense.

"execute handle" (calls call_handle) in main:
singlepass: [49.432 us 51.169 us 52.996 us]
cranelift: [42.637 us 43.724 us 44.907 us]

"execute handle" (calls call_handle) in read_region-copy (branch I created for this):
singlepass: [47.547 us 48.810 us 50.255 us]
cranelift: [41.843 us 42.505 us 42.989 us]

That's an improvement of ~5% for read_region with copy. So, regarding these benchmarks / use cases, this change is not that relevant.

My opinion is that it wouldn't hurt either. wasmer's read_bytes does something similar (using casts over an array of regions).

References:
main benchs: https://app.circleci.com/pipelines/github/CosmWasm/cosmwasm/2172/workflows/1820727a-8858-4ebc-b176-8988c0ea1ec0/jobs/21223
read_region-copy benchs: https://app.circleci.com/pipelines/github/CosmWasm/cosmwasm/2174/workflows/4107fb3b-8821-4666-a584-20d840a82df4/jobs/21237

maurolacy · 2021-01-21T11:34:20Z

And could you tell me which contract was used in the test? It looks a lot shorter than the erc20's execution.

We're using the (optimized) hackatom contract. You can take a look at these benchmarks here: https://github.com/CosmWasm/cosmwasm/blob/main/packages/vm/benches/main.rs#L56

webmaster128 · 2021-01-21T11:40:52Z

My opinion is that it wouldn't hurt either.

Fine with me

wasmer's read_bytes does something similar (using casts over an array of regions).

Can we please add a reference comment to the latest version of that code to the place where we do unsafe operation? Wasmer changed a lot between 0.13.1 and 1.0.

maurolacy · 2021-01-21T11:46:48Z

Can we please add a reference comment to the latest version of that code to the place where we do unsafe operation? Wasmer changed a lot between 0.13.1 and 1.0.

Good point. There you go: https://github.com/wasmerio/wasmer/blob/1.0.0/lib/wasi/src/syscalls/mod.rs#L82-L84

webmaster128 · 2021-01-21T11:47:32Z

packages/vm/src/memory.rs

@@ -39,12 +40,11 @@ pub fn read_region(memory: &wasmer::Memory, ptr: u32, max_length: usize) -> VmRe

    match WasmPtr::<u8, Array>::new(region.offset).deref(memory, 0, region.length) {
        Some(cells) => {
-            // In case you want to do some premature optimization, this shows how to cast a `&'mut [Cell<u8>]` to `&mut [u8]`:
-            // https://github.com/wasmerio/wasmer/blob/0.13.1/lib/wasi/src/syscalls/mod.rs#L79-L81
+            let raw_cells = cells as *const [_] as *const u8;
            let len = region.length as usize;
            let mut result = vec![0u8; len];


Does Rust allow us to create an a vector of the correct length without zeroing it? There is no point in wrinting all zeros here and then overwriting it in the next line.

Seems that's the preferred, and one of the fastest ways. See rust-lang/rust#54628. For what they say, vec! has special behaviour when its first argument is zero.

It might be the best for initialized memory. But we only need uninitialized memory that just holds any data from previous memory use instead of writing zeros there.

We can do

// Allocate vector big enough for len elements. let mut result = Vec::with_capacity(len); // write into the vector // Mark the first len elements of the vector as being initialized. unsafe { result.set_len(len); }

See also this example: https://doc.rust-lang.org/std/vec/struct.Vec.html#examples-18.

Then we write to this memory only once instead of twice.

Nice.

I was thinking, on a related note, do we want to do de the same for write_region?

I think we should do both in parallel, in one PR. Feel free to cherry pick the work from here to give credit to the author and open a new PR including both.

OK, I can do that, and in fact already have the branch created for the benchmarks.

Another option would be for @slave5vw to add these changes, along with the requested ones above, and so he gets the credits directly.

@slave5vw, what do you think?

In any case, I've already implemented these, in the read_region-copy branch, in order to benchmark it.

webmaster128 · 2021-01-21T17:06:31Z

Thank you for bringing this up and proposing a solution. After careful consideration we decided this should really be implemented in Wasmer, not here. I opened wasmerio/wasmer#2035. The reason is that if we put so much effort into reasoning whether or not this is a safe operation, it should be available to all users of Wasmer, not just us. In the unlikely case that the feature request is rejected by Wasmer, we'll come back to it here.

slave5vw · 2021-01-22T01:20:16Z

That's a good decision. If necessary, I will comment in the wasmer issue.

webmaster128 · 2021-01-22T08:14:51Z

That's a good decision. If necessary, I will comment in the wasmer issue.

If you can support the motivation with benchmarkes from Wasmer 1.0, that is definitly helpful. Otherwise just subscribe. We will probably start working on an implementation next week.

refactor: Change to copy function in read_region

bfe0c13

When using the for statement, optimization such as SIMD copy is impossible, makes overhead.

brew0722 mentioned this pull request Jan 19, 2021

memory read_region directly access patch PR to cosmwasm Finschia/cosmwasm#61

Closed

webmaster128 added the Performance Improvements (Productionize Code) label Jan 19, 2021

Base automatically changed from master to main January 19, 2021 22:43

fix: Run cargo fmt

d155956

maurolacy added a commit that referenced this pull request Jan 21, 2021

Change to copy function in read_region (#730)

c87181e

On behalf of @slave5vw

webmaster128 reviewed Jan 21, 2021

View reviewed changes

maurolacy mentioned this pull request Jan 21, 2021

Read/write region copy #739

Closed

webmaster128 mentioned this pull request Jan 21, 2021

API for reading/writing binary data from/to Wasm memory wasmerio/wasmer#2035

Closed

webmaster128 closed this Jan 21, 2021

brew0722 mentioned this pull request Jul 15, 2021

[VM] 2nd Performance optimization Finschia/cosmwasm#33

Closed

brew0722 mentioned this pull request Dec 3, 2021

Apply the performance improvement patches Finschia/cosmwasm#53

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: Change to copy function in read_region #730

refactor: Change to copy function in read_region #730

slave5vw commented Jan 19, 2021 •

edited

Loading

webmaster128 commented Jan 19, 2021

maurolacy commented Jan 20, 2021 •

edited

Loading

ethanfrey commented Jan 20, 2021 •

edited

Loading

slave5vw commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021

maurolacy commented Jan 21, 2021

slave5vw commented Jan 21, 2021 •

edited

Loading

ethanfrey commented Jan 21, 2021

slave5vw commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021 •

edited

Loading

slave5vw commented Jan 21, 2021

webmaster128 commented Jan 21, 2021

ethanfrey commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021

slave5vw commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021 •

edited

Loading

slave5vw commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021 •

edited

Loading

webmaster128 commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021

webmaster128 Jan 21, 2021

maurolacy Jan 21, 2021 •

edited

Loading

webmaster128 Jan 21, 2021

maurolacy Jan 21, 2021

webmaster128 Jan 21, 2021 •

edited

Loading

maurolacy Jan 21, 2021 •

edited

Loading

maurolacy Jan 21, 2021 •

edited

Loading

webmaster128 commented Jan 21, 2021 •

edited

Loading

slave5vw commented Jan 22, 2021

webmaster128 commented Jan 22, 2021 •

edited

Loading

refactor: Change to copy function in read_region #730

refactor: Change to copy function in read_region #730

Conversation

slave5vw commented Jan 19, 2021 • edited Loading

Motivation

Proposed Solution

webmaster128 commented Jan 19, 2021

maurolacy commented Jan 20, 2021 • edited Loading

ethanfrey commented Jan 20, 2021 • edited Loading

slave5vw commented Jan 21, 2021 • edited Loading

maurolacy commented Jan 21, 2021

maurolacy commented Jan 21, 2021

slave5vw commented Jan 21, 2021 • edited Loading

ethanfrey commented Jan 21, 2021

slave5vw commented Jan 21, 2021 • edited Loading

maurolacy commented Jan 21, 2021 • edited Loading

slave5vw commented Jan 21, 2021

webmaster128 commented Jan 21, 2021

ethanfrey commented Jan 21, 2021 • edited Loading

maurolacy commented Jan 21, 2021

slave5vw commented Jan 21, 2021 • edited Loading

maurolacy commented Jan 21, 2021 • edited Loading

slave5vw commented Jan 21, 2021 • edited Loading

maurolacy commented Jan 21, 2021 • edited Loading

maurolacy commented Jan 21, 2021 • edited Loading

webmaster128 commented Jan 21, 2021 • edited Loading

maurolacy commented Jan 21, 2021

webmaster128 Jan 21, 2021

Choose a reason for hiding this comment

maurolacy Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

webmaster128 Jan 21, 2021

Choose a reason for hiding this comment

maurolacy Jan 21, 2021

Choose a reason for hiding this comment

webmaster128 Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

maurolacy Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

maurolacy Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

webmaster128 commented Jan 21, 2021 • edited Loading

slave5vw commented Jan 22, 2021

webmaster128 commented Jan 22, 2021 • edited Loading

slave5vw commented Jan 19, 2021 •

edited

Loading

maurolacy commented Jan 20, 2021 •

edited

Loading

ethanfrey commented Jan 20, 2021 •

edited

Loading

slave5vw commented Jan 21, 2021 •

edited

Loading

slave5vw commented Jan 21, 2021 •

edited

Loading

slave5vw commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021 •

edited

Loading

ethanfrey commented Jan 21, 2021 •

edited

Loading

slave5vw commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021 •

edited

Loading

slave5vw commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021 •

edited

Loading

maurolacy commented Jan 21, 2021 •

edited

Loading

webmaster128 commented Jan 21, 2021 •

edited

Loading

maurolacy Jan 21, 2021 •

edited

Loading

webmaster128 Jan 21, 2021 •

edited

Loading

maurolacy Jan 21, 2021 •

edited

Loading

maurolacy Jan 21, 2021 •

edited

Loading

webmaster128 commented Jan 21, 2021 •

edited

Loading

webmaster128 commented Jan 22, 2021 •

edited

Loading