Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Change to copy function in read_region #730

Closed

Conversation

slave5vw
Copy link

@slave5vw slave5vw commented Jan 19, 2021

Motivation

Due to wasmer's memory type patch, dynamic memory type was changed to static memory type, resulting in high performance improvement. wasmerio/wasmer#1299
As a result, there are fewer bound check assemblies for dynamic memory types, which reduces code size and improves performance. It works for all backends.

However, after applied, when measured with erc20 in cosmwasm, the improved performance range was not consistently recorded.
(at least 30% to up to 200%. i.e: when singlepass, 418txs -> 532 ~ 870txs)
So, I had to find out where the difference occurred.

The log below is the measurement result when a JIT compiled function is called from wasmer.
For each transfer, A difference of about 50us occurs irregularly.

address 3608BA37 is handle

debug: func "deallocate" | func:360B5DE8 | elapsed:4.838µs
debug: func "allocate" | func:360B5709 | elapsed:15.843µs
debug: func "allocate" | func:360B5709 | elapsed:4.115µs
debug: func "handle" | func:360B5709 | elapsed:5.189µs
func:3608BA37 | elapsed:199.764µs
debug: func "deallocate" | func:360B5DE8 | elapsed:4.826µs
debug: func "allocate" | func:360B5709 | elapsed:18.979µs
debug: func "allocate" | func:360B5709 | elapsed:3.872µs
debug: func "handle" | func:360B5709 | elapsed:5.222µs
func:3608BA37 | elapsed:207.639µs
debug: func "deallocate" | func:360B5DE8 | elapsed:4.667µs
debug: func "allocate" | func:360B5709 | elapsed:21.81µs
debug: func "allocate" | func:360B5709 | elapsed:4.543µs
debug: func "handle" | func:360B5709 | elapsed:5.699µs
func:3608BA37 | elapsed:236.879µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.159µs
debug: func "allocate" | func:360B5709 | elapsed:21.678µs
debug: func "allocate" | func:360B5709 | elapsed:4.599µs
debug: func "handle" | func:360B5709 | elapsed:6.005µs
func:3608BA37 | elapsed:297.624µs
debug: func "deallocate" | func:360B5DE8 | elapsed:6.274µs
debug: func "allocate" | func:360B5709 | elapsed:19.402µs
debug: func "allocate" | func:360B5709 | elapsed:4.457µs
debug: func "handle" | func:360B5709 | elapsed:5.347µs
func:3608BA37 | elapsed:213.839µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.012µs
debug: func "allocate" | func:360B5709 | elapsed:15.524µs
debug: func "allocate" | func:360B5709 | elapsed:4.157µs
debug: func "handle" | func:360B5709 | elapsed:5.215µs
func:3608BA37 | elapsed:197.728µs
debug: func "deallocate" | func:360B5DE8 | elapsed:4.924µs
debug: func "allocate" | func:360B5709 | elapsed:28.02µs
debug: func "allocate" | func:360B5709 | elapsed:5.18µs
debug: func "handle" | func:360B5709 | elapsed:8.691µs
func:3608BA37 | elapsed:360.173µs
debug: func "deallocate" | func:360B5DE8 | elapsed:7.916µs
debug: func "allocate" | func:360B5709 | elapsed:20.315µs
debug: func "allocate" | func:360B5709 | elapsed:4.221µs
debug: func "handle" | func:360B5709 | elapsed:5.81µs
func:3608BA37 | elapsed:244.343µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.193µs
debug: func "allocate" | func:360B5709 | elapsed:21.005µs
debug: func "allocate" | func:360B5709 | elapsed:4.536µs
debug: func "handle" | func:360B5709 | elapsed:5.448µs
func:3608BA37 | elapsed:236.151µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.057µs
debug: func "allocate" | func:360B5709 | elapsed:21.288µs
debug: func "allocate" | func:360B5709 | elapsed:4.263µs
debug: func "handle" | func:360B5709 | elapsed:5.594µs
func:3608BA37 | elapsed:238.434µs
debug: func "deallocate" | func:360B5DE8 | elapsed:5.078µs

This is by no means small compared to the total execution time.
However, it is difficult to know from this information alone which part of the handle function is causing the cause.

As a result of analyzing using the profiler tool, I found that every time the execution was slow, read_region took up time.
due to the low resolution of the profiler, read_region was only captured when it was slow and not when it was fast.
(I'd like to write more details about the profiling process, but it will take a long time, so I skip it.)

image
image

Proposed Solution

As you know well, optimization is difficult if you copy it repeatedly with a for statement.
Using a method like memcpy, you can lead to optimizations such as SIMD copying.
It is effective even if the memory size is not necessarily large.

As a result of patching the read_region,

  • recorded stable performance
  • and maximum performance improvement 10%
    (when singlepass, 532 ~ 870txs -> 876 ~ 913txs)

In the comment, I found that already had this kind of issue.
So, for some reason, I'd like to hear your opinion on the change back to the for statement method.

When using the for statement, optimization such as SIMD
copy is impossible, makes overhead.
@webmaster128
Copy link
Member

@maurolacy could you look into this proposal and let us know what you think? Also check the latest Wasmer code how they implment the raw copy in WASI. If we want to use this or a similar implementation, we should be heavily testing read_region (don't know to what degree this is done at the moment).

Base automatically changed from master to main January 19, 2021 22:43
@maurolacy
Copy link
Contributor

maurolacy commented Jan 20, 2021

Will look into it in more detail during the day, but (except for the formatting issues) this looks good to me already.

The for loop seems to be there just because of the Cell encapsulation, i. e. cautiously avoiding premature optimization.

@ethanfrey
Copy link
Member

ethanfrey commented Jan 20, 2021

Yeah, this was marked as posssibly inefficient, but safe.

Good to revisit what wasmer has done, and maybe ask them what they recommend. The PR may open up to panics/UB when getting an overly large region size for which no memory was allocated (malicious creation of Region in the contract side).

There is an equivalent call in write_region, which I would think to be safer to optimize (the bounds of the region we copy come from trusted go/vm code, not untrusted contracts). We still need to sanity check if there is enough memory available for this (eg. contract implements a lying allocate function)

@slave5vw
Copy link
Author

slave5vw commented Jan 21, 2021

The PR may open up to panics/UB when getting an overly large region size for which no memory was allocated (malicious creation of Region in the contract side).

Thanks for comments!
However, But I have a different opinion carefully.

First, Cell::get() in the previous code is implemented as follows.
This is the almost same as the unsafe type cast. The difference is race condition safety is guaranteed due to Cell's !sync trait.
https://doc.rust-lang.org/src/core/cell.rs.html#429-433
https://doc.rust-lang.org/src/core/cell.rs.html#1711

So, I think if the mentioned unallocated or large sized problem can occur, the approach using Cell seems the same problem can occur as directly cast. Because It makes sense as a problem that has nothing to do with race condition when single-thread. Cells do not guarantee access to unallocated memory areas. I think WasmPtr is responsible for this.
Is there any example that happened before? or Is there anything I misunderstood? If so, Please comment.

(btw, I think there are any no problems about race conditions until the wasm thread feature spec is implemented. And even if after the thread feature is implemented, I think we will not use that feature.)

@maurolacy
Copy link
Contributor

So, I think if the mentioned unallocated or large sized problem can occur, the approach using Cell seems the same problem can occur as directly cast.

Yes, I agree.

I was taking a look yesterday, and the issues @ethanfrey mentions are related to potential attacks to the memory.
I think it would be very interesting to perform an analysis of wasm / wasmer memory model from a security / safety standpoint, but that that is clearly beyond the changes that this PR is suggesting.

As you say, these changes are equivalent, as far as I see, to the original version in terms of security / safety.

@maurolacy
Copy link
Contributor

@slave5vw can you please fix the format (cargo fmt), and generally make the CI happy?

@slave5vw
Copy link
Author

slave5vw commented Jan 21, 2021

@slave5vw can you please fix the format (cargo fmt), and generally make the CI happy?

Oh, I forgot. Thanks!

@ethanfrey
Copy link
Member

One more comment. Are the benchmarking scripts used to show the performance gain in the repo already? Or part of this PR?

It would be good to start tracking that. And maybe we can improve as well further (with the write_memory section for example)

@slave5vw
Copy link
Author

slave5vw commented Jan 21, 2021

Oh, unfortunately, the company's current policy makes it difficult to share the Benchmark tool. Sorry.
However, the operation is simple. After creating multiple accounts, execute erc20 transfer to try repeatedly.

And the profiler tool I used is instruments on mac. You can use the counter library.
With this tool, you can view profiling results for only the desired time interval through the timeline. It will be helpful in identifying this issue. (Other profilers usually only see the aggregate results from start to finish, which was not helpful in identifying this issue.)

@maurolacy
Copy link
Contributor

maurolacy commented Jan 21, 2021

Well, we can add benchmarks to measure read/write_region directly, and we should be able to the see the performance gain of this change.

@slave5vw if you want to do it, go ahead. Otherwise, I'll take it.

In any case, let's do it in a different PR, so we can easily see the impact of this.

@slave5vw
Copy link
Author

@maurolacy I would appreciate it if you take works for benchmarks.
and write_region has little change. So I didn't include it in pr. But It was a limited test in the case of the erc20, so it may make sense for other samples.

@webmaster128
Copy link
Member

We have higher level benchmarks that can benchmark all public interfaces of cosmwasm-vm. Those include contract executions. I would not go into microbenchmarking at this point because what does it help if the call gets 20x faster but that speedup is not relevant in the bigger picture?

@ethanfrey
Copy link
Member

ethanfrey commented Jan 21, 2021

We have higher level benchmarks that can benchmark all public interfaces of cosmwasm-vm.

@slave5vw or @maurolacy can someone run those benchmarks on main and then on this branch and post the results here (or do we track performance inside the repo in an md?)

It would be nice to demo the speedup of call_handle (1%? 20%?).

@maurolacy
Copy link
Contributor

can someone run those benchmarks on main and then on this branch

I'll do it.

@slave5vw
Copy link
Author

slave5vw commented Jan 21, 2021

but that speedup is not relevant in the bigger picture?

@webmaster128 It's hard to know what it means,
The exact purpose of this pr is to stabilize the irregular performance that occurs after wasmer static bound patch.
wasmerio/wasmer#1299 (Note that this patch works regardless of the backend. Not just for cranelift.)
If you want to independently check the irregular performance without side-effects, I recommend try applying the patch above PR to cosmwasm 0.10.1 and wasmer 0.17.0.

maurolacy added a commit that referenced this pull request Jan 21, 2021
@maurolacy
Copy link
Contributor

maurolacy commented Jan 21, 2021

Here are some results:

"execute handle" (calls call_handle) in main:
singlepass: [47.191 us 48.384 us 50.100 us]
cranelift: [44.035 us 45.701 us 47.007 us]

"execute handle" (calls call_handle) in read_region-copy (branch I created for this):
singlepass: [56.238 us 57.568 us 59.715 us]
cranelift: [43.016 us 45.889 us 49.848 us]

So, this is in fact worse for singlepass(!) and "irrelevant" (within the measured calls, of course) for cranelift. I'm assuming these benchmarks are comparable. I'll run again locally, to see if there are spurious factors / variations or so here.

References:
main latest benchs: https://app.circleci.com/pipelines/github/CosmWasm/cosmwasm/2172/workflows/38460158-6bb6-4d6f-87ab-75b6a0bd3d6b/jobs/21182
read_region-copy benchs: https://app.circleci.com/pipelines/github/CosmWasm/cosmwasm/2174/workflows/59a2fb13-860d-4e5e-9d69-11fec9e158d3/jobs/21209

@slave5vw
Copy link
Author

slave5vw commented Jan 21, 2021

emm, CI test result is strange. The benchmark result as Tendermint log, will be left as a comment tomorrow. I'm already off work.

And could you tell me which contract was used in the test? It looks a lot shorter than the erc20's execution.

@maurolacy
Copy link
Contributor

maurolacy commented Jan 21, 2021

OK, I've run benchmarks again in both branches, and now results make more sense.

"execute handle" (calls call_handle) in main:
singlepass: [49.432 us 51.169 us 52.996 us]
cranelift: [42.637 us 43.724 us 44.907 us]

"execute handle" (calls call_handle) in read_region-copy (branch I created for this):
singlepass: [47.547 us 48.810 us 50.255 us]
cranelift: [41.843 us 42.505 us 42.989 us]

That's an improvement of ~5% for read_region with copy. So, regarding these benchmarks / use cases, this change is not that relevant.

My opinion is that it wouldn't hurt either. wasmer's read_bytes does something similar (using casts over an array of regions).

References:
main benchs: https://app.circleci.com/pipelines/github/CosmWasm/cosmwasm/2172/workflows/1820727a-8858-4ebc-b176-8988c0ea1ec0/jobs/21223
read_region-copy benchs: https://app.circleci.com/pipelines/github/CosmWasm/cosmwasm/2174/workflows/4107fb3b-8821-4666-a584-20d840a82df4/jobs/21237

@maurolacy
Copy link
Contributor

maurolacy commented Jan 21, 2021

And could you tell me which contract was used in the test? It looks a lot shorter than the erc20's execution.

We're using the (optimized) hackatom contract. You can take a look at these benchmarks here: https://github.com/CosmWasm/cosmwasm/blob/main/packages/vm/benches/main.rs#L56

@webmaster128
Copy link
Member

webmaster128 commented Jan 21, 2021

My opinion is that it wouldn't hurt either.

Fine with me

wasmer's read_bytes does something similar (using casts over an array of regions).

Can we please add a reference comment to the latest version of that code to the place where we do unsafe operation? Wasmer changed a lot between 0.13.1 and 1.0.

@maurolacy
Copy link
Contributor

Can we please add a reference comment to the latest version of that code to the place where we do unsafe operation? Wasmer changed a lot between 0.13.1 and 1.0.

Good point. There you go: https://github.com/wasmerio/wasmer/blob/1.0.0/lib/wasi/src/syscalls/mod.rs#L82-L84

@@ -39,12 +40,11 @@ pub fn read_region(memory: &wasmer::Memory, ptr: u32, max_length: usize) -> VmRe

match WasmPtr::<u8, Array>::new(region.offset).deref(memory, 0, region.length) {
Some(cells) => {
// In case you want to do some premature optimization, this shows how to cast a `&'mut [Cell<u8>]` to `&mut [u8]`:
// https://github.com/wasmerio/wasmer/blob/0.13.1/lib/wasi/src/syscalls/mod.rs#L79-L81
let raw_cells = cells as *const [_] as *const u8;
let len = region.length as usize;
let mut result = vec![0u8; len];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does Rust allow us to create an a vector of the correct length without zeroing it? There is no point in wrinting all zeros here and then overwriting it in the next line.

Copy link
Contributor

@maurolacy maurolacy Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that's the preferred, and one of the fastest ways. See rust-lang/rust#54628. For what they say, vec! has special behaviour when its first argument is zero.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be the best for initialized memory. But we only need uninitialized memory that just holds any data from previous memory use instead of writing zeros there.

We can do

// Allocate vector big enough for len elements.
let mut result = Vec::with_capacity(len);

// write into the vector

// Mark the first len elements of the vector as being initialized.
unsafe {
    result.set_len(len);
}

See also this example: https://doc.rust-lang.org/std/vec/struct.Vec.html#examples-18.

Then we write to this memory only once instead of twice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice.

I was thinking, on a related note, do we want to do de the same for write_region?

Copy link
Member

@webmaster128 webmaster128 Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should do both in parallel, in one PR. Feel free to cherry pick the work from here to give credit to the author and open a new PR including both.

Copy link
Contributor

@maurolacy maurolacy Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can do that, and in fact already have the branch created for the benchmarks.

Another option would be for @slave5vw to add these changes, along with the requested ones above, and so he gets the credits directly.

@slave5vw, what do you think?

Copy link
Contributor

@maurolacy maurolacy Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, I've already implemented these, in the read_region-copy branch, in order to benchmark it.

@webmaster128
Copy link
Member

webmaster128 commented Jan 21, 2021

Thank you for bringing this up and proposing a solution. After careful consideration we decided this should really be implemented in Wasmer, not here. I opened wasmerio/wasmer#2035. The reason is that if we put so much effort into reasoning whether or not this is a safe operation, it should be available to all users of Wasmer, not just us. In the unlikely case that the feature request is rejected by Wasmer, we'll come back to it here.

@slave5vw
Copy link
Author

That's a good decision. If necessary, I will comment in the wasmer issue.

@webmaster128
Copy link
Member

webmaster128 commented Jan 22, 2021

That's a good decision. If necessary, I will comment in the wasmer issue.

If you can support the motivation with benchmarkes from Wasmer 1.0, that is definitly helpful. Otherwise just subscribe. We will probably start working on an implementation next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants