-
Notifications
You must be signed in to change notification settings - Fork 354
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: Change to copy function in read_region #730
refactor: Change to copy function in read_region #730
Conversation
When using the for statement, optimization such as SIMD copy is impossible, makes overhead.
@maurolacy could you look into this proposal and let us know what you think? Also check the latest Wasmer code how they implment the raw copy in WASI. If we want to use this or a similar implementation, we should be heavily testing read_region (don't know to what degree this is done at the moment). |
Will look into it in more detail during the day, but (except for the formatting issues) this looks good to me already. The for loop seems to be there just because of the |
Yeah, this was marked as posssibly inefficient, but safe. Good to revisit what wasmer has done, and maybe ask them what they recommend. The PR may open up to panics/UB when getting an overly large region size for which no memory was allocated (malicious creation of Region in the contract side). There is an equivalent call in |
Thanks for comments! First, Cell::get() in the previous code is implemented as follows. So, I think if the mentioned unallocated or large sized problem can occur, the approach using Cell seems the same problem can occur as directly cast. Because It makes sense as a problem that has nothing to do with race condition when single-thread. Cells do not guarantee access to unallocated memory areas. I think WasmPtr is responsible for this. (btw, I think there are any no problems about race conditions until the wasm thread feature spec is implemented. And even if after the thread feature is implemented, I think we will not use that feature.) |
Yes, I agree. I was taking a look yesterday, and the issues @ethanfrey mentions are related to potential attacks to the memory. As you say, these changes are equivalent, as far as I see, to the original version in terms of security / safety. |
@slave5vw can you please fix the format ( |
Oh, I forgot. Thanks! |
One more comment. Are the benchmarking scripts used to show the performance gain in the repo already? Or part of this PR? It would be good to start tracking that. And maybe we can improve as well further (with the write_memory section for example) |
Oh, unfortunately, the company's current policy makes it difficult to share the Benchmark tool. Sorry. And the profiler tool I used is instruments on mac. You can use the counter library. |
Well, we can add benchmarks to measure @slave5vw if you want to do it, go ahead. Otherwise, I'll take it. In any case, let's do it in a different PR, so we can easily see the impact of this. |
@maurolacy I would appreciate it if you take works for benchmarks. |
We have higher level benchmarks that can benchmark all public interfaces of cosmwasm-vm. Those include contract executions. I would not go into microbenchmarking at this point because what does it help if the call gets 20x faster but that speedup is not relevant in the bigger picture? |
@slave5vw or @maurolacy can someone run those benchmarks on It would be nice to demo the speedup of |
I'll do it. |
@webmaster128 It's hard to know what it means, |
Here are some results: "execute handle" (calls "execute handle" (calls So, this is in fact worse for singlepass(!) and "irrelevant" (within the measured calls, of course) for cranelift. I'm assuming these benchmarks are comparable. I'll run again locally, to see if there are spurious factors / variations or so here. References: |
emm, CI test result is strange. The benchmark result as Tendermint log, will be left as a comment tomorrow. I'm already off work. And could you tell me which contract was used in the test? It looks a lot shorter than the erc20's execution. |
OK, I've run benchmarks again in both branches, and now results make more sense. "execute handle" (calls call_handle) in main: "execute handle" (calls call_handle) in read_region-copy (branch I created for this): That's an improvement of ~5% for My opinion is that it wouldn't hurt either. wasmer's References: |
We're using the (optimized) |
Fine with me
Can we please add a reference comment to the latest version of that code to the place where we do unsafe operation? Wasmer changed a lot between 0.13.1 and 1.0. |
Good point. There you go: https://github.com/wasmerio/wasmer/blob/1.0.0/lib/wasi/src/syscalls/mod.rs#L82-L84 |
@@ -39,12 +40,11 @@ pub fn read_region(memory: &wasmer::Memory, ptr: u32, max_length: usize) -> VmRe | |||
|
|||
match WasmPtr::<u8, Array>::new(region.offset).deref(memory, 0, region.length) { | |||
Some(cells) => { | |||
// In case you want to do some premature optimization, this shows how to cast a `&'mut [Cell<u8>]` to `&mut [u8]`: | |||
// https://github.com/wasmerio/wasmer/blob/0.13.1/lib/wasi/src/syscalls/mod.rs#L79-L81 | |||
let raw_cells = cells as *const [_] as *const u8; | |||
let len = region.length as usize; | |||
let mut result = vec![0u8; len]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Rust allow us to create an a vector of the correct length without zeroing it? There is no point in wrinting all zeros here and then overwriting it in the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that's the preferred, and one of the fastest ways. See rust-lang/rust#54628. For what they say, vec!
has special behaviour when its first argument is zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be the best for initialized memory. But we only need uninitialized memory that just holds any data from previous memory use instead of writing zeros there.
We can do
// Allocate vector big enough for len elements.
let mut result = Vec::with_capacity(len);
// write into the vector
// Mark the first len elements of the vector as being initialized.
unsafe {
result.set_len(len);
}
See also this example: https://doc.rust-lang.org/std/vec/struct.Vec.html#examples-18.
Then we write to this memory only once instead of twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice.
I was thinking, on a related note, do we want to do de the same for write_region
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should do both in parallel, in one PR. Feel free to cherry pick the work from here to give credit to the author and open a new PR including both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any case, I've already implemented these, in the read_region-copy branch, in order to benchmark it.
Thank you for bringing this up and proposing a solution. After careful consideration we decided this should really be implemented in Wasmer, not here. I opened wasmerio/wasmer#2035. The reason is that if we put so much effort into reasoning whether or not this is a safe operation, it should be available to all users of Wasmer, not just us. In the unlikely case that the feature request is rejected by Wasmer, we'll come back to it here. |
That's a good decision. If necessary, I will comment in the wasmer issue. |
If you can support the motivation with benchmarkes from Wasmer 1.0, that is definitly helpful. Otherwise just subscribe. We will probably start working on an implementation next week. |
Motivation
Due to wasmer's memory type patch, dynamic memory type was changed to static memory type, resulting in high performance improvement. wasmerio/wasmer#1299
As a result, there are fewer bound check assemblies for dynamic memory types, which reduces code size and improves performance. It works for all backends.
However, after applied, when measured with erc20 in cosmwasm, the improved performance range was not consistently recorded.
(at least 30% to up to 200%. i.e: when singlepass, 418txs -> 532 ~ 870txs)
So, I had to find out where the difference occurred.
The log below is the measurement result when a JIT compiled function is called from wasmer.
For each transfer, A difference of about 50us occurs irregularly.
This is by no means small compared to the total execution time.
However, it is difficult to know from this information alone which part of the handle function is causing the cause.
As a result of analyzing using the profiler tool, I found that every time the execution was slow, read_region took up time.
due to the low resolution of the profiler, read_region was only captured when it was slow and not when it was fast.
(I'd like to write more details about the profiling process, but it will take a long time, so I skip it.)
Proposed Solution
As you know well, optimization is difficult if you copy it repeatedly with a for statement.
Using a method like memcpy, you can lead to optimizations such as SIMD copying.
It is effective even if the memory size is not necessarily large.
As a result of patching the read_region,
(when singlepass, 532 ~ 870txs -> 876 ~ 913txs)
In the comment, I found that already had this kind of issue.
So, for some reason, I'd like to hear your opinion on the change back to the for statement method.