Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reconstruct your benchmarks? #1

Open
Robbepop opened this issue Feb 19, 2022 · 44 comments
Open

How to reconstruct your benchmarks? #1

Robbepop opened this issue Feb 19, 2022 · 44 comments

Comments

@Robbepop
Copy link

Robbepop commented Feb 19, 2022

Hi, I am one of the authors of wasmi and got curious about your s1vm which claims to be significantly (approx 2.2 times) faster than wasmi for some fibonacci testcase.
I got seriously interested in your approach because of this claim and tried to reproduce your benchmarks.

My results so far with fib(25):

  • s1vm: 9.668629ms
  • wasmi_v0: 21.343 ms ~120% slower
  • wasmi_v1: 12.966 ms ~30% slower

Note that wasmi_v0 uses an old engine whereas wasmi_v1 is the most recent engine. Both use stack machine based bytecode which naturally is inferior with respect to performance to register machine bytecode used in both Wasm3 and s1vm. This paper indicates roughly 50%-150% performance improvement by switching over to a register machine base bytecode. So if wami_v1 were to use register machine based bytecode it would probably not be ~30% slower but ~35% faster than s1vm. In fact there is an issue open for wasmi to implement register machine based bytecode, however, we are not yet sure which instruction dispatching technique to use. For this we currently research into the most effective dispatching techniques and closure-based approaches so far do not seem to be very promising unfortunately.

To reconstruct my own benchmarks you can checkout wasmi using this branch. And a fork of your s1vm using this link.

For wasmi simply use cargo bench fib_recursive and for the s1vm fork use cargo run --release -- fib.wasm fib_recursive 30.

@Neopallium
Copy link
Owner

I think my results are from about 2 years ago. I will try to make some time this week to re-run the benchmarks. It is possible that wasmi has improved a lot since then.

My main goal with s1vm was to just try out some ideas for writing a VM in Rust. I have experience with VMs written in C. Since Rust doesn't guarantee tail-calls, the method used by wasm3 can't be used in Rust.

s1vm does more than just using closures for dispatch. If I remember correctly switching to closures didn't make a big improve on their own, but they do allow for some interesting optimizations/specializations.

  1. During the bytecode -> closure compiling step, track inputs to each opcode (local, constants, stack, etc..).
  2. Most stack values can be traced back to the instruction that output it. Most of these stack writes/reads can be eliminated. This decreases usage of the wasm runtime value stack.

Looking at your closure based tests they don't use the same nesting that s1vm does:

  1. They still pop inputs and push results (using the wasm runtime stack).
  2. Still increments the PC counter (even for the nested closures which shouldn't need it?).

A virtual stack is built during the compile phase to track the inputs/outputs for each opcode:

enum Input {
  Local(u32), // Normally a function argument.
  Const(StackValue), // Constance value from wasm bytecode.
  Op(OpFunc), // Closure that produces a value (the opcode pushes a stack value).
}

This is code cut from the bytecode -> closure compiler here

      // For binops (add, sub, etc....), get the two `Input` values from the virtual stack (compile time)
      let right = state.pop()?;
      let left = state.pop()?;
      match left {
        /* .... CUT OTHER LEFT matches: local/constant. */
        Input::Op(left_closure) => {
          match right {
            /* ... CUT OTHER RIGHT matches: local/constant. */
            Input::Op(right_closure) => {
              // Push the new closure onto the virtual stack (compile time).
              state.push(Input::Op(Box::new(
             ///////////////// BINOP closure: 
               // Two parameters `state` is the immutable WASM state (functions, code).  `store` is the mutable WASM instance state (memory, runtime stack).
             move |state: &vm::State, store: &mut Store| -> Trap<StackValue> {
                let left = left_closure(state, store)?; // <---------------------- call the closure that produces the left value.
                let right = right_closure(state, store)?; // <------------------- call the closure that produces the right value.

                let res = (left.0 as $type).$op(right.0 as $type); // <------ the opcode's code.

                Ok(StackValue(res as _)) // <------ return the results (native stack, not wasm runtime stack).
              }
           ////////////////////
            )));
              return Ok(());
            },
          }
        },
        /* .... CUT OTHER LEFT matches */
   }

@Robbepop
Copy link
Author

Robbepop commented Feb 21, 2022

Since Rust doesn't guarantee tail-calls, the method used by wasm3 can't be used in Rust.

While Rust does not support guaranteed tail-calls we can still use them when compiling for --release since then Rust will usually tail call optimize code if code is written in a particular way. I did that in the linked research repo and found that the performance of tail-call dispatch is pretty much the same as the simple switch-dispatch. Maybe there will be some gains when proper tail-calls via become keyword will be implemented in Rust but so far it doesn't look like it.

You are right that my closure based approaches for instruction dispatch are not entirely similar to yours, however, the closure-block dispatch still has plenty of nested closure calls very similar to your approach which should be comparable. It is also by far the slowest dispatch. However, it could very well be that I did some mistakes in the implementation.

I will probably try out a fully nested closure based dispatch soon.

Looking at your closure based tests they don't use the same nesting that s1vm does:

  • They still pop inputs and push results (using the wasm runtime stack).
  • Still increments the PC counter (even for the nested closures which shouldn't need it?).

All the dispatching approaches in the research repo use a register machine as the backbone so no push or pop operations. I wonder what you mean. Maybe there is some confusion.
The tail-call based dispatches do use a pc counter since they are still dispatching over a slice of instructions, i.e. goto based branching.

The main question in the room to me is: Are closures actually worth it for dispatching?
Again: In my research the two fastest dispatching techniques (by far) were the simple switch dispatch and the switch dispatch using tali calls. The closure based dispatch using tail calls (that compares a lot with your approach) was super slow in comparison.
I see what you are trying to do with them and it would certainly be challenging to model those semantics using an enum representing all the different combinations (which I btw also tried out in the research repo under the fused/ct files).

You could go even further with this approach because inputs and outputs can also be global variables or even linear memory accesses. So you could for example define an add operation that has a global variable as its input and output or even some linear memory region, so you'd combine the linear memory store or load into the instruction. I tried this out and it did not bear fruits in my case. The reason is probably that with more and more fused instructions (and their combinations) you spill the instruction cache of the interpreter.

When you benchmark the new v1 engine of wasmi make sure to checkout the master branch and use --package wasmi_v1 since otherwise you refer to the old engine. We do have fibonacci benchmarks in wasmi so you could as well just do: cargo bench fib_recursive/v1.

@Neopallium
Copy link
Owner

Since Rust doesn't guarantee tail-calls, the method used by wasm3 can't be used in Rust.

While Rust does not support guaranteed tail-calls we can still use them when compiling for --release since then Rust will usually tail call optimize code if code is written in a particular way. I did that in the linked research repo and found that the performance of tail-call dispatch is pretty much the same as the simple switch-dispatch. Maybe there will be some gains when proper tail-calls via become keyword will be implemented in Rust but so far it doesn't look like it.

I am looking forward to Rust adding support for proper tail-calls. The main reason I don't try using tail-calls in Rust is because they only work in release mode, but also because they will not work with async fn as far as I know.

Another reason I had started s1vm was to try building a VM that could support async (suspend/resume) calling. Haven't gotten to that yet with s1vm.

You are right that my closure based approaches for instruction dispatch are not entirely similar to yours, however, the closure-block dispatch still has plenty of nested closure calls very similar to your approach which should be comparable. It is also by far the slowest dispatch. However, it could very well be that I did some mistakes in the implementation.

It also looks like the test code for the different designs are not using the same set of instructions:
https://github.com/Robbepop/interpreter-dispatch-research/blob/dc4c31f1462b4a8ef2061a7661231513dd705e81/src/closure_tail.rs#L93
vs:
https://github.com/Robbepop/interpreter-dispatch-research/blob/dc4c31f1462b4a8ef2061a7661231513dd705e81/src/switch.rs#L131

The closure designs only have a sub in the loop, the switch designs have a mul and two sub instructions.

I will probably try out a fully nested closure based dispatch soon.

Looking at your closure based tests they don't use the same nesting that s1vm does:

  • They still pop inputs and push results (using the wasm runtime stack).
  • Still increments the PC counter (even for the nested closures which shouldn't need it?).

All the dispatching approaches in the research repo use a register machine as the backbone so no push or pop operations. I wonder what you mean. Maybe there is some confusion. The tail-call based dispatches do use a pc counter since they are still dispatching over a slice of instructions, i.e. goto based branching.

I should have said read/write instead of push/pop. Even with the unsafe code in context.set_reg and context.get_reg those are still memory accesses that the Rust compiler can't eliminate (convert to native registers). While developing s1vm I profiled memory/cache accesses using cachegrind. The closure design used in s1vm is able to eliminate a lot of memory accesses and that is what makes it fast. Also note that there is no unsafe code in s1vm.

It looked like the Instruction handlers are still doing a PC increment here:
https://github.com/Robbepop/interpreter-dispatch-research/blob/dc4c31f1462b4a8ef2061a7661231513dd705e81/src/lib.rs#L71

The main question in the room to me is: Are closures actually worth it for dispatching? Again: In my research the two fastest dispatching techniques (by far) were the simple switch dispatch and the switch dispatch using tali calls. The closure based dispatch using tail calls (that compares a lot with your approach) was super slow in comparison. I see what you are trying to do with them and it would certainly be challenging to model those semantics using an enum representing all the different combinations (which I btw also tried out in the research repo under the fused/ct files).

I has been a few years, but I think that when I had created the first "closure" version it was either the same speed or slower than a loop+switch design. It wasn't until I started to add optimizations/specializations to eliminate the read/writes to the wasm runtime stack that it started to be faster. At one point I had both VM designs in the repo to make it easier to compare.

You could go even further with this approach because inputs and outputs can also be global variables or even linear memory accesses. So you could for example define an add operation that has a global variable as its input and output or even some linear memory region, so you'd combine the linear memory store or load into the instruction. I tried this out and it did not bear fruits in my case. The reason is probably that with more and more fused instructions (and their combinations) you spill the instruction cache of the interpreter.

If you mean the fused folder, it seems to still do memory accesses for the registers.

I think the closure based design will allow for more optimizations:

  1. Detect a loop with a counter (conditional branching) and use a closure that handles the loop logic in Rust. JIT compilers do this to make compiled loops fast.
  2. Merge common patterns of opcodes (add + mul, sub + mul, etc....) into one closure.

@Neopallium
Copy link
Owner

I just tried out the master branch of wasmi and tested both v0 and v1.

My original benchmarks were just using the time command for total run time. For wasmi I used the wasmi/examples/invoke.rs example to run the fib.wasm file.

Here is a quickly modified invoke.rs to get it to work with wasmi_v1:

extern crate parity_wasm;
extern crate wasmi;

use std::env::args;

use wasmi::RuntimeValue;
use wasmi_v1 as v1;

fn main() {
    let args: Vec<_> = args().collect();
    if args.len() < 3 {
        println!("Usage: {} <wasm file> <exported func> [<arg>...]", args[0]);
        return;
    }
    let (_, program_args) = args.split_at(3);

    let module = load_module(&args[1]);
    let mut linker = <v1::Linker<()>>::default();
    let mut store = v1::Store::new(module.engine(), ());
    let instance = linker.instantiate(&mut store, &module)
        .unwrap()
        .start(&mut store)
        .unwrap();

    let fib_func = instance.get_export(&store, "fib")
        .and_then(v1::Extern::into_func)
        .unwrap();
    let mut result = [RuntimeValue::I32(0)];

    let args = [
       RuntimeValue::I32(
           program_args[0]
               .parse::<i32>()
               .unwrap_or_else(|_| panic!("Can't parse arg #{} as i32", program_args[0])),
       ),
    ];
    fib_func.call(&mut store, &args, &mut result).expect;
    println!("{:?}", result[0]);
}

fn load_module(file: &str) -> v1::Module {
    let buf = std::fs::read(file).expect("Read file");
    let engine = v1::Engine::default();
    v1::Module::new(&engine, &buf[..]).expect("Failed to load")
}

Run on my AMD Ryzen 9 5900HX laptop (not the same as last time).

wasmi v0:

time ./target/release/examples/invoke ./fib.wasm fib 40
Result: Some(I32(165580141))

real    0m19.313s
user    0m19.302s
sys     0m0.001s

wasmi v1:

time ./target/release/examples/invoke_fib ./fib.wasm fib 40
I32(165580141)

real    0m29.395s
user    0m29.380s
sys     0m0.001s

s1vm:

time ./target/release/main fib.wasm fib 40
165580141

real    0m8.873s
user    0m8.863s
sys     0m0.005s

Cachegrind results. Just the "I refs" (CPU instructions executed) and "D refs" (memory accesses).
wasmi v0:

valgrind --tool=cachegrind  ./target/release/examples/invoke ./fib.wasm fib 30
Some(I32(1346269))
==746407==
==746407== I   refs:      2,169,232,440
==746407==
==746407== D   refs:      1,112,018,749  (704,242,633 rd   + 407,776,116 wr)

wasmi v1:

valgrind --tool=cachegrind ./target/release/examples/invoke_fib ./fib.wasm fib 30
I32(1346269)
==746185==
==746185== I   refs:      2,819,702,274
==746185==
==746185== D   refs:      1,590,154,734  (947,924,143 rd   + 642,230,591 wr)

s1vm:

valgrind --tool=cachegrind ./target/release/main fib.wasm fib 30
1346269
==746282==
==746282== I   refs:      808,427,721
==746282==
==746282== D   refs:      468,791,439  (293,623,787 rd   + 175,167,652 wr)

@Neopallium
Copy link
Owner

Also this is the fib.wasm file I used: https://github.com/Neopallium/s1vm/blob/master/fib.wasm
Forgot to add it before.

@Neopallium
Copy link
Owner

Not sure why wasmi v1 was slower then v0 for me. I didn't have the time to fully port the invoke.rs example from v0 to v1, so maybe I made a mistake in how to call into the wasm module.

@Robbepop
Copy link
Author

Robbepop commented Feb 21, 2022

Not sure why wasmi v1 was slower then v0 for me. I didn't have the time to fully port the invoke.rs example from v0 to v1, so maybe I made a mistake in how to call into the wasm module.

It seems you were not using the following Cargo profile for compiling your invoke binary:

[profile.release]
lto = "fat"
codegen-units = 1

In experiments we saw regressions of up to 400% for wasmi_v1 when this profile was not set.
Note thought that with your way of benchmarking you distort the results since you do not only benchmark the execution time but also the entire setup, module compilation, validation, instantiation etc. that is much more complex on wasmi side which is why I was using the cargo bench command on wasmi due to the nature of it not being a POC. :)

For a fair comparison of the actual execution time (which we are mainly talking about) you should wrap the execution only:

let before = std::time::Instant::now();
execute(fib);
let elapsed = before.elapsed();
println!("time = {:?}", elapsed);

@Robbepop
Copy link
Author

I am looking forward to Rust adding support for proper tail-calls. The main reason I don't try using tail-calls in Rust is because they only work in release mode, but also because they will not work with async fn as far as I know.

Oh yes! I too wait for tail call support in Rust. However, it is stated by the devs to have low priority. So I guess if we ever want to see it soon(TM) we should become active ourselves and support the development actively. There seem to be an underwhelming amount of people who appreciate the importance of proper tail call support in the Rust community atm.

Another reason I had started s1vm was to try building a VM that could support async (suspend/resume) calling. Haven't gotten to that yet with s1vm.

This sounds interesting. Can you elaborate why your closure based approach is well suited for this? You mean without performance pentalties in the non-resume execution? The old wasmi engine supports resuming execution, too, however, for the new engine we did not implement it for performance reasons.

It also looks like the test code for the different designs are not using the same set of instructions:

Hehe, you are comparing two different test cases there. ;) I just benchmarked the counter_loop tests which is not available for every tested architecture due to me being lazy and I deemed it to be not worth the effort always.

I should have said read/write instead of push/pop. Even with the unsafe code in context.set_reg and context.get_reg those are still memory accesses that the Rust compiler can't eliminate (convert to native registers). While developing s1vm I profiled memory/cache accesses using cachegrind. The closure design used in s1vm is able to eliminate a lot of memory accesses and that is what makes it fast. Also note that there is no unsafe code in s1vm.

This is interesting. I have heard a lot about "moving stuff to registers" by the Wasm3 authors which is probably what you are saying there but I have to admit that I was not able to fully grasp what they meant in their artificial descriptions of how they achieve it. I will have another look at your code again. Will definitely update my research repo once understood.

If you mean the fused folder, it seems to still do memory accesses for the registers.

Yeah sorry for the bad description what I mean by "fused": With fused I mean to not only allow instructions such as add to operate on registers and constants but also on global variables directly or even linear memory regions. So instead of

local.get 0
i32.load 0
i32.const 1
i32.add
local.get 0
i32.store 0

You'd be able to fuse all those instructions into a single one:

add (mem (local 0) 0) (mem (local 0) 0) (i32.const 1)

Which does the entire loading, adding and storing in a single instruction without calling any other function or dispatching into a closure for any of the sub procedures. This is especially possible with closures. We simply not only regard local.get and local.set as well as ty.const as value providers but anything that actually provides a value such as linear memory loads and global variable loads. And everything that updates values such as local.set, global.set or ty.store follows the same pattern.

@Robbepop Robbepop changed the title How to reconstruct your performance claims? How to reconstruct your benchmarks? Feb 21, 2022
@Neopallium
Copy link
Owner

It seems you were not using the following Cargo profile for compiling your invoke binary:

[profile.release]
lto = "fat"
codegen-units = 1

Correct. That helped a lot for the wasmi v1. I re-ran my simple benchmark with all 3 compiled with those options.

wasmi v0:

time  ./target/release/examples/invoke ./fib.wasm fib 40
Result: Some(I32(165580141))

real    0m18.467s
user    0m18.452s
sys     0m0.006s

valgrind --tool=cachegrind  ./target/release/examples/invoke ./fib.wasm fib 30
Some(I32(1346269))
==788933==
==788933== I   refs:      1,918,812,388
==788933==
==788933== D   refs:        916,803,710  (550,763,865 rd   + 366,039,845 wr)

wasmi v1

time ./target/release/examples/invoke_fib ./fib.wasm fib 40
I32(165580141)

real    0m15.008s
user    0m14.998s
sys     0m0.002s

valgrind --tool=cachegrind ./target/release/examples/invoke_fib ./fib.wasm fib 30
I32(1346269)
==788893==
==788893== I   refs:      1,575,732,238
==788893==
==788893== D   refs:        856,430,067  (471,339,573 rd   + 385,090,494 wr)

No improvement for s1vm:

time ./target/release/main fib.wasm fib 40
165580141

real    0m8.969s
user    0m8.959s
sys     0m0.003s

valgrind --tool=cachegrind ./target/release/main fib.wasm fib 30
1346269
==788852==
==788852== I   refs:      811,101,388
==788852==
==788852== D   refs:      466,091,055  (288,233,468 rd   + 177,857,587 wr)

In experiments we saw regressions of up to 400% for wasmi_v1 when this profile was not set. Note thought that with your way of benchmarking you distort the results since you do not only benchmark the execution time but also the entire setup, module compilation, validation, instantiation etc. that is much more complex on wasmi side which is why I was using the cargo bench command on wasmi due to the nature of it not being a POC. :)

For a fair comparison of the actual execution time (which we are mainly talking about) you should wrap the execution only:

let before = std::time::Instant::now();
execute(fib);
let elapsed = before.elapsed();
println!("time = {:?}", elapsed);

For benchmarking VMs I like to include the setup time too. Since some VMs (Java) can have high startup times. Also s1vm should have a more costly setup time (compiling wasm bytecode to closures and related optimizations), then VMs that mostly just load the wasm bytecode and execute that or translate those simple opcodes.

It could be useful to split the start & execution times. One thing that I haven't checked is the memory cost of using closures in s1vm for for complex wasm modules. I think it will be the main disadvantage of the design in s1vm.

@Robbepop
Copy link
Author

Robbepop commented Feb 22, 2022

Thank you for the updated benchmarks! Yet, wasmi_v1 still looks a bit slow imo but at least these results are more realistic.

It could be useful to split the start & execution times. One thing that I haven't checked is the memory cost of using closures in s1vm for for complex wasm modules. I think it will be the main disadvantage of the design in s1vm.

If this ever becomes a problem you can use Bumpalo or another alternative ultra fast bump allocator. Another advantage of bump allocation is that your closure data then resides closely to each other and won't be spilled anywhere.

For benchmarking VMs I like to include the setup time too. Since some VMs (Java) can have high startup times. Also s1vm should have a more costly setup time (compiling wasm bytecode to closures and related optimizations), then VMs that mostly just load the wasm bytecode and execute that or translate those simple opcodes.

I generally agree, but having both separate (setup and execution) would be a very useful information here. Also as I have mentioned earlier it is not entirely fair to compare a full fledged implementation with a POC that lacks most of the functionality (of Wasm).

@Neopallium
Copy link
Owner

I am looking forward to Rust adding support for proper tail-calls. The main reason I don't try using tail-calls in Rust is because they only work in release mode, but also because they will not work with async fn as far as I know.

Oh yes! I too wait for tail call support in Rust. However, it is stated by the devs to have low priority. So I guess if we ever want to see it soon(TM) we should become active ourselves and support the development actively. There seem to be an underwhelming amount of people who appreciate the importance of proper tail call support in the Rust community atm.

Is there an open issue or RFC about the become keyword?

Another reason I had started s1vm was to try building a VM that could support async (suspend/resume) calling. Haven't gotten to that yet with s1vm.

This sounds interesting. Can you elaborate why your closure based approach is well suited for this? You mean without performance pentalties in the non-resume execution? The old wasmi engine supports resuming execution, too, however, for the new engine we did not implement it for performance reasons.

The closure design might make it a little bit more difficult. A loop+switch design is much easier to suspend and resume.

To support async with the closure based compiler, I think it would be best to add "yield" points to instruction blocks. This would allow most of the closures to be normal non-async functions. Only opcodes that call other functions (internal or external) would need to support async and the compiler could do some static analysis to avoid adding overhead to simple functions that don't have loops or complex branching. There is a wasm module translator that can instrument a wasm module with yield calls (external calls).

https://emscripten.org/docs/porting/asyncify.html
https://web.dev/asyncify/

I should have said read/write instead of push/pop. Even with the unsafe code in context.set_reg and context.get_reg those are still memory accesses that the Rust compiler can't eliminate (convert to native registers). While developing s1vm I profiled memory/cache accesses using cachegrind. The closure design used in s1vm is able to eliminate a lot of memory accesses and that is what makes it fast. Also note that there is no unsafe code in s1vm.

This is interesting. I have heard a lot about "moving stuff to registers" by the Wasm3 authors which is probably what you are saying there but I have to admit that I was not able to fully grasp what they meant in their artificial descriptions of how they achieve it. I will have another look at your code again. Will definitely update my research repo once understood.

It has been a few years since I read about the wasm3 VM. I think they used a fixed set of parameters for each opcode function which allows them to jmp from one instruction directly to the next using tail-calls. This allows the VM to avoid using the native stack and the parameters stay in the same native registers (C ABI calling convention, the first X parameters use native registers, not the native stack). When I started s1vm I wanted to see if I could find a way to emulate that same design in Rust. Maybe with become support it would be possible.

If you mean the fused folder, it seems to still do memory accesses for the registers.

Yeah sorry for the bad description what I mean by "fused": With fused I mean to not only allow instructions such as add to operate on registers and constants but also on global variables directly or even linear memory regions. So instead of

local.get 0
i32.load 0
i32.const 1
i32.add
local.get 0
i32.store 0

You'd be able to fuse all those instructions into a single one:

add (mem (local 0) 0) (mem (local 0) 0) (i32.const 1)

Which does the entire loading, adding and storing in a single instruction without calling any other function or dispatching into a closure for any of the sub procedures. This is especially possible with closures. We simply not only regard local.get and local.set as well as ty.const as value providers but anything that actually provides a value such as linear memory loads and global variable loads. And everything that updates values such as local.set, global.set or ty.store follows the same pattern.

It would be easy to add Memory/Global support to the Input enum in s1vm, but that would only help with the input parameter case. local.set, global.set and store are still compiled to a closure that gets the value from the compiler's virtual stack of Inputs.

@Robbepop
Copy link
Author

Robbepop commented Feb 22, 2022

Is there an open issue or RFC about the become keyword?

The most recent activities are

The closure design might make it a little bit more difficult. A loop+switch design is much easier to suspend and resume.

I have not yet figured out a way that would not introduce additional overhead for cases where you do not need to resume execution. For wasmi resumable execution is a nice-to-have but not a must-have feature, therefore we should not regress performance of the main path of execution.

I think they used a fixed set of parameters for each opcode function which allows them to jmp from one instruction directly to the next using tail-calls. This allows the VM to avoid using the native stack and the parameters stay in the same native registers (C ABI calling convention, the first X parameters use native registers, not the native stack).

These parameters are exactly what I never could wrap my head around in how they are used. Unfortunately Wasm3 implementation is rather cryptic to me with all of its macros.

It would be easy to add Memory/Global support to the Input enum in s1vm, but that would only help with the input parameter case. local.set, global.set and store are still compiled to a closure that gets the value from the compiler's virtual stack of Inputs.

Yeah exactly, this is where the closure approach actually could shine. Just look at this research for an example of how NOT to do it. :D

@Neopallium
Copy link
Owner

Thank you for the updated benchmarks! Yet, wasmi_v1 still looks a bit slow imo but at least these results are more realistic.

My "simple" benchmarks are not high-quality. I used a high input value for fib to try getting the results to mostly cover the execution time and to help with warm-up.

I think using criterion while developing the VM would help a lot. 2 years ago I didn't have as much experience with Rust, so didn't know about all the tooling that was available.

It could be useful to split the start & execution times. One thing that I haven't checked is the memory cost of using closures in s1vm for for complex wasm modules. I think it will be the main disadvantage of the design in s1vm.

If this ever becomes a problem you can use Bumpalo or another alternative ultra fast bump allocator. Another advantage of bump allocation is that your closure data then resides closely to each other and won't be spilled anywhere.

I agree that keeping the closure data close together should improve things for larger wasm module, but I would wait until there is a test case that shows high memory usage or high cache misses before trying to improve that.

For benchmarking VMs I like to include the setup time too. Since some VMs (Java) can have high startup times. Also s1vm should have a more costly setup time (compiling wasm bytecode to closures and related optimizations), then VMs that mostly just load the wasm bytecode and execute that or translate those simple opcodes.

I generally agree, but having both separate (setup and execution) would be a very useful information here. Also as I have mentioned earlier it is not entirely fair to compare a full fledged implementation with a POC that lacks most of the functionality (of Wasm).

I agree that benchmarking s1vm against a full wasm VM is not fair. Part of the reason I haven't really tried publishing the results, other then just releasing the code. I only made s1vm public as a showcase for my Rust experience.

@Robbepop
Copy link
Author

Robbepop commented Feb 22, 2022

I agree that benchmarking s1vm against a full wasm VM is not fair. Part of the reason I haven't really tried publishing the results, other then just releasing the code. I only made s1vm public as a showcase for my Rust experience.

I am super thankful for you publishing s1vm so I was able to take a look! :)
A goal of mine for wasmi is to somehow manage to pull together a Wasm interpreter that is actually competetive with Wasm3 performance wise. If your proposed interpreter architecture will do the best possible job I will happily implement it. There is just more research needed since reimplementing the internals of wasmi won't be done over a weekend or two.

Some questions that arise from the top of my head about the POC:

  • How will multi-value Wasm proposal support look like?

    • Looking at your OpFunc type definition that defines closures that optionally return a single StackValue I wonder how this could be efficiently extended to work with multi-value Wasm proposal that allows a function, block, loop or any kind of instruction to take and return multiple values (which is what we support in wasmi). Returning a Vec<StackValue> will pretty much not scale very well here. Maybe an output parameter will do?
  • How will the different control flow constructs work like? E.g. loop and if/else? Currently your POC shows that for basic blocks.

  • How does performance compare to a switch/loop interpreter architecture when we use the same access patterns (register machine etc.)

  • How much does the performance regress once this architecture supports more than just a handful of instructions? High chance for performance regression due to bloated instruction cache dependency.

  • Will tail calls in Rust impact performance positively once they exist?

@Neopallium
Copy link
Owner

I agree that benchmarking s1vm against a full wasm VM is not fair. Part of the reason I haven't really tried publishing the results, other then just releasing the code. I only made s1vm public as a showcase for my Rust experience.

I am super thankful for you publishing s1vm so I was able to take a look! :) A goal of mine for wasmi is to somehow manage to pull together a Wasm interpreter that is actually competetive with Wasm3 performance wise. If your proposed interpreter architecture will do the best possible job I will happily implement it. There is just more research needed since reimplementing the internals of wasmi won't be done over a weekend or two.

Originally I wasn't even going to support WASM. I was just going to mock-up a simple bytecode based VM to try out different ideas for a Rust based VM, much like your research repo. But after reading the wasm specs, I decided it was a great starting point and the parity-wasm crate would save a lot of time.

I also decided against trying to rewrite the internal wasmi VM, because like you said it can't be done over a weekend.

Looking at your OpFunc type definition that defines closures that optionally return a single StackValue I wonder how this could be efficiently extended to work with multi-value Wasm proposal that allows a function, block, loop or any kind of instruction to take and return multiple values (which is what we support in wasmi). Returning a Vec<StackValue> will pretty much not scale very well here. Maybe an output parameter will do?

A simple way to support multi-value would be to just fall back to the wasm runtime stack, which still exists in s1vm and used by some non-optimized opcodes if I remember correctly. I would do this first to support the feature, then look for some wasm modules that use the feature before trying to optimize support.

@Neopallium
Copy link
Owner

The most recent activities are

Thanks for the links.

The closure design might make it a little bit more difficult. A loop+switch design is much easier to suspend and resume.

I have not yet figured out a way that would not introduce additional overhead for cases where you do not need to resume execution. For wasmi resumable execution is a nice-to-have but not a must-have feature, therefore we should not regress performance of the main path of execution.

I think the compiler could be made to only add async support when requested. I try to push as much work as possible on the compiler, to keep execution/runtime as fast as possible.

I think they used a fixed set of parameters for each opcode function which allows them to jmp from one instruction directly to the next using tail-calls. This allows the VM to avoid using the native stack and the parameters stay in the same native registers (C ABI calling convention, the first X parameters use native registers, not the native stack).

These parameters are exactly what I never could wrap my head around in how they are used. Unfortunately Wasm3 implementation is rather cryptic to me with all of its macros.

Not sure if you have seen this document about the wasm3 interpreter:
https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md#tightly-chained-operations

It does require some understanding of C ABI calling conventions and assembly to understand why it helps with performance.

@Robbepop
Copy link
Author

Robbepop commented Feb 22, 2022

Not sure if you have seen this document about the wasm3 interpreter:
https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md#tightly-chained-operations

I have seen and read this and understood the basic architecture. The only detail that I do not yet grasp is how the reg_t r0, f64 fp0 paramters are used precisely which is obviously one of the fundamentals of putting things into registers.

I think the compiler could be made to only add async support when requested. I try to push as much work as possible on the compiler, to keep execution/runtime as fast as possible.

Hmmm I see a problem in there since it is not really known at compile time if resumability of code is actually used after compilation. But overall I like the approach of putting as many things as possible into the compilation step. A drawback though (as you mentioned earlier) is that it is still a trade-off. For example, for Parity's use case we usually have to compile relatively small to medium sized Wasm binaries that tend to execute pretty fast. So compilation overhead therefore tends to play a bigger role since we do not use the compiled code for more than once. So generally for interpreters it is about finding the sweet spot for compilation and execution time.

@Robbepop
Copy link
Author

Robbepop commented Feb 22, 2022

I re-implemented your interpreter architecture in my research repo under the name closure-tree. In fact it achieves the best performance of all solutions so far. 🤯 I hope I did do a fair comparison there though.
Next step is to see if it scales to many instructions equally well. And especially we will need some solution to the aforementioned multi-value Wasm proposal since it is ubiquitous.

@Neopallium
Copy link
Owner

  • How will the different control flow constructs work like? E.g. loop and if/else? Currently your POC shows that for basic blocks.

Loops and if/else are supported. Only a few Br, BrIf and BrTable are unimplemented. It should be possible to optimize conditional loops like how if/else blocks are handled.

Each block (branch point, loop, if/else, etc...) has a Vec of EvalFunc that return an Action (End, Return, Branch). These are like the OpFunc but don't produce a value directly and are just for control flow.

  • How does performance compare to a switch/loop interpreter architecture when we use the same access patterns (register machine etc.)

Not sure I fully understand your question here. For a "register machine" it is important to know how the registers are stored. If they are just a Vec in the runtime state, then the memory access overhead will be high. I don't think there is a way in Rust to map the registers in the VM opcodes to native registers without assembly. Which would also make the code none portable.

  • How much does the performance regress once this architecture supports more than just a handful of instructions? High chance for performance regression due to bloated instruction cache dependency.

The closures have two parts static code (generated during Rust compile-time) and data (created during wasm compile time). I don't think the generated machine code will be that much bigger than a VM with a lot of specialized opcodes.

A loop+switch based VM with simple opcodes would have the smallest machine code size and could possibly run without many Instruction L1 cache misses. But the number of instructions executed (loop overhead) would be higher.

  • Will tail calls in Rust impact performance positively once they exist?

For the current design of s1vm, no I don't think so. I would try a design closer to what wasm3 does.

@Robbepop
Copy link
Author

Robbepop commented Feb 22, 2022

I don't think the generated machine code will be that much bigger than a VM with a lot of specialized opcodes.

That's exactly what I mean. In my research repo I tried out the switch based interpreter with no specialized op-codes, with many runtime specialized ops, just a few and with many enum variants and the more specialization there were the slower the execution was. So at least in case of switch dispatch too much specialized op-codes regressed performance of execution dramatically. This is why I wonder how this behaves for the closure based approach.

Loops and if/else are supported. Only a few Br, BrIf and BrTable are unimplemented. It should be possible to optimize conditional loops like how if/else blocks are handled.

In my implementation at the research repo loops only have a single body instruction which could be a block again having multiple instructions. I guess the difference is not too big.

@Neopallium
Copy link
Owner

I re-implemented your interpreter architecture in my research repo under the name closure-tree. In fact it achieves the best performance of all solutions so far. exploding_head I hope I did do a fair comparison there though. Next step is to see if it scales to many instructions equally well. And especially we will need some solution to the aforementioned multi-value Wasm proposal since it is ubiquitous.

It still seems to pass values in the register's Vec (context.get_reg and context.set_reg). I am a little bit surprised that it was faster.
https://github.com/Robbepop/interpreter-dispatch-research/blob/fb11eb998fdc299c7d1ffdf0a2ce239013e361f7/src/lib.rs#L61

I haven't looked at the specs for multi-value, so I don't know for sure if it uses the wasm runtime value stack (i.e. pushing multiple values). But I would think that would be an easy way to get it to work and then try improving the performance later.

@Robbepop
Copy link
Author

Robbepop commented Feb 22, 2022

In my implementation at the research repo loops only have a single body instruction which could be a block again having multiple instructions. I guess the difference is not too big.

I was wrong. I just tried out loop having a Vec<Inst> instead of a single Inst and this severely regressed performance and is now a bit slower than the switch based approach.

It still seems to pass values in the register's Vec (context.get_reg and context.set_reg). I am a little bit surprised that it was faster.
https://github.com/Robbepop/interpreter-dispatch-research/blob/fb11eb998fdc299c7d1ffdf0a2ce239013e361f7/src/lib.rs#L61

Yes, that's how sub_imm instruction updates the value of the register - how else would it? However, what provides the speed-up is that sub_imm also returns the result which is immediately used by the outer br_if construct. So the br_if does not again load the value from the register and instead uses it from the return value directly which should have the effect of using a register here. Your solution also requires updating the stack for local.set and local.get.

@Neopallium
Copy link
Owner

I don't think the generated machine code will be that much bigger than a VM with a lot of specialized opcodes.

That's exactly what I mean. In my research repo I tried out the switch based interpreter with no specialized op-codes, with many runtime specialized ops, just a few and with many enum variants and the more specialization there were the slower the execution was. So at least in case of switch dispatch too much specialized op-codes regressed performance of execution dramatically. This is why I wonder how this behaves for the closure based approach.

The switch design requires some place to store values produced/consumed by the opcodes (stack/registers) which has a high cost, since it is hard for the Rust compiler to eliminate these read/writes and use native registers. With nested closure like used in s1vm the Rust compile can/should be able to use native registers (func return value), that is why the memory accesses are lower for s1vm and this outweighs the higher native code size.

I am sure there will be some upper limit on how much larger the native code size for all the specialized closure can be before performance starts to drop.

For any opcode that already has a high cost (load/store memory, get/set global), it might be best to keep those as simple unspecialized cases. It is the simpler opcodes (add, sub, mul, etc...) or control flow that would get the biggest improvement from specialization.

Loops and if/else are supported. Only a few Br, BrIf and BrTable are unimplemented. It should be possible to optimize conditional loops like how if/else blocks are handled.

In my implementation at the research repo loops only have a single body instruction which could be a block again having multiple instructions. I guess the difference is not too big.

I had the most difficulty with nested blocks and control flow in the closure based compiler. It basically has to keep track of the PC counter while compiling wasm opcodes to handle jump targets.

Your loop_block could take two parameters, the conditional op and a Vec like the basic_block.

Also you might want to make sure that the Rust compile doesn't get to "smart" and do too much optimization. Using something like blackbox around the input opcodes inst to make sure that the compiler can't optimize based on the value there.

@Neopallium
Copy link
Owner

It still seems to pass values in the register's Vec (context.get_reg and context.set_reg). I am a little bit surprised that it was faster.
https://github.com/Robbepop/interpreter-dispatch-research/blob/fb11eb998fdc299c7d1ffdf0a2ce239013e361f7/src/lib.rs#L61

Yes, that's how sub_imm instruction updates the value of the register - how else would it? However, what provides the speed-up is that sub_imm also returns the result which is immediately used by the outer br_if construct. So the br_if does not again load the value from the register and instead uses it from the return value directly which should have the effect of using a register here. Your solution also requires updating the stack for local.set and local.get.

For this simple research VM, there is no need for registers. Just like how the branch_eqz closure executes an Expr to get the conditional value, all other opcodes add, sub, mul can do the same for their inputs.

s1vm still uses the runtime stack for locals (function parameters). It would be interesting to find a way to optimize those. Maybe as a third parameter to the opcode closures, then the closure for the main function block could allocate/store the locals on the native stack. Not sure if it is possible to eliminate the bound checks without using unsafe Rust code. I have been trying to avoid unsafe code in s1vm.

@Robbepop
Copy link
Author

For any opcode that already has a high cost (load/store memory, get/set global), it might be best to keep those as simple unspecialized cases. It is the simpler opcodes (add, sub, mul, etc...) or control flow that would get the biggest improvement from specialization.

That is an interesting thought but you might be right with this.

I had the most difficulty with nested blocks and control flow in the closure based compiler. It basically has to keep track of the PC counter while compiling wasm opcodes to handle jump targets.

I imagine this to work with the Wasm branching depth that is decreased by 1 for everytime you return a branch. At depth of 0 the outer control structure can take off again.

Your loop_block could take two parameters, the conditional op and a Vec like the basic_block.

I don't think this would work well for Wasm execution since in Wasm you could have multiple break points out of a loop. It is probably best to either use a list of op codes directly (as you did) or to have a single block body instruction that might also be a block instruction (containing multiple instructions again).

Also you might want to make sure that the Rust compile doesn't get to "smart" and do too much optimization. Using something like blackbox around the input opcodes inst to make sure that the compiler can't optimize based on the value there.

This is interesting. Can you elaborate when/why I would want to prevent optimizations from happening in this interpreter architecture?

For this simple research VM, there is no need for registers. Just like how the branch_eqz closure executes an Expr to get the conditional value, all other opcodes add, sub, mul can do the same for their inputs.

I think this is true until you try to support Wasm operands such as local.set, global.set or i32.store etc. Especially local.set more or less requires you to save state somewhere on a stack or in a register which is what I did there. Or how else would your architecture support a counter loop as I do in my research repo where you increase or decrease a single value? You could do it entirely functional as you did so far but I guess that has its own complications with respect to recursive depth and compilation etc.

s1vm still uses the runtime stack for locals (function parameters). It would be interesting to find a way to optimize those. Maybe as a third parameter to the opcode closures, then the closure for the main function block could allocate/store the locals on the native stack. Not sure if it is possible to eliminate the bound checks without using unsafe Rust code. I have been trying to avoid unsafe code in s1vm.

So you'd say that s1vm is a stack machine still? I thought it operates on registers after compilation. Trying to stay with safe Rust is good imo.

@Neopallium
Copy link
Owner

Not sure if you have seen this document about the wasm3 interpreter:
https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md#tightly-chained-operations

I have seen and read this and understood the basic architecture. The only detail that I do not yet grasp is how the reg_t r0, f64 fp0 paramters are used precisely which is obviously one of the fundamentals of putting things into registers.

r0 can only be used for integer opcode op_u64_* and fp0 for float/double opcodes. So there is basically just one register that can be used by the wasm3 opcodes. The op_u64_Or_sr is a specialized OR of the value at the top of the stack and the register (_sr, the s for stack and r for register) with the results in left in the r0 register. I am sure there are other opcodes for push/pop values between the stack and the r0 or fp0 registers.

The mapping of those parameters is the standard C x86 ABI calling convention (for 32bit code, 64bit code has more parameters as registers). Other arch. like ARM use different conventions on which parameters to pass by register. Also Windows has its own convention.

I think the compiler could be made to only add async support when requested. I try to push as much work as possible on the compiler, to keep execution/runtime as fast as possible.

Hmmm I see a problem in there since it is not really known at compile time if resumability of code is actually used after compilation. But overall I like the approach of putting as many things as possible into the compilation step. A drawback though (as you mentioned earlier) is that it is still a trade-off. For example, for Parity's use case we usually have to compile relatively small to medium sized Wasm binaries that tend to execute pretty fast. So compilation overhead therefore tends to play a bigger role since we do not use the compiled code for more than once. So generally for interpreters it is about finding the sweet spot for compilation and execution time.

@Neopallium
Copy link
Owner

For any opcode that already has a high cost (load/store memory, get/set global), it might be best to keep those as simple unspecialized cases. It is the simpler opcodes (add, sub, mul, etc...) or control flow that would get the biggest improvement from specialization.

That is an interesting thought but you might be right with this.

I had the most difficulty with nested blocks and control flow in the closure based compiler. It basically has to keep track of the PC counter while compiling wasm opcodes to handle jump targets.

I imagine this to work with the Wasm branching depth that is decreased by 1 for everytime you return a branch. At depth of 0 the outer control structure can take off again.

Your loop_block could take two parameters, the conditional op and a Vec like the basic_block.

I don't think this would work well for Wasm execution since in Wasm you could have multiple break points out of a loop. It is probably best to either use a list of op codes directly (as you did) or to have a single block body instruction that might also be a block instruction (containing multiple instructions again).

To handle multiple breakpoints or breaking out of a nested block, there is the Action::Branch(depth) that is return from block until self.depth == depth. The compiler tracks block depth to convert branch targets into block depth.

Also you might want to make sure that the Rust compile doesn't get to "smart" and do too much optimization. Using something like blackbox around the input opcodes inst to make sure that the compiler can't optimize based on the value there.

This is interesting. Can you elaborate when/why I would want to prevent optimizations from happening in this interpreter architecture?

Normally the wasm instructions get loaded from an external source (file, buffer) and the Rust compiler can't see them. But with micro benchmarks where the input values (instructions to execute) are available for the compiler to see, it can produce machine code optimized for those inputs, even calculate the results at compile time. The compiler can also optimize away loops too.

Another black_box used by criterion:
https://docs.rs/criterion/0.3.5/criterion/fn.black_box.html

For this simple research VM, there is no need for registers. Just like how the branch_eqz closure executes an Expr to get the conditional value, all other opcodes add, sub, mul can do the same for their inputs.

I think this is true until you try to support Wasm operands such as local.set, global.set or i32.store etc. Especially local.set more or less requires you to save state somewhere on a stack or in a register which is what I did there. Or how else would your architecture support a counter loop as I do in my research repo where you increase or decrease a single value? You could do it entirely functional as you did so far but I guess that has its own complications with respect to recursive depth and compilation etc.

local.set, global.set and store can still stay as an opcode/closure that consumes a value (Input in s1vm) from another opcode.

If you look at how s1vm compiles if blocks that design can be used for counter loops.

let val = closure(state, store)?;

It might require some detection of conditional loops from wasm opcodes. s1vm doesn't do this optimization yet, so I don't know if it would have a big impact.

s1vm still uses the runtime stack for locals (function parameters). It would be interesting to find a way to optimize those. Maybe as a third parameter to the opcode closures, then the closure for the main function block could allocate/store the locals on the native stack. Not sure if it is possible to eliminate the bound checks without using unsafe Rust code. I have been trying to avoid unsafe code in s1vm.

So you'd say that s1vm is a stack machine still? I thought it operates on registers after compilation. Trying to stay with safe Rust is good imo.

Basically yes. The compiler tries to eliminate as much of the runtime stack usage as it can. Whiling compiling the wasm instructions it uses it's virtual stack to track which opcodes produce values (push a result) and which consume a value (pop values). It resolve those dependencies at compile time.

@Neopallium
Copy link
Owner

I found an interesting optimization while thinking about how wasm3 has a "register" parameter in it's opcodes and that s1vm was still using the stack for function parameters.

On branch local_opt1 I added a third parameter l0 to the compiled closure for passing the local at index 0 (parameter 1 for wasm functions).

I am not sure how well this method would extend to handle more locals this way, but other locals (index > 0) can still be handled on the runtime stack like normal.

The improvement wasn't as much as I was hoping, but it did lower the memory accesses and instruction counts:

time ./target/release/main fib.wasm fib 40
165580141

real    0m8.643s
user    0m8.637s
sys     0m0.001s

callgrind.sh  ./target/release/main fib.wasm fib 30
1346269
==893984==
==893984== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
==893984== Collected : 689938357 238420444 150933544 2727 4473 2114 2565 3104 1874
==893984==
==893984== I   refs:      689,938,357
==893984==
==893984== D   refs:      389,353,988  (238,420,444 rd + 150,933,544 wr)

@Robbepop
Copy link
Author

Robbepop commented Feb 23, 2022

I found an interesting optimization while thinking about how wasm3 has a "register" parameter in it's opcodes and that s1vm was still using the stack for function parameters.

On branch local_opt1 I added a third parameter l0 to the compiled closure for passing the local at index 0 (parameter 1 for wasm functions).

I am not sure how well this method would extend to handle more locals this way, but other locals (index > 0) can still be handled on the runtime stack like normal.

The improvement wasn't as much as I was hoping, but it did lower the memory accesses and instruction counts:

time ./target/release/main fib.wasm fib 40
165580141

real    0m8.643s
user    0m8.637s
sys     0m0.001s

callgrind.sh  ./target/release/main fib.wasm fib 30
1346269
==893984==
==893984== Events    : Ir Dr Dw I1mr D1mr D1mw ILmr DLmr DLmw
==893984== Collected : 689938357 238420444 150933544 2727 4473 2114 2565 3104 1874
==893984==
==893984== I   refs:      689,938,357
==893984==
==893984== D   refs:      389,353,988  (238,420,444 rd + 150,933,544 wr)

Sounds interesting and probably explains how Wasm3 uses this reg_t parameter. One riddle solved for me. :)
The only thing I wonder is how this optimizations works in combination with local.set 0 <new_value> where a single instruction within a function body changes the value of this register parameter for the rest of the instructions. So somehow this information of local.set 0 has to be propagated to following instructions in order to properly support Wasm, e.g. within loops, if/else branches that may store different values for the first parameter etc.

Having support for local.set in s1vm would also allow us to compare benchmarks of the simple counter_loop example in my repo with your architecture.

@Neopallium
Copy link
Owner

Sounds interesting and probably explains how Wasm3 uses this reg_t parameter. One riddle solved for me. :) The only thing I wonder is how this optimizations works in combination with local.set 0 <new_value> where a single instruction within a function body changes the value of this register parameter for the rest of the instructions. So somehow this information of local.set 0 has to be propagated to following instructions in order to properly support Wasm, e.g. within loops, if/else branches that may store different values for the first parameter etc.

The new parameter is a mutable reference, so it can be updated by any opcode in that function. The Call opcode uses a new Rust variable for holding the value for the l0 parameter. So wasm local 0 is on the native stack and other wasm locals are on the wasm runtime stack.

Having support for local.set in s1vm would also allow us to compare benchmarks of the simple counter_loop example in my repo with your architecture.

I was going to try implementing local.set, but ran into a problem with the current s1vm compiler. Basically it assumes all opcodes produce one value and is wrapped in an Input before being push onto the virtual (compile time) stack. That doesn't work for opcodes that don't produce a value. This can be solved by adding an Input::RunOp(OpFunc) variant and doing a rollup/merging of those values from the virtual stack when another opcode pops a value from that stack.

Yesterday while profiling s1vm I was looking at all of the memory accesses (there still is a lot) and wondering if there was away to decrease those. Right now with the fib.wasm those memory accesses seem to be mostly (>90%) from the closure data. The closures are basically just a chunk of heap memory Box::new that contains the captured data and a func pointer. It should be possible to do about the same with an enum. It might not be the closure, but the way values are passed from opcode to opcode that makes s1vm fast.

Something like this which is similar to your closure_nest, but doesn't use registers. It replaces the closure func pointer with a switch dispatch.

type LocalIdx = u32;
enum Input {
  Local(LocalIdx),
  Const(StackValue),
  Op(Box<Opcode>), // Need to box `Opcode` to avoid recursive type.
  // Might need a variant for opcodes that don't produce values: SetLocal, SetGlobal.
}

// the `Box<Opcode>` parameters would provide the nested execution of opcodes like what is done in s1vm
// with the current closures.
enum Opcode {
  // Specialize some opcodes for the different `Input` variants, which is the same that the closure specialization does.
  I32Add_L_L(LocalIdx, LocalIdx),
  I32Add_L_C(LocalIdx, StackValue),
  I32Add_L_O(LocalIdx, Box<Opcode>),
  I32Add_O_C(Box<Opcode>, StackValue),
  I32Add_O_C(Box<Opcode>, Box<Opcode>),

  I32Sub_L_L(LocalIdx, LocalIdx),
  I32Sub_L_C(LocalIdx, StackValue),
  I32Sub_L_O(LocalIdx, Box<Opcode>),
  I32Sub_O_C(Box<Opcode>, StackValue),
  I32Sub_O_C(Box<Opcode>, Box<Opcode>),

  I32Lts_L(LocalIdx),
  I32Lts_C(StackValue),
  I32Lts_O(Box<Opcode>),

  Call0(FuncIdx), // Zero parameter funcs.
  Call1(FuncIdx, Input), // One parameter funcs.
  // Generic call op for funcs with > 1 parameters.
  Call(FuncIdx, Vec<Input>),

  // ..... many more opcodes.
}

Would need to do profiling to see if it had better memory access patterns.

@Robbepop
Copy link
Author

Robbepop commented Feb 23, 2022

Looking forward to your benchmark results with this enum based approach. :)
You might want to replace I32Add_O_C(Box<Opcode>, Box<Opcode>) with I32Add_O_C(Box<[Opcode; 2]>) for more cache friendly heap memory allocations.
In general I view your approach as a division between instructions (not returning a value) and expressions (returning a value). Therefore in my own experiments I named those things respectively, e.g. your OpCode is an Expr for me since it can be evaluated to a value.
Viewing it this way operations like local.set and global.set act as instructions and are explicitly no expressions.

@Robbepop
Copy link
Author

Robbepop commented Feb 23, 2022

Tried out a variant of your proposed dispatch technique (not put a lot of effort into it though) and it is super slow unfortunately:
https://github.com/Robbepop/interpreter-dispatch-research/blob/main/src/enum_tree.rs

@Robbepop
Copy link
Author

Robbepop commented Feb 23, 2022

The new switch_tail_2 dispatch technique in the research repo is now the fastest I got to so far.
I uses an enum for the instructions, has specialized instructions for reading and writing to local 0 to use a register instead and makes use of tail calls: https://github.com/Robbepop/interpreter-dispatch-research/blob/main/src/switch_tail_2.rs

edit: Added another switch_2 that is the same as switch_tail_2 but uses a simple switch-loop based dispatch and it is equally fast.

On my system for both switch and switch_tail benchmarks run in roughly 380ms and both switch_2 and switch_tail_2 run in roughly 320ms. So a significant speedup just by using more specialized instructions for the first function parameter. I am not entirely sure if I have made some mistakes during implementation though. However, what is remarkable is that both switch-loop and switch-tail dispatch are equally fast in both cases.

@Robbepop
Copy link
Author

Robbepop commented Apr 3, 2022

Hi, have you had some time and energy to experiment on these approaches more?
I am nearly done with the new wasmi codegen and it was hell of a ride so far.
Looking forward to the benchmarks once the new execution engine is ready. Will update you.

@Neopallium
Copy link
Owner

Hi, @Robbepop

I had some time this weekend, so I did some testing with your research repo. Also fixed loops/branching in s1vm. Also there is a for_loop.wasm module for testing/benching for loops (./target/release/main ./for_loop.wasm for_loop 100000000).

One thing I have notice is that the benchmark numbers are not stable if multiple threads are used. The testsuite can be forked to use one thread with --test-threads 1. I added a script to run the different counter_loop tests and sort the results.

See my fork here: https://github.com/Neopallium/interpreter-dispatch-research

I added to new implementations closure_reg0 and closure_s1vm (tried to make it like the s1vm design).

Here are the number I get on my laptop (AMD Ryzen 9 5900HX) and Rust 1.60.0 (7737e0b5c 2022-04-04)

taskset -c 0 ./time_tests.sh ./target/release/deps/interpreter_dispatch_research-503a3bfd9fb23604
201.540135ms closure_tree::counter_loop
204.615313ms closure_reg0::counter_loop
227.409807ms fused::rt3::counter_loop
237.641822ms switch_2::counter_loop
286.590728ms switch::counter_loop
294.793438ms fused::rt2::counter_loop
330.544453ms fused::rt::counter_loop
384.129139ms closure_s1vm::counter_loop
408.361034ms fused::ct3::counter_loop
440.942165ms closure_tail::counter_loop
453.473982ms enum_tree_2::counter_loop
461.200585ms fused::ct2::counter_loop
480.623633ms closure_block::counter_loop
492.725507ms closure_loop::counter_loop
640.735641ms enum_tree::counter_loop
994.112301ms fused::ct::counter_loop

@Robbepop
Copy link
Author

Robbepop commented Jul 14, 2022

Hi @Neopallium , sorry for the long pause I was on a break from coding.

Thanks a lot for the interesting additions. It is great to see how much potential there still is!

I especially looked into the closure_reg0 implementation and was wondering about a few things with respect to Wasm translating to it. For example, I currently do not see a proper mapping from Wasm loops and Wasm basic blocks that can break out to any parent block in their middle from your current design. Please point me out in case I missed something.

In the meantime I kinda finalized the initial implementation of the register-machine approach for wasmi. It currently yields improvements of up to 40% for some real-world compute intense and iteration heavy benchmarks such as tiny_keccak, however, for call intense benchmarks such as your baseline example, the recursive fibonacci, it does not yield any major improvements. Actually I figured out that in case of multi-value Wasm a stack machine can technically be extremely efficient at calling other code whereas a register machine has lots of copying of registers going on. Also I implemented an instruction scheduling improvement to the stack-machine based wasmi with nice gains.

Still, even as register machine wasmi is not even close to the performance of Wasm3 and I start to wonder what else the implementation of wasmi is seriously missing besides the reg0 optimization and a better instruction scheduling approach because those optimizations cannot explain the huge gap that still exists.

I am referring to the coremark.wasm benchmark where wasmi currently sits at 600 points on my machine whereas Wasm3 sits on roughly 2.2k points, so like 3.5 times faster still. Even with 30% gains from a closure based instruction scheduler and the reg0 optimization we won't even get close. Maybe I am underestimating the gains but I get the feeling that other parts are probably much slower than expected.

Do you have any thoughts on this topic?

@Neopallium
Copy link
Owner

Have you tried running wasmi under profilers like valgrind --tool=callgrind? I used that a lot to see data reads/writes in s1vm.

The "research" vms don't have logic for early break out of parent blocks like wasm has. But s1vm does have support of that now using the Action::Branch(depth) return value from instruction closures.

@Robbepop
Copy link
Author

Robbepop commented Jul 15, 2022

Have you tried running wasmi under profilers like valgrind --tool=callgrind? I used that a lot to see data reads/writes in s1vm.

I was using perf and flamegraph visualizations of its output which was a great help in finding slow parts, however, it won't show estimations for cache misses etc. Completely forgot about the valgrind tools. Gonna give callgrind and cachegrind a try later.

The "research" vms don't have logic for early break out of parent blocks like wasm has. But s1vm does have support of that now using the Action::Branch(depth) return value from instruction closures.

Gonna have to take another look at this functionality.

With respect to the performance characteristics I think the difference between Wasm3 and our tools is that Wasm3 was super tuned for speed. From what I see Wasm3 does not support certain things such as multiple memories (or multiple instances), making memory interopt obviously faster since lookup is simpler than we can afford in wasmi since we are trying to have a very conformant interpreter.

@Neopallium
Copy link
Owner

Good luck. Let me know if you have any questions. I am subscribed to your wasmi PR, so I have seen the increase in commits. When I have some free time I may give a go at profiling it to see if anything pops out.

@Robbepop
Copy link
Author

Robbepop commented Jul 15, 2022

Thank you!

I just did some performance comparisons between your s1vm and the register-machine based wasmi.
The results are as follows:

  • fibonacci_recursive(35): s1vm won by roughly 30% (1.5s vs 2.2s)
  • fibonacci_iterative(100_000_000): wasmi was roughly 50% faster (3.5s vs 1.6s)
    • Note: I didn't care about integer overflows for benchmarks purposes.
  • for_loop_64(100_000_000): wasmi was roughly 10% faster (1.35s vs 1.2s)

Tested on rust 1.62 (current stable).

So it seems your approach wins significantly for call-heavy workloads while the new register machine based wasmi seems to win for compute intense workloads. I don't guarantee for the correct execution of these measurements.

Even with those wins for wasmi in some of the runs I am super impressed by your s1vm because it is way simpler and seems to be more elegant. It probably also has a lot of undiscovered potential for further optimizations. Makes me think if all the work for the register-machine based approach was worth it. Maybe it is possible to combine both approaches in a productive way.

Note: For fibonacci_iterative I had to use i64 instead of i32 but the changes required to s1vm were trivial.

I used this .wat file for the fibonacci tests:

(module
  (func $fib_recursive (export "fib_recursive") (param $N i64) (result i64)
    (if
      (i64.le_s (local.get $N) (i64.const 1))
      (then (return (local.get $N)))
    )
    (return
      (i64.add
        (call $fib_recursive
          (i64.sub (local.get $N) (i64.const 1))
        )
        (call $fib_recursive
          (i64.sub (local.get $N) (i64.const 2))
        )
      )
    )
  )

  (func $fib_iterative (export "fib_iterative") (param $N i64) (result i64)
    (local $n1 i64)
    (local $n2 i64)
    (local $tmp i64)
    (local $i i64)
    ;; return $N for N <= 1
    (if
      (i64.le_s (local.get $N) (i64.const 1))
      (then (return (local.get $N)))
    )
    (local.set $n1 (i64.const 1))
    (local.set $n2 (i64.const 1))
    (local.set $i (i64.const 2))
    ;;since we normally return n2, handle n=1 case specially
    (loop $again
      (if
        (i64.lt_s (local.get $i) (local.get $N))
        (then
          (local.set $tmp (i64.add (local.get $n1) (local.get $n2)))
          (local.set $n1 (local.get $n2))
          (local.set $n2 (local.get $tmp))
          (local.set $i (i64.add (local.get $i) (i64.const 1)))
          (br $again)
        )
      )
    )
    (local.get $n2)
  )
)

edit: I forgot to enable opt-level=3, lto="fat" for s1vm so the comparison above was unfair.

The new results are as follows:

  • fibonacci_recursive(35): s1vm won by roughly 60% (1.0s vs 2.2s)
  • fibonacci_iterative(100_000_000): wasmi was roughly 45% faster (3.1s vs 1.6s)
  • for_loop_64(100_000_000): s1vm was roughly 25% faster (0.9s vs 1.2s)

In conclusion the results are even more devastating for the register machine based wasmi ...

What further interests me a lot is how the compilation times of Wasm to bytecode (or closures) perform between wasmi and s1vm with all the small heap allocations required by s1vm. I am afraid that this won't be easy to evaluate on real use cases without extending the feature set of s1vm.

Also thought about how to efficiently add Wasm multi-value support to your s1vm interpreter design. While wasmi supports it, the proposal also adds quite a lot of complexity and in some cases even overhead to the runtime of the interpreter.

@Neopallium
Copy link
Owner

Yeah, it would be interesting to bench the "compile" overhead for both vms. i.e. just load, but not execute. I have a feeling that s1vm might not do well with larger wasm modules, at least not right now. It might be possible to do lazy compiling of each wasm function the first time it is called, but that will require mutable code storage.

I haven't tried optimizing loop code yet (i.e. hoisting the loop counter, which should be possible with the closure based design).

@Robbepop
Copy link
Author

Robbepop commented Jul 21, 2022

I have optimized wasmi_v1 and have some new results to share:

The new results are as follows:

  • fibonacci_recursive(35): s1vm wins by roughly 20% (1.0s vs 1.2s)
  • fibonacci_iterative(100_000_000): wasmi wins by roughly 45% (3.1s vs 1.6s)
  • for_loop_64(100_000_000): s1vm wins by roughly 10% (0.9s vs 1.0s)

I have primarily added a bunch of new optimized variants such as i32.AddImm to add an immediate value and have implemented optimizations for common cases of returning no values or just a single value back to the caller. There is even more that can be done but so far I am okay with the results. Repetetive memory accesses have been optimized, too. Also some unnecessary bounds checks have been eliminated.

Not all of those optimizations have been pushed to GitHub, yet.

The optimizations regarding recursive calls roughly boil down to the fact that so far wasmi has always done the more general work as required to support Wasm multi-value return values. However, in many common cases multi-value is not required so I created some new bytecode instructions and the gains were quite good. From what it looks s1vm currently does not support multi-value and therefore never did this extra work in the first place.

@Robbepop
Copy link
Author

Robbepop commented Aug 21, 2022

Hi @Neopallium, long time no update.
Have you had time to tinker on the loop counter hoisting of s1vm loop execution? :)
I just implemented some more optimizations on the older stack machine based wasmi engine with some surprising results.

  • fibonacci_recursive(35): 0.95s (~5% faster than s1vm)
  • fibonacci_iterative(100_000_000): 1.9s (~40% faster than s1vm)
  • for_loop_64(100_000_000): 0.9s (as fast as s1vm)

So the old wasmi stack machine based engine is starting to become faster than s1vm.

edit: Updated the values several times after implementation of new optimizations.

@Robbepop
Copy link
Author

Robbepop commented Mar 17, 2023

I know it isn't much but I just wanted to let you know that wasmi is going to use the same execution mechanism as s1vm for constant Wasm expressions introduced in the extended-const Wasm proposal.
We also link to s1vm as source of inspiration for this technique. :)

PR: wasmi-labs/wasmi#707

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants