Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce performance benchmarks for Stage 1 ZK WASM implementation #90

Closed
aborg-dev opened this issue Nov 17, 2023 · 14 comments
Closed
Assignees

Comments

@aborg-dev
Copy link

As a part of Stage 1 plan we aimed for at most 1000x slowdown compared to native execution.

We need to introduce a few representative benchmarks that allow to:

  • Measure slowdown compared to native execution
  • Simple enough to work with Stage 1 instruction set
  • [Bonus] Comparable to other ZK proving systems

A few ideas:

@aborg-dev
Copy link
Author

I took a simple program computing SHA256 of a constant string: https://github.com/akashin/bench/blob/sha256/src/main.rs

And generated WASM for it https://gist.github.com/akashin/b038a9e9fcc9bb4ccf0dc5b990f5d9df with following commands:

cargo build --target wasm32-unknown-unknown --release
# Before strip: 1.8M
# After strip: 37K
wasm-strip cranelift-bench.wasm

The final program has ~16k WASM ops.
It uses:

  • Single memory region and 1 memory.grow instruction
  • 700 i32.xor ops, 400 i32.and ops
  • 300 i32.shl and i32.shr_u instructions; 600 i32.rot
  • 600 i32.load and 600 i32.store ops
  • 434 call ops
  • 1250 i32.add ops
  • 37 call_indirect ops

@aborg-dev
Copy link
Author

aborg-dev commented Nov 21, 2023

Another very simple program computing fibonacci: https://github.com/akashin/bench/blob/fibo/src/main.rs

Here is generated WAT: https://gist.github.com/akashin/326d3273ec98436e80813ad83f24dfda

It contains ~10k WASM ops, which is much more than I expected.
I will post the breakdown of instructions below.

UPD: The actual fibonacci code is relatively small, so we need to find a way to trim the rest functions that we don't need:

  (func (;4;) (type 0)
    (local i32 i64 i64 i32)
    global.get 0
    i32.const 32
    i32.sub
    local.tee 0
    global.set 0
    i64.const 0
    local.set 1
    i64.const 1
    local.set 2
    i32.const 10000
    local.set 3
    loop  ;; label = @1
      local.get 2
      local.get 1
      local.get 2
      i64.add
      local.tee 1
      i64.add
      local.tee 2
      local.get 1
      local.get 2
      i64.add
      local.tee 2
      i64.add
      local.tee 1
      local.get 2
      local.get 1
      i64.add
      local.tee 2
      i64.add
      local.tee 1
      local.get 2
      local.get 1
      i64.add
      local.tee 2
      i64.add
      local.tee 1
      local.get 2
      local.get 1
      i64.add
      local.tee 1
      i64.add
      local.set 2
      local.get 3
      i32.const -10
      i32.add
      local.tee 3
      br_if 0 (;@1;)
    end
    local.get 0
    local.get 1
    i64.store
    block  ;; label = @1
      local.get 1
      i64.const -2872092127636481573
      i64.ne
      br_if 0 (;@1;)
      local.get 0
      i32.const 32
      i32.add
      global.set 0
      return
    end
    local.get 0
    i32.const 0
    i32.store offset=8
    i32.const 0
    local.get 0
    i32.const 1048592
    local.get 0
    i32.const 8
    i32.add
    i32.const 1048612
    call 3
    unreachable)

@aborg-dev
Copy link
Author

aborg-dev commented Nov 22, 2023

I've managed to convince Rust to generate the small program for fibonacci by using no_std incantations:

#![no_main]
#![no_std]

use panic_halt as _;

extern "C" {
    fn assert_eq(actual: i64, expected: i64);
}

#[no_mangle]
pub fn main() {
    let mut a = 0i64;
    let mut b = 1i64;

    for _ in 0..10000 {
        let c = a.wrapping_add(b);
        a = b;
        b = c;
    }
    unsafe {
        assert_eq(a, -2872092127636481573);
    }
}

The generated WAT file is 61 lines long and only contains the relevant Fibonacci implementation:

(module
  (type (;0;) (func (param i64 i64)))
  (type (;1;) (func))
  (import "env" "assert_eq" (func (;0;) (type 0)))
  (func (;1;) (type 1)
    (local i64 i64 i32)
    i64.const 1
    local.set 0
    i64.const 0
    local.set 1
    i32.const 10000
    local.set 2
    loop  ;; label = @1
      local.get 0
      local.get 1
      i64.add
      local.tee 1
      local.get 0
      i64.add
      local.tee 0
      local.get 1
      i64.add
      local.tee 1
      local.get 0
      i64.add
      local.tee 0
      local.get 1
      i64.add
      local.tee 1
      local.get 0
      i64.add
      local.tee 0
      local.get 1
      i64.add
      local.tee 1
      local.get 0
      i64.add
      local.tee 0
      local.get 1
      i64.add
      local.tee 1
      local.get 0
      i64.add
      local.set 0
      local.get 2
      i32.const -10
      i32.add
      local.tee 2
      br_if 0 (;@1;)
    end
    local.get 1
    i64.const -2872092127636481573
    call 0)
  (memory (;0;) 16)
  (global (;0;) (mut i32) (i32.const 1048576))
  (global (;1;) i32 (i32.const 1048576))
  (global (;2;) i32 (i32.const 1048576))
  (export "memory" (memory 0))
  (export "main" (func 1))
  (export "__data_end" (global 1))
  (export "__heap_base" (global 2)))

I think this would be good enough for our first benchmark and I believe we already support all necessary opcodes.

As a next step, I will see if I can reuse the same techniques to build SHA2 benchmark and see which instructions it uses.

Related to Fibonacci benchmark there is one outstanding questions: How to generate a start function for the WASM module?

I've seen wasm_bindgen(start) annotation but that requires STD in my understanding.
@nagisa , do you have any ideas how to address this?

@nagisa
Copy link
Collaborator

nagisa commented Nov 22, 2023

There’s a (start) declaration in wasm modules, but it is a little different in concept to the _start entry in native code. WASM’s start is more like the life-before-main code that is used to initialize various data at runtime.

What you will need to do in this case is to look at the (export "main") declarations and decide which function becomes zkASM’s start from there.

@aborg-dev
Copy link
Author

There’s a (start) declaration in wasm modules, but it is a little different in concept to the _start entry in native code. WASM’s start is more like the life-before-main code that is used to initialize various data at runtime.

What you will need to do in this case is to look at the (export "main") declarations and decide which function becomes zkASM’s start from there.

Makes sense, we can always do this as a part of .zkasm file generation as long as we keep some convention that is easy to follow programmatically.

@MCJOHN974
Copy link

MCJOHN974 commented Nov 23, 2023

I think for first benchmarks it will be easier to use only 64-bit operations, it will be much easier to implement. i32 now have a problem with jumps, and maybe will contain more. Maybe it is even worth it to write benchmark in .wat, since we can run it natively as well?

@MCJOHN974
Copy link

Also, on which opcodes should we focus now? From list above shifts and rotations are still not implemented, should we start work on them or chose another benchmark? Should we better ship 32-bit computations ASAP?

@aborg-dev
Copy link
Author

Also, on which opcodes should we focus now? From list above shifts and rotations are still not implemented, should we start work on them or chose another benchmark? Should we better ship 32-bit computations ASAP?

I think we should start with this Fibonacci benchmark in #90 (comment). It only uses basic opcodes with i32 and i64 and would be a good stepping stone for a more complex ECDSA benchmark. There is no arguing that it would be easier to use i64 for counter in this specific case, but we will very soon need basic i32 operations anyway for other benchmarks, so I think we should include this in Stage 1.

To sum up, for the first Fibonacci benchmark generated from Rust we need:

  • i64.const, i64.add
  • i32.const, i32.add
  • loop, br_if

@aborg-dev
Copy link
Author

aborg-dev commented Nov 24, 2023

Here is SHA256 benchmark with no_std approach + using existing crate sha2: https://github.com/akashin/bench/blob/9cf8fc4c25e685f8b842db5ad2e57f9df2b0b125/src/main.rs

It results in 5.5k lines of WAT, with a few new instructions on top of the set for Fibonacci:

  • i32.sub, i32.lt_u, i32.xor, i32.or, i32.and, i32.rotl
  • i64.shl, i64.or, i64.and, i64.shr_u, i64.extend_i32_u
  • i32.load offset=1048615 align=1, i32.store8
  • i64.store align=1, i64.store offset=8, i64.load offset=1048600

This seems like a sizable enough target to me to declare as a cutoff for the instruction set for Stage 1 as it touches most of the integer operations and also involves memory loads and stores.

I'll repeat the same experiment with hand-written SHA256 to see if this helps trim the program size and look at a similar Keccac benchmark for which we already have dedicated instruction in ZK ASM.

@aborg-dev
Copy link
Author

Handwritten SHA256 benchmark: https://github.com/akashin/bench/blob/21264dfaad1e0b4788e19bb25b5a0314bc879eb9/src/main.rs and associated WAT: https://gist.github.com/akashin/9804635dc299f7fdc727eaefeb94f1df

Very similar results to using sha2, which is a very good sign, meaning that using no_std crates does not impose that much of an overhead.
It results in ~3k WAT and all the same WASM instructions.

It would actually be really interesting how the actual number of ZK ASM counters differs between these two programs, given that SHA2 is supposed to be more "optimized", even though it has more lines in WAT.

aborg-dev added a commit that referenced this issue Nov 27, 2023
This PR changes how run-tests.zkasm.js communicates results to the zkasm-result.py.
Before, we used to parse log lines, now we pass a json object with a test result for each of the tests.
This will allow storing more detailed information like performance counters for #90 and error messages.
aborg-dev added a commit that referenced this issue Nov 27, 2023
This PR changes how run-tests.zkasm.js communicates results to the zkasm-result.py.
Before, we used to parse log lines, now we pass a json object with a test result for each of the tests.
This will allow storing more detailed information like performance counters for #90 and error messages.
@aborg-dev
Copy link
Author

Here is the final version of sha256 that I'm importing:

#![no_main]
#![no_std]

use panic_halt as _;

use sha2::{Sha256, Digest};

extern "C" {
    fn assert_eq(actual: u64, expected: u64);
}
// fn assert_eq(actual: u64, expected: u64) {
//     assert_eq!(actual, expected);
// }

fn subslice_to_u64(slice: &[u8], begin: usize, end: usize) -> u64 {
    let mut result = 0u64;
    for i in begin..end {
        result <<= 8;
        result += slice[i] as u64;
    }
    result
}

#[no_mangle]
pub fn main() {
    let mut hasher = Sha256::new();
    hasher.update(b"hello world");
    let result = hasher.finalize();


    unsafe {
        assert_eq(subslice_to_u64(&result, 0, 8), 13352372148217134600);
        assert_eq(subslice_to_u64(&result, 8, 16), 11902541952223915002);
        assert_eq(subslice_to_u64(&result, 16, 24), 14160706888648589550);
        assert_eq(subslice_to_u64(&result, 24, 32), 10414846460208074217);
    }
}

The hashes were generated with

import hashlib

def subslice_to_u64(s, begin, end):
    result = 0
    for i in range(begin, end):
        result <<= 8;
        result += s[i]
    return result

s = b"hello world"
h = hashlib.sha256(s)
d = h.digest()

print(subslice_to_u64(d, 0, 8))
print(subslice_to_u64(d, 8, 16))
print(subslice_to_u64(d, 16, 24))
print(subslice_to_u64(d, 24, 32))

@aborg-dev
Copy link
Author

Now that SHA256 is submitted (#136) I think we can declare this as done.
The future work for making SHA256 run correctly will be tracked in #143

@mooori
Copy link

mooori commented Mar 18, 2024

@Akashin Is the Cargo.toml that was used when compiling the sha256 snippet above publicly available? I'm trying to reproduce sha256/from_rust.wat based on this repo and the commands contained in its Makefile. However I end up with different WAT code and that WAT compiled to zkASM triggers an error in zkevm-proverjs.

@aborg-dev
Copy link
Author

@Akashin Is the Cargo.toml that was used when compiling the sha256 snippet above publicly available? I'm trying to reproduce sha256/from_rust.wat based on this repo and the commands contained in its Makefile. However I end up with different WAT code and that WAT compiled to zkASM triggers an error in zkevm-proverjs.

Yes, you can find it here: https://github.com/akashin/bench/tree/sha256_final
The WAT can be built by running prepare.sh and adding a start section in the resulting WAT file that points to main function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants