Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big jump tables #288

Open
wants to merge 17 commits into
base: edubart
Choose a base branch
from
Open

Big jump tables #288

wants to merge 17 commits into from

Conversation

edubart
Copy link
Contributor

@edubart edubart commented Oct 29, 2024

This optimizer our RISC-V instruction decoder by using big jump tables, through token threading, to the point that decoding takes just 2 instructions for most RISC-V instructions, even compressed ones. And overall the FENCE instruction takes 12 host instructions in GCC AMD64, and Clang ARM64.

Here is the GCC x86_64 trace as proof:

//// FENCE GCC x86_64 (2/12 instructions)
// increment mcycle (3 instructions)
=> 0x7ffff7a2e98c <loop+28108>:   add    $0x1,%r15                     // mcycle += 1
=> 0x7ffff7a2e990 <loop+28112>:   cmp    %r13,%r15                     // mcycle < mcycle_tick_end
=> 0x7ffff7a2e993 <loop+28115>:   jae    0x7ffff7a2f230 <loop+30320>   // -> break loop
// fetch (5 instructions)
=> 0x7ffff7a2e999 <loop+28121>:   mov    %r10,%rbx                     // pc
=> 0x7ffff7a2e99c <loop+28124>:   xor    %rbp,%rbx                     // pc ^ fetch_vaddr_page
=> 0x7ffff7a2e99f <loop+28127>:   cmp    $0xffd,%rbx                   // check fetch page
=> 0x7ffff7a2e9a6 <loop+28134>:   ja     0x7ffff7a27d00 <loop+320>     // -> miss fetch
=> 0x7ffff7a2e9ac <loop+28140>:   mov    (%r14,%rbp,1),%ebx            // insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode (2 instructions)
=> 0x7ffff7a2e9b0 <loop+28144>:   movzwl %bx,%ecx                      // insn & 0b1111111111111111
=> 0x7ffff7a2e9b3 <loop+28147>:   jmp    *(%r11,%rcx,8)                // -> jump to instruction
// execute (2 instructions)
=> 0x7ffff7a2ea3b <loop+28283>:   add    $0x4,%rbp                     // pc += 4
=> 0x7ffff7a2ea3f <loop+28287>:   jmp    0x7ffff7a2e98c <loop+28108>   // -> jump to loop begin

And the Clang arm64:

//// FENCE Clang arm64 (2/12 instructions)
// increment mcycle
=> 0xfffff7b8a328 <loop+4568>:    add x25, x25, $0x1
=> 0xfffff7b8a32c <loop+4572>:    cmp x25, x27
=> 0xfffff7b8a330 <loop+4576>:    b.cs    0xfffff7b8e7a8 <loop+22104>
// fetch
=> 0xfffff7b8a334 <loop+4580>:    eor x19, x20, x28
=> 0xfffff7b8a338 <loop+4584>:    cmp x19, $0xffd
=> 0xfffff7b8a33c <loop+4588>:    b.hi    0xfffff7b89264 <loop+276>
=> 0xfffff7b8a340 <loop+4592>:    ldr w19, [x20, x22]
// decode
=> 0xfffff7b8a344 <loop+4596>:    and w10, w19, $0xffff
=> 0xfffff7b8a348 <loop+4600>:    ldr x16, [x24, x10, lsl $3]
=> 0xfffff7b8a34c <loop+4604>:    br  x16
// execute
=> 0xfffff7b8dde8 <loop+19608>:   add x20, x20, $0x4
=> 0xfffff7b8ddec <loop+19612>:   b   0xfffff7b8a328 <loop+4568>

In emulator v0.18.x the trace for same FENCE RISC-V instruction took about 40 x86_64 instructions.

Overall the performance varies between 1.2x up to 2x speedup across many benchmarks relative to emulator v0.18.1, here is results for many benchmarks with stress-ng :

Benchmarks

Times faster Benchmark
2.56 ± 0.03 stress-ng --no-rand-seed --syscall 1 --syscall-ops 4000
2.15 ± 0.02 stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
1.95 ± 0.00 stress-ng --no-rand-seed --cpu 1 --cpu-method fibonacci --cpu-ops 400
1.94 ± 0.01 stress-ng --no-rand-seed --cpu 1 --cpu-method int64 --cpu-ops 400
1.90 ± 0.01 stress-ng --no-rand-seed --memcpy 1 --memcpy-ops 50
1.88 ± 0.02 stress-ng --no-rand-seed --crypt 1 --crypt-method SHA-256 --crypt-ops 400000
1.87 ± 0.01 stress-ng --no-rand-seed --qsort 1 --qsort-ops 5
1.83 ± 0.01 stress-ng --no-rand-seed --memrate 1 --memrate-bytes 2M --memrate-ops 200
1.82 ± 0.03 stress-ng --no-rand-seed --hash 1 --hash-ops 40000
1.75 ± 0.00 stress-ng --no-rand-seed --heapsort 1 --heapsort-ops 3
1.72 ± 0.01 stress-ng --no-rand-seed --zlib 1 --zlib-ops 20
1.66 ± 0.00 stress-ng --no-rand-seed --matrix 1 --matrix-method mult --matrix-ops 20000
1.49 ± 0.02 stress-ng --no-rand-seed --hdd 1 --hdd-ops 2000
1.41 ± 0.00 stress-ng --no-rand-seed --fp 1 --fp-method floatadd --fp-ops 1000
1.33 ± 0.01 stress-ng --no-rand-seed --fma 1 --fma-ops 40000
1.24 ± 0.01 stress-ng --no-rand-seed --trig 1 --trig-ops 50
1.16 ± 0.01 stress-ng --no-rand-seed --fork 1 --fork-ops 1000
1.14 ± 0.01 stress-ng --no-rand-seed --malloc 1 --malloc-ops 40000

You can see 1.94 a speedup for integer operations. Notably I am able to reach GHz speed for some simple integer arithmetic benchmarks, with the interpreter being only 10~20x slower than host native.

The table of benchmarks were created by running hyperfine and stress-ng, for example:

$ hyperfine -w 1 'cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400' '/usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400'
Benchmark 1: cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      2.225 s ±  0.021 s    [User: 2.213 s, System: 0.010 s]
  Range (min … max):    2.197 s …  2.257 s    10 runs
 
Benchmark 2: /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      4.615 s ±  0.041 s    [User: 4.602 s, System: 0.009 s]
  Range (min … max):    4.561 s …  4.682 s    10 runs
 
Summary
  cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400 ran
    2.07 ± 0.03 times faster than /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400

This PR is evolution of #226

@edubart edubart added the enhancement New feature or request label Oct 29, 2024
@edubart edubart self-assigned this Oct 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

1 participant