Big jump tables #288

edubart · 2024-10-29T15:47:10Z

This optimizer our RISC-V instruction decoder by using big jump tables, through token threading, to the point that decoding takes just 2 instructions for most RISC-V instructions, even compressed ones. And overall the FENCE instruction takes 12 host instructions in GCC AMD64, and Clang ARM64.

Here is the GCC x86_64 trace as proof:

//// FENCE GCC x86_64 (2/12 instructions)
// increment mcycle (3 instructions)
=> 0x7ffff7a2e98c <loop+28108>:   add    $0x1,%r15                     // mcycle += 1
=> 0x7ffff7a2e990 <loop+28112>:   cmp    %r13,%r15                     // mcycle < mcycle_tick_end
=> 0x7ffff7a2e993 <loop+28115>:   jae    0x7ffff7a2f230 <loop+30320>   // -> break loop
// fetch (5 instructions)
=> 0x7ffff7a2e999 <loop+28121>:   mov    %r10,%rbx                     // pc
=> 0x7ffff7a2e99c <loop+28124>:   xor    %rbp,%rbx                     // pc ^ fetch_vaddr_page
=> 0x7ffff7a2e99f <loop+28127>:   cmp    $0xffd,%rbx                   // check fetch page
=> 0x7ffff7a2e9a6 <loop+28134>:   ja     0x7ffff7a27d00 <loop+320>     // -> miss fetch
=> 0x7ffff7a2e9ac <loop+28140>:   mov    (%r14,%rbp,1),%ebx            // insn = *(uint32_t*)(pc + fetch_vh_offset)
// decode (2 instructions)
=> 0x7ffff7a2e9b0 <loop+28144>:   movzwl %bx,%ecx                      // insn & 0b1111111111111111
=> 0x7ffff7a2e9b3 <loop+28147>:   jmp    *(%r11,%rcx,8)                // -> jump to instruction
// execute (2 instructions)
=> 0x7ffff7a2ea3b <loop+28283>:   add    $0x4,%rbp                     // pc += 4
=> 0x7ffff7a2ea3f <loop+28287>:   jmp    0x7ffff7a2e98c <loop+28108>   // -> jump to loop begin

And the Clang arm64:

//// FENCE Clang arm64 (2/12 instructions)
// increment mcycle
=> 0xfffff7b8a328 <loop+4568>:    add x25, x25, $0x1
=> 0xfffff7b8a32c <loop+4572>:    cmp x25, x27
=> 0xfffff7b8a330 <loop+4576>:    b.cs    0xfffff7b8e7a8 <loop+22104>
// fetch
=> 0xfffff7b8a334 <loop+4580>:    eor x19, x20, x28
=> 0xfffff7b8a338 <loop+4584>:    cmp x19, $0xffd
=> 0xfffff7b8a33c <loop+4588>:    b.hi    0xfffff7b89264 <loop+276>
=> 0xfffff7b8a340 <loop+4592>:    ldr w19, [x20, x22]
// decode
=> 0xfffff7b8a344 <loop+4596>:    and w10, w19, $0xffff
=> 0xfffff7b8a348 <loop+4600>:    ldr x16, [x24, x10, lsl $3]
=> 0xfffff7b8a34c <loop+4604>:    br  x16
// execute
=> 0xfffff7b8dde8 <loop+19608>:   add x20, x20, $0x4
=> 0xfffff7b8ddec <loop+19612>:   b   0xfffff7b8a328 <loop+4568>

In emulator v0.18.x the trace for same FENCE RISC-V instruction took about 40 x86_64 instructions.

Overall the performance varies between 1.2x up to 2x speedup across many benchmarks relative to emulator v0.18.1, here is results for many benchmarks with stress-ng :

Benchmarks

Times faster	Benchmark
2.56 ± 0.03	stress-ng --no-rand-seed --syscall 1 --syscall-ops 4000
2.15 ± 0.02	stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
1.95 ± 0.00	stress-ng --no-rand-seed --cpu 1 --cpu-method fibonacci --cpu-ops 400
1.94 ± 0.01	stress-ng --no-rand-seed --cpu 1 --cpu-method int64 --cpu-ops 400
1.90 ± 0.01	stress-ng --no-rand-seed --memcpy 1 --memcpy-ops 50
1.88 ± 0.02	stress-ng --no-rand-seed --crypt 1 --crypt-method SHA-256 --crypt-ops 400000
1.87 ± 0.01	stress-ng --no-rand-seed --qsort 1 --qsort-ops 5
1.83 ± 0.01	stress-ng --no-rand-seed --memrate 1 --memrate-bytes 2M --memrate-ops 200
1.82 ± 0.03	stress-ng --no-rand-seed --hash 1 --hash-ops 40000
1.75 ± 0.00	stress-ng --no-rand-seed --heapsort 1 --heapsort-ops 3
1.72 ± 0.01	stress-ng --no-rand-seed --zlib 1 --zlib-ops 20
1.66 ± 0.00	stress-ng --no-rand-seed --matrix 1 --matrix-method mult --matrix-ops 20000
1.49 ± 0.02	stress-ng --no-rand-seed --hdd 1 --hdd-ops 2000
1.41 ± 0.00	stress-ng --no-rand-seed --fp 1 --fp-method floatadd --fp-ops 1000
1.33 ± 0.01	stress-ng --no-rand-seed --fma 1 --fma-ops 40000
1.24 ± 0.01	stress-ng --no-rand-seed --trig 1 --trig-ops 50
1.16 ± 0.01	stress-ng --no-rand-seed --fork 1 --fork-ops 1000
1.14 ± 0.01	stress-ng --no-rand-seed --malloc 1 --malloc-ops 40000

You can see 1.94 a speedup for integer operations. Notably I am able to reach GHz speed for some simple integer arithmetic benchmarks, with the interpreter being only 10~20x slower than host native.

The table of benchmarks were created by running hyperfine and stress-ng, for example:

$ hyperfine -w 1 'cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400' '/usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400'
Benchmark 1: cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      2.225 s ±  0.021 s    [User: 2.213 s, System: 0.010 s]
  Range (min … max):    2.197 s …  2.257 s    10 runs
 
Benchmark 2: /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400
  Time (mean ± σ):      4.615 s ±  0.041 s    [User: 4.602 s, System: 0.009 s]
  Range (min … max):    4.561 s …  4.682 s    10 runs
 
Summary
  cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400 ran
    2.07 ± 0.03 times faster than /usr/bin/cartesi-machine -- stress-ng --no-rand-seed --cpu 1 --cpu-method loop --cpu-ops 400

This PR is evolution of #226

edubart added 10 commits October 25, 2024 13:15

feat: big jump tables

b615ee5

feat: optimize dispatch of lui/auipc/ja/jalr instructions when rd=0

bf114cf

feat: optimize dispatch of arithmetic immediate instructions when rd=0

3a3d5f9

feat: optimize dispatch of arithmetic instructions when rd=0

03d60e9

chore: make linter/format happy

aef155c

refactor: reorder instructions in the interpret switch

cc6de27

feat: minimize compressed instructions branching

a05121c

feat: optimize branching of uncompressed arithmetic instructions

04a4be3

feat: optimize GCC interpret flags

8fa2f85

feat: optimize compressed instruction argument parsing

bd8af66

edubart added the enhancement New feature or request label Oct 29, 2024

edubart self-assigned this Oct 29, 2024

edubart added 4 commits October 30, 2024 13:24

feat: optimize double indirection when accessing X registers

8b3d48c

feat: optimize rd == 0 branching in load instructions

257139a

feat: optimize decoding of compressed instructions signed imm

fff9cea

feat: make sure interpreter is not compiled with stack protection

37845ab

edubart force-pushed the feature/big-jump-tables branch from 1e1f4f6 to 37845ab Compare November 1, 2024 16:47

edubart added 2 commits November 2, 2024 09:04

feat: optimize few host instructions from trace of SD and LD

ee4b0e4

feat: optimize store instructions

67f62d4

edubart force-pushed the feature/big-jump-tables branch from 1e261ce to 67f62d4 Compare November 2, 2024 21:31

feat: optimize out switch range checks when not using computed goto

c390506

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big jump tables #288

Big jump tables #288

edubart commented Oct 29, 2024 •

edited

Loading

Big jump tables #288

Are you sure you want to change the base?

Big jump tables #288

Conversation

edubart commented Oct 29, 2024 • edited Loading

Benchmarks

edubart commented Oct 29, 2024 •

edited

Loading