Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New dispatch strategy: Add array of instruction handles #310

Closed
rakita opened this issue Dec 24, 2022 · 9 comments
Closed

New dispatch strategy: Add array of instruction handles #310

rakita opened this issue Dec 24, 2022 · 9 comments
Labels
good first issue Good for newcomers

Comments

@rakita
Copy link
Member

rakita commented Dec 24, 2022

I experimented a little bit with this, as in having [InstructionFn;256] but I had a lot of changes and I am not sure if this was good for perf or not. Either way having this Array can be beneficial in multiple ways as in adding custom handles that could potentially replace the inspector, or allowing an additional ways to access the internal state of the interpreter.

Blocked by: #307

@rakita rakita added the good first issue Good for newcomers label Dec 24, 2022
@andy-thomason
Copy link
Contributor

Most CPUs have difficulty with dynamic branching as if the address changes or the density of branches is too high the pipeline will flush.

However, the current match statement probably has the same issue.

Many interpreters get round this by dispatching at the end of every instruction block to create a unique site for the branch in ICache. This means that if a common sequence of instructions such as:

PUSH4
EQ
JUMPI

Occurs, the PUSH4's ICache will likely dispatch to EQ without a stall as will EQ to JUMPI.

This is a vast and interesting topic. I think that:

https://craftinginterpreters.com/

May be useful.

@andy-thomason
Copy link
Contributor

Rust does not have a variable goto, but tail calls (part of LLVM) may be helpful.

@andy-thomason
Copy link
Contributor

https://github.com/fast-rust/tailcall-interpreter/blob/main/src/main.rs

@rakita
Copy link
Member Author

rakita commented Jan 25, 2023

https://github.com/fast-rust/tailcall-interpreter/blob/main/src/main.rs

I see, but generating this array in runtime has its own cost and maybe I am mistaken but this would get optimized only if we have static DISPATCH if it is not static, this would be increasing call stack with every instruction, but I am not the expert on the matter?

This was shared with me previously: https://docs.rs/tailcall/latest/tailcall/

@andy-thomason
Copy link
Contributor

Yes. Without the tail call optimisation, this would be a performance disaster.
This will happen if the state is too large or if LLVM was having a bad day.
Sadly, LLVM does not distinguish between "works" and "works but is too slow to use".

DISPATCH does not need to be static, I'm just being lazy.

I was going to include this example in my book.

@andy-thomason
Copy link
Contributor

This is the good codegen:

tailcall_interpreter::add:
 inc     dword, ptr, [rsi]
 movzx   eax, byte, ptr, [rdi]
 inc     rdi
 lea     rcx, [rip, +, _ZN20tailcall_interpreter8DISPATCH17hbd87f1336b41e07aE]
 jmp     qword, ptr, [rcx, +, 8*rax]

@DaniPopes
Copy link
Collaborator

DaniPopes commented Aug 3, 2023

This is possible with the following patch (at 967ac6c, #582), but it requires the inline_const and const_mut_refs nightly features:

diff --git a/crates/interpreter/src/instructions/opcode.rs b/crates/interpreter/src/instructions/opcode.rs
index 342e18b..1c5e9c8 100644
--- a/crates/interpreter/src/instructions/opcode.rs
+++ b/crates/interpreter/src/instructions/opcode.rs
@@ -25,13 +25,26 @@ macro_rules! opcodes {
             map
         };
 
+        type Instruction = fn(&mut Interpreter, &mut dyn Host);
+        type InstructionTable = [Instruction; 256];
+
+        const fn make_instruction_table<SPEC: Spec>() -> InstructionTable {
+            let mut table: InstructionTable = [control::not_found; 256];
+            let mut i = 0usize;
+            while i < 256 {
+                table[i] = match i as u8 {
+                    $($name => $f,)*
+                    _ => control::not_found,
+                };
+                i += 1;
+            }
+            table
+        }
+
         /// Evaluates the opcode in the given context.
         #[inline(always)]
         pub(crate) fn eval<SPEC: Spec>(opcode: u8, interpreter: &mut Interpreter, host: &mut dyn Host) {
-            match opcode {
-                $($name => $f(interpreter, host),)*
-                _ => control::not_found(interpreter, host),
-            }
+            (const { make_instruction_table::<SPEC>() })[opcode as usize](interpreter, host)
         }
     };
 }
diff --git a/crates/interpreter/src/lib.rs b/crates/interpreter/src/lib.rs
index 5256d0a..27ef2ec 100644
--- a/crates/interpreter/src/lib.rs
+++ b/crates/interpreter/src/lib.rs
@@ -1,4 +1,5 @@
 #![cfg_attr(not(feature = "std"), no_std)]
+#![feature(inline_const, const_mut_refs)]
 
 extern crate alloc;
 

Generates the following (with a manual wrapper that specifies SPEC):

revm_interpreter::eval:

	.cfi_startproc
	movzx eax, dil
	lea r8, [rip + .Lanon.34442afdd17b17c672be37f934c5b32c.226]
	mov rdi, rsi

	mov rsi, rdx

	mov rdx, rcx

	jmp qword ptr [r8 + 8*rax]

Edit: We can get much closer to the look-up table performance by casting all functions to a fn pointer before calling it, and it's doable in current stable (number is instructions count in the benchmark in #558):

// 1. initial: 490,072,689
match opcode {
    $($name => $f(interpreter, host),)*
    _ => control::not_found(interpreter, host),
}

// 2. cast to fn: 454,501,135
let f: Instruction = match opcode {
    $($name => $f as Instruction,)*
    _ => control::not_found as Instruction,
};
f(interpreter, host);

// 3. static lookup table (nightly): 446,286,962
(const { make_instruction_table::<SPEC>() })[opcode as usize](interpreter, host);

DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 4, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 7, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
@andy-thomason
Copy link
Contributor

Looks good.

Next step in the JIT journey after a threaded interpreter is an "all calls" interpreter where every opcode is converted to
a call.

You want to generate something like this:

https://godbolt.org/z/abfhW69zq

pub struct State;

type Call = fn (state: &mut State) -> Result<(), i32>;

pub fn contract(dispatch: &[Call; 256], state: &mut State) -> Result<(), i32> {
    dispatch[0x00](state)?;
    dispatch[0x01](state)?;
    dispatch[0x02](state)?;
    Ok(())
}

Where the numbers like 0x01 are the opcodes OP_ADD, OP_SSTORE etc.

Use as -al to get the X86 opcodes:

   1              
   2                        .text
   3 0000 4157                  pushq   %r15
   4 0002 4156                  pushq   %r14
   5 0004 53                    pushq   %rbx
   6 0005 4989FE                movq    %rdi, %r14
   7 0008 4889F7                movq    %rsi, %rdi
   8 000b 4989F7                movq    %rsi, %r15
   9 000e 41FF16                callq   *(%r14)
  10 0011 BB010000              movl    $1, %ebx
  10      00
  11 0016 85C0                  testl   %eax, %eax
  12 0018 7519                  jne     .LBB0_3
  13 001a 4C89FF                movq    %r15, %rdi
  14 001d 41FF5608              callq   *8(%r14)
  15 0021 85C0                  testl   %eax, %eax
  16 0023 750E                  jne     .LBB0_3
  17 0025 4C89FF                movq    %r15, %rdi
  18 0028 41FF5610              callq   *16(%r14)
  19 002c 31DB                  xorl    %ebx, %ebx
  20 002e 85C0                  testl   %eax, %eax
  21 0030 0F95C3                setne   %bl
  22                    .LBB0_3:
  23 0033 89D8                  movl    %ebx, %eax
  24 0035 5B                    popq    %rbx
  25 0036 415E                  popq    %r14
  26 0038 415F                  popq    %r15
  27 003a C3                    retq...

Your very simple JIT can generate this without using something complex (and heavy) like LLVM
by allocating an executable page and poking in the X86 code and calling it. On X86 you don't
need an ICache flush.

Enhance this with N-gram instruction coalescence:

OP_PUSH 1
OP_ADD

Becomes

OP_ADDCONST 1

And you start to get something within 50% of the optimal performance
with almost zero compile time.

Note: More sophisticated JITs exist, but they have diminishing returns.

DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 9, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 15, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 16, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 16, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 22, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 24, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
DaniPopes added a commit to DaniPopes/revm that referenced this issue Aug 25, 2023
The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy#310 (comment)>
for more information.
rakita pushed a commit that referenced this issue Sep 20, 2023
* perf: refactor interpreter internals (take 2)

* perf: cast instruction functions to `fn`

The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <#310 (comment)>
for more information.

* chore: remove prelude

* perf: remove stack and memory bound checks on release

* chore: re-add and deprecate `Memory::get_slice`

* readd BLOBHASH

* fix: TSTORE and TLOAD order

* some cleanup

* nits
@rakita
Copy link
Member Author

rakita commented Dec 25, 2023

This is finally considered finished

@rakita rakita closed this as completed Dec 25, 2023
rising9719 pushed a commit to rising9719/revm-v-rust that referenced this issue Oct 25, 2024
* perf: refactor interpreter internals (take 2)

* perf: cast instruction functions to `fn`

The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy/revm#310 (comment)>
for more information.

* chore: remove prelude

* perf: remove stack and memory bound checks on release

* chore: re-add and deprecate `Memory::get_slice`

* readd BLOBHASH

* fix: TSTORE and TLOAD order

* some cleanup

* nits
migramirez2 pushed a commit to migramirez2/Revm that referenced this issue Nov 5, 2024
* perf: refactor interpreter internals (take 2)

* perf: cast instruction functions to `fn`

The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy/revm#310 (comment)>
for more information.

* chore: remove prelude

* perf: remove stack and memory bound checks on release

* chore: re-add and deprecate `Memory::get_slice`

* readd BLOBHASH

* fix: TSTORE and TLOAD order

* some cleanup

* nits
hondairoban added a commit to hondairoban/revm that referenced this issue Feb 7, 2025
* perf: refactor interpreter internals (take 2)

* perf: cast instruction functions to `fn`

The compiler generates much more favorable assembly if all the
functions are casted to a `fn(_, _)` pointer before calling them.

See <bluealloy/revm#310 (comment)>
for more information.

* chore: remove prelude

* perf: remove stack and memory bound checks on release

* chore: re-add and deprecate `Memory::get_slice`

* readd BLOBHASH

* fix: TSTORE and TLOAD order

* some cleanup

* nits
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants