-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New dispatch strategy: Add array of instruction handles #310
Comments
Most CPUs have difficulty with dynamic branching as if the address changes or the density of branches is too high the pipeline will flush. However, the current match statement probably has the same issue. Many interpreters get round this by dispatching at the end of every instruction block to create a unique site for the branch in ICache. This means that if a common sequence of instructions such as:
Occurs, the This is a vast and interesting topic. I think that: https://craftinginterpreters.com/ May be useful. |
Rust does not have a variable goto, but tail calls (part of LLVM) may be helpful. |
I see, but generating this array in runtime has its own cost and maybe I am mistaken but this would get optimized only if we have static This was shared with me previously: https://docs.rs/tailcall/latest/tailcall/ |
Yes. Without the tail call optimisation, this would be a performance disaster. DISPATCH does not need to be static, I'm just being lazy. I was going to include this example in my book. |
This is the good codegen:
|
This is possible with the following patch (at 967ac6c, #582), but it requires the diff --git a/crates/interpreter/src/instructions/opcode.rs b/crates/interpreter/src/instructions/opcode.rs
index 342e18b..1c5e9c8 100644
--- a/crates/interpreter/src/instructions/opcode.rs
+++ b/crates/interpreter/src/instructions/opcode.rs
@@ -25,13 +25,26 @@ macro_rules! opcodes {
map
};
+ type Instruction = fn(&mut Interpreter, &mut dyn Host);
+ type InstructionTable = [Instruction; 256];
+
+ const fn make_instruction_table<SPEC: Spec>() -> InstructionTable {
+ let mut table: InstructionTable = [control::not_found; 256];
+ let mut i = 0usize;
+ while i < 256 {
+ table[i] = match i as u8 {
+ $($name => $f,)*
+ _ => control::not_found,
+ };
+ i += 1;
+ }
+ table
+ }
+
/// Evaluates the opcode in the given context.
#[inline(always)]
pub(crate) fn eval<SPEC: Spec>(opcode: u8, interpreter: &mut Interpreter, host: &mut dyn Host) {
- match opcode {
- $($name => $f(interpreter, host),)*
- _ => control::not_found(interpreter, host),
- }
+ (const { make_instruction_table::<SPEC>() })[opcode as usize](interpreter, host)
}
};
}
diff --git a/crates/interpreter/src/lib.rs b/crates/interpreter/src/lib.rs
index 5256d0a..27ef2ec 100644
--- a/crates/interpreter/src/lib.rs
+++ b/crates/interpreter/src/lib.rs
@@ -1,4 +1,5 @@
#![cfg_attr(not(feature = "std"), no_std)]
+#![feature(inline_const, const_mut_refs)]
extern crate alloc;
Generates the following (with a manual wrapper that specifies revm_interpreter::eval:
.cfi_startproc
movzx eax, dil
lea r8, [rip + .Lanon.34442afdd17b17c672be37f934c5b32c.226]
mov rdi, rsi
mov rsi, rdx
mov rdx, rcx
jmp qword ptr [r8 + 8*rax] Edit: We can get much closer to the look-up table performance by casting all functions to a // 1. initial: 490,072,689
match opcode {
$($name => $f(interpreter, host),)*
_ => control::not_found(interpreter, host),
}
// 2. cast to fn: 454,501,135
let f: Instruction = match opcode {
$($name => $f as Instruction,)*
_ => control::not_found as Instruction,
};
f(interpreter, host);
// 3. static lookup table (nightly): 446,286,962
(const { make_instruction_table::<SPEC>() })[opcode as usize](interpreter, host); |
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
Looks good. Next step in the JIT journey after a threaded interpreter is an "all calls" interpreter where every opcode is converted to You want to generate something like this: https://godbolt.org/z/abfhW69zq
Where the numbers like 0x01 are the opcodes OP_ADD, OP_SSTORE etc. Use
Your very simple JIT can generate this without using something complex (and heavy) like LLVM Enhance this with N-gram instruction coalescence: OP_PUSH 1 Becomes OP_ADDCONST 1 And you start to get something within 50% of the optimal performance Note: More sophisticated JITs exist, but they have diminishing returns. |
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy#310 (comment)> for more information.
* perf: refactor interpreter internals (take 2) * perf: cast instruction functions to `fn` The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <#310 (comment)> for more information. * chore: remove prelude * perf: remove stack and memory bound checks on release * chore: re-add and deprecate `Memory::get_slice` * readd BLOBHASH * fix: TSTORE and TLOAD order * some cleanup * nits
This is finally considered finished |
* perf: refactor interpreter internals (take 2) * perf: cast instruction functions to `fn` The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy/revm#310 (comment)> for more information. * chore: remove prelude * perf: remove stack and memory bound checks on release * chore: re-add and deprecate `Memory::get_slice` * readd BLOBHASH * fix: TSTORE and TLOAD order * some cleanup * nits
* perf: refactor interpreter internals (take 2) * perf: cast instruction functions to `fn` The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy/revm#310 (comment)> for more information. * chore: remove prelude * perf: remove stack and memory bound checks on release * chore: re-add and deprecate `Memory::get_slice` * readd BLOBHASH * fix: TSTORE and TLOAD order * some cleanup * nits
* perf: refactor interpreter internals (take 2) * perf: cast instruction functions to `fn` The compiler generates much more favorable assembly if all the functions are casted to a `fn(_, _)` pointer before calling them. See <bluealloy/revm#310 (comment)> for more information. * chore: remove prelude * perf: remove stack and memory bound checks on release * chore: re-add and deprecate `Memory::get_slice` * readd BLOBHASH * fix: TSTORE and TLOAD order * some cleanup * nits
I experimented a little bit with this, as in having
[InstructionFn;256]
but I had a lot of changes and I am not sure if this was good for perf or not. Either way having this Array can be beneficial in multiple ways as in adding custom handles that could potentially replace the inspector, or allowing an additional ways to access the internal state of the interpreter.Blocked by: #307
The text was updated successfully, but these errors were encountered: