-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address Translation & Memory Model #193
Comments
Has there been any profiling done on what fraction of time RBPF spends in address translation for some benchmark? (not the percentage of memory access instructions in a specific program, but actual wall-clock time spend in address translation code of the RBPF relative to the whole program execution time) What would be the maximum theoretical improvement if address translation overhead were 0? |
bench_jit_vs_interpreter_address_translation: 1016299 ns per loop block Each loop block is 65536 iterations. Timing according to this recent CI benchmark run. |
I think this is not exactly correct a comparison, because the first loop does a load from memory and address translation, but the second loop doesn't do address translation, neither does it do a load from memory. It would be good to measure the time it takes to do the address translation and then get the percentage of the whole execution time spent in address translation, that would show how much running time would improve if address translation overhead were 0. As shown, your 10x improvement includes time not performing a load operation on every loop iteration. |
It does not matter that much if you subtract the rest of the loop: I disabled the emission of x86 load instructions in the JIT, so that only the address translation happens. In other words: (1016299 - 96902) / 1016299 ~ 90% of the time is spent in address translation in that particular example. |
This comparison is not very useful. If you compare a program that does only load instructions in this way, your speedup will be infinite, because without address translation the program's running time will approach 0. This is why I think it is important to take characteristic benchmarks representable of on-chain programs and estimate the expected speedup on such benchmarks. |
except maybe it can be used as a baseline for time it takes to execute a single address translation, roughly 1 ms / 65536 or less than 16 ns. |
Sure, but how do you measure the time spent in address translation and nothing else, and without changing the timing of what you measure? These benchmarks here are far from perfect but still allow a rough estimate of the ball park we are talking about. |
You instrument the address translation code and measure the time it takes. You measure the overhead it takes to measure time and subtract that, if you want to be utterly precise, although in this case instrumentation overhead may be negligible.
Rough estimate of what exactly? Of one load on 4 arithmetic instructions? Yes, like I said it may be taken as a baseline for the time it takes to do a single address translation. If for example 90% of your address translations happen for addresses that were previously translated, then simple software caching of translated addresses could speed up the address translation dramatically. It's important to know what you're optimizing and what the potential benefits are... |
My worry is that the measured addr translation overhead of 12-16ns (I've measured this locally too) doesn't explain the measured median CU/us of Serum on mainnet, corresponding to about 150ns/opcode. The median across programs is about 20ns per opcode (50CU/us), which probably are not dominated by loads and stores... |
@jon-chuang Keep in mind that this was measuring a tight loop with all loads always staying in cache. This is why I am working on a better benchmark solution here: #197 Actually, you could also run some of your own by checking out that branch and adding: |
@Lichtso ../../solana/target/release/rbpf-cli --use jit --input input.json target/deploy/ec_math.so --trace
Program input:
accounts []
insndata [1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 236, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0] where the insndata corresponds to
The time taken is 2839670ns (2839 ns/op). This is a slowdown over native (20ns/op) of approximately 140x. Assuming a 21ns ldx/stx time and a 4.0GHz CPU, we obtain the following slowdown over zero-overhead ldx/stx, assuming IPC = 2 (for predominantly ALU workload):
The slowdown from a lightweight ldx/stx of 8 cycles is
This suggests a 7.1x performance boost is possible, depending on how fast the lightweight ldx/stx can be made. This still leaves a 20x performance gap that cannot be closed... |
Btw, this is the CU/us for the above run, which is unbelievably high compared to mainnet's 50CU/us median: Here is the CU/us for This leads me to believe that somehow, the VM running on mainnet is not particularly optimised... perhaps the VM there is somehow built as debug rather than release... But it still doesn't explain why Serum has such poor CU/us... |
What does the time measurement of mainnet include? Just VM execution or account serialization & deserialization as well? |
The data is obtained from: https://github.com/solana-labs/solana/pull/16984/files Here is what is stated:
So it may not be an accurate reflection, as you say Still, according to several sources, memcpy does about 20GB/s. So even a Serum memcpy with 1MB data would take 50us. That's a small fraction of the total wallclock time. I measured this on an 8KiB load. It takes about 120us/MB. |
If you want to compare the standalone with the one deployed on mainnet then you should only use the execution time and explicitly exclude parameter ser/de. That might also explain where the "missing" performance gap comes from. |
Here is the result of using the flattened, non-rust-call version of address translation, alongside turning off stack frame gaps, immediate sanitisation, and compute meter: Turning off environment register encryption: That's 5x improvement from turning off the security features. Still about 25x slower than native, but hey, we're making progress discovering sources of slowdown. |
The address translation is known to be the bottleneck of most workloads in the BPF VM at the moment.
Ideas to improve the situation:
The text was updated successfully, but these errors were encountered: