Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Translation & Memory Model #193

Closed
Lichtso opened this issue Jul 9, 2021 · 16 comments
Closed

Address Translation & Memory Model #193

Lichtso opened this issue Jul 9, 2021 · 16 comments
Labels
enhancement New feature or request

Comments

@Lichtso
Copy link

Lichtso commented Jul 9, 2021

The address translation is known to be the bottleneck of most workloads in the BPF VM at the moment.

Ideas to improve the situation:

  • Hoist address translation from loop bodies outside, so that they happen less frequently. This would require the static analyzer to be reliable and secure in order to detect loops. Also an efficient way to bypass the translated results into the body and check if they are still in bounds during the execution would be needed. Finally, a fallback in case the boundcheck fails is required too. Furthermore, dynamic calls must be restricted so that they can not jump into a loop. Either by excluding loops or only allowing registered symbols as jump table.
  • Manually optimize the address translation on x86 instruction level to reduce redundant boundchecks and the need for register spilling. This will probably not yield much, maybe a few percent improvement.
  • Change the memory model to something simpler, e.g. what WASM is doing. This would be the most radical change, but also the one which might help the most.
@Lichtso Lichtso added the enhancement New feature or request label Jul 9, 2021
@dmakarov
Copy link
Collaborator

dmakarov commented Jul 9, 2021

Has there been any profiling done on what fraction of time RBPF spends in address translation for some benchmark? (not the percentage of memory access instructions in a specific program, but actual wall-clock time spend in address translation code of the RBPF relative to the whole program execution time)

What would be the maximum theoretical improvement if address translation overhead were 0?

@Lichtso
Copy link
Author

Lichtso commented Jul 12, 2021

bench_jit_vs_interpreter_address_translation: 1016299 ns per loop block
bench_jit_vs_interpreter_empty_for_loop: 96902 ns per loop block

Each loop block is 65536 iterations.
So we are talking about 10x possible improvement.

Timing according to this recent CI benchmark run.

@dmakarov
Copy link
Collaborator

I think this is not exactly correct a comparison, because the first loop does a load from memory and address translation, but the second loop doesn't do address translation, neither does it do a load from memory. It would be good to measure the time it takes to do the address translation and then get the percentage of the whole execution time spent in address translation, that would show how much running time would improve if address translation overhead were 0. As shown, your 10x improvement includes time not performing a load operation on every loop iteration.

@Lichtso
Copy link
Author

Lichtso commented Jul 12, 2021

It does not matter that much if you subtract the rest of the loop:
1016299 / 96902 ~ 10.5
(1016299 - 96902) / 96902 ~ 9.5

I disabled the emission of x86 load instructions in the JIT, so that only the address translation happens.
I tested it a few times with and without load instructions and the result remains the same (within the variance of the benchmark). Probably because it stays in the cache.

In other words: (1016299 - 96902) / 1016299 ~ 90% of the time is spent in address translation in that particular example.

@dmakarov
Copy link
Collaborator

This comparison is not very useful. If you compare a program that does only load instructions in this way, your speedup will be infinite, because without address translation the program's running time will approach 0. This is why I think it is important to take characteristic benchmarks representable of on-chain programs and estimate the expected speedup on such benchmarks.

@dmakarov
Copy link
Collaborator

except maybe it can be used as a baseline for time it takes to execute a single address translation, roughly 1 ms / 65536 or less than 16 ns.

@Lichtso
Copy link
Author

Lichtso commented Jul 12, 2021

This is why I think it is important to take characteristic benchmarks representable of on-chain programs

Sure, but how do you measure the time spent in address translation and nothing else, and without changing the timing of what you measure? These benchmarks here are far from perfect but still allow a rough estimate of the ball park we are talking about.

@dmakarov
Copy link
Collaborator

This is why I think it is important to take characteristic benchmarks representable of on-chain programs

Sure, but how do you measure the time spent in address translation and nothing else, and without changing the timing of what you measure?

You instrument the address translation code and measure the time it takes. You measure the overhead it takes to measure time and subtract that, if you want to be utterly precise, although in this case instrumentation overhead may be negligible.

These benchmarks here are far from perfect but still allow a rough estimate of the ball park we are talking about.

Rough estimate of what exactly? Of one load on 4 arithmetic instructions? Yes, like I said it may be taken as a baseline for the time it takes to do a single address translation. If for example 90% of your address translations happen for addresses that were previously translated, then simple software caching of translated addresses could speed up the address translation dramatically. It's important to know what you're optimizing and what the potential benefits are...

@jon-chuang
Copy link

jon-chuang commented Jul 13, 2021

My worry is that the measured addr translation overhead of 12-16ns (I've measured this locally too) doesn't explain the measured median CU/us of Serum on mainnet, corresponding to about 150ns/opcode.

The median across programs is about 20ns per opcode (50CU/us), which probably are not dominated by loads and stores...

@Lichtso
Copy link
Author

Lichtso commented Jul 13, 2021

@jon-chuang Keep in mind that this was measuring a tight loop with all loads always staying in cache. This is why I am working on a better benchmark solution here: #197

Actually, you could also run some of your own by checking out that branch and adding:
emit_stopwatch(jit, true)?; here
and
?; emit_stopwatch(jit, false) here

@jon-chuang
Copy link

jon-chuang commented Jul 14, 2021

@Lichtso
To follow up on the results produced here: solana-labs/solana#17393 (comment)
Here is the result of tracing and counting the opcodes in trace.out from bpf-ristretto with the following inputs:

../../solana/target/release/rbpf-cli --use jit  --input input.json target/deploy/ec_math.so --trace
Program input:
accounts []
insndata [1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 236, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0]

where the insndata corresponds to instruction::field_mul(two, minus_one), performing the multiplication 1000 times.

trace:
and 6016
call 26039
stx 97494
ldx 116495
lsh 130027
mul 155007
rsh 231033
add 249108
mov 361187

Total: 1340351

Total isn: 1455581

The time taken is 2839670ns (2839 ns/op). This is a slowdown over native (20ns/op) of approximately 140x.

Assuming a 21ns ldx/stx time and a 4.0GHz CPU, we obtain the following slowdown over zero-overhead ldx/stx, assuming IPC = 2 (for predominantly ALU workload):

(97494 + 116495) * 21 * 4 * 2 / 1455581 = 24.

The slowdown from a lightweight ldx/stx of 8 cycles is

((97494 + 116495) * 8 * 2 + (1455581 - 97494 + 116495)) / 1455581 = 3.36.

This suggests a 7.1x performance boost is possible, depending on how fast the lightweight ldx/stx can be made.

This still leaves a 20x performance gap that cannot be closed...

@jon-chuang
Copy link

jon-chuang commented Jul 14, 2021

Btw, this is the CU/us for the above run, which is unbelievably high compared to mainnet's 50CU/us median:
546.389264264

Here is the CU/us for
edwards_add: 444.989488437
edwards_mul: 432.879176471

This leads me to believe that somehow, the VM running on mainnet is not particularly optimised... perhaps the VM there is somehow built as debug rather than release...
Or, mainnet programs are dominated by loads and stores...

But it still doesn't explain why Serum has such poor CU/us...

@Lichtso
Copy link
Author

Lichtso commented Jul 14, 2021

What does the time measurement of mainnet include? Just VM execution or account serialization & deserialization as well?
Because the copying of accounts is also known to be a huge time sink.

@jon-chuang
Copy link

jon-chuang commented Jul 14, 2021

The data is obtained from: https://github.com/solana-labs/solana/pull/16984/files

Here is what is stated:

These numbers only include program execution (message processor, vm
setup/takedown, parameter ser/de) but do not include account loading/storing

So it may not be an accurate reflection, as you say

Still, according to several sources, memcpy does about 20GB/s. So even a Serum memcpy with 1MB data would take 50us. That's a small fraction of the total wallclock time.

I measured this on an 8KiB load. It takes about 120us/MB.

@Lichtso
Copy link
Author

Lichtso commented Jul 14, 2021

If you want to compare the standalone with the one deployed on mainnet then you should only use the execution time and explicitly exclude parameter ser/de. That might also explain where the "missing" performance gap comes from.

@jon-chuang
Copy link

jon-chuang commented Jul 17, 2021

Here is the result of using the flattened, non-rust-call version of address translation, alongside turning off stack frame gaps, immediate sanitisation, and compute meter:
865838ns, or 865ns per field mul
That's a 3.28x improvement.

Turning off environment register encryption:
559782ns, or 560ns per field mul.

That's 5x improvement from turning off the security features.

Still about 25x slower than native, but hey, we're making progress discovering sources of slowdown.

@Lichtso Lichtso closed this as completed Sep 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants