Address Translation & Memory Model #193

Lichtso · 2021-07-09T15:14:02Z

The address translation is known to be the bottleneck of most workloads in the BPF VM at the moment.

Ideas to improve the situation:

Hoist address translation from loop bodies outside, so that they happen less frequently. This would require the static analyzer to be reliable and secure in order to detect loops. Also an efficient way to bypass the translated results into the body and check if they are still in bounds during the execution would be needed. Finally, a fallback in case the boundcheck fails is required too. Furthermore, dynamic calls must be restricted so that they can not jump into a loop. Either by excluding loops or only allowing registered symbols as jump table.
Manually optimize the address translation on x86 instruction level to reduce redundant boundchecks and the need for register spilling. This will probably not yield much, maybe a few percent improvement.
Change the memory model to something simpler, e.g. what WASM is doing. This would be the most radical change, but also the one which might help the most.

dmakarov · 2021-07-09T16:16:53Z

Has there been any profiling done on what fraction of time RBPF spends in address translation for some benchmark? (not the percentage of memory access instructions in a specific program, but actual wall-clock time spend in address translation code of the RBPF relative to the whole program execution time)

What would be the maximum theoretical improvement if address translation overhead were 0?

Lichtso · 2021-07-12T15:09:29Z

bench_jit_vs_interpreter_address_translation: 1016299 ns per loop block
bench_jit_vs_interpreter_empty_for_loop: 96902 ns per loop block

Each loop block is 65536 iterations.
So we are talking about 10x possible improvement.

Timing according to this recent CI benchmark run.

dmakarov · 2021-07-12T17:29:15Z

I think this is not exactly correct a comparison, because the first loop does a load from memory and address translation, but the second loop doesn't do address translation, neither does it do a load from memory. It would be good to measure the time it takes to do the address translation and then get the percentage of the whole execution time spent in address translation, that would show how much running time would improve if address translation overhead were 0. As shown, your 10x improvement includes time not performing a load operation on every loop iteration.

Lichtso · 2021-07-12T17:49:16Z

It does not matter that much if you subtract the rest of the loop:
1016299 / 96902 ~ 10.5
(1016299 - 96902) / 96902 ~ 9.5

I disabled the emission of x86 load instructions in the JIT, so that only the address translation happens.
I tested it a few times with and without load instructions and the result remains the same (within the variance of the benchmark). Probably because it stays in the cache.

In other words: (1016299 - 96902) / 1016299 ~ 90% of the time is spent in address translation in that particular example.

dmakarov · 2021-07-12T20:04:23Z

This comparison is not very useful. If you compare a program that does only load instructions in this way, your speedup will be infinite, because without address translation the program's running time will approach 0. This is why I think it is important to take characteristic benchmarks representable of on-chain programs and estimate the expected speedup on such benchmarks.

dmakarov · 2021-07-12T20:14:00Z

except maybe it can be used as a baseline for time it takes to execute a single address translation, roughly 1 ms / 65536 or less than 16 ns.

Lichtso · 2021-07-12T20:25:00Z

This is why I think it is important to take characteristic benchmarks representable of on-chain programs

Sure, but how do you measure the time spent in address translation and nothing else, and without changing the timing of what you measure? These benchmarks here are far from perfect but still allow a rough estimate of the ball park we are talking about.

dmakarov · 2021-07-12T20:41:11Z

This is why I think it is important to take characteristic benchmarks representable of on-chain programs

Sure, but how do you measure the time spent in address translation and nothing else, and without changing the timing of what you measure?

You instrument the address translation code and measure the time it takes. You measure the overhead it takes to measure time and subtract that, if you want to be utterly precise, although in this case instrumentation overhead may be negligible.

These benchmarks here are far from perfect but still allow a rough estimate of the ball park we are talking about.

Rough estimate of what exactly? Of one load on 4 arithmetic instructions? Yes, like I said it may be taken as a baseline for the time it takes to do a single address translation. If for example 90% of your address translations happen for addresses that were previously translated, then simple software caching of translated addresses could speed up the address translation dramatically. It's important to know what you're optimizing and what the potential benefits are...

jon-chuang · 2021-07-13T16:11:07Z

My worry is that the measured addr translation overhead of 12-16ns (I've measured this locally too) doesn't explain the measured median CU/us of Serum on mainnet, corresponding to about 150ns/opcode.

The median across programs is about 20ns per opcode (50CU/us), which probably are not dominated by loads and stores...

Lichtso · 2021-07-13T16:28:53Z

@jon-chuang Keep in mind that this was measuring a tight loop with all loads always staying in cache. This is why I am working on a better benchmark solution here: #197

Actually, you could also run some of your own by checking out that branch and adding:
emit_stopwatch(jit, true)?; here
and
?; emit_stopwatch(jit, false) here

jon-chuang · 2021-07-14T06:39:01Z

@Lichtso
To follow up on the results produced here: solana-labs/solana#17393 (comment)
Here is the result of tracing and counting the opcodes in trace.out from bpf-ristretto with the following inputs:

../../solana/target/release/rbpf-cli --use jit  --input input.json target/deploy/ec_math.so --trace
Program input:
accounts []
insndata [1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 236, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0, 255, 255, 255, 255, 255, 255, 7, 0]

where the insndata corresponds to instruction::field_mul(two, minus_one), performing the multiplication 1000 times.

trace:
and 6016
call 26039
stx 97494
ldx 116495
lsh 130027
mul 155007
rsh 231033
add 249108
mov 361187

Total: 1340351

Total isn: 1455581

The time taken is 2839670ns (2839 ns/op). This is a slowdown over native (20ns/op) of approximately 140x.

Assuming a 21ns ldx/stx time and a 4.0GHz CPU, we obtain the following slowdown over zero-overhead ldx/stx, assuming IPC = 2 (for predominantly ALU workload):

(97494 + 116495) * 21 * 4 * 2 / 1455581 = 24.

The slowdown from a lightweight ldx/stx of 8 cycles is

((97494 + 116495) * 8 * 2 + (1455581 - 97494 + 116495)) / 1455581 = 3.36.

This suggests a 7.1x performance boost is possible, depending on how fast the lightweight ldx/stx can be made.

This still leaves a 20x performance gap that cannot be closed...

jon-chuang · 2021-07-14T06:56:48Z

Btw, this is the CU/us for the above run, which is unbelievably high compared to mainnet's 50CU/us median:
546.389264264

Here is the CU/us for
edwards_add: 444.989488437
edwards_mul: 432.879176471

This leads me to believe that somehow, the VM running on mainnet is not particularly optimised... perhaps the VM there is somehow built as debug rather than release...
Or, mainnet programs are dominated by loads and stores...

But it still doesn't explain why Serum has such poor CU/us...

Lichtso · 2021-07-14T07:00:53Z

What does the time measurement of mainnet include? Just VM execution or account serialization & deserialization as well?
Because the copying of accounts is also known to be a huge time sink.

jon-chuang · 2021-07-14T07:04:21Z

The data is obtained from: https://github.com/solana-labs/solana/pull/16984/files

Here is what is stated:

These numbers only include program execution (message processor, vm
setup/takedown, parameter ser/de) but do not include account loading/storing

So it may not be an accurate reflection, as you say

Still, according to several sources, memcpy does about 20GB/s. So even a Serum memcpy with 1MB data would take 50us. That's a small fraction of the total wallclock time.

I measured this on an 8KiB load. It takes about 120us/MB.

Lichtso · 2021-07-14T07:16:21Z

If you want to compare the standalone with the one deployed on mainnet then you should only use the execution time and explicitly exclude parameter ser/de. That might also explain where the "missing" performance gap comes from.

jon-chuang · 2021-07-17T01:02:00Z

Here is the result of using the flattened, non-rust-call version of address translation, alongside turning off stack frame gaps, immediate sanitisation, and compute meter:
865838ns, or 865ns per field mul
That's a 3.28x improvement.

Turning off environment register encryption:
559782ns, or 560ns per field mul.

That's 5x improvement from turning off the security features.

Still about 25x slower than native, but hey, we're making progress discovering sources of slowdown.

Lichtso added the enhancement New feature or request label Jul 9, 2021

jon-chuang mentioned this issue Jul 12, 2021

Const integer bounds approach to bounds checking #194

Closed

Lichtso closed this as completed Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address Translation & Memory Model #193

Address Translation & Memory Model #193

Lichtso commented Jul 9, 2021 •

edited

Loading

dmakarov commented Jul 9, 2021

Lichtso commented Jul 12, 2021 •

edited

Loading

dmakarov commented Jul 12, 2021

Lichtso commented Jul 12, 2021 •

edited

Loading

dmakarov commented Jul 12, 2021

dmakarov commented Jul 12, 2021

Lichtso commented Jul 12, 2021

dmakarov commented Jul 12, 2021

jon-chuang commented Jul 13, 2021 •

edited

Loading

Lichtso commented Jul 13, 2021

jon-chuang commented Jul 14, 2021 •

edited

Loading

jon-chuang commented Jul 14, 2021 •

edited

Loading

Lichtso commented Jul 14, 2021

jon-chuang commented Jul 14, 2021 •

edited

Loading

Lichtso commented Jul 14, 2021

jon-chuang commented Jul 17, 2021 •

edited

Loading

Address Translation & Memory Model #193

Address Translation & Memory Model #193

Comments

Lichtso commented Jul 9, 2021 • edited Loading

Ideas to improve the situation:

dmakarov commented Jul 9, 2021

Lichtso commented Jul 12, 2021 • edited Loading

dmakarov commented Jul 12, 2021

Lichtso commented Jul 12, 2021 • edited Loading

dmakarov commented Jul 12, 2021

dmakarov commented Jul 12, 2021

Lichtso commented Jul 12, 2021

dmakarov commented Jul 12, 2021

jon-chuang commented Jul 13, 2021 • edited Loading

Lichtso commented Jul 13, 2021

jon-chuang commented Jul 14, 2021 • edited Loading

jon-chuang commented Jul 14, 2021 • edited Loading

Lichtso commented Jul 14, 2021

jon-chuang commented Jul 14, 2021 • edited Loading

Lichtso commented Jul 14, 2021

jon-chuang commented Jul 17, 2021 • edited Loading

Lichtso commented Jul 9, 2021 •

edited

Loading

Lichtso commented Jul 12, 2021 •

edited

Loading

Lichtso commented Jul 12, 2021 •

edited

Loading

jon-chuang commented Jul 13, 2021 •

edited

Loading

jon-chuang commented Jul 14, 2021 •

edited

Loading

jon-chuang commented Jul 14, 2021 •

edited

Loading

jon-chuang commented Jul 14, 2021 •

edited

Loading

jon-chuang commented Jul 17, 2021 •

edited

Loading