Understanding RocksDB State column performance #9235

pugachAG · 2023-06-21T12:24:11Z

pugachAG
Jun 21, 2023
Collaborator

Introduction

This is an attempt to understand the performance of a single RocksDB State column get request on mainnet data as well as analyze potential room for improvements.

Methodology

In order to simulate read-world workload we need to make get requests for existing keys in random order. We can use FlatState as a source of State keys. One caveat is that we want to skip values that we already read previously as well as large ( > 4KB) values.

The test is performed on GCP with a 1000GB persistent SSD. This is a standard setup for a Pagoda RPC node.

Another important note is that we are more interested in RocksDB performance related to raw SSD performance as opposed to raw latency number. The reason is that we still expect RocksDB to read data from the disk for every get request since we cannot assume any data locality for the State column and we don’t rely on RocksDB block cache to store raw data. So in an ideal situation we want RocksDB get latency to be as close as possible to a single raw SSD read latency. See this discussion for notes on raw SSD performance analysis.

RocksDB Perf Context can be used to get insights into performance of a single request. In particular we are interested in the following RocksDB metrics:

block_read_count for the number of reads from the disk
block_read_time for the total time spent reading data from disk

On top of that we measure observed_latency as total request latency observed by our code.
We want to look at the distribution of requests grouped by block_read_count to understand RocksDB reads amplification.

Benchmark implementation: measure_get_perf

Measurements

First of all we need to establish the baseline for disk read latency. We use fio tool for that, see Setup section for more details.
lat (usec): avg=515.89, stdev=138.35
Also let’s run the same benchmark but with block size matching our RocksDB config (achieved by setting bs=16k in the fio job file).
lat (usec): avg=593.16, stdev=143.09

Now let’s execute our benchmark:

Raw benchmark output

Overall:
avg observed_latency: 1.088638ms, block_read_time: 841.792µs
==============================
block_read_count: 0, samples: 0.02%
avg observed_latency: 42.248µs, block_read_time: 0ns
==============================
block_read_count: 1, samples: 53.79%
avg observed_latency: 790.803µs, block_read_time: 683.9µs
==============================
block_read_count: 2, samples: 15.55%
avg observed_latency: 1.152334ms, block_read_time: 907.14µs
==============================
block_read_count: 3, samples: 11.69%
avg observed_latency: 1.378278ms, block_read_time: 994.13µs
==============================
block_read_count: 4, samples: 15.62%
avg observed_latency: 1.540443ms, block_read_time: 1.014659ms
==============================
block_read_count: 5, samples: 2.48%
avg observed_latency: 1.802286ms, block_read_time: 1.242862ms
==============================
block_read_count: 6, samples: 0.69%
avg observed_latency: 4.329755ms, block_read_time: 3.145319ms
==============================
block_read_count: 7, samples: 0.12%
avg observed_latency: 4.168716ms, block_read_time: 3.082196ms
==============================
block_read_count: 8, samples: 0.02%
avg observed_latency: 4.398257ms, block_read_time: 3.143788ms
==============================
block_read_count: 10, samples: 0.02%
avg observed_latency: 8.783767ms, block_read_time: 6.131094ms
==============================

Interesting observations:

We can see that only around half of the requests (53%) are executed with a single disk read.
RocksDB block_read_time for a single block is close enough to the latency number reported by fio. By looking at observed_latency compared to block_read_time we can see that RocksDB overhead is around 15%.

Now let’s increase State column block cache size 10x via col_state_cache_size config parameter.

Raw benchmark output with 10x block cache size

Overall:
avg observed_latency: 913.418µs, block_read_time: 771.082µs
==============================
block_read_count: 0, samples: 1.44%
avg observed_latency: 47.002µs, block_read_time: 0ns
==============================
block_read_count: 1, samples: 92.04%
avg observed_latency: 845.538µs, block_read_time: 712.122µs
==============================
block_read_count: 2, samples: 5.23%
avg observed_latency: 1.616206ms, block_read_time: 1.444495ms
==============================
block_read_count: 3, samples: 0.72%
avg observed_latency: 2.367265ms, block_read_time: 2.153923ms
==============================
block_read_count: 4, samples: 0.10%
avg observed_latency: 2.99551ms, block_read_time: 2.756324ms
==============================
block_read_count: 5, samples: 0.02%
avg observed_latency: 2.809816ms, block_read_time: 1.82686ms
==============================
block_read_count: 6, samples: 0.41%
avg observed_latency: 6.419238ms, block_read_time: 4.692413ms
==============================
block_read_count: 7, samples: 0.02%
avg observed_latency: 6.758455ms, block_read_time: 5.073071ms
==============================
block_read_count: 10, samples: 0.02%
avg observed_latency: 9.193976ms, block_read_time: 6.60202ms
==============================

Overall avg latency is slightly improved but more importantly 92% of requests are executed with a single disk read. This indicates that our default block cache size is not sufficient to keep all filter and index blocks which is critical for RocksDB read performance. Also note that it doesn’t have a drastic effect on overall latency because filter and index blocks are accessed frequently enough to be in the OS page cache.

Conclusion

We can consider decreasing block size for the State column from 16KB to 4KB. The only downside is that it can potentially increase the size of the index block, that is something we need to verify.
Make sure that filter and index blocks fit into RAM to ensure more stable performance (avoid relying on OS page cache). This can be done by either increasing block cache size or simply not using block cache to store filter and index blocks (which is the default setting in RocksDB, we override it with cache_index_and_filter_blocks=true).
State column is updated using the refcount merge operator. This forces RocksDB to read values at all levels where it is present (unlike only at the lowest level when set operation is used). Kudos to @Longarithm for discovering that. Even though with sufficient memory for filter and index blocks we go to disk more than once only in < 6% of cases, this can still be an issue since it makes get performance very sensitive to compaction.

jbajic · 2023-06-21T13:49:00Z

jbajic
Jun 21, 2023

In regards to the second point of conclusion you can bypass fs cache for writes and reads by setting direct io flag that rocksdb provides

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding RocksDB State column performance #9235

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Understanding RocksDB State column performance #9235

pugachAG Jun 21, 2023 Collaborator

Introduction

Methodology

Measurements

Conclusion

Replies: 1 comment

jbajic Jun 21, 2023

pugachAG
Jun 21, 2023
Collaborator

jbajic
Jun 21, 2023