Replies: 1 comment
-
In regards to the second point of conclusion you can bypass fs cache for writes and reads by setting direct io flag that rocksdb provides |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Introduction
This is an attempt to understand the performance of a single RocksDB State column
get
request on mainnet data as well as analyze potential room for improvements.Methodology
In order to simulate read-world workload we need to make
get
requests for existing keys in random order. We can use FlatState as a source of State keys. One caveat is that we want to skip values that we already read previously as well as large ( > 4KB) values.The test is performed on GCP with a 1000GB persistent SSD. This is a standard setup for a Pagoda RPC node.
Another important note is that we are more interested in RocksDB performance related to raw SSD performance as opposed to raw latency number. The reason is that we still expect RocksDB to read data from the disk for every
get
request since we cannot assume any data locality for the State column and we don’t rely on RocksDB block cache to store raw data. So in an ideal situation we want RocksDBget
latency to be as close as possible to a single raw SSD read latency. See this discussion for notes on raw SSD performance analysis.RocksDB Perf Context can be used to get insights into performance of a single request. In particular we are interested in the following RocksDB metrics:
block_read_count
for the number of reads from the diskblock_read_time
for the total time spent reading data from diskOn top of that we measure
observed_latency
as total request latency observed by our code.We want to look at the distribution of requests grouped by
block_read_count
to understand RocksDB reads amplification.Benchmark implementation: measure_get_perf
Measurements
First of all we need to establish the baseline for disk read latency. We use fio tool for that, see Setup section for more details.
lat (usec): avg=515.89, stdev=138.35
Also let’s run the same benchmark but with block size matching our RocksDB config (achieved by setting
bs=16k
in the fio job file).lat (usec): avg=593.16, stdev=143.09
Now let’s execute our benchmark:
Raw benchmark output
Interesting observations:
block_read_time
for a single block is close enough to the latency number reported by fio. By looking atobserved_latency
compared toblock_read_time
we can see that RocksDB overhead is around 15%.Now let’s increase State column block cache size 10x via
col_state_cache_size
config parameter.Raw benchmark output with 10x block cache size
Overall avg latency is slightly improved but more importantly 92% of requests are executed with a single disk read. This indicates that our default block cache size is not sufficient to keep all filter and index blocks which is critical for RocksDB read performance. Also note that it doesn’t have a drastic effect on overall latency because filter and index blocks are accessed frequently enough to be in the OS page cache.
Conclusion
cache_index_and_filter_blocks=true
).set
operation is used). Kudos to @Longarithm for discovering that. Even though with sufficient memory for filter and index blocks we go to disk more than once only in < 6% of cases, this can still be an issue since it makesget
performance very sensitive to compaction.Beta Was this translation helpful? Give feedback.
All reactions