-
Notifications
You must be signed in to change notification settings - Fork 456
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate WAL ingest performance for logical messages #9642
Comments
Notes from initial
Interesting findings:
I wonder if the delayed flushes due to the |
Results of the
Some findings:
|
Adding another variant that also commits records doesn't show any significant difference. We only flush the control file after each WAL segment, so that checks out. neon/safekeeper/src/safekeeper.rs Lines 931 to 938 in e287f36
|
The fixed 8 µs cost is likely Tokio scheduling overhead, see flamegraph.svg. Tokio's file IO is not efficient. One option is to explore tokio-epoll-uring. Another is to increase batching along with |
Just to get a baseline, I added benchmarks for simple stdlib and Tokio writes. This confirms that the majority of the fixed 8 µs cost is just Tokio. However, it also shows that we should be able to saturate disks with both write paths at large enough write sizes, although the throughput is highly unreliable beyond 1 GB/s. This is unlike the
For completeness, here are `fsync=true` results (expand) -- as expected, Tokio and stdlib are roughly equal here.
|
It's segments all right: |
Much of the segment costs were addressed in the issues listed above. We can get further improvements by increasing the segment size in #9687, but this only yields 8% on the instances we currently use for Safekeepers and requires system-wide changes. Next, we should look at AppendRequest batching: |
With the above fixes, logical message throughput on a Safekeeper instance type has increased by 13% at 1 KB writes, and 246% at 1 MB writes. Batching of smaller writes will close the gap. We're maxing out at 734 MB/s, with a hardware capacity of 1.1 GB/s.
With fsync disabled, ingestion hits the hardware capacity at 1.1 GB/s:
|
The compute already performs sufficient batching: the Safekeeper receives 128 KB appends ( I added an end-to-end benchmark in #9749, which ingests 10 GB of logical message WAL into the compute, Safekeeper, and Pageserver (with fsync enabled). Results:
The compute also does the same amount of writes to its local WAL, and I see the disk peak out at 1.9 GB/s (capacity 5.5 GB/s). So the bottleneck here is definitely the Safekeeper → Pageserver path, which only does 506 MB/s. The Pageserver shouldn't be doing much processing here, so I suspect this is due to inefficient IO along this path. Interestingly, disk IO graphs show that disk reads only pick up once WAL ingestion completes on the Safekeeper: I added some logging in the Safekeeper WAL sender which confirms this: the throughput is only about 300 MB/s while WAL is being ingested, and then increases to 600 MB/s when ingestion completes. Why?
|
The above is an issue with the benchmark (or the compute?). Postgres returns once the logical messages have been written to its local WAL, and does not wait for them to be committed to Safekeepers. I thought
This also explains the bimodal nature of the write graph in #9642 (comment): it's first writing both to Postgres and Safekeeper, then only to Safekeeper. Safekeeper throughput is about 50% slower when both Postgres and Safekeeper are writing to disk -- perhaps logical, although the disk should have plenty of capacity to accommodate them both. I'll fix the benchmark by having it wait for Safekeeper commits, then investigate performance further. |
With the benchmark fixed, the Safekeeper is the bottleneck:
I think the Safekeeper tends to fall behind because it's fsyncing more aggressively than the compute -- it fsyncs on every segment bound and every second, while Postgres doesn't really need to fsync until the end here. However, with fsync disabled we also see the Safekeeper fall behind (although less so):
There's probably too much cross-talk between the compute and Safekeeper here, since they're using the same disk. I'll try a multi-node benchmark, and also a run on a Linux node with cheaper fsyncs. |
On a Linux machine (Hetzner), we are within 10% of the microbenchmark throughput with fsync enabled:
With fsync disabled we're within 15%:
The disk can do 1.1 GB/s:
I'll see if I can find some more low-hanging fruit to improve Safekeeper ingestion. Otherwise, we should move on to a multi-node setup (with separate disks and network latency), and other workloads. |
Closing this out for now. Logical message ingestion appears to be good enough, and we should focus on Pageserver performance in #9789. |
Adds a benchmark for logical message WAL ingestion throughput end-to-end. Logical messages are essentially noops, and thus ignored by the Pageserver. Example results from my MacBook, with fsync enabled: ``` postgres_ingest: 14.445 s safekeeper_ingest: 29.948 s pageserver_ingest: 30.013 s pageserver_recover_ingest: 8.633 s wal_written: 10,340 MB message_count: 1310720 messages postgres_throughput: 715 MB/s safekeeper_throughput: 345 MB/s pageserver_throughput: 344 MB/s pageserver_recover_throughput: 1197 MB/s ``` See #9642 (comment) for running analysis. Touches #9642.
Adds a benchmark for logical message WAL ingestion throughput end-to-end. Logical messages are essentially noops, and thus ignored by the Pageserver. Example results from my MacBook, with fsync enabled: ``` postgres_ingest: 14.445 s safekeeper_ingest: 29.948 s pageserver_ingest: 30.013 s pageserver_recover_ingest: 8.633 s wal_written: 10,340 MB message_count: 1310720 messages postgres_throughput: 715 MB/s safekeeper_throughput: 345 MB/s pageserver_throughput: 344 MB/s pageserver_recover_throughput: 1197 MB/s ``` See #9642 (comment) for running analysis. Touches #9642.
Ingestion of logical messages (noops) should be able to saturate the hardware, especially when fsync is disabled. It doesn't. Why?
See Slack thread and waltest.txt.
The text was updated successfully, but these errors were encountered: