Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate WAL ingest performance for logical messages #9642

Closed
Tracked by #9624
erikgrinaker opened this issue Nov 5, 2024 · 13 comments
Closed
Tracked by #9624

Investigate WAL ingest performance for logical messages #9642

erikgrinaker opened this issue Nov 5, 2024 · 13 comments
Assignees
Labels
a/performance Area: relates to performance of the system c/storage Component: storage

Comments

@erikgrinaker
Copy link
Contributor

Ingestion of logical messages (noops) should be able to saturate the hardware, especially when fsync is disabled. It doesn't. Why?

See Slack thread and waltest.txt.

@erikgrinaker erikgrinaker added a/performance Area: relates to performance of the system c/storage Component: storage labels Nov 5, 2024
@erikgrinaker erikgrinaker self-assigned this Nov 5, 2024
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 5, 2024

Notes from initial waltest tests (local with fsync disabled):

  • WAL throughput caps out at ~300 MB/s, with ~1.2 GB/s disk throughput.
  • fio can do sustained 5 GB/s writes with large enough block sizes (1M).
  • Each append request results in an individual Safekeeper disk write.
  • Each commit advancement results in a control file write and rename.
  • Increasing logical messages from 10 KB to 1 MB did not significantly improve throughput.
  • Increasing MAX_SEND_SIZE from 128 KB to 1 MB did not significantly improve throughput.

Interesting findings:

  • Buffering/batching of AppendRequest::wal_data before dispatching writes significantly reduced throughput (below 100 MB/s).
  • Disabling flush batching in WAL receiver significantly improved throughput to ~400 MB/s (i.e. don't use NoFlushAppendRequest).

I wonder if the delayed flushes due to the NoFlushAppendRequest batching causes delays in commit_lsn advancement, and this somehow affects the pipelining (similarly to the bandwidth-delay product). Will investigate.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 6, 2024

Results of the process_msg and wal_acceptor_throughput benchmarks from #9531 with fsync=false on my local MacBook. This only measures payload throughput, and considers segment/page/record/message headers to be overhead.

process_msg/fsync=false/size=8
                        time:   [8.4812 µs 8.5148 µs 8.5460 µs]
                        thrpt:  [914.17 KiB/s 917.52 KiB/s 921.16 KiB/s]
process_msg/fsync=false/size=1024
                        time:   [9.4852 µs 9.5408 µs 9.5931 µs]
                        thrpt:  [101.80 MiB/s 102.36 MiB/s 102.96 MiB/s]
process_msg/fsync=false/size=8192
                        time:   [15.714 µs 16.296 µs 16.824 µs]
                        thrpt:  [464.36 MiB/s 479.41 MiB/s 497.17 MiB/s]
process_msg/fsync=false/size=131072
                        time:   [147.03 µs 153.72 µs 160.07 µs]
                        thrpt:  [780.93 MiB/s 813.19 MiB/s 850.15 MiB/s]
process_msg/fsync=false/size=1048576
                        time:   [1.3300 ms 1.3565 ms 1.3812 ms]
                        thrpt:  [723.99 MiB/s 737.21 MiB/s 751.90 MiB/s]

wal_acceptor_throughput/fsync=false/size=1024
                        time:   [14.618 s 14.714 s 14.828 s]
                        thrpt:  [69.058 MiB/s 69.591 MiB/s 70.049 MiB/s]
wal_acceptor_throughput/fsync=false/size=8192
                        time:   [3.3334 s 3.3466 s 3.3611 s]
                        thrpt:  [304.66 MiB/s 305.98 MiB/s 307.19 MiB/s]
wal_acceptor_throughput/fsync=false/size=131072
                        time:   [1.9624 s 1.9828 s 2.0100 s]
                        thrpt:  [509.46 MiB/s 516.44 MiB/s 521.82 MiB/s]
wal_acceptor_throughput/fsync=false/size=1048576
                        time:   [1.9293 s 1.9515 s 1.9787 s]
                        thrpt:  [517.51 MiB/s 524.74 MiB/s 530.75 MiB/s]

Some findings:

  • Raw process_msg caps out at ~800 MB/s. The disk can do 6 GB/s.
  • There is a fixed cost of ~8 µs, or 125,000 messages/s.
    • At 8 KB/message, that's about 1 GB/s.
    • Comparing size=1024 to size=8192, the additional IO cost is also about 8 µs.
    • The IO cost appears to scale linearly at about 1.2 µs/KB.
    • At 8 KB writes, optimizing or batching away the fixed cost would provide at most a 2x throughput improvement.
  • The WAL acceptor loop reduces throughput from ~800 MB/s to ~500 MB/s, or ~40%, even at large message sizes. Why?

@erikgrinaker
Copy link
Contributor Author

Adding another variant that also commits records doesn't show any significant difference. We only flush the control file after each WAL segment, so that checks out.

// Update truncate and commit LSN in control file.
// To avoid negative impact on performance of extra fsync, do it only
// when commit_lsn delta exceeds WAL segment size.
if self.state.commit_lsn + (self.state.server.wal_seg_size as u64)
< self.state.inmem.commit_lsn
{
self.state.flush().await?;
}

process_msg/fsync=false/commit=false/size=8
                        time:   [8.8677 µs 8.9545 µs 9.0712 µs]
                        thrpt:  [861.24 KiB/s 872.46 KiB/s 881.01 KiB/s]
process_msg/fsync=false/commit=false/size=1024
                        time:   [9.3791 µs 9.4232 µs 9.4647 µs]
                        thrpt:  [103.18 MiB/s 103.63 MiB/s 104.12 MiB/s]
process_msg/fsync=false/commit=false/size=8192
                        time:   [16.107 µs 16.785 µs 17.405 µs]
                        thrpt:  [448.86 MiB/s 465.45 MiB/s 485.02 MiB/s]
process_msg/fsync=false/commit=false/size=131072
                        time:   [153.33 µs 159.96 µs 166.16 µs]
                        thrpt:  [752.29 MiB/s 781.47 MiB/s 815.22 MiB/s]
process_msg/fsync=false/commit=false/size=1048576
                        time:   [1.3596 ms 1.3936 ms 1.4289 ms]
                        thrpt:  [699.83 MiB/s 717.57 MiB/s 735.51 MiB/s]

process_msg/fsync=false/commit=true/size=8
                        time:   [8.7410 µs 8.8093 µs 8.8824 µs]
                        thrpt:  [879.55 KiB/s 886.85 KiB/s 893.78 KiB/s]
process_msg/fsync=false/commit=true/size=1024
                        time:   [9.5369 µs 9.8461 µs 10.228 µs]
                        thrpt:  [95.479 MiB/s 99.183 MiB/s 102.40 MiB/s]
process_msg/fsync=false/commit=true/size=8192
                        time:   [15.810 µs 16.456 µs 17.041 µs]
                        thrpt:  [458.44 MiB/s 474.75 MiB/s 494.14 MiB/s]
process_msg/fsync=false/commit=true/size=131072
                        time:   [143.30 µs 154.28 µs 164.11 µs]
                        thrpt:  [761.67 MiB/s 810.21 MiB/s 872.30 MiB/s]
process_msg/fsync=false/commit=true/size=1048576
                        time:   [1.4108 ms 1.4383 ms 1.4624 ms]
                        thrpt:  [683.79 MiB/s 695.29 MiB/s 708.80 MiB/s]

@erikgrinaker
Copy link
Contributor Author

The fixed 8 µs cost is likely Tokio scheduling overhead, see flamegraph.svg. Tokio's file IO is not efficient. One option is to explore tokio-epoll-uring. Another is to increase batching along with tokio::fs::File::set_max_buf_size to amortize scheduling costs.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 6, 2024

Just to get a baseline, I added benchmarks for simple stdlib and Tokio writes. This confirms that the majority of the fixed 8 µs cost is just Tokio. However, it also shows that we should be able to saturate disks with both write paths at large enough write sizes, although the throughput is highly unreliable beyond 1 GB/s. This is unlike the process_msg path, where we cap out at around 800 MB/s -- worth exploring. Could be the segment size.

file_write/stdlib/fsync=false/size=8
                        time:   [1.1057 µs 1.1126 µs 1.1211 µs]
                        thrpt:  [6.8055 MiB/s 6.8572 MiB/s 6.9002 MiB/s]
file_write/stdlib/fsync=false/size=1024
                        time:   [1.6529 µs 1.6643 µs 1.6787 µs]
                        thrpt:  [581.72 MiB/s 586.75 MiB/s 590.83 MiB/s]
file_write/stdlib/fsync=false/size=8192
                        time:   [3.6048 µs 3.7885 µs 4.0441 µs]
                        thrpt:  [1.8866 GiB/s 2.0139 GiB/s 2.1165 GiB/s]
file_write/stdlib/fsync=false/size=131072
                        time:   [33.349 µs 40.166 µs 46.905 µs]
                        thrpt:  [2.6025 GiB/s 3.0391 GiB/s 3.6604 GiB/s]
file_write/stdlib/fsync=false/size=1048576
                        time:   [180.84 µs 206.91 µs 234.39 µs]
                        thrpt:  [4.1665 GiB/s 4.7198 GiB/s 5.4003 GiB/s]

file_write/tokio/fsync=false/size=8
                        time:   [7.4669 µs 7.5745 µs 7.6736 µs]
                        thrpt:  [1018.1 KiB/s 1.0072 MiB/s 1.0218 MiB/s]
file_write/tokio/fsync=false/size=1024
                        time:   [8.2489 µs 8.2887 µs 8.3289 µs]
                        thrpt:  [117.25 MiB/s 117.82 MiB/s 118.39 MiB/s]
file_write/tokio/fsync=false/size=8192
                        time:   [8.7655 µs 9.3938 µs 10.119 µs]
                        thrpt:  [772.06 MiB/s 831.66 MiB/s 891.28 MiB/s]
file_write/tokio/fsync=false/size=131072
                        time:   [23.198 µs 23.475 µs 23.824 µs]
                        thrpt:  [5.1239 GiB/s 5.2001 GiB/s 5.2622 GiB/s]
file_write/tokio/fsync=false/size=1048576
                        time:   [287.18 µs 340.62 µs 402.53 µs]
                        thrpt:  [2.4260 GiB/s 2.8671 GiB/s 3.4005 GiB/s]
For completeness, here are `fsync=true` results (expand) -- as expected, Tokio and stdlib are roughly equal here.
file_write/stdlib/fsync=true/size=8
                        time:   [4.2782 ms 4.3225 ms 4.3734 ms]
                        thrpt:  [1.7864 KiB/s 1.8074 KiB/s 1.8261 KiB/s]
file_write/stdlib/fsync=true/size=1024
                        time:   [4.3500 ms 4.3826 ms 4.4172 ms]
                        thrpt:  [226.39 KiB/s 228.18 KiB/s 229.89 KiB/s]
file_write/stdlib/fsync=true/size=8192
                        time:   [4.7115 ms 4.7774 ms 4.8438 ms]
                        thrpt:  [1.6129 MiB/s 1.6353 MiB/s 1.6582 MiB/s]
file_write/stdlib/fsync=true/size=131072
                        time:   [4.8032 ms 4.8609 ms 4.9180 ms]
                        thrpt:  [25.417 MiB/s 25.715 MiB/s 26.024 MiB/s]
file_write/stdlib/fsync=true/size=1048576
                        time:   [5.1204 ms 5.1468 ms 5.1752 ms]
                        thrpt:  [193.23 MiB/s 194.30 MiB/s 195.30 MiB/s]

file_write/tokio/fsync=true/size=8
                        time:   [4.1335 ms 4.1950 ms 4.2764 ms]
                        thrpt:  [1.8269 KiB/s 1.8623 KiB/s 1.8901 KiB/s]
file_write/tokio/fsync=true/size=1024
                        time:   [4.2537 ms 4.3002 ms 4.3485 ms]
                        thrpt:  [229.97 KiB/s 232.55 KiB/s 235.09 KiB/s]
file_write/tokio/fsync=true/size=8192
                        time:   [4.9080 ms 4.9594 ms 5.0100 ms]
                        thrpt:  [1.5594 MiB/s 1.5753 MiB/s 1.5918 MiB/s]
file_write/tokio/fsync=true/size=131072
                        time:   [4.9253 ms 4.9811 ms 5.0378 ms]
                        thrpt:  [24.812 MiB/s 25.095 MiB/s 25.379 MiB/s]
file_write/tokio/fsync=true/size=1048576
                        time:   [5.2555 ms 5.2914 ms 5.3267 ms]
                        thrpt:  [187.73 MiB/s 188.99 MiB/s 190.28 MiB/s]

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 13, 2024

Much of the segment costs were addressed in the issues listed above. We can get further improvements by increasing the segment size in #9687, but this only yields 8% on the instances we currently use for Safekeepers and requires system-wide changes.

Next, we should look at AppendRequest batching:

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 13, 2024

With the above fixes, logical message throughput on a Safekeeper instance type has increased by 13% at 1 KB writes, and 246% at 1 MB writes. Batching of smaller writes will close the gap. We're maxing out at 734 MB/s, with a hardware capacity of 1.1 GB/s.

wal_acceptor_throughput/fsync=true/commit=false/size=1024
                        time:   [25.490 s 25.558 s 25.621 s]
                        thrpt:  [39.968 MiB/s 40.065 MiB/s 40.172 MiB/s]
                 change:
                        time:   [-11.833% -11.444% -11.080%] (p = 0.00 < 0.05)
                        thrpt:  [+12.461% +12.923% +13.422%]

wal_acceptor_throughput/fsync=true/commit=false/size=8192
                        time:   [4.3186 s 4.3455 s 4.3710 s]
                        thrpt:  [234.27 MiB/s 235.65 MiB/s 237.11 MiB/s]
                 change:
                        time:   [-43.972% -43.545% -43.115%] (p = 0.00 < 0.05)
                        thrpt:  [+75.792% +77.133% +78.482%]

wal_acceptor_throughput/fsync=true/commit=false/size=131072
                        time:   [1.4305 s 1.4513 s 1.4727 s]
                        thrpt:  [695.30 MiB/s 705.57 MiB/s 715.84 MiB/s]
                 change:
                        time:   [-70.501% -70.069% -69.607%] (p = 0.00 < 0.05)
                        thrpt:  [+229.02% +234.10% +238.99%]

wal_acceptor_throughput/fsync=true/commit=false/size=1048576
                        time:   [1.3878 s 1.3951 s 1.4049 s]
                        thrpt:  [728.89 MiB/s 734.01 MiB/s 737.87 MiB/s]
                 change:
                        time:   [-71.279% -71.108% -70.884%] (p = 0.00 < 0.05)
                        thrpt:  [+243.45% +246.12% +248.17%]

With fsync disabled, ingestion hits the hardware capacity at 1.1 GB/s:

wal_acceptor_throughput/fsync=false/commit=false/size=1024
                        time:   [24.939 s 25.065 s 25.196 s]
                        thrpt:  [40.642 MiB/s 40.854 MiB/s 41.060 MiB/s]
                 change:
                        time:   [-10.570% -10.013% -9.4540%] (p = 0.00 < 0.05)
                        thrpt:  [+10.441% +11.127% +11.819%]

wal_acceptor_throughput/fsync=false/commit=false/size=8192
                        time:   [3.8126 s 3.8494 s 3.8813 s]
                        thrpt:  [263.83 MiB/s 266.02 MiB/s 268.58 MiB/s]
                 change:
                        time:   [-43.303% -42.680% -42.150%] (p = 0.00 < 0.05)
                        thrpt:  [+72.860% +74.458% +76.376%]

wal_acceptor_throughput/fsync=false/commit=false/size=131072
                        time:   [904.23 ms 920.37 ms 936.57 ms]
                        thrpt:  [1.0677 GiB/s 1.0865 GiB/s 1.1059 GiB/s]
                 change:
                        time:   [-76.685% -76.266% -75.778%] (p = 0.00 < 0.05)
                        thrpt:  [+312.86% +321.33% +328.90%]

wal_acceptor_throughput/fsync=false/commit=false/size=1048576
                        time:   [879.75 ms 886.35 ms 895.54 ms]
                        thrpt:  [1.1166 GiB/s 1.1282 GiB/s 1.1367 GiB/s]
                 change:
                        time:   [-77.261% -76.960% -76.626%] (p = 0.00 < 0.05)
                        thrpt:  [+327.83% +334.02% +339.77%]

@erikgrinaker
Copy link
Contributor Author

Next, we should look at AppendRequest batching

The compute already performs sufficient batching: the Safekeeper receives 128 KB appends (MAX_SEND_SIZE).

I added an end-to-end benchmark in #9749, which ingests 10 GB of logical message WAL into the compute, Safekeeper, and Pageserver (with fsync enabled). Results:

wal_ingest: 12.349 s
pageserver_ingest: 20.415 s
wal_written: 10,340 MB
message_count: 1310720 messages
wal_throughput: 837 MB/s
pageserver_throughput: 506 MB/s

wal_throughput is in line with microbenchmarks, which show 730 MB/s throughput (logical payload) on this machine:

wal_acceptor_throughput/fsync=true/commit=false/size=131072
                        time:   [1.3638 s 1.4025 s 1.4411 s]
                        thrpt:  [710.56 MiB/s 730.10 MiB/s 750.83 MiB/s]

The compute also does the same amount of writes to its local WAL, and I see the disk peak out at 1.9 GB/s (capacity 5.5 GB/s).

So the bottleneck here is definitely the Safekeeper → Pageserver path, which only does 506 MB/s. The Pageserver shouldn't be doing much processing here, so I suspect this is due to inefficient IO along this path.

Interestingly, disk IO graphs show that disk reads only pick up once WAL ingestion completes on the Safekeeper:

Image

I added some logging in the Safekeeper WAL sender which confirms this: the throughput is only about 300 MB/s while WAL is being ingested, and then increases to 600 MB/s when ingestion completes. Why?

2024-11-14T10:33:30.896460Z WAL sender: throughput 17 MB/s
2024-11-14T10:33:31.896940Z WAL sender: throughput 364 MB/s
2024-11-14T10:33:32.937164Z WAL sender: throughput 292 MB/s
2024-11-14T10:33:33.942417Z WAL sender: throughput 382 MB/s
2024-11-14T10:33:34.955909Z WAL sender: throughput 333 MB/s
2024-11-14T10:33:35.989857Z WAL sender: throughput 339 MB/s
2024-11-14T10:33:36.989873Z WAL sender: throughput 423 MB/s
2024-11-14T10:33:37.989872Z WAL sender: throughput 409 MB/s
2024-11-14T10:33:38.989894Z WAL sender: throughput 317 MB/s
2024-11-14T10:33:40.000935Z WAL sender: throughput 224 MB/s
2024-11-14T10:33:41.000937Z WAL sender: throughput 609 MB/s
2024-11-14T10:33:42.000989Z WAL sender: throughput 598 MB/s
2024-11-14T10:33:43.001002Z WAL sender: throughput 583 MB/s
2024-11-14T10:33:44.001006Z WAL sender: throughput 672 MB/s
2024-11-14T10:33:45.001018Z WAL sender: throughput 665 MB/s
2024-11-14T10:33:46.001015Z WAL sender: throughput 597 MB/s
2024-11-14T10:33:47.001041Z WAL sender: throughput 714 MB/s
2024-11-14T10:33:48.001046Z WAL sender: throughput 614 MB/s
2024-11-14T10:33:49.001069Z WAL sender: throughput 522 MB/s
2024-11-14T10:33:50.001084Z WAL sender: throughput 549 MB/s
2024-11-14T10:33:51.001086Z WAL sender: throughput 470 MB/s

@erikgrinaker
Copy link
Contributor Author

the throughput is only about 300 MB/s while WAL is being ingested, and then increases to 600 MB/s when ingestion completes. Why?

The above is an issue with the benchmark (or the compute?). Postgres returns once the logical messages have been written to its local WAL, and does not wait for them to be committed to Safekeepers. I thought pg_current_wal_flush_lsn() would be committed to Safekeepers, but it's not.

2024-11-14 12:53:13.656 INFO [test_ingest_logical_message.py:55] Ingest done: wal_lsn=2/8795E000 flush_lsn=2/8795E000

2024-11-14T11:53:14.015562Z  INFO WAL acceptor{cid=3 ttid=8d3c227168b4dd7293767240a2070a1e/28e7b224c5309a7eb3e8ba3ef57e61fb}: flushed 1/3C03E0D8
2024-11-14T11:53:14.058077Z  INFO WAL acceptor{cid=3 ttid=8d3c227168b4dd7293767240a2070a1e/28e7b224c5309a7eb3e8ba3ef57e61fb}: committed 1/19FFE7B8 → 1/3C03E0D8
2024-11-14T11:53:14.112243Z  INFO {cid=1 ttid=8d3c227168b4dd7293767240a2070a1e/28e7b224c5309a7eb3e8ba3ef57e61fb application_name=Some("pageserver")}:WAL sender: throughput 454 MB/s
2024-11-14T11:53:14.265083Z  INFO {cid=1 ttid=8d3c227168b4dd7293767240a2070a1e/28e7b224c5309a7eb3e8ba3ef57e61fb application_name=Some("pageserver")}:WAL sender: starved, waiting for WAL at 1/3C03E0D8
2024-11-14T11:53:15.028431Z  INFO WAL acceptor{cid=3 ttid=8d3c227168b4dd7293767240a2070a1e/28e7b224c5309a7eb3e8ba3ef57e61fb}: flushed 1/6A07E698
2024-11-14T11:53:15.073533Z  INFO WAL acceptor{cid=3 ttid=8d3c227168b4dd7293767240a2070a1e/28e7b224c5309a7eb3e8ba3ef57e61fb}: committed 1/3C03E0D8 → 1/6A07E698
2024-11-14T11:53:15.112267Z  INFO {cid=1 ttid=8d3c227168b4dd7293767240a2070a1e/28e7b224c5309a7eb3e8ba3ef57e61fb application_name=Some("pageserver")}:WAL sender: throughput 493 MB/s
2024-11-14T11:53:15.398301Z  INFO {cid=1 ttid=8d3c227168b4dd7293767240a2070a1e/28e7b224c5309a7eb3e8ba3ef57e61fb application_name=Some("pageserver")}:WAL sender: starved, waiting for WAL at 1/6A07E698

This also explains the bimodal nature of the write graph in #9642 (comment): it's first writing both to Postgres and Safekeeper, then only to Safekeeper. Safekeeper throughput is about 50% slower when both Postgres and Safekeeper are writing to disk -- perhaps logical, although the disk should have plenty of capacity to accommodate them both.

I'll fix the benchmark by having it wait for Safekeeper commits, then investigate performance further.

@erikgrinaker
Copy link
Contributor Author

With the benchmark fixed, the Safekeeper is the bottleneck:

postgres_ingest: 12.524 s
safekeeper_ingest: 26.732 s
pageserver_ingest: 26.770 s
wal_written: 10,340 MB
message_count: 1310720 messages
postgres_throughput: 825 MB/s
safekeeper_throughput: 386 MB/s
pageserver_throughput: 386 MB/s

I think the Safekeeper tends to fall behind because it's fsyncing more aggressively than the compute -- it fsyncs on every segment bound and every second, while Postgres doesn't really need to fsync until the end here. However, with fsync disabled we also see the Safekeeper fall behind (although less so):

postgres_ingest: 8.983 s
safekeeper_ingest: 12.333 s
pageserver_ingest: 12.419 s
wal_written: 10,340 MB
message_count: 1310720 messages
postgres_throughput: 1151 MB/s
safekeeper_throughput: 838 MB/s
pageserver_throughput: 832 MB/s

There's probably too much cross-talk between the compute and Safekeeper here, since they're using the same disk. I'll try a multi-node benchmark, and also a run on a Linux node with cheaper fsyncs.

@erikgrinaker
Copy link
Contributor Author

On a Linux machine (Hetzner), we are within 10% of the microbenchmark throughput with fsync enabled:

postgres_ingest: 7.191 s
safekeeper_ingest: 15.310 s
pageserver_ingest: 15.629 s
wal_written: 10,340 MB
postgres_throughput: 1437 MB/s
safekeeper_throughput: 675 MB/s
pageserver_throughput: 661 MB/s

wal_acceptor_throughput/fsync=true/commit=false/size=131072
                        time:   [1.3588 s 1.3787 s 1.3979 s]
                        thrpt:  [732.52 MiB/s 742.70 MiB/s 753.61 MiB/s]

With fsync disabled we're within 15%:

postgres_ingest: 6.527 s
safekeeper_ingest: 9.743 s
pageserver_ingest: 10.072 s
wal_written: 10,340 MB
postgres_throughput: 1584 MB/s
safekeeper_throughput: 1061 MB/s
pageserver_throughput: 1026 MB/s

wal_acceptor_throughput/fsync=false/commit=false/size=131072
                        time:   [804.79 ms 835.63 ms 860.47 ms]
                        thrpt:  [1.1622 GiB/s 1.1967 GiB/s 1.2426 GiB/s]

The disk can do 1.1 GB/s:

$ dd if=/dev/zero of=zero bs=1M count=4096 conv=fsync
4294967296 bytes (4.3 GB, 4.0 GiB) copied, 4.08711 s, 1.1 GB/s

I'll see if I can find some more low-hanging fruit to improve Safekeeper ingestion. Otherwise, we should move on to a multi-node setup (with separate disks and network latency), and other workloads.

@erikgrinaker
Copy link
Contributor Author

Closing this out for now. Logical message ingestion appears to be good enough, and we should focus on Pageserver performance in #9789.

github-merge-queue bot pushed a commit that referenced this issue Nov 29, 2024
Adds a benchmark for logical message WAL ingestion throughput
end-to-end. Logical messages are essentially noops, and thus ignored by
the Pageserver.

Example results from my MacBook, with fsync enabled:

```
postgres_ingest: 14.445 s
safekeeper_ingest: 29.948 s
pageserver_ingest: 30.013 s
pageserver_recover_ingest: 8.633 s
wal_written: 10,340 MB
message_count: 1310720 messages
postgres_throughput: 715 MB/s
safekeeper_throughput: 345 MB/s
pageserver_throughput: 344 MB/s
pageserver_recover_throughput: 1197 MB/s
```

See
#9642 (comment)
for running analysis.

Touches #9642.
awarus pushed a commit that referenced this issue Dec 5, 2024
Adds a benchmark for logical message WAL ingestion throughput
end-to-end. Logical messages are essentially noops, and thus ignored by
the Pageserver.

Example results from my MacBook, with fsync enabled:

```
postgres_ingest: 14.445 s
safekeeper_ingest: 29.948 s
pageserver_ingest: 30.013 s
pageserver_recover_ingest: 8.633 s
wal_written: 10,340 MB
message_count: 1310720 messages
postgres_throughput: 715 MB/s
safekeeper_throughput: 345 MB/s
pageserver_throughput: 344 MB/s
pageserver_recover_throughput: 1197 MB/s
```

See
#9642 (comment)
for running analysis.

Touches #9642.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/performance Area: relates to performance of the system c/storage Component: storage
Projects
None yet
Development

No branches or pull requests

1 participant