page_service: add benchmark for batching #9820

problame · 2024-11-20T13:19:20Z

This PR adds two benchmark to demonstrate the effect of server-side
getpage request batching added in #9321.

For the CPU usage, I found the the prometheus crate's built-in CPU usage accounts the seconds at integer granularity. That's not enough you reduce the target benchmark runtime for local iteration. So, add a new libmetrics metric and report that.

The benchmarks are disabled because on our benchmark nodes, timer resolution isn't high enough.
They work (no statement about quality) on my bare-metal devbox.

They will be refined and enabled once we find a fix. Candidates at time of writing are:

Refs:

Epic: Epic: get page throughput improvements #9376
Extracted from page_service: getpage batching: refactor & minor fixes #9792

This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in #9321. Refs: - Epic: #9376 - Extracted from #9792

github-actions · 2024-11-20T14:11:51Z

5607 tests run: 5371 passed, 0 failed, 236 skipped (full report)

Flaky tests (1)

Postgres 17

test_timeline_archival_chaos: release-x86-64

Code coverage* (full report)

functions: 31.0% (7972 of 25721 functions)
lines: 48.8% (63294 of 129732 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
18f2964 at 2024-11-25T09:13:22.691Z :recycle:}

test_runner/performance/pageserver/test_pageserver_getpage_merge.py

…chmark

test_runner/performance/pageserver/test_pageserver_getpage_merge.py

libs/metrics/src/more_process_metrics.rs

…chmark

# Problem The timeout-based batching adds latency to unbatchable workloads. We can choose a short batching timeout (e.g. 10us) but that requires high-resolution timers, which tokio doesn't have. I thoroughly explored options to use OS timers (see [this](#9822) abandoned PR). In short, it's not an attractive option because any timer implementation adds non-trivial overheads. # Solution The insight is that, in the steady state of a batchable workload, the time we spend in `get_vectored` will be hundreds of microseconds anyway. If we prepare the next batch concurrently to `get_vectored`, we will have a sizeable batch ready once `get_vectored` of the current batch is done and do not need an explicit timeout. This can be reasonably described as **pipelining of the protocol handler**. # Implementation We model the sub-protocol handler for pagestream requests (`handle_pagrequests`) as two futures that form a pipeline: 2. Batching: read requests from the connection and fill the current batch 3. Execution: `take` the current batch, execute it using `get_vectored`, and send the response. The Reading and Batching stage are connected through a new type of channel called `spsc_fold`. See the long comment in the `handle_pagerequests_pipelined` for details. # Changes - Refactor `handle_pagerequests` - separate functions for - reading one protocol message; produces a `BatchedFeMessage` with just one page request in it - batching; tried to merge an incoming `BatchedFeMessage` into an existing `BatchedFeMessage`; returns `None` on success and returns back the incoming message in case merging isn't possible - execution of a batched message - unify the timeline handle acquisition & request span construction; it now happen in the function that reads the protocol message - Implement serial and pipelined model - serial: what we had before any of the batching changes - read one protocol message - execute protocol messages - pipelined: the design described above - optionality for execution of the pipeline: either via concurrent futures vs tokio tasks - Pageserver config - remove batching timeout field - add ability to configure pipelining mode - add ability to limit max batch size for pipelined configurations (required for the rollout, cf neondatabase/cloud#20620 ) - ability to configure execution mode - Tests - remove `batch_timeout` parametrization - rename `test_getpage_merge_smoke` to `test_throughput` - add parametrization to test different max batch sizes and execution moes - rename `test_timer_precision` to `test_latency` - rename the test case file to `test_page_service_batching.py` - better descriptions of what the tests actually do ## On the holding The `TimelineHandle` in the pending batch While batching, we hold the `TimelineHandle` in the pending batch. Therefore, the timeline will not finish shutting down while we're batching. This is not a problem in practice because the concurrently ongoing `get_vectored` call will fail quickly with an error indicating that the timeline is shutting down. This results in the Execution stage returning a `QueryError::Shutdown`, which causes the pipeline / entire page service connection to shut down. This drops all references to the `Arc<Mutex<Option<Box<BatchedFeMessage>>>>` object, thereby dropping the contained `TimelineHandle`s. - => fixes #9850 # Performance Local run of the benchmarks, results in [this empty commit](1cf5b14) in the PR branch. Key take-aways: * `concurrent-futures` and `tasks` deliver identical `batching_factor` * tail latency impact unknown, cf #9837 * `concurrent-futures` has higher throughput than `tasks` in all workloads (=lower `time` metric) * In unbatchable workloads, `concurrent-futures` has 5% higher `CPU-per-throughput` than that of `tasks`, and 15% higher than that of `serial`. * In batchable-32 workload, `concurrent-futures` has 8% lower `CPU-per-throughput` than that of `tasks` (comparison to tput of `serial` is irrelevant) * in unbatchable workloads, mean and tail latencies of `concurrent-futures` is practically identical to `serial`, whereas `tasks` adds 20-30us of overhead Overall, `concurrent-futures` seems like a slightly more attractive choice. # Rollout This change is disabled-by-default. Rollout plan: - neondatabase/cloud#20620 # Refs - epic: #9376 - this sub-task: #9377 - the abandoned attempt to improve batching timeout resolution: #9820 - closes #9850 - fixes #9835

page_service: add benchmark for batching

b695907

This PR adds a benchmark to demonstrate the effect of server-side getpage request batching added in #9321. Refs: - Epic: #9376 - Extracted from #9792

problame added the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Nov 20, 2024

This was referenced Nov 20, 2024

Epic: get page throughput improvements #9376

Open

pageserver: batch get page requests and serve them with one vectored get #9377

Open

problame requested a review from VladLazar November 20, 2024 13:53

problame mentioned this pull request Nov 20, 2024

WIP: page_service: higher-resolution timer for batching #9822

Closed

bayandin reviewed Nov 20, 2024

View reviewed changes

test_runner/performance/pageserver/test_pageserver_getpage_merge.py Outdated Show resolved Hide resolved

VladLazar approved these changes Nov 20, 2024

View reviewed changes

test_runner/performance/pageserver/test_pageserver_getpage_merge.py Show resolved Hide resolved

test_runner/performance/pageserver/test_pageserver_getpage_merge.py Show resolved Hide resolved

problame commented Nov 20, 2024

View reviewed changes

test_runner/performance/pageserver/test_pageserver_getpage_merge.py Outdated Show resolved Hide resolved

problame added 5 commits November 21, 2024 11:16

high-resolution CPU usage

e82deb2

pytest.approx; #9820 (comment)

3375f28

Merge remote-tracking branch 'origin/main' into problame/batching-ben…

ff0aa15

…chmark

add benchmark for making batching timeout preceision apparent

70608b0

add another benchmark

0b79f13

VladLazar approved these changes Nov 21, 2024

View reviewed changes

test_runner/performance/pageserver/test_pageserver_getpage_merge.py Show resolved Hide resolved

libs/metrics/src/more_process_metrics.rs Show resolved Hide resolved

problame mentioned this pull request Nov 21, 2024

page_service: measure tail latency impact in batchable workload #9837

Open

problame added 2 commits November 21, 2024 17:07

allowed_errors fix

56cbbc3

Merge remote-tracking branch 'origin/main' into problame/batching-ben…

ad8dbcb

…chmark

problame mentioned this pull request Nov 22, 2024

page_service: rewrite batching to work without a timeout #9851

Merged

problame removed the run-benchmarks Indicates to the CI that benchmarks should be run for PR marked with this label label Nov 22, 2024

skip tests

3eba1ba

problame enabled auto-merge (squash) November 22, 2024 11:37

Merge branch 'main' into problame/batching-benchmark

18f2964

bayandin approved these changes Nov 25, 2024

View reviewed changes

problame disabled auto-merge November 25, 2024 15:52

problame added this pull request to the merge queue Nov 25, 2024

Merged via the queue into main with commit 5c23569 Nov 25, 2024
79 checks passed

problame deleted the problame/batching-benchmark branch November 25, 2024 15:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

page_service: add benchmark for batching #9820

page_service: add benchmark for batching #9820

problame commented Nov 20, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024 •

edited

Loading

Postgres 17

page_service: add benchmark for batching #9820

page_service: add benchmark for batching #9820

Conversation

problame commented Nov 20, 2024 • edited Loading

github-actions bot commented Nov 20, 2024 • edited Loading

5607 tests run: 5371 passed, 0 failed, 236 skipped (full report)

Postgres 17

Code coverage* (full report)

problame commented Nov 20, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024 •

edited

Loading