Epic: revamp pageserver backpressure #8390

skyzh · 2024-07-15T17:52:54Z

Followup on https://neondb.slack.com/archives/C03F5SM1N02/p1721058880447979 and #10095.

Updated proposal 2024-12-12 by @erikgrinaker:

Recall the current backpressure mechanism, based on these compute knobs:

max_replication_write_lag: 500 MB (based on Pageserver last_received_lsn).
max_replication_flush_lag: 10 GB (based on Pageserver disk_consistent_lsn).
max_replication_apply_lag: disabled (based on Pageserver remote_consistent_lsn).

If the compute WAL leads by the given thresholds, the compute will inject a 10 ms sleep after every WAL record.

There are three aspects we don't backpressure on, but should:

L0/compaction: if compaction falls behind, read amplification and compaction debt increases without bound (#5415).
S3 uploads: if uploads fall behind, disk usage and crash recovery time increases without bound (#5897).
Sharding: different shards can have different amounts of debt, so e.g. remote_consistent_lsn is misleading (#10095 comment).

With sharding, disk_consistent_lsn or remote_consistent_lsn are misleading, because they don't scale with shard count. 8 shards lagging by 1 GB LSN is very different from 1 shard lagging by 1 GB LSN -- we should bound the outstanding amount of work per shard, not the total outstanding work.

Additionally, the current backpressure protocol has a few issues:

Calculation changes require compute release and restart (can take weeks/months).
Protocol changes must be backwards compatible until all computes have restarted.
Backpressure is binary (either off or 10 ms per WAL record).

Sketch for a new backpressure protocol:

Each Pageserver shard computes a per-shard WAL target rate based on:
- Safekeeper commit LSN: Pageserver ingestion should keep up.
- In-memory layers: disk flushing should keep up.
- L0 bytes and files: compaction should keep up.
- Upload queue size: uploads should keep up.
The Safekeeper aggregates a single WAL target rate based on min/average/sum across shards (needs experiments).
- Alternatively, just have a stop or slow down signal from each shard.
Send a single WAL target rate to the compute (0 to stall, -1 to disable throttling).
Sleep on compute WAL appends based on target WAL rate.

Tasks

Give feedback

pageserver: backpressure on layer freeze/flush #7317

a/tech_debt c/storage/pageserver
pageserver: add L0 compute backpressure (replace flush backpressure) #10095

a/performance c/storage/pageserver
pageserver: single per-timeline variable backpressure signal #10116

a/performance c/storage/pageserver
pageserver: WAL ingestion backpressure #5897

c/storage/pageserver t/feature
pageserver: backpressure when delta layer count is high #5415

c/storage/pageserver t/feature
Options

The text was updated successfully, but these errors were encountered:

jcsp · 2024-07-15T17:56:00Z

Let's look over our existing backpressure-related issues and make a plan

jcsp · 2024-07-25T13:49:31Z

Plan:

Prioritize: pageserver: backpressure on layer freeze/flush #7317
For the l0 stack problem: this may depend on other compaction design decisions? Potentially trigger compaction on LSN rather than time -- that way faster writing tenants get to compact more often.

Our existing mitigation for L0 compaction (only compact 10 at once) makes us safe.

## Problem In #8550, we made the flush loop wait for uploads after every layer. This was to avoid unbounded buildup of uploads, and to reduce compaction debt. However, the approach has several problems: * It prevents upload parallelism. * It prevents flush and upload pipelining. * It slows down ingestion even when there is no need to backpressure. * It does not directly backpressure WAL ingestion (only via `disk_consistent_lsn`), and will build up in-memory layers. * It does not directly backpressure based on compaction debt and read amplification. An alternative solution to these problems is proposed in #8390. In the meanwhile, we revert the change to reduce the impact on ingest throughput. This does reintroduce some risk of unbounded upload/compaction buildup. Until #8390, this can be addressed in other ways: * Use `max_replication_apply_lag` (aka `remote_consistent_lsn`), which will more directly limit upload debt. * Shard the tenant, which will spread the flush/upload work across more Pageservers and move the bottleneck to Safekeeper. Touches #10095. ## Summary of changes Remove waiting on the upload queue in the flush loop.

This reverts commit f3ecd5d. It is [suspected](https://neondb.slack.com/archives/C033RQ5SPDH/p1735907405716759) to have caused significant read amplification in the [ingest benchmark](https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from=now-30d&to=now&timezone=utc&var-new_project_endpoint_id=ep-solitary-sun-w22bmut6&var-large_tenant_endpoint_id=ep-holy-bread-w203krzs) (specifically during index creation). We will revisit an intermediate improvement here to unblock [upload parallelism](#10096) before properly addressing [compaction backpressure](#8390).

## Problem The upload queue currently sees significant head-of-line blocking. For example, index uploads act as upload barriers, and for every layer flush we schedule a layer and index upload, which effectively serializes layer uploads. Resolves #10096. ## Summary of changes Allow upload queue operations to bypass the queue if they don't conflict with preceding operations, increasing parallelism. NB: the upload queue currently schedules an explicit barrier after every layer flush as well (see #8550). This must be removed to enable parallelism. This will require a better mechanism for compaction backpressure, see e.g. #8390 or #5415.

skyzh added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Jul 15, 2024

skyzh changed the title ~~pageserver backpressure~~ Epic: pageserver backpressure Jul 15, 2024

skyzh mentioned this issue Jul 19, 2024

pageserver: image layer generation <-> partial L0 compaction #8435

Closed

jcsp added the triaged bugs that were already triaged label Jul 25, 2024

skyzh mentioned this issue Dec 11, 2024

pageserver: add L0 compute backpressure (replace flush backpressure) #10095

Closed

erikgrinaker self-assigned this Dec 12, 2024

erikgrinaker changed the title ~~Epic: pageserver backpressure~~ Epic: revamp pageserver backpressure Dec 12, 2024

erikgrinaker pinned this issue Dec 12, 2024

erikgrinaker added the a/performance Area: relates to performance of the system label Dec 12, 2024

erikgrinaker unpinned this issue Dec 12, 2024

This was referenced Dec 12, 2024

pageserver: add disk_compacted_lsn #10113

Closed

pageserver: revert flush backpressure (#8550) #10135

Merged

jcsp mentioned this issue Dec 16, 2024

Epic: ingestion performance phase 2 #10160

Open

erikgrinaker mentioned this issue Jan 3, 2025

Revert "pageserver: revert flush backpressure (#8550) (#10135)" #10270

Merged

erikgrinaker mentioned this issue Jan 4, 2025

pageserver: reorder upload queue when possible #10218

Merged

erikgrinaker removed their assignment Jan 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: revamp pageserver backpressure #8390

Epic: revamp pageserver backpressure #8390

skyzh commented Jul 15, 2024 •

edited by erikgrinaker

Loading

Tasks

jcsp commented Jul 15, 2024

jcsp commented Jul 25, 2024

Epic: revamp pageserver backpressure #8390

Epic: revamp pageserver backpressure #8390

Comments

skyzh commented Jul 15, 2024 • edited by erikgrinaker Loading

Tasks

Tasks

jcsp commented Jul 15, 2024

jcsp commented Jul 25, 2024

skyzh commented Jul 15, 2024 •

edited by erikgrinaker

Loading