Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: revamp pageserver backpressure #8390

Open
2 of 5 tasks
Tracked by #10160
skyzh opened this issue Jul 15, 2024 · 2 comments
Open
2 of 5 tasks
Tracked by #10160

Epic: revamp pageserver backpressure #8390

skyzh opened this issue Jul 15, 2024 · 2 comments
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged

Comments

@skyzh
Copy link
Member

skyzh commented Jul 15, 2024

Followup on https://neondb.slack.com/archives/C03F5SM1N02/p1721058880447979 and #10095.

Updated proposal 2024-12-12 by @erikgrinaker:

Recall the current backpressure mechanism, based on these compute knobs:

  • max_replication_write_lag: 500 MB (based on Pageserver last_received_lsn).
  • max_replication_flush_lag: 10 GB (based on Pageserver disk_consistent_lsn).
  • max_replication_apply_lag: disabled (based on Pageserver remote_consistent_lsn).

If the compute WAL leads by the given thresholds, the compute will inject a 10 ms sleep after every WAL record.

There are three aspects we don't backpressure on, but should:

  • L0/compaction: if compaction falls behind, read amplification and compaction debt increases without bound (#5415).
  • S3 uploads: if uploads fall behind, disk usage and crash recovery time increases without bound (#5897).
  • Sharding: different shards can have different amounts of debt, so e.g. remote_consistent_lsn is misleading (#10095 comment).

With sharding, disk_consistent_lsn or remote_consistent_lsn are misleading, because they don't scale with shard count. 8 shards lagging by 1 GB LSN is very different from 1 shard lagging by 1 GB LSN -- we should bound the outstanding amount of work per shard, not the total outstanding work.

Additionally, the current backpressure protocol has a few issues:

  • Calculation changes require compute release and restart (can take weeks/months).
  • Protocol changes must be backwards compatible until all computes have restarted.
  • Backpressure is binary (either off or 10 ms per WAL record).

Sketch for a new backpressure protocol:

  • Each Pageserver shard computes a per-shard WAL target rate based on:
    • Safekeeper commit LSN: Pageserver ingestion should keep up.
    • In-memory layers: disk flushing should keep up.
    • L0 bytes and files: compaction should keep up.
    • Upload queue size: uploads should keep up.
  • The Safekeeper aggregates a single WAL target rate based on min/average/sum across shards (needs experiments).
    • Alternatively, just have a stop or slow down signal from each shard.
  • Send a single WAL target rate to the compute (0 to stall, -1 to disable throttling).
  • Sleep on compute WAL appends based on target WAL rate.

Tasks

Tasks

Preview Give feedback
  1. a/tech_debt c/storage/pageserver
    arpad-m
  2. a/performance c/storage/pageserver
    erikgrinaker
  3. a/performance c/storage/pageserver
    erikgrinaker
  4. c/storage/pageserver t/feature
  5. c/storage/pageserver t/feature
    erikgrinaker
@skyzh skyzh added t/bug Issue Type: Bug c/storage/pageserver Component: storage: pageserver labels Jul 15, 2024
@jcsp
Copy link
Collaborator

jcsp commented Jul 15, 2024

@skyzh skyzh changed the title pageserver backpressure Epic: pageserver backpressure Jul 15, 2024
@jcsp
Copy link
Collaborator

jcsp commented Jul 25, 2024

Plan:

Our existing mitigation for L0 compaction (only compact 10 at once) makes us safe.

@jcsp jcsp added the triaged bugs that were already triaged label Jul 25, 2024
@erikgrinaker erikgrinaker self-assigned this Dec 12, 2024
@erikgrinaker erikgrinaker changed the title Epic: pageserver backpressure Epic: revamp pageserver backpressure Dec 12, 2024
@erikgrinaker erikgrinaker pinned this issue Dec 12, 2024
@erikgrinaker erikgrinaker added the a/performance Area: relates to performance of the system label Dec 12, 2024
@erikgrinaker erikgrinaker unpinned this issue Dec 12, 2024
github-merge-queue bot pushed a commit that referenced this issue Dec 15, 2024
## Problem

In #8550, we made the flush loop wait for uploads after every layer.
This was to avoid unbounded buildup of uploads, and to reduce compaction
debt. However, the approach has several problems:

* It prevents upload parallelism.
* It prevents flush and upload pipelining.
* It slows down ingestion even when there is no need to backpressure.
* It does not directly backpressure WAL ingestion (only via
`disk_consistent_lsn`), and will build up in-memory layers.
* It does not directly backpressure based on compaction debt and read
amplification.

An alternative solution to these problems is proposed in #8390.

In the meanwhile, we revert the change to reduce the impact on ingest
throughput. This does reintroduce some risk of unbounded
upload/compaction buildup. Until
#8390, this can be addressed
in other ways:

* Use `max_replication_apply_lag` (aka `remote_consistent_lsn`), which
will more directly limit upload debt.
* Shard the tenant, which will spread the flush/upload work across more
Pageservers and move the bottleneck to Safekeeper.

Touches #10095.

## Summary of changes

Remove waiting on the upload queue in the flush loop.
github-merge-queue bot pushed a commit that referenced this issue Jan 3, 2025
This reverts commit f3ecd5d.

It is
[suspected](https://neondb.slack.com/archives/C033RQ5SPDH/p1735907405716759)
to have caused significant read amplification in the [ingest
benchmark](https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from=now-30d&to=now&timezone=utc&var-new_project_endpoint_id=ep-solitary-sun-w22bmut6&var-large_tenant_endpoint_id=ep-holy-bread-w203krzs)
(specifically during index creation).

We will revisit an intermediate improvement here to unblock [upload
parallelism](#10096) before
properly addressing [compaction
backpressure](#8390).
erikgrinaker added a commit that referenced this issue Jan 3, 2025
This reverts commit f3ecd5d.

It is
[suspected](https://neondb.slack.com/archives/C033RQ5SPDH/p1735907405716759)
to have caused significant read amplification in the [ingest
benchmark](https://neonprod.grafana.net/d/de3mupf4g68e8e/perf-test3a-ingest-benchmark?orgId=1&from=now-30d&to=now&timezone=utc&var-new_project_endpoint_id=ep-solitary-sun-w22bmut6&var-large_tenant_endpoint_id=ep-holy-bread-w203krzs)
(specifically during index creation).

We will revisit an intermediate improvement here to unblock [upload
parallelism](#10096) before
properly addressing [compaction
backpressure](#8390).
github-merge-queue bot pushed a commit that referenced this issue Jan 14, 2025
## Problem

The upload queue currently sees significant head-of-line blocking. For
example, index uploads act as upload barriers, and for every layer flush
we schedule a layer and index upload, which effectively serializes layer
uploads.

Resolves #10096.

## Summary of changes

Allow upload queue operations to bypass the queue if they don't conflict
with preceding operations, increasing parallelism.

NB: the upload queue currently schedules an explicit barrier after every
layer flush as well (see #8550). This must be removed to enable
parallelism. This will require a better mechanism for compaction
backpressure, see e.g. #8390 or #5415.
@erikgrinaker erikgrinaker removed their assignment Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/performance Area: relates to performance of the system c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug triaged bugs that were already triaged
Projects
None yet
Development

No branches or pull requests

3 participants