Epic: sharded pageserver ingest #9329

VladLazar · 2024-10-09T13:13:16Z

Background

We currently we send all WAL to all pageserver shards, and each shard filters out the data that it needs,
in this epic we add a mechanism to filter the WAL on the safekeeper, so that each shard receives
only the data it needs.

RFC in #8754.

Sub-issues

Give feedback

safekeeper: add metric for per-timeline queue depth/buffer size #9328

a/observability c/storage/safekeeper
pageserver: refactor WAL ingest #9335

a/scalability c/storage/pageserver
safekeeper: lift WAL preprocessing and filtering from pageserver #9336

a/scalability c/storage/safekeeper
safekeeper: add per-shard cursors for WAL shard fan-out #9337

a/scalability c/storage/safekeeper
safekeeper: efficient shard catchup for WAL cursor fan-out #9338

a/scalability c/storage/safekeeper
safekeeper: add performance benchmarks for WAL ingest #9339

a/benchmark c/storage/safekeeper
Improve safekeeper to pageserver protocol #5543

a/performance a/scalability c/storage/safekeeper
safekeeper: add WAL ingestion benchmarks with pageserver fan-out #9569

a/benchmark c/storage
Options

Nice to Haves

Give feedback

Epic: Improve sk<->ps connection observability #7002

c/storage/pageserver c/storage/safekeeper t/Epic
WAL records that are no-ops cause timelines to remain active #5962

1 of 3

c/storage c/storage/pageserver t/bug triaged
Options

Review follow-ups

Give feedback

pageserver/walingest: clean up post-interpretation unwraps in walingest.rs #9491

a/tech_debt c/storage c/storage/pageserver
utils: use ShardIdentity in `postgres_client.rs #9823
safekeeper: use WalReaderStreamBuilder for sk peer-to-peer wal replication #9824
Options

The text was updated successfully, but these errors were encountered:

jcsp · 2024-10-10T12:50:33Z

Can we reconcile this with #6345? (I don't mind replacing that issue, just want to avoid duplication)

…#9472) ## Problem WAL ingest couples decoding of special records with their handling (updates to the storage engine mostly). This is a roadblock for our plan to move WAL filtering (and implicitly decoding) to safekeepers since they cannot do writes to the storage engine. ## Summary of changes This PR decouples the decoding of the special WAL records from their application. The changes are done in place and I've done my best to refrain from refactorings and attempted to preserve the original code as much as possible. Related: #9335 Epic: #9329

## Problem We wish to have high level WAL decoding logic in `wal_decoder::decoder` module. ## Summary of Changes For this we need the `Value` and `NeonWalRecord` types accessible there, so: 1. Move `Value` and `NeonWalRecord` to `pageserver::value` and `pageserver::record` respectively. 2. Get rid of `pageserver::repository` (follow up from (1)) 3. Move PG specific WAL record types to `postgres_ffi::walrecord`. In theory they could live in `wal_decoder`, but it would create a circular dependency between `wal_decoder` and `postgres_ffi`. Long term it makes sense for those types to be PG version specific, so that will work out nicely. 4. Move higher level WAL record types (to be ingested by pageserver) into `wal_decoder::models` Related: #9335 Epic: #9329

…9524) ## Problem Decoding and ingestion are still coupled in `pageserver::WalIngest`. ## Summary of changes A new type is added to `wal_decoder::models`, InterpretedWalRecord. This type contains everything that the pageserver requires in order to ingest a WAL record. The highlights are the `metadata_record` which is an optional special record type to be handled and `blocks` which stores key, value pairs to be persisted to storage. This type is produced by `wal_decoder::models::InterpretedWalRecord::from_bytes` from a raw PG wal record. The rest of this commit separates decoding and interpretation of the PG WAL record from its application in `WalIngest::ingest_record`. Related: #9335 Epic: #9329

…#9746) ## Problem For any given tenant shard, pageservers receive all of the tenant's WAL from the safekeeper. This soft-blocks us from using larger shard counts due to bandwidth concerns and CPU overhead of filtering out the records. ## Summary of changes This PR lifts the decoding and interpretation of WAL from the pageserver into the safekeeper. A customised PG replication protocol is used where instead of sending raw WAL, the safekeeper sends filtered, interpreted records. The receiver drives the protocol selection, so, on the pageserver side, usage of the new protocol is gated by a new pageserver config: `wal_receiver_protocol`. More granularly the changes are: 1. Optionally inject the protocol and shard identity into the arguments used for starting replication 2. On the safekeeper side, implement a new wal sending primitive which decodes and interprets records before sending them over 3. On the pageserver side, implement the ingestion of this new replication message type. It's very similar to what we already have for raw wal (minus decoding and interpreting). ## Notes * This PR currently uses my [branch of rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support) which includes the deserialization logic for the new replication message type. PR for that is open [here](neondatabase/rust-postgres#32). * This PR contains changes for both pageservers and safekeepers. It's safe to merge because the new protocol is disabled by default on the pageserver side. We can gradually start enabling it in subsequent releases. * CI tests are running on #9747 ## Links Related: #9336 Epic: #9329

…#9821) ## Problem #9746 lifted decoding and interpretation of WAL to the safekeeper. This reduced the ingested amount on the pageservers by around 10x for a tenant with 8 shards, but doubled the ingested amount for single sharded tenants. Also, #9746 uses bincode which doesn't support schema evolution. Technically the schema can be evolved, but it's very cumbersome. ## Summary of changes This patch set addresses both problems by adding protobuf support for the interpreted wal records and adding compression support. Compressed protobuf reduced the ingested amount by 100x on the 32 shards `test_sharded_ingest` case (compared to non-interpreted proto). For the 1 shard case the reduction is 5x. Sister change to `rust-postgres` is [here](neondatabase/rust-postgres#33). ## Links Related: #9336 Epic: #9329

## Problem Can't change protocol at tenant granularity. ## Summary of changes Add tenant config level override for wal receiver protocol. ## Links Related: #9336 Epic: #9329

VladLazar · 2024-12-20T16:00:31Z

Short pre winter holiday update.

Sharded ingest is deployed in ap-southeast (safekeeper: lift WAL preprocessing and filtering from pageserver #9336)
Fan-out code is functionally complete (safekeeper: add per-shard cursors for WAL shard fan-out #9337)

Once we're back:

Continue deployment of sharded ingest to all other regions
Tests and tweaks to the fan-out wal reader (e.g. use futures when not sharded to minimise task spawning)
Separate roll-out of fan-out wal reader

## Problem Safekeepers currently decode and interpret WAL for each shard separately. This is wasteful in terms of CPU memory usage - we've seen this in profiles. ## Summary of changes Fan-out interpreted WAL to multiple shards. The basic is that wal decoding and interpretation happens in a separate tokio task and senders attach to it. Senders only receive batches concerning their shard and only past the Lsn they've last seen. Fan-out is gated behind the `wal_reader_fanout` safekeeper flag (disabled by default for now). When fan-out is enabled, it might be desirable to control the absolute delta between the current position and a new shard's desired position (i.e. how far behind or ahead a shard may be). `max_delta_for_fanout` is a new optional safekeeper flag which dictates whether to create a new WAL reader or attach to the existing one. By default, this behaviour is disabled. Let's consider enabling it if we spot the need for it in the field. ## Testing Tests passed [here](#10301) with wal reader fanout enabled as of 34f6a71. Related: #9337 Epic: #9329

## Summary Whereas currently we send all WAL to all pageserver shards, and each shard filters out the data that it needs, in this RFC we add a mechanism to filter the WAL on the safekeeper, so that each shard receives only the data it needs. This will place some extra CPU load on the safekeepers, in exchange for reducing the network bandwidth for ingesting WAL back to scaling as O(1) with shard count, rather than O(N_shards). Touches #9329. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>

VladLazar added c/storage Component: storage t/Epic Issue type: Epic labels Oct 9, 2024

VladLazar assigned VladLazar and erikgrinaker Oct 9, 2024

erikgrinaker mentioned this issue Oct 9, 2024

rfcs: add sharded ingest RFC #8754

Merged

5 tasks

VladLazar mentioned this issue Oct 11, 2024

WAL filtering in safekeeper for sharding #6345

Closed

VladLazar mentioned this issue Oct 23, 2024

pageserver: refactor ingest inplace to decouple decoding and handling #9472

Merged

5 tasks

jcsp mentioned this issue Oct 25, 2024

pageserver: 2x ingest performance #8452

Closed

VladLazar mentioned this issue Oct 25, 2024

pageserver: move things around to prepare for decoding logic #9504

Merged

5 tasks

VladLazar mentioned this issue Oct 29, 2024

pageserver: lift decoding and interpreting of wal into wal_decoder #9524

Merged

5 tasks

erikgrinaker removed their assignment Nov 8, 2024

VladLazar mentioned this issue Nov 13, 2024

safekeeper: lift decoding and interpretation of WAL to the safekeeper #9746

Merged

VladLazar mentioned this issue Nov 27, 2024

safekeeper: use protobuf for sending compressed records to pageserver #9821

Merged

VladLazar mentioned this issue Nov 27, 2024

pageserver: add tenant config override for wal receiver proto #9888

Merged

jcsp mentioned this issue Dec 16, 2024

Epic: ingestion performance phase 2 #10160

Open

VladLazar mentioned this issue Jan 7, 2025

safekeeper: fan out from single wal reader to multiple shards #10190

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: sharded pageserver ingest #9329

Epic: sharded pageserver ingest #9329

VladLazar commented Oct 9, 2024 •

edited

Loading

Sub-issues

Nice to Haves

Review follow-ups

jcsp commented Oct 10, 2024

VladLazar commented Dec 20, 2024 •

edited

Loading

Epic: sharded pageserver ingest #9329

Epic: sharded pageserver ingest #9329

Comments

VladLazar commented Oct 9, 2024 • edited Loading

Background

Sub-issues

Nice to Haves

Review follow-ups

jcsp commented Oct 10, 2024

VladLazar commented Dec 20, 2024 • edited Loading

VladLazar commented Oct 9, 2024 •

edited

Loading

VladLazar commented Dec 20, 2024 •

edited

Loading