-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: sharded pageserver ingest #9329
Comments
Can we reconcile this with #6345? (I don't mind replacing that issue, just want to avoid duplication) |
5 tasks
VladLazar
added a commit
that referenced
this issue
Oct 24, 2024
…#9472) ## Problem WAL ingest couples decoding of special records with their handling (updates to the storage engine mostly). This is a roadblock for our plan to move WAL filtering (and implicitly decoding) to safekeepers since they cannot do writes to the storage engine. ## Summary of changes This PR decouples the decoding of the special WAL records from their application. The changes are done in place and I've done my best to refrain from refactorings and attempted to preserve the original code as much as possible. Related: #9335 Epic: #9329
5 tasks
VladLazar
added a commit
that referenced
this issue
Oct 29, 2024
## Problem We wish to have high level WAL decoding logic in `wal_decoder::decoder` module. ## Summary of Changes For this we need the `Value` and `NeonWalRecord` types accessible there, so: 1. Move `Value` and `NeonWalRecord` to `pageserver::value` and `pageserver::record` respectively. 2. Get rid of `pageserver::repository` (follow up from (1)) 3. Move PG specific WAL record types to `postgres_ffi::walrecord`. In theory they could live in `wal_decoder`, but it would create a circular dependency between `wal_decoder` and `postgres_ffi`. Long term it makes sense for those types to be PG version specific, so that will work out nicely. 4. Move higher level WAL record types (to be ingested by pageserver) into `wal_decoder::models` Related: #9335 Epic: #9329
5 tasks
VladLazar
added a commit
that referenced
this issue
Oct 31, 2024
…9524) ## Problem Decoding and ingestion are still coupled in `pageserver::WalIngest`. ## Summary of changes A new type is added to `wal_decoder::models`, InterpretedWalRecord. This type contains everything that the pageserver requires in order to ingest a WAL record. The highlights are the `metadata_record` which is an optional special record type to be handled and `blocks` which stores key, value pairs to be persisted to storage. This type is produced by `wal_decoder::models::InterpretedWalRecord::from_bytes` from a raw PG wal record. The rest of this commit separates decoding and interpretation of the PG WAL record from its application in `WalIngest::ingest_record`. Related: #9335 Epic: #9329
github-merge-queue bot
pushed a commit
that referenced
this issue
Nov 25, 2024
…#9746) ## Problem For any given tenant shard, pageservers receive all of the tenant's WAL from the safekeeper. This soft-blocks us from using larger shard counts due to bandwidth concerns and CPU overhead of filtering out the records. ## Summary of changes This PR lifts the decoding and interpretation of WAL from the pageserver into the safekeeper. A customised PG replication protocol is used where instead of sending raw WAL, the safekeeper sends filtered, interpreted records. The receiver drives the protocol selection, so, on the pageserver side, usage of the new protocol is gated by a new pageserver config: `wal_receiver_protocol`. More granularly the changes are: 1. Optionally inject the protocol and shard identity into the arguments used for starting replication 2. On the safekeeper side, implement a new wal sending primitive which decodes and interprets records before sending them over 3. On the pageserver side, implement the ingestion of this new replication message type. It's very similar to what we already have for raw wal (minus decoding and interpreting). ## Notes * This PR currently uses my [branch of rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support) which includes the deserialization logic for the new replication message type. PR for that is open [here](neondatabase/rust-postgres#32). * This PR contains changes for both pageservers and safekeepers. It's safe to merge because the new protocol is disabled by default on the pageserver side. We can gradually start enabling it in subsequent releases. * CI tests are running on #9747 ## Links Related: #9336 Epic: #9329
github-merge-queue bot
pushed a commit
that referenced
this issue
Nov 27, 2024
…#9821) ## Problem #9746 lifted decoding and interpretation of WAL to the safekeeper. This reduced the ingested amount on the pageservers by around 10x for a tenant with 8 shards, but doubled the ingested amount for single sharded tenants. Also, #9746 uses bincode which doesn't support schema evolution. Technically the schema can be evolved, but it's very cumbersome. ## Summary of changes This patch set addresses both problems by adding protobuf support for the interpreted wal records and adding compression support. Compressed protobuf reduced the ingested amount by 100x on the 32 shards `test_sharded_ingest` case (compared to non-interpreted proto). For the 1 shard case the reduction is 5x. Sister change to `rust-postgres` is [here](neondatabase/rust-postgres#33). ## Links Related: #9336 Epic: #9329
Short pre winter holiday update.
Once we're back:
|
github-merge-queue bot
pushed a commit
that referenced
this issue
Jan 15, 2025
## Problem Safekeepers currently decode and interpret WAL for each shard separately. This is wasteful in terms of CPU memory usage - we've seen this in profiles. ## Summary of changes Fan-out interpreted WAL to multiple shards. The basic is that wal decoding and interpretation happens in a separate tokio task and senders attach to it. Senders only receive batches concerning their shard and only past the Lsn they've last seen. Fan-out is gated behind the `wal_reader_fanout` safekeeper flag (disabled by default for now). When fan-out is enabled, it might be desirable to control the absolute delta between the current position and a new shard's desired position (i.e. how far behind or ahead a shard may be). `max_delta_for_fanout` is a new optional safekeeper flag which dictates whether to create a new WAL reader or attach to the existing one. By default, this behaviour is disabled. Let's consider enabling it if we spot the need for it in the field. ## Testing Tests passed [here](#10301) with wal reader fanout enabled as of 34f6a71. Related: #9337 Epic: #9329
github-merge-queue bot
pushed a commit
that referenced
this issue
Jan 20, 2025
## Summary Whereas currently we send all WAL to all pageserver shards, and each shard filters out the data that it needs, in this RFC we add a mechanism to filter the WAL on the safekeeper, so that each shard receives only the data it needs. This will place some extra CPU load on the safekeepers, in exchange for reducing the network bandwidth for ingesting WAL back to scaling as O(1) with shard count, rather than O(N_shards). Touches #9329. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>
awarus
pushed a commit
that referenced
this issue
Jan 24, 2025
## Summary Whereas currently we send all WAL to all pageserver shards, and each shard filters out the data that it needs, in this RFC we add a mechanism to filter the WAL on the safekeeper, so that each shard receives only the data it needs. This will place some extra CPU load on the safekeepers, in exchange for reducing the network bandwidth for ingesting WAL back to scaling as O(1) with shard count, rather than O(N_shards). Touches #9329. ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com> Co-authored-by: Vlad Lazar <vlad@neon.tech>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
We currently we send all WAL to all pageserver shards, and each shard filters out the data that it needs,
in this epic we add a mechanism to filter the WAL on the safekeeper, so that each shard receives
only the data it needs.
RFC in #8754.
Sub-issues
Nice to Haves
Review follow-ups
walingest.rs
#9491ShardIdentity
in `postgres_client.rs #9823The text was updated successfully, but these errors were encountered: