Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: sharded pageserver ingest #9329

Open
6 of 13 tasks
VladLazar opened this issue Oct 9, 2024 · 2 comments
Open
6 of 13 tasks

Epic: sharded pageserver ingest #9329

VladLazar opened this issue Oct 9, 2024 · 2 comments
Assignees
Labels
c/storage Component: storage t/Epic Issue type: Epic

Comments

@VladLazar
Copy link
Contributor

VladLazar commented Oct 9, 2024

Background

We currently we send all WAL to all pageserver shards, and each shard filters out the data that it needs,
in this epic we add a mechanism to filter the WAL on the safekeeper, so that each shard receives
only the data it needs.

RFC in #8754.

Sub-issues

Preview Give feedback
  1. a/observability c/storage/safekeeper
    erikgrinaker
  2. a/scalability c/storage/pageserver
    VladLazar
  3. a/scalability c/storage/safekeeper
    VladLazar
  4. a/scalability c/storage/safekeeper
    VladLazar
  5. a/scalability c/storage/safekeeper
  6. a/benchmark c/storage/safekeeper
    erikgrinaker
  7. a/performance a/scalability c/storage/safekeeper
  8. a/benchmark c/storage
    erikgrinaker

Nice to Haves

Preview Give feedback
  1. c/storage/pageserver c/storage/safekeeper t/Epic
  2. 1 of 3
    c/storage c/storage/pageserver t/bug triaged

Review follow-ups

Preview Give feedback
  1. a/tech_debt c/storage c/storage/pageserver
    VladLazar
@VladLazar VladLazar added c/storage Component: storage t/Epic Issue type: Epic labels Oct 9, 2024
@jcsp
Copy link
Collaborator

jcsp commented Oct 10, 2024

Can we reconcile this with #6345? (I don't mind replacing that issue, just want to avoid duplication)

VladLazar added a commit that referenced this issue Oct 24, 2024
…#9472)

## Problem

WAL ingest couples decoding of special records with their handling
(updates to the storage engine mostly).
This is a roadblock for our plan to move WAL filtering (and implicitly
decoding) to safekeepers since they cannot
do writes to the storage engine. 

## Summary of changes

This PR decouples the decoding of the special WAL records from their
application. The changes are done in place
and I've done my best to refrain from refactorings and attempted to
preserve the original code as much as possible.

Related: #9335
Epic: #9329
VladLazar added a commit that referenced this issue Oct 29, 2024
## Problem

We wish to have high level WAL decoding logic in `wal_decoder::decoder`
module.

## Summary of Changes

For this we need the `Value` and `NeonWalRecord` types accessible there, so:
1. Move `Value` and `NeonWalRecord` to `pageserver::value` and
`pageserver::record` respectively.
2. Get rid of `pageserver::repository` (follow up from (1))
3. Move PG specific WAL record types to `postgres_ffi::walrecord`. In
theory they could live in `wal_decoder`, but it would create a circular
dependency between `wal_decoder` and `postgres_ffi`. Long term it makes
sense for those types to be PG version specific, so that will work out nicely.
4. Move higher level WAL record types (to be ingested by pageserver)
into `wal_decoder::models`

Related: #9335
Epic: #9329
VladLazar added a commit that referenced this issue Oct 31, 2024
…9524)

## Problem

Decoding and ingestion are still coupled in `pageserver::WalIngest`.

## Summary of changes

A new type is added to `wal_decoder::models`, InterpretedWalRecord. This
type contains everything that the pageserver requires in order to ingest
a WAL record. The highlights are the `metadata_record` which is an
optional special record type to be handled and `blocks` which stores
key, value pairs to be persisted to storage.

This type is produced by
`wal_decoder::models::InterpretedWalRecord::from_bytes` from a raw PG
wal record.

The rest of this commit separates decoding and interpretation of the PG
WAL record from its application in `WalIngest::ingest_record`.

Related: #9335
Epic: #9329
@erikgrinaker erikgrinaker removed their assignment Nov 8, 2024
github-merge-queue bot pushed a commit that referenced this issue Nov 25, 2024
…#9746)

## Problem

For any given tenant shard, pageservers receive all of the tenant's WAL
from the safekeeper.
This soft-blocks us from using larger shard counts due to bandwidth
concerns and CPU overhead of filtering
out the records.

## Summary of changes

This PR lifts the decoding and interpretation of WAL from the pageserver
into the safekeeper.

A customised PG replication protocol is used where instead of sending
raw WAL, the safekeeper sends
filtered, interpreted records. The receiver drives the protocol
selection, so, on the pageserver side, usage
of the new protocol is gated by a new pageserver config:
`wal_receiver_protocol`.

 More granularly the changes are:
1. Optionally inject the protocol and shard identity into the arguments
used for starting replication
2. On the safekeeper side, implement a new wal sending primitive which
decodes and interprets records
 before sending them over
3. On the pageserver side, implement the ingestion of this new
replication message type. It's very similar
 to what we already have for raw wal (minus decoding and interpreting).
 
 ## Notes
 
* This PR currently uses my [branch of
rust-postgres](https://github.com/neondatabase/rust-postgres/tree/vlad/interpreted-wal-record-replication-support)
which includes the deserialization logic for the new replication message
type. PR for that is open
[here](neondatabase/rust-postgres#32).
* This PR contains changes for both pageservers and safekeepers. It's
safe to merge because the new protocol is disabled by default on the
pageserver side. We can gradually start enabling it in subsequent
releases.
* CI tests are running on #9747
 
 ## Links
 
 Related: #9336
 Epic: #9329
github-merge-queue bot pushed a commit that referenced this issue Nov 27, 2024
…#9821)

## Problem

#9746 lifted decoding and
interpretation of WAL to the safekeeper.
This reduced the ingested amount on the pageservers by around 10x for a
tenant with 8 shards, but doubled
the ingested amount for single sharded tenants.

Also, #9746 uses bincode which
doesn't support schema evolution.
Technically the schema can be evolved, but it's very cumbersome.

## Summary of changes

This patch set addresses both problems by adding protobuf support for
the interpreted wal records and adding compression support. Compressed
protobuf reduced the ingested amount by 100x on the 32 shards
`test_sharded_ingest` case (compared to non-interpreted proto). For the
1 shard case the reduction is 5x.

Sister change to `rust-postgres` is
[here](neondatabase/rust-postgres#33).

## Links

Related: #9336
Epic: #9329
github-merge-queue bot pushed a commit that referenced this issue Nov 27, 2024
## Problem

Can't change protocol at tenant granularity.

## Summary of changes

Add tenant config level override for wal receiver protocol.

## Links

Related: #9336
Epic: #9329
@VladLazar
Copy link
Contributor Author

VladLazar commented Dec 20, 2024

Short pre winter holiday update.

Once we're back:

  • Continue deployment of sharded ingest to all other regions
  • Tests and tweaks to the fan-out wal reader (e.g. use futures when not sharded to minimise task spawning)
  • Separate roll-out of fan-out wal reader

github-merge-queue bot pushed a commit that referenced this issue Jan 15, 2025
## Problem

Safekeepers currently decode and interpret WAL for each shard
separately.
This is wasteful in terms of CPU memory usage - we've seen this in
profiles.

## Summary of changes

Fan-out interpreted WAL to multiple shards. 
The basic is that wal decoding and interpretation happens in a separate
tokio task and senders
attach to it. Senders only receive batches concerning their shard and
only past the Lsn they've last seen.

Fan-out is gated behind the `wal_reader_fanout` safekeeper flag
(disabled by default for now).

When fan-out is enabled, it might be desirable to control the absolute
delta between the
current position and a new shard's desired position (i.e. how far behind
or ahead a shard may be).
`max_delta_for_fanout` is a new optional safekeeper flag which dictates
whether to create a new
WAL reader or attach to the existing one. By default, this behaviour is
disabled. Let's consider enabling
it if we spot the need for it in the field.

## Testing

Tests passed [here](#10301)
with wal reader fanout enabled
as of
34f6a71.

Related: #9337
Epic: #9329
github-merge-queue bot pushed a commit that referenced this issue Jan 20, 2025
## Summary

Whereas currently we send all WAL to all pageserver shards, and each
shard filters out the data that it needs,
in this RFC we add a mechanism to filter the WAL on the safekeeper, so
that each shard receives
only the data it needs.

This will place some extra CPU load on the safekeepers, in exchange for
reducing the network bandwidth
for ingesting WAL back to scaling as O(1) with shard count, rather than
O(N_shards).

Touches #9329.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
awarus pushed a commit that referenced this issue Jan 24, 2025
## Summary

Whereas currently we send all WAL to all pageserver shards, and each
shard filters out the data that it needs,
in this RFC we add a mechanism to filter the WAL on the safekeeper, so
that each shard receives
only the data it needs.

This will place some extra CPU load on the safekeepers, in exchange for
reducing the network bandwidth
for ingesting WAL back to scaling as O(1) with shard count, rather than
O(N_shards).

Touches #9329.

## Checklist before requesting a review

- [ ] I have performed a self-review of my code.
- [ ] If it is a core feature, I have added thorough tests.
- [ ] Do we need to implement analytics? if so did you add the relevant
metrics to the dashboard?
- [ ] If this PR requires public announcement, mark it with
/release-notes label and add several sentences in this section.

## Checklist before merging

- [ ] Do not forget to reformat commit message to not include the above
checklist

---------

Co-authored-by: Vlad Lazar <vlalazar.vlad@gmail.com>
Co-authored-by: Vlad Lazar <vlad@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage Component: storage t/Epic Issue type: Epic
Projects
None yet
Development

No branches or pull requests

3 participants