pageserver: refactor ingest inplace to decouple decoding and handling #9472

VladLazar · 2024-10-21T17:20:29Z

Problem

WAL ingest couples decoding of special records with their handling (updates to the storage engine mostly).
This is a roadblock for our plan to move WAL filtering (and implicitly decoding) to safekeepers since they cannot
do writes to the storage engine.

Summary of changes

This PR decouples the decoding of the special WAL records from their application. The changes are done in place
and I've done my best to refrain from refactorings and attempted to preserve the original code as much as possible.
The commit messages flag deviations.

Reviewer Note

You will notice that this is a somewhat transient state. This is intentional. I'd like this review to focus on ensuring
that I've not changed the handling of these special records.

Future PRs will:

Move all the new types and decoding logic into a crate of its own
Create separate instances for these types for each supported pg version
Further refactor ingest to take a SerializedBatch and a series of special records for handling (types defined in this PR)

Related: #9335
Epic: #9329

Checklist before requesting a review

I have performed a self-review of my code.
If it is a core feature, I have added thorough tests.
Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

Do not forget to reformat commit message to not include the above checklist

github-actions · 2024-10-21T18:23:11Z

5264 tests run: 5050 passed, 0 failed, 214 skipped (full report)

Flaky tests (1)

Postgres 15

test_s3_wal_replay: release-arm64

Code coverage* (full report)

functions: 31.4% (7671 of 24445 functions)
lines: 48.8% (60374 of 123675 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
02fbac3 at 2024-10-24T11:37:52.072Z :recycle:}

This one is a bit less obvious than the previous ones. I merged some of the logic that was previously in `WalIngest::ingest_record` to `WalIngest::ingest_xact_record`.

This is an odd one. It requires the current checkpoint value to decide what to do. That can't trivially be moved to the SK. It's possible with protocol change, but deferring decision for now. Hence, send the raw record and let the pageserver figure it out.

This will give us a nice evolution path when we want to add new actions for a record.

The goal of this commit is to make it clearer when we are ingesting the whole record versus when we are ingesting an action for the record. I also merged the VM bits clearing into one function since they were exactly the same.

problame

No good tooling to post review comments for individual commits (I reviewed commit by commit).
So, dumping comments here

         return Ok(Some(SmgrRecord::Truncate(SmgrTruncate {
                    rel,
                    to: truncate.blkno,
                })));
            }

            if (truncate.flags & pg_constants::SMGR_TRUNCATE_FSM) != 0 {
                let rel = RelTag {
                    spcnode,

In the original code, theoretically SMGR_TRUNCATE_HEAP and SMGR_TRUNCATE_FSM could both be set, so, it's not technically equivalent to return on the first matching bit. I guess in practice they're exclusive though? => should anyhow::ensure! that at the top of the function.

   let (xact_common, is_commit, is_prepared) = match record {
            XactRecord::Prepare(XactPrepare { xl_xid, data }) => {
                let xid: u64 = if modification.tline.pg_version >= 17 {
                    self.adjust_to_full_transaction_id(xl_xid)?
                } else {
                    xl_xid as u64
                };
                return modification.put_twophase_file(xid, data, ctx).await;
            }

Before this PR, this special case was for the twophase PREPARED records, i.e., XLOG_XACT_COMMIT_PREPARED and XLOG_XACT_ABORT_PREPARED. This PR moves that handling to the XactRecord::Prepare, which afaict is the wrong place. Needs to be in the cases for XactRecord::CommitPrepared and XactRecord::AbortPrepared.

Please create a follow-up task to address the .unwrap().unwrap() business that multiple commits added, and link it from the epic & this PR.
If we don't address this, inevitably eventually we'll make stuff fallible and cause a panic.


    fn decode_logical_message_record(
        buf: &mut Bytes,
        decoded: &DecodedWALRecord,
        _pg_version: u32,
    ) -> anyhow::Result<Option<LogicalMessageRecord>> {
        let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
        if info == pg_constants::XLOG_LOGICAL_MESSAGE {
            let xlrec = crate::walrecord::XlLogicalMessage::decode(buf);
            let buf_size = xlrec.prefix_size + xlrec.message_size;
            // TODO change decode function interface to take ownership of buf

I think it would be more natural, robust, and efficient, to filter the prefix during decode instead of during ingest, which is what this PR does right now.
It would also allow us to have an enum for the two variants (neon-test and neon-file:).

Independent of that discussion, did you check the curren tusers of neon-test wrt where the failpoint happens?
Like, probably all the tests that use it just expect that they can hold up pageserver walreceiver via the wal-ingest-logical-message-sleep .
But would make sense to check that / maybe rename the failpoint to pageserver-wal-ingest-... to make things a bit clearer.

Have you thought through what happens if the decoding sides pg_version (=SK's view of the world) gets out of sync with the ingesting side's pg_version (=PS's view of the world?
Do we want to protect against that by including the pg_version in each of the decoded records?
=> ok to do this later

 async fn ingest_heapam_record(
        &mut self,
        clear_vm_bits: ClearVmBits,
        modification: &mut DatadirModification<'_>,
        ctx: &RequestContext,
    ) -> anyhow::Result<()> {
        let ClearVmBits {

IMO this should still take a HeapamRecord to get more type safety.
Even though that enum only has one variant, ClearVmBits.
You can still destructure it painfree.

Same for ingest_neonrmgr_record

Hmm, the last commit that DRYs the clear VM bits code should be a separate PR IMO.
Can't speak to whether it's a wise choice or not to DRY the code.
Matthias would know, I think.

VladLazar · 2024-10-23T17:18:35Z

         return Ok(Some(SmgrRecord::Truncate(SmgrTruncate {
                    rel,
                    to: truncate.blkno,
                })));
            }

            if (truncate.flags & pg_constants::SMGR_TRUNCATE_FSM) != 0 {
                let rel = RelTag {
                    spcnode,
In the original code, theoretically SMGR_TRUNCATE_HEAP and SMGR_TRUNCATE_FSM could both be set, so, it's not technically equivalent to return on the first matching bit. I guess in practice they're exclusive though? => should anyhow::ensure! that at the top of the function.

Thank you for spotting this. I've adjusted the handling to allow for truncation of multiple fork types in the same record in 4580b54.

   let (xact_common, is_commit, is_prepared) = match record {
            XactRecord::Prepare(XactPrepare { xl_xid, data }) => {
                let xid: u64 = if modification.tline.pg_version >= 17 {
                    self.adjust_to_full_transaction_id(xl_xid)?
                } else {
                    xl_xid as u64
                };
                return modification.put_twophase_file(xid, data, ctx).await;
            }
Before this PR, this special case was for the twophase PREPARED records, i.e., XLOG_XACT_COMMIT_PREPARED and XLOG_XACT_ABORT_PREPARED. This PR moves that handling to the XactRecord::Prepare, which afaict is the wrong place. Needs to be in the cases for XactRecord::CommitPrepared and XactRecord::AbortPrepared.

I don't see it, but perhaps I'm missing something here. Let's work through it.

XLOG_XACT_PREPARE
a. Previous code: It was handled here by: adjusting the transaction id and calling put_twophase_file.
b. New code: It is now handled here with exactly the same code as previously
XLOG_XACT_ABORT_PREPARED or XLOG_XACT_COMMIT_PREPARED
a. Previous code: It was handled here by: ingesting the xact record, adjusting xid and, finally, dropping twophase files
b. New code: It is now handled here with is_prepared=true. It's hard to tell from the diff but the code block linked
previously merges two pre-existint code blocks: previous body of ingest_xact_record and epilogue of the old handling which does the xid adjustment and dropping of two phase files. In the new code, we do xid adjustment and dropping of two phase files at the end of ingest_xact_record for XLOG_XACT_ABORT_PREPARED or XLOG_XACT_COMMIT_PREPARED.

To me, this looks equiavalent to the old code, but let me know if you're not convinced.

Please create a follow-up task to address the .unwrap().unwrap() business that multiple commits added, and link it from the epic & this PR. If we don't address this, inevitably eventually we'll make stuff fallible and cause a panic.

I already have a draft of a follow-follow-up PR where this is fixed, but sure: #9491.

    fn decode_logical_message_record(
        buf: &mut Bytes,
        decoded: &DecodedWALRecord,
        _pg_version: u32,
    ) -> anyhow::Result<Option<LogicalMessageRecord>> {
        let info = decoded.xl_info & pg_constants::XLR_RMGR_INFO_MASK;
        if info == pg_constants::XLOG_LOGICAL_MESSAGE {
            let xlrec = crate::walrecord::XlLogicalMessage::decode(buf);
            let buf_size = xlrec.prefix_size + xlrec.message_size;
            // TODO change decode function interface to take ownership of buf
I think it would be more natural, robust, and efficient, to filter the prefix during decode instead of during ingest, which is what this PR does right now. It would also allow us to have an enum for the two variants (neon-test and neon-file:).

I like the suggestion. Done in 4f8c733.

Independent of that discussion, did you check the curren tusers of neon-test wrt where the failpoint happens? Like, probably all the tests that use it just expect that they can hold up pageserver walreceiver via the wal-ingest-logical-message-sleep . But would make sense to check that / maybe rename the failpoint to pageserver-wal-ingest-... to make things a bit clearer.

Renamed the failpoint in 9c7baec.

Have you thought through what happens if the decoding sides pg_version (=SK's view of the world) gets out of sync with the ingesting side's pg_version (=PS's view of the world? Do we want to protect against that by including the pg_version in each of the decoded records? => ok to do this later

Subtle breakage. Will work fine for 90% of records until you get to one with PG specific parsing or handling.
If we detect that (SK can include it's own pg version for the timeline), we should kill the wal connection and drop our current in-memory file (like any other error on the ingest path).
This is not really consistent right now, so will have to be thought through more and tested.

 async fn ingest_heapam_record(
        &mut self,
        clear_vm_bits: ClearVmBits,
        modification: &mut DatadirModification<'_>,
        ctx: &RequestContext,
    ) -> anyhow::Result<()> {
        let ClearVmBits {
IMO this should still take a HeapamRecord to get more type safety. Even though that enum only has one variant, ClearVmBits. You can still destructure it painfree.

Same for ingest_neonrmgr_record

I don't see how that's more type safe. Are you alluding to the scenario where we call ingest_heapam_record from the neonrmgr handling? That's why I want to keep the DRI
change for ClearVmBits.

Hmm, the last commit that DRYs the clear VM bits code should be a separate PR IMO. Can't speak to whether it's a wise choice or not to DRY the code. Matthias would know, I think.

See above.

problame

To me, this looks equiavalent to the old code, but let me know if you're not convinced.

Ok, I hadn't seen the inlining / DRYing of the body of ingest_xact_record.

pageserver/src/walingest.rs

VladLazar added 3 commits October 16, 2024 17:11

pagesver: refactor heapam records

9ea2f32

pageserver: refactor neonrmgr records

ee2a962

pageserver: refactor smgr records

ceadcc4

VladLazar added 3 commits October 22, 2024 11:46

pageserver: refactor dbase records

6355aac

pageserver: refactor clog records

6fdba1a

pageserver: refactor xact records

813bd02

This one is a bit less obvious than the previous ones. I merged some of the logic that was previously in `WalIngest::ingest_record` to `WalIngest::ingest_xact_record`.

VladLazar force-pushed the vlad/refactor-ingest branch from c3f31fa to 4b155c1 Compare October 22, 2024 09:46

VladLazar added 8 commits October 22, 2024 12:37

pageserver: refactor multixact records

79bd6f8

pageserver: refactor relmap record

2dc64ae

pageserer: refactor logical message record

e2d008b

pageserver: refactor standby record

d37d6f5

pageserver: refactor replorigin record

345ae45

pageserver: wrap all special records in an enum

84099e4

This will give us a nice evolution path when we want to add new actions for a record.

pageserver: rename ingest functions

991a4c0

The goal of this commit is to make it clearer when we are ingesting the whole record versus when we are ingesting an action for the record. I also merged the VM bits clearing into one function since they were exactly the same.

VladLazar force-pushed the vlad/refactor-ingest branch from 4b155c1 to 991a4c0 Compare October 22, 2024 10:37

VladLazar changed the title ~~Vlad/refactor ingest~~ pageserver: refactor ingest inplace to decouple decoding and handling Oct 22, 2024

VladLazar marked this pull request as ready for review October 22, 2024 10:47

VladLazar requested a review from a team as a code owner October 22, 2024 10:47

VladLazar requested review from arpad-m and problame and removed request for arpad-m October 22, 2024 10:47

problame reviewed Oct 22, 2024

View reviewed changes

jcsp requested a review from skyzh October 23, 2024 14:24

review: allow for multiple types of smgr truncation

4580b54

VladLazar mentioned this pull request Oct 23, 2024

pageserver/walingest: clean up post-interpretation unwraps in walingest.rs #9491

Closed

VladLazar added 2 commits October 23, 2024 19:01

review: parse logical message prefix on decode

4f8c733

review: rename failpoint

9c7baec

ci: format python code

dcd2e20

VladLazar requested a review from problame October 23, 2024 17:21

problame approved these changes Oct 23, 2024

View reviewed changes

hlinnaka reviewed Oct 23, 2024

View reviewed changes

pageserver/src/walingest.rs Show resolved Hide resolved

VladLazar added 2 commits October 24, 2024 12:05

pageserver/walingest: fix failpoint name typo

3f550db

review: add a comment for the types

02fbac3

VladLazar merged commit 5069123 into main Oct 24, 2024
81 checks passed

VladLazar deleted the vlad/refactor-ingest branch October 24, 2024 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: refactor ingest inplace to decouple decoding and handling #9472

pageserver: refactor ingest inplace to decouple decoding and handling #9472

VladLazar commented Oct 21, 2024 •

edited

Loading

github-actions bot commented Oct 21, 2024 •

edited

Loading

Postgres 15

problame left a comment

VladLazar commented Oct 23, 2024

problame left a comment

pageserver: refactor ingest inplace to decouple decoding and handling #9472

pageserver: refactor ingest inplace to decouple decoding and handling #9472

Conversation

VladLazar commented Oct 21, 2024 • edited Loading

Problem

Summary of changes

Reviewer Note

Checklist before requesting a review

Checklist before merging

github-actions bot commented Oct 21, 2024 • edited Loading

5264 tests run: 5050 passed, 0 failed, 214 skipped (full report)

Postgres 15

Code coverage* (full report)

problame left a comment

Choose a reason for hiding this comment

VladLazar commented Oct 23, 2024

problame left a comment

Choose a reason for hiding this comment

VladLazar commented Oct 21, 2024 •

edited

Loading

github-actions bot commented Oct 21, 2024 •

edited

Loading