Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix init of WAL page header at startup #8914

Merged
merged 4 commits into from
Sep 21, 2024
Merged

Conversation

hlinnaka
Copy link
Contributor

@hlinnaka hlinnaka commented Sep 4, 2024

If the primary is started at an LSN within the first of a 16 MB WAL segment, the "long XLOG page header" at the beginning of the segment was not initialized correctly. That has gone unnnoticed, because under normal circumstances, nothing looks at the page header. The WAL that is streamed to the safekeepers starts at the new record's LSN, not at the beginning of the page, so that bogus page header didn't propagate elsewhere, and a primary server doesn't normally read the WAL its written. Which is good because the contents of the page would be bogus anyway, as it wouldn't contain any of the records before the LSN where the new record is written.

Except that in the following cases a primary does read its own WAL:

  1. When there are two-phase transactions in prepared state at checkpoint. The checkpointer reads the two-phase state from the XLOG_XACT_PREPARE record, and writes it to a file in pg_twophase/.

  2. Logical decoding reads the WAL starting from the replication slot's restart LSN.

This PR fixes the problem with two-phase transactions. For that, it's sufficient to initialize the page header correctly. The checkpointer only needs to read XLOG_XACT_PREPARE records that were generated after the server startup, so it's still OK that older WAL is missing / bogus.

I have not investigated if we have a problem with logical decoding, however. Let's deal with that separately.

@hlinnaka hlinnaka requested a review from a team as a code owner September 4, 2024 11:54
@hlinnaka hlinnaka requested review from ololobus and removed request for a team September 4, 2024 11:54
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Sep 4, 2024

I happened to notice this bug while I was writing a new test case for an issue with prepared transactions on the v17 branch. It turned out to be a pre-existing bug. (There were other issues with prepared transactions on the v17 branch too, though)

@hlinnaka hlinnaka requested a review from MMeent September 4, 2024 11:55
@hlinnaka hlinnaka force-pushed the fix-twophase-checkpoint branch from b8698da to 86a7ced Compare September 4, 2024 11:58
@MMeent
Copy link
Contributor

MMeent commented Sep 4, 2024

@hlinnaka is the target branch the one you want this to be merged into? It seems like a fairly arbitrary branch to me.

Copy link

github-actions bot commented Sep 4, 2024

4977 tests run: 4813 passed, 0 failed, 164 skipped (full report)


Flaky tests (6)

Postgres 17

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 31.9% (7435 of 23317 functions)
  • lines: 49.9% (59921 of 120118 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
caea18d at 2024-09-20T23:57:30.168Z :recycle:

hlinnaka added a commit that referenced this pull request Sep 4, 2024
This doesn't include the new test from that PR, however, because it
fails without the corresponding code changes, which we don't have on
this branch yet.
Base automatically changed from fix-v14-submodule-reference to main September 4, 2024 14:41
@hlinnaka hlinnaka force-pushed the fix-twophase-checkpoint branch from 86a7ced to 94d0f00 Compare September 4, 2024 17:08
@hlinnaka
Copy link
Contributor Author

hlinnaka commented Sep 4, 2024

@hlinnaka is the target branch the one you want this to be merged into? It seems like a fairly arbitrary branch to me.

Right, it was just because I made those other change first and knew they were going to be merged soon. It's been merged now, so the point is moot.

Copy link
Contributor

@MMeent MMeent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Please update the submodules after merging the PG PRs but before merging this one.

@hlinnaka hlinnaka force-pushed the fix-twophase-checkpoint branch from 0df5cf1 to 07613c3 Compare September 7, 2024 07:02
hlinnaka added a commit that referenced this pull request Sep 9, 2024
This doesn't include the new test from that PR, however, because it
fails without the corresponding code changes, which we don't have on
this branch yet.
hlinnaka added a commit that referenced this pull request Sep 9, 2024
This doesn't include the new test from that PR, however, because it
fails without the corresponding code changes, which we don't have on
this branch yet.
@clarkbw
Copy link

clarkbw commented Sep 9, 2024

vendor/revisions.json Outdated Show resolved Hide resolved
@hlinnaka
Copy link
Contributor Author

Corresponding PRs in the postgres repo:

v17: neondatabase/postgres#506
v16: neondatabase/postgres#481
v15: neondatabase/postgres#482
v14: neondatabase/postgres#483

@hlinnaka
Copy link
Contributor Author

Rebased, this is now ready to be merged again. I need approval for the v17 PR: neondatabase/postgres#506

@hlinnaka hlinnaka force-pushed the fix-twophase-checkpoint branch 2 times, most recently from 86ac9c5 to a4f35fb Compare September 20, 2024 09:46
@hlinnaka
Copy link
Contributor Author

CI failure on one branch, seems to be flaky https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8914/10957027921/index.html#suites/572721f46de364f1302c76c19e8023c4/c4814c8a998655dd/:

PG:2024-09-20 10:41:59.644 GMT [313604] LOG:  [WP] got VoteResponse from acceptor 127.0.0.1:18364, voteGiven=1, epoch=2, flushLsn=0/2000000, truncateLsn=0/14EE280, timelineStartLsn=0/14EE280
PG:2024-09-20 10:41:59.644 GMT [313604] LOG:  [WP] got votes from majority (1) of nodes, term 3, epochStartLsn 0/2000000, donor 127.0.0.1:18364, truncate_lsn 0/14EE280
2024-09-20T10:41:59.668726Z  INFO starting activity monitor for postgresql://cloud_admin@127.0.0.1:18368/postgres?application_name=compute_activity_monitor
PG:2024-09-20 10:41:59.644 GMT [313604] PANIC:  [WP] collected propEpochStartLsn 0/2000000, but basebackup LSN 0/14EE280
PG:2024-09-20 10:41:59.673 GMT [313607] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:18361' with protocol version 2
PG:2024-09-20 10:41:59.753 GMT [313594] LOG:  background worker "WAL proposer" (PID 313604) was terminated by signal 6: Aborted

@arssher is this a known issue?

@hlinnaka
Copy link
Contributor Author

hlinnaka commented Sep 20, 2024

CI failure on one branch, seems to be flaky https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8914/10957027921/index.html#suites/572721f46de364f1302c76c19e8023c4/c4814c8a998655dd/:

PG:2024-09-20 10:41:59.644 GMT [313604] LOG:  [WP] got VoteResponse from acceptor 127.0.0.1:18364, voteGiven=1, epoch=2, flushLsn=0/2000000, truncateLsn=0/14EE280, timelineStartLsn=0/14EE280
PG:2024-09-20 10:41:59.644 GMT [313604] LOG:  [WP] got votes from majority (1) of nodes, term 3, epochStartLsn 0/2000000, donor 127.0.0.1:18364, truncate_lsn 0/14EE280
2024-09-20T10:41:59.668726Z  INFO starting activity monitor for postgresql://cloud_admin@127.0.0.1:18368/postgres?application_name=compute_activity_monitor
PG:2024-09-20 10:41:59.644 GMT [313604] PANIC:  [WP] collected propEpochStartLsn 0/2000000, but basebackup LSN 0/14EE280
PG:2024-09-20 10:41:59.673 GMT [313607] LOG:  [NEON_SMGR] [shard 0] libpagestore: connected to 'postgresql://no_user@localhost:18361' with protocol version 2
PG:2024-09-20 10:41:59.753 GMT [313594] LOG:  background worker "WAL proposer" (PID 313604) was terminated by signal 6: Aborted

@arssher is this a known issue?

The test case up to the failure is basically just:

  1. Create branch, start an endpoint on it
  2. SELECT pg_switch_wal()
  3. Immediately stop the endpoint
  4. Start a new endpoint

@arssher
Copy link
Contributor

arssher commented Sep 20, 2024

is this a known issue?

I haven't seen exactly this sequence with xlog switch, but it is kind of expected behaviour. CI failure misses first ep logs (it is better to do endpoint.stop(mode="immediate") to preserve them instead of endpoint.stop_and_destroy()), but I think the following happens:

  1. SELECT pg_switch_wal() is done at first compute;
  2. first compute is SIGKILL'ed
  3. second compute starts and observers that sk position is 0/14EE280, nothing to sync, and fetches basebackup at this pos
  4. safekeeper gets from TCP stream WAL switch record and bumps flush position to 0/2000000
  5. basebackup is now spoiled because it is taken at wrong LSN position.

So I'd suggest to add sleep/retry until endpoint starts well. Alternatively safekeeper exposes list of walreceivers and we can poll sk until walreceiver on it is gone.

@hlinnaka
Copy link
Contributor Author

is this a known issue?

I haven't seen exactly this sequence with xlog switch, but it is kind of expected behaviour.

I was able to reproduce this locally with:

    for i in range(1, 2000):
        branch_name = f"test_twophase{i}"
        env.neon_cli.create_branch(branch_name)
        endpoint = env.endpoints.create_start(branch_name)
        endpoint.safe_psql("SELECT pg_switch_wal()")
        endpoint.stop_and_destroy()
        endpoint = env.endpoints.create_start(branch_name)
        endpoint.safe_psql("SELECT pg_switch_wal()")
        endpoint.stop_and_destroy()

failed after about ~100 iterations

CI failure misses first ep logs (it is better to do endpoint.stop(mode="immediate") to preserve them instead of endpoint.stop_and_destroy()), but I think the following happens:

1. SELECT pg_switch_wal() is done at first compute;

2. first compute is SIGKILL'ed

3. second compute starts and observers that sk position is 0/14EE280, nothing to sync, and fetches basebackup at this pos

4. safekeeper gets from TCP stream WAL switch record and bumps flush position to 0/2000000

5. basebackup is now spoiled because it is taken at wrong LSN position.

So I'd suggest to add sleep/retry until endpoint starts well. Alternatively safekeeper exposes list of walreceivers and we can poll sk until walreceiver on it is gone.

Hmm, isn't this a potential problem in production too?

@hlinnaka
Copy link
Contributor Author

Opened #9079 to track that flakiness, since it seems unrelated to this PR. In this PR, to work around this, I'll add a call to wait for the WAL to be flushed before stopping the compute.

@hlinnaka hlinnaka force-pushed the fix-twophase-checkpoint branch 2 times, most recently from 34978e2 to c42856d Compare September 20, 2024 14:27
@hlinnaka hlinnaka enabled auto-merge (squash) September 20, 2024 14:30
@hlinnaka hlinnaka disabled auto-merge September 20, 2024 14:57
If the primary is started at an LSN within the first of a 16 MB WAL
segment, the "long XLOG page header" at the beginning of the segment
was not initialized correctly. That has gone unnnoticed, because under
normal circumstances, nothing looks at the page header. The WAL that
is streamed to the safekeepers starts at the new record's LSN, not at
the beginning of the page, so that bogus page header didn't propagate
elsewhere, and a primary server doesn't normally read the WAL its
written. Which is good because the contents of the page would be
bogus anyway, as it wouldn't contain any of the records before the LSN
where the new record is written.

Except that in the following cases a primary does read its own WAL:

1. When there are two-phase transactions in prepared state at checkpoint.
   The checkpointer reads the two-phase state from the XLOG_XACT_PREPARE
   record, and writes it to a file in pg_twophase/.

2. Logical decoding reads the WAL starting from the replication slot's
   restart LSN.

This PR fixes the problem with two-phase transactions. For that, it's
sufficient to initialize the page header correctly. The checkpointer
only needs to read XLOG_XACT_PREPARE records that were generated after
the server startup, so it's still OK that older WAL is missing / bogus.

I have not investigated if we have a problem with logical decoding,
however. Let's deal with that separately.

Special thanks to @Lzjing-1997, who independently found the same bug
and opened a PR to fix it, although I did not use that PR.
@hlinnaka hlinnaka force-pushed the fix-twophase-checkpoint branch from c42856d to 437b09a Compare September 20, 2024 17:30
I hope I got them right now...
@hlinnaka hlinnaka merged commit 9a32aa8 into main Sep 21, 2024
79 checks passed
@hlinnaka hlinnaka deleted the fix-twophase-checkpoint branch September 21, 2024 01:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants