Logical Replication transactions applied multiple times #6977

save-buffer · 2024-02-29T23:11:49Z

Steps to reproduce

Make Neon a subscriber to some workload that produces several inserts/sec on a table with a primary key. Then restart it. Replication will fail because of duplicate key insert.

Expected result

No duplicate key insert

Actual result

Duplicate key insert

Environment

Logs, links

Discussion: https://neondb.slack.com/archives/C04DGM6SMTM/p1708363190710839

It seems that pageserver isn't applying advancing replorigin, and so the compute's origin when it restarts is whatever was in the last checkpoint.

kelvich · 2024-03-12T08:28:35Z

To achieve exactly once semantics for incoming logical replication postgres stamps each commit record with a 10-byte tuple of (origin_id, origin_lsn), where origin_id is two byte unique identifier of the replication source and origin_lsn is 8-byte lsn on the source side. Postgres maintains in-memory mapping of origin_id => origin_lsn and writes it to disk on each checkpoint including a shutdown checkpoint. On a normal start postgres would just read this disk state. During start after a crash postgres will load last on-disk state and on WAL replay it will update in-memory values on each commit record to end up in a consistent state.

That way postgres always has knowledge of remote lsn of last applied record and in case of replication reconnect it can ask to continue replication from a proper point. For context: postgres replication is somewhat pull model -- it is on replica to reconnect and ask for data starting with some lsn.

How can we support it in Neon:

There are already wal records for creating and dropping replication origins (xl_replorigin_set / xl_replorigin_drop). No need to generate aux file logical messages for repl. origins.
We store all repl. origins in one more directory-like structure. Could be sorted array of [[origin_id, lsn],...] (denser) or hashmap as before.
Commit records serve as update records for repl. origins. We maintain in-memory commit counter and each 1k commits we write out new state of repl. origins in case there were some changes.
On basebackup:
- read disk state of repl. origins
- find last 1k of commit records (IIUC they are now attached as updates to CLOG and we somewhat simulate range scan?) and replay them to get latest origin_lsn values
- include up to date files in basebackup archive

knizhnik · 2024-03-12T13:19:36Z

To achieve exactly once semantics for incoming logical replication postgres stamps each commit record with a 10-byte tuple of (origin_id, origin_lsn), where origin_id is two byte unique identifier of the replication source and origin_lsn is 8-byte lsn on the source side.

Formally speaking it is not precisely true.
origin_id is not specified I commit record.
It is special kind of block_id (XLR_BLOCK_ID_DATA_SHORT, XLR_BLOCK_ID_DATA_LONG,... XLR_BLOCK_ID_DATA_SHORT).

Commit record may include origin_lsn&origin_timestamp (16 bytes totally) if XACT_XINFO_HAS_HAS_ORIGIN but is set in xinfo. But it is not principle.

knizhnik · 2024-03-12T13:41:08Z

There are already wal records for creating and dropping replication origins (xl_replorigin_set / xl_replorigin_drop). No need to generate aux file logical messages for repl. origins.

Well, presence of WAL record doesn't exclude necessity to have key with which this WAL record will be associated. Please notice that our storage is key-value storage. For example now commit records are associated with CLOG pages.
AUX_FILE is such key with which xl_replorigin_set / xl_replorigin_drop can be associated (right now them are not stored as WAL records but just as single-file entry, but in principle it is the same).

We store all repl. origins in one more directory-like structure. Could be sorted array of [[origin_id, lsn],...] (denser) or hashmap as before.

No problem with it. I have already implemented this part.

Commit records serve as update records for repl. origins. We maintain in-memory commit counter and each 1k commits we write out new state of repl. origins in case there were some changes.

Do you mean that we should add REPL_ORIGIN key and associate commit records with it? It means that we need to write commit in two places: CLOG and repl. origin.

On basebackup:
read disk state of repl. origins
find last 1k of commit records (IIUC they are now attached as updates to CLOG and we somewhat simulate range scan?) > and replay them to get latest origin_lsn values
include up to date files in basebackup archiveConcerning your proposal (1-4).

There is no any efficient way to locate N last commits.
Right now original commit records are not stored in KV storage at all. Walingest stores its own CLOG update records.

But I do not understand what do we need to load N last commits at all. If we introduce REPL_ORIGIN(id) key, then we can associate repl origin updates with this page. And we need to load just one value (preceding basebackup LSN).

The question is how to find all origins. In Postgres each slot has its own origin and we just iterate through all slots. PS knows nothing about replication slots. But it has AUX file with slot state. In principle it can get this files, parse it and so get array of origins. Alternatively we can use range scan to find all available REPL_ORIGIN(id) keys. Last approach seems to be less efficient but easier to implement and doesn't require PS to know format of replication slot state file.

Store logical replication origin in KV storage

Store logical replication origin in KV storage ## Problem See #6977 ## Summary of changes * Extract origin_lsn from commit WAl record * Add ReplOrigin key to KV storage and store origin_lsn * In basebackup replace snapshot origin_lsn with last committed origin_lsn at basebackup LSN ## Checklist before requesting a review - [ ] I have performed a self-review of my code. - [ ] If it is a core feature, I have added thorough tests. - [ ] Do we need to implement analytics? if so did you add the relevant metrics to the dashboard? - [ ] If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section. ## Checklist before merging - [ ] Do not forget to reformat commit message to not include the above checklist --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech> Co-authored-by: Alex Chi Z <chi@neon.tech>

hlinnaka · 2024-06-11T21:03:09Z

This was fixed by #7099

save-buffer added the t/bug Issue Type: Bug label Feb 29, 2024

kelvich added the c/cloud/compute label Mar 12, 2024

kelvich assigned knizhnik Mar 12, 2024

knizhnik pushed a commit that referenced this issue Mar 12, 2024

refer #6977

a6e7b7d

Store logical replication origin in KV storage

knizhnik mentioned this issue Mar 12, 2024

Store logical replication origin in KV storage #7099

Merged

5 tasks

knizhnik pushed a commit that referenced this issue May 8, 2024

refer #6977

40af963

Store logical replication origin in KV storage

knizhnik pushed a commit that referenced this issue May 9, 2024

refer #6977

8f47edc

Store logical replication origin in KV storage

knizhnik pushed a commit that referenced this issue Jun 1, 2024

refer #6977

4811917

Store logical replication origin in KV storage

knizhnik pushed a commit that referenced this issue Jun 3, 2024

refer #6977

35f72a3

Store logical replication origin in KV storage

hlinnaka closed this as completed Jun 11, 2024

stepashka added the c/compute Component: compute, excluding postgres itself label Jun 21, 2024

ololobus mentioned this issue Jul 12, 2024

Epic: inbound logical replication (Public Beta) #1632

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Logical Replication transactions applied multiple times #6977

Logical Replication transactions applied multiple times #6977

save-buffer commented Feb 29, 2024 •

edited

Loading

kelvich commented Mar 12, 2024

knizhnik commented Mar 12, 2024

knizhnik commented Mar 12, 2024 •

edited

Loading

hlinnaka commented Jun 11, 2024

Logical Replication transactions applied multiple times #6977

Logical Replication transactions applied multiple times #6977

Comments

save-buffer commented Feb 29, 2024 • edited Loading

Steps to reproduce

Expected result

Actual result

Environment

Logs, links

kelvich commented Mar 12, 2024

knizhnik commented Mar 12, 2024

knizhnik commented Mar 12, 2024 • edited Loading

hlinnaka commented Jun 11, 2024

save-buffer commented Feb 29, 2024 •

edited

Loading

knizhnik commented Mar 12, 2024 •

edited

Loading