Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assorted streaming fixes, including skipping of empty transactions. #257

Merged
merged 3 commits into from
Apr 20, 2023

Conversation

dimitri
Copy link
Owner

@dimitri dimitri commented Apr 19, 2023

Should fix #254. Or at least help with it.

Also should fix issues with SWITCH WAL messages found in the middle of JSON files, when it comes from a situation where streaming has been restarted.

Copy link
Contributor

@shubhamdhama shubhamdhama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimitri With a mild load and an idle system, the lag is almost zero, which is a major improvement. However, when the system experiences higher loads, the problem still remains, as you can expect. Currently, this improvement seems adequate considering the scope of this PR for the messages that are empty transactions, but the problem will be entirely solved once the issue raised in this comment #254 (comment) is addressed.

@shubhamdhama
Copy link
Contributor

shubhamdhama commented Apr 19, 2023

I have an interesting case I observed earlier (migration using main pgcopydb, not based on this PR),

{"action":"C","xid":"64395","lsn":"16/A601C0B0","timestamp":"2023-04-18 08:45:03.608755+0530","message":{"action":"C","xid":64395}}
{"action":"B","xid":"64396","lsn":"16/A601C0E8","timestamp":"2023-04-18 08:45:13.624321+0530","message":{"action":"B","xid":64396}}
{"action":"C","xid":"64396","lsn":"16/A601C190","timestamp":"2023-04-18 08:45:13.624321+0530","message":{"action":"C","xid":64396}}
{"action":"X","lsn":"16/A7000028"}
{"action":"B","xid":"64397","lsn":"16/A601D618","timestamp":"2023-04-18 08:45:29.842951+0530","message":{"action":"B","xid":64397}}
{"action":"X","lsn":"16/A7003600"}

which led to,

COMMIT; -- {"xid":64395,"lsn":"16/A601C0B0","timestamp":"2023-04-18 08:45:03.608755+0530"}
BEGIN; -- {"xid":64396,"lsn":"16/A601C0E8","timestamp":"2023-04-18 08:45:13.624321+0530"}
COMMIT; -- {"xid":64396,"lsn":"16/A601C190","timestamp":"2023-04-18 08:45:13.624321+0530"}
-- SWITCH WAL {"lsn":"16/A7000028"}
-- SWITCH WAL {"lsn":"16/A7003600"}

and hence migration failed,

2023-04-18 14:15:34 32860 INFO   ld_apply.c:266            The pgcopydb sentinel has enabled applying changes
2023-04-18 14:15:34 32860 INFO   ld_apply.c:93             Catching-up with changes from LSN 16/A60146F0
2023-04-18 14:15:34 32860 INFO   ld_apply.c:334            Replaying changes from file "pg/pgcopydb/Debug/date-23-04-18--time-14-13-44-all_types/cdc/0000000100000016000000A6.sql"
2023-04-18 14:15:34 32860 ERROR  ld_apply.c:364            SWITCH command for LSN 16/A7000028 found in line 37, before last line 38
2023-04-18 14:15:34 32857 ERROR  follow.c:748              Subprocess catchup with pid 32860 has exited with error code 12

I can't understand how this could happen.

Also, can you please elaborate a bit what was earlier missed around SWITCH messages that is fixed in this PR?

@dimitri
Copy link
Owner Author

dimitri commented Apr 20, 2023

Also, can you please elaborate a bit what was earlier missed around SWITCH messages that is fixed in this PR?

We missed updating the “latest” symlink at close file time, meaning that when restarting streaming later we would re-fetch LSN positions that made it to disk locally already in the previous round. I believe that could explain double SWITCH entries in the JSON files.

Now that we need to keep around the previous LogicalMessageMetadata, we also
need to keep around its JSON buffer. This means we can't re-use the pointer
to the current context buffer, and we need to duplicate memory even when
using wal2json and keeping the same JSON buffer as received.

In passing, adjust the JSON files in the test suite according to the new
empty transactions filtering and synthetic KEEPALIVE situation.
The exit code 141 is 128 + 13, and 13 is SIGPIPE. It means that in our
cdc-low-level test case when doing receive | transform | apply, one of the
processes is trying to write to its output pipe when the next process has
already exited, or at least closed its pipe input file descriptor.

By ignoring SIGPIPE we ask the system to return EPIPE at write time. Let's
now see how our code handles that.
@dimitri dimitri merged commit 0c4e433 into main Apr 20, 2023
@dimitri dimitri deleted the fix/assorted-streaming-fixes branch April 20, 2023 10:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lag between stream and apply during replay phase keeps getting worse
2 participants