Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: Resume fails after interrupted replication connection #707

Merged
merged 2 commits into from
Mar 21, 2024

Conversation

arajkumar
Copy link
Contributor

When the replication connection is interrupted due to network or server issues, although there is a retry logic in place, it fails to work effectively. This is because the interruption may leave the last transaction in a partial state, leading to a failure when the ld_transform module attempts to transform the message.

Solution: Add a ROLLBACK message when the last transaction is partially written and the stream is in retry mode.

Use the following query on psql to test the network interruption behaviour while doing DML on source.

 SELECT pg_cancel_backend(pid) FROM pg_stat_activity where application_name like '%pgcopydb%' and query ilike '%START_REPL%';

\watch 5

The following error message will be loged on console when it is interrupted,

11:40:54.061 94512 ERROR unexpected termination of replication stream:
11:40:54.061 94512 ERROR ERROR: canceling statement due to user request 11:40:54.061 94512 ERROR CONTEXT: slot "pgcopydb", output plugin "test_decoding", in the change callback, associated LSN 7B/704E6A18
11:40:54.061 94512 WARN Streaming got interrupted at 7B/704E6A18, reconnecting in 1s

Without this fix, ld_transform would fail with the following console error logs,

2024-02-28 18:44:44.785 37 ERROR ld_transform.c:1099 Failed to parse BEGIN: transaction already in progress
2024-02-28 18:44:44.785 37 ERROR ld_transform.c:924 Failed to parse JSON message: {"action":"B","xid":"3560764890","lsn":"DD68/8ADC210","timestamp":"2024-02-28 17:59:31.756879+0000","message":{"action":"B","xid":3560764890}}
2024-02-28 18:44:44.789 37 ERROR ld_transform.c:686 Stream transform worker encountered 1 errors, see above for details
2024-02-28 18:44:44.789 37 INFO follow.c:758 Transform process has terminated
2024-02-28 18:44:44.855 19 ERROR follow.c:993 Subprocess transform with pid 37 has exited with error code 12
2024-02-28 18:44:45.016 19 ERROR follow.c:471 Failed to transform 2 messages from the queue, see above for details

When the replication connection is interrupted due to network or server issues, although there is a retry logic in place, it fails to work effectively. This is because the interruption may leave the last transaction in a partial state, leading to a failure when the ld_transform module attempts to transform the message.

Solution: Add a ROLLBACK message when the last transaction is partially written and the stream is in retry mode.

Use the following query on psql to test the network interruption behaviour while doing DML on source.

```.sql
 SELECT pg_cancel_backend(pid) FROM pg_stat_activity where application_name like '%pgcopydb%' and query ilike '%START_REPL%';

\watch 5
```

The following error message will be loged on console when it is interrupted,

> 11:40:54.061 94512 ERROR  unexpected termination of replication stream:
11:40:54.061 94512 ERROR  ERROR:  canceling statement due to user request
11:40:54.061 94512 ERROR  CONTEXT:  slot "pgcopydb", output plugin "test_decoding", in the change callback, associated LSN 7B/704E6A18
11:40:54.061 94512 WARN   Streaming got interrupted at 7B/704E6A18, reconnecting in 1s

Without this fix, ld_transform would fail with the following console error logs,

>2024-02-28 18:44:44.785 37 ERROR  ld_transform.c:1099       Failed to parse BEGIN: transaction already in progress
2024-02-28 18:44:44.785 37 ERROR  ld_transform.c:924        Failed to parse JSON message: {"action":"B","xid":"3560764890","lsn":"DD68/8ADC210","timestamp":"2024-02-28 17:59:31.756879+0000","message":{"action":"B","xid":3560764890}}
2024-02-28 18:44:44.789 37 ERROR  ld_transform.c:686        Stream transform worker encountered 1 errors, see above for details
2024-02-28 18:44:44.789 37 INFO   follow.c:758              Transform process has terminated
2024-02-28 18:44:44.855 19 ERROR  follow.c:993              Subprocess transform with pid 37 has exited with error code 12
2024-02-28 18:44:45.016 19 ERROR  follow.c:471              Failed to transform 2 messages from the queue, see above for details

Signed-off-by: Arunprasad Rajkumar <ar.arunprasad@gmail.com>
@dimitri dimitri merged commit 0ce70cb into dimitri:main Mar 21, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants