Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lsn for KEEPALIVE action. #164

Merged
merged 2 commits into from
Jan 16, 2023

Conversation

shubhamdhama
Copy link
Contributor

@shubhamdhama shubhamdhama commented Dec 9, 2022

It fixes an issue under the following conditions,

  1. Source doesn't have any activity.
  2. endpos is set to current LSN.
  3. The action that crosses the endpos is KEEPALIVE.

in this case, the written lsn that corresponds to KEEPALIVE isn't correct, i.e. less than endpos. This ultimately leads to Apply process waiting for a statement >= endpos which never happens, hence it never terminates.

@shubhamdhama
Copy link
Contributor Author

shubhamdhama commented Dec 9, 2022

Draft as I will write a unit test when no activity on source happens (sort of like edge case).
I'm not able to come up with a test case where I can deterministically set the endpos right after a K action because we also have B/C action due to activity on the sentinel table. So I was thinking the best we can do is create a test case where we have no activity for 10 seconds, prefetch, and put a timeout on apply for 10s so that migration should finish quickly.

Also, I have a question about some other tests (cdc/*) I see that Keepalive is in the expected JSON. How is every test run emitting the Keepalive at the same position relative to other actions?

@dimitri
Copy link
Owner

dimitri commented Jan 11, 2023

My reading of the Postgres source code makes me think that this PR is okay, we actually want to advance to WalEnd when receiving a keepalive message. See function WalSndUpdateProgress in src/backend/replication/walsender.c:

	 * When skipping empty transactions in synchronous replication, we send a
	 * keepalive message to avoid delaying such transactions.

About the test cases and the Keepalive (and other messages) position, we just skip comparing them because each run may provide a different one...

@shubhamdhama shubhamdhama marked this pull request as ready for review January 13, 2023 08:10
@shubhamdhama
Copy link
Contributor Author

Thanks for confirming that this fix is correct.

I've published the PR since this change is important and I'm couldn't finish the unit test I mentioned. I will share them in a separate PR once I am done.

@dimitri
Copy link
Owner

dimitri commented Jan 13, 2023

It seems you have added an extra KeepAlive message in some of the tests output, and the expected file has not been updated by your PR so tests are failing. Please see about updating the expected tests results.

@shubhamdhama
Copy link
Contributor Author

Oh right, I have updated the PR.

@dimitri dimitri merged commit 2598444 into dimitri:main Jan 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants