[PG15] Feature/replicas #279

MMeent · 2023-04-11T11:44:20Z

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

This fixes some test failures that showed up after updating Neon code to do more precise handling of replica's get_page_at_lsn's request_lsn lsns.

* Recovery requirements: Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer. * Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts. This fixes some test failures that showed up after updating Neon code to do more precise handling of replica's get_page_at_lsn's request_lsn lsns. --------- Co-authored-by: Matthias van de Meent <boekewurm+postgres@gmail.com>

alexanderlaw · 2024-11-29T12:00:01Z

@MMeent , sorry for bumping the old issue, but I've stumbled upon a dubious thing introduced here.
Namely, this in XLogWaitForReplayOf(), e.g., in postgres-v15:

	timeout = ConditionVariableTimedSleep(&XLogRecoveryCtl->replayProgressCV,
										  10000000, /* 10 seconds */
										  WAIT_EVENT_RECOVERY_WAL_STREAM);

Does 10000000 really mean 10 seconds for that function?

I'm seeing inside that function:

            cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);

Maybe the value passed is incorrect?

(I have discovered this when noticed that that sleep lasts for more than 5 minutes.)

MMeent · 2024-11-29T12:17:59Z

Yep, seems like it indeed is off by factor 1000.

MMeent · 2024-11-29T12:45:10Z

I've created PRs for the issue for all PG versions and the Neon repo, so this won't remain an issue for much longer. Thank you for reporting this issue!

alexanderlaw · 2024-11-29T12:47:53Z

Thank you for paying attention to this!

The previous value assumed usec precision, while the timeout used is in milliseconds, causing replica backends to wait for (potentially) many hours for WAL replay without the expected progress reports in logs. This fixes the issue. Reported-By: Alexander Lakhin <exclusion@gmail.com> ## Problem neondatabase/postgres#279 (comment) The timeout value was configured with the assumption the indicated value would be microseconds, where it's actually milliseconds. That causes the backend to wait for much longer (2h46m40s) before it emits the "I'm waiting for recovery" message. While we do have wait events configured on this, it's not great to have stuck backends without clear logs, so this fixes the timeout value in all our PostgreSQL branches. ## PG PRs * PG14: neondatabase/postgres#542 * PG15: neondatabase/postgres#543 * PG16: neondatabase/postgres#544 * PG17: neondatabase/postgres#545

MMeent added 2 commits April 11, 2023 13:36

Recovery requirements:

873a843

Add condition variable for WAL recovery; allowing backends to wait for recovery up to some record pointer.

Fix issues w.r.t. WAL when LwLsn is initiated and when recovery starts.

8650bd8

This fixes some test failures that showed up after updating Neon code to do more precise handling of replica's get_page_at_lsn's request_lsn lsns.

MMeent requested a review from knizhnik April 11, 2023 11:47

knizhnik approved these changes Apr 13, 2023

View reviewed changes

MMeent merged commit aee72b7 into REL_15_STABLE_neon Apr 13, 2023

MMeent deleted the feature/replicas-v15 branch April 13, 2023 20:42

MMeent mentioned this pull request Nov 29, 2024

Fix timeout value used in XLogWaitForReplayOf neondatabase/neon#9937

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PG15] Feature/replicas #279

[PG15] Feature/replicas #279

MMeent commented Apr 11, 2023

alexanderlaw commented Nov 29, 2024

MMeent commented Nov 29, 2024

MMeent commented Nov 29, 2024

alexanderlaw commented Nov 29, 2024

[PG15] Feature/replicas #279

[PG15] Feature/replicas #279

Conversation

MMeent commented Apr 11, 2023

alexanderlaw commented Nov 29, 2024

MMeent commented Nov 29, 2024

MMeent commented Nov 29, 2024

alexanderlaw commented Nov 29, 2024