Fix timeout value used in XLogWaitForReplayOf #9937

MMeent · 2024-11-29T12:37:11Z

The previous value assumed usec precision, while the timeout used is in milliseconds, causing replica backends to wait for (potentially) many hours for WAL replay without the expected progress reports in logs.

This fixes the issue.

Reported-By: Alexander Lakhin exclusion@gmail.com

Problem

neondatabase/postgres#279 (comment)

The timeout value was configured with the assumption the indicated value would be microseconds, where it's actually milliseconds. That causes the backend to wait for much longer (2h46m40s) before it emits the "I'm waiting for recovery" message. While we do have wait events configured on this, it's not great to have stuck backends without clear logs, so this fixes the timeout value in all our PostgreSQL branches.

PG PRs

The previous value assumed usec precision, while the timeout used is in milliseconds, causing replica backends to wait for (potentially) many hours for WAL replay without the expected progress reports in logs. This fixes the issue.

github-actions · 2024-11-29T13:38:19Z

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)

Flaky tests (1)

Postgres 17

test_location_conf_churn[3]: debug-x86-64

Code coverage* (full report)

functions: 30.3% (8186 of 27044 functions)
lines: 47.7% (64837 of 135929 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
a9d616d at 2024-11-29T16:14:47.762Z :recycle:}

The previous value assumed usec precision, while the timeout used is in milliseconds, causing apparently stuck backends to wait for WAL replay. This fixes the issue.

The previous value assumed usec precision, while the timeout used is in milliseconds, causing replica backends to wait for (potentially) many hours for WAL replay without the expected progress reports in logs. This fixes the issue. Reported-By: Alexander Lakhin <exclusion@gmail.com> ## Problem neondatabase/postgres#279 (comment) The timeout value was configured with the assumption the indicated value would be microseconds, where it's actually milliseconds. That causes the backend to wait for much longer (2h46m40s) before it emits the "I'm waiting for recovery" message. While we do have wait events configured on this, it's not great to have stuck backends without clear logs, so this fixes the timeout value in all our PostgreSQL branches. ## PG PRs * PG14: neondatabase/postgres#542 * PG15: neondatabase/postgres#543 * PG16: neondatabase/postgres#544 * PG17: neondatabase/postgres#545

Fix timeout value used in XLogWaitForReplayOf

d01d46c

The previous value assumed usec precision, while the timeout used is in milliseconds, causing replica backends to wait for (potentially) many hours for WAL replay without the expected progress reports in logs. This fixes the issue.

MMeent requested a review from a team as a code owner November 29, 2024 12:37

MMeent requested a review from tristan957 November 29, 2024 12:37

hlinnaka approved these changes Nov 29, 2024

View reviewed changes

Fix timeout value used in XLogWaitForReplayOf

a9d616d

The previous value assumed usec precision, while the timeout used is in milliseconds, causing apparently stuck backends to wait for WAL replay. This fixes the issue.

MMeent enabled auto-merge November 29, 2024 14:32

MMeent added this pull request to the merge queue Nov 29, 2024

Merged via the queue into main with commit 973a8d2 Nov 29, 2024
84 checks passed

MMeent deleted the MMeent/fix/xlog-replay-wait-timeout branch November 29, 2024 19:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix timeout value used in XLogWaitForReplayOf #9937

Fix timeout value used in XLogWaitForReplayOf #9937

MMeent commented Nov 29, 2024 •

edited

Loading

github-actions bot commented Nov 29, 2024 •

edited

Loading

Postgres 17

Fix timeout value used in XLogWaitForReplayOf #9937

Fix timeout value used in XLogWaitForReplayOf #9937

Conversation

MMeent commented Nov 29, 2024 • edited Loading

Problem

PG PRs

github-actions bot commented Nov 29, 2024 • edited Loading

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)

Postgres 17

Code coverage* (full report)

MMeent commented Nov 29, 2024 •

edited

Loading

github-actions bot commented Nov 29, 2024 •

edited

Loading