Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix timeout value used in XLogWaitForReplayOf #9937

Merged
merged 2 commits into from
Nov 29, 2024

Conversation

MMeent
Copy link
Contributor

@MMeent MMeent commented Nov 29, 2024

The previous value assumed usec precision, while the timeout used is in milliseconds, causing replica backends to wait for (potentially) many hours for WAL replay without the expected progress reports in logs.

This fixes the issue.

Reported-By: Alexander Lakhin exclusion@gmail.com

Problem

neondatabase/postgres#279 (comment)

The timeout value was configured with the assumption the indicated value would be microseconds, where it's actually milliseconds. That causes the backend to wait for much longer (2h46m40s) before it emits the "I'm waiting for recovery" message. While we do have wait events configured on this, it's not great to have stuck backends without clear logs, so this fixes the timeout value in all our PostgreSQL branches.

PG PRs

The previous value assumed usec precision, while the timeout used is in
milliseconds, causing replica backends to wait for (potentially) many
hours for WAL replay without the expected progress reports in logs.

This fixes the issue.
Copy link

github-actions bot commented Nov 29, 2024

6952 tests run: 6644 passed, 0 failed, 308 skipped (full report)


Flaky tests (1)

Postgres 17

Code coverage* (full report)

  • functions: 30.3% (8186 of 27044 functions)
  • lines: 47.7% (64837 of 135929 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
a9d616d at 2024-11-29T16:14:47.762Z :recycle:

The previous value assumed usec precision, while the timeout used is in
milliseconds, causing apparently stuck backends to wait for WAL replay.

This fixes the issue.
@MMeent MMeent enabled auto-merge November 29, 2024 14:32
@MMeent MMeent added this pull request to the merge queue Nov 29, 2024
Merged via the queue into main with commit 973a8d2 Nov 29, 2024
84 checks passed
@MMeent MMeent deleted the MMeent/fix/xlog-replay-wait-timeout branch November 29, 2024 19:11
awarus pushed a commit that referenced this pull request Dec 5, 2024
The previous value assumed usec precision, while the timeout used is in
milliseconds, causing replica backends to wait for (potentially) many
hours for WAL replay without the expected progress reports in logs.

This fixes the issue.

Reported-By: Alexander Lakhin <exclusion@gmail.com>

## Problem


neondatabase/postgres#279 (comment)

The timeout value was configured with the assumption the indicated value
would be microseconds, where it's actually milliseconds. That causes the
backend to wait for much longer (2h46m40s) before it emits the "I'm
waiting for recovery" message. While we do have wait events configured
on this, it's not great to have stuck backends without clear logs, so
this fixes the timeout value in all our PostgreSQL branches.

## PG PRs

* PG14: neondatabase/postgres#542
* PG15: neondatabase/postgres#543
* PG16: neondatabase/postgres#544
* PG17: neondatabase/postgres#545
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants