Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suppress wal lag timeout warnings right after tenant attachment #9232

Merged
merged 5 commits into from
Oct 3, 2024

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Oct 2, 2024

As seen in https://github.com/neondatabase/cloud/issues/17335, during releases we can have ingest lags that are above the limits for warnings. However, such lags are part of normal pageserver startup.

Therefore, calculate a certain cooldown timestamp until which we accept lags. The heuristic is chosen to grow the later we get to fully load the tenant, and we also add 60 seconds as a grace period after that term.

@arpad-m arpad-m requested a review from a team as a code owner October 2, 2024 00:35
@arpad-m arpad-m requested review from skyzh and problame October 2, 2024 00:35
@arpad-m arpad-m force-pushed the arpad/wal_lag_timeout branch from c85be97 to 8337205 Compare October 2, 2024 01:00
Copy link

github-actions bot commented Oct 2, 2024

5058 tests run: 4872 passed, 0 failed, 186 skipped (full report)


Flaky tests (5)

Postgres 17

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 31.3% (7487 of 23883 functions)
  • lines: 49.5% (60117 of 121349 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
1038b60 at 2024-10-03T01:19:13.068Z :recycle:

Copy link
Member

@skyzh skyzh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but this patch seems to break backward compatibility?

pageserver/src/tenant.rs Outdated Show resolved Hide resolved
@arpad-m
Copy link
Member Author

arpad-m commented Oct 2, 2024

this patch seems to break backward compatibility?

yeah the test expects the warning to be present but it is still within the 60 seconds grace period. Actually it makes sense for the warning to fire in that instance.

I think the heuristic needs to be adjusted. Instead of blanket-allowing all lag warnings until a certain timestamp, it might make more sense to, until a certain timestamp, allow lag warnings that are below a certain duration. I will adjust the PR according to that.

@arpad-m arpad-m enabled auto-merge (squash) October 3, 2024 00:52
@arpad-m arpad-m merged commit 2d8f6d7 into main Oct 3, 2024
79 checks passed
@arpad-m arpad-m deleted the arpad/wal_lag_timeout branch October 3, 2024 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants