Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: lsn lease logic needs to properly handle restart #9754

Closed
yliang412 opened this issue Nov 13, 2024 · 0 comments · Fixed by #9833
Closed

pageserver: lsn lease logic needs to properly handle restart #9754

yliang412 opened this issue Nov 13, 2024 · 0 comments · Fixed by #9833
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@yliang412
Copy link
Contributor

yliang412 commented Nov 13, 2024

Problem

test_readonly_node_gc has become very flaky, and we root caused it to edge cases around pageserver restart / tenant migration. In particular, we will fail with Bad request: tried to request a page version that was garbage collected. if the GetPage request arrives before the first lease request after pageserver restart / tenant migration. The fix in #9055 does not eliminate bad requests.

To fix this permanently, we need to modify the lease logic to be aware of the what is guarded even after pageserver restart / tenant migration, likely through persistence.

Related: #8817, Notion

Potential Solution

Solution 1: Persisting state

  • Persist lease information for each timeline in TimelineMetadata.
    • Persist max lsn with valid lease: not strictly necessary, but it is used in current GC logic for both retain lsns (branch points) and leases.
      • invariant: keep if layer.start_lsn < max_lsn_with_valid_lease (same for retain lsns)
    • Don't need timestamp, we can auto-renew the leases once we finishes pageserver restart / tenant migration
    • TODO: might need to persist all lsns, since there are gaps in between.
  • Keep the existing logic: GC loop is only place we remove leases, lsn lease handler is the only place we add / renew leases.
    • TODO: synthetic size calculation handler currently also refreshes gc info? What should we do with those?

Solution 2: Prevent gc cutoff from proceeding

  • Change the logic so that gc cutoff does not proceed pass lsn lease. (Pessimistic).
  • Need to keep more data around.
@yliang412 yliang412 added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Nov 13, 2024
yliang412 added a commit that referenced this issue Nov 14, 2024
## Problem

After investigation, we think to make `test_readonly_node_gc` less
flaky, we need to make a proper fix (likely involving persisting part of
the lease state). See #9754
for details.

## Summary of changes

- skip the test until proper fix.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
@jcsp jcsp self-assigned this Nov 19, 2024
github-merge-queue bot pushed a commit that referenced this issue Nov 22, 2024
)

## Problem

In #9754 and the flakiness of
`test_readonly_node_gc`, we saw that although our logic for controlling
GC was sound, the validation of getpage requests was not, because it
could not consider LSN leases when requests arrived shortly after
restart.

Closes #9754

## Summary of changes

This is the "Option 3" discussed verbally -- rather than holding back gc
cutoff, we waive the usual validation of request LSN if we are still
waiting for leases to be sent after startup

- When validating LSN in `wait_or_get_last_lsn`, skip the validation
relative to GC cutoff if the timeline is still in its LSN lease grace
period
- Re-enable test_readonly_node_gc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants