pageserver: lsn lease logic needs to properly handle restart #9754

yliang412 · 2024-11-13T22:10:34Z

Problem

test_readonly_node_gc has become very flaky, and we root caused it to edge cases around pageserver restart / tenant migration. In particular, we will fail with Bad request: tried to request a page version that was garbage collected. if the GetPage request arrives before the first lease request after pageserver restart / tenant migration. The fix in #9055 does not eliminate bad requests.

To fix this permanently, we need to modify the lease logic to be aware of the what is guarded even after pageserver restart / tenant migration, likely through persistence.

Related: #8817, Notion

Potential Solution

Solution 1: Persisting state

Persist lease information for each timeline in TimelineMetadata.
- Persist max lsn with valid lease: not strictly necessary, but it is used in current GC logic for both retain lsns (branch points) and leases.
  - invariant: keep if layer.start_lsn < max_lsn_with_valid_lease (same for retain lsns)
- Don't need timestamp, we can auto-renew the leases once we finishes pageserver restart / tenant migration
- TODO: might need to persist all lsns, since there are gaps in between.
Keep the existing logic: GC loop is only place we remove leases, lsn lease handler is the only place we add / renew leases.
- TODO: synthetic size calculation handler currently also refreshes gc info? What should we do with those?

Solution 2: Prevent gc cutoff from proceeding

Change the logic so that gc cutoff does not proceed pass lsn lease. (Pessimistic).
Need to keep more data around.

The text was updated successfully, but these errors were encountered:

## Problem After investigation, we think to make `test_readonly_node_gc` less flaky, we need to make a proper fix (likely involving persisting part of the lease state). See #9754 for details. ## Summary of changes - skip the test until proper fix. Signed-off-by: Yuchen Liang <yuchen@neon.tech>

) ## Problem In #9754 and the flakiness of `test_readonly_node_gc`, we saw that although our logic for controlling GC was sound, the validation of getpage requests was not, because it could not consider LSN leases when requests arrived shortly after restart. Closes #9754 ## Summary of changes This is the "Option 3" discussed verbally -- rather than holding back gc cutoff, we waive the usual validation of request LSN if we are still waiting for leases to be sent after startup - When validating LSN in `wait_or_get_last_lsn`, skip the validation relative to GC cutoff if the timeline is still in its LSN lease grace period - Re-enable test_readonly_node_gc

yliang412 added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Nov 13, 2024

yliang412 mentioned this issue Nov 13, 2024

test: disable test_readonly_node_gc until proper fix #9755

Merged

yliang412 mentioned this issue Nov 15, 2024

[TESTING] fix(test): make test_readonly_node_gc more robust #9709

Closed

5 tasks

jcsp self-assigned this Nov 19, 2024

jcsp mentioned this issue Nov 21, 2024

pageserver: permit reads behind GC cutoff during LSN grace period #9833

Merged

jcsp closed this as completed in #9833 Nov 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pageserver: lsn lease logic needs to properly handle restart #9754

pageserver: lsn lease logic needs to properly handle restart #9754

yliang412 commented Nov 13, 2024 •

edited

Loading

pageserver: lsn lease logic needs to properly handle restart #9754

pageserver: lsn lease logic needs to properly handle restart #9754

Comments

yliang412 commented Nov 13, 2024 • edited Loading

Problem

Potential Solution

yliang412 commented Nov 13, 2024 •

edited

Loading