You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
test_readonly_node_gc has become very flaky, and we root caused it to edge cases around pageserver restart / tenant migration. In particular, we will fail with Bad request: tried to request a page version that was garbage collected. if the GetPage request arrives before the first lease request after pageserver restart / tenant migration. The fix in #9055 does not eliminate bad requests.
To fix this permanently, we need to modify the lease logic to be aware of the what is guarded even after pageserver restart / tenant migration, likely through persistence.
## Problem
After investigation, we think to make `test_readonly_node_gc` less
flaky, we need to make a proper fix (likely involving persisting part of
the lease state). See #9754
for details.
## Summary of changes
- skip the test until proper fix.
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
)
## Problem
In #9754 and the flakiness of
`test_readonly_node_gc`, we saw that although our logic for controlling
GC was sound, the validation of getpage requests was not, because it
could not consider LSN leases when requests arrived shortly after
restart.
Closes#9754
## Summary of changes
This is the "Option 3" discussed verbally -- rather than holding back gc
cutoff, we waive the usual validation of request LSN if we are still
waiting for leases to be sent after startup
- When validating LSN in `wait_or_get_last_lsn`, skip the validation
relative to GC cutoff if the timeline is still in its LSN lease grace
period
- Re-enable test_readonly_node_gc
Problem
test_readonly_node_gc
has become very flaky, and we root caused it to edge cases around pageserver restart / tenant migration. In particular, we will fail withBad request: tried to request a page version that was garbage collected.
if the GetPage request arrives before the first lease request after pageserver restart / tenant migration. The fix in #9055 does not eliminate bad requests.To fix this permanently, we need to modify the lease logic to be aware of the what is guarded even after pageserver restart / tenant migration, likely through persistence.
Related: #8817, Notion
Potential Solution
Solution 1: Persisting state
TimelineMetadata
.layer.start_lsn < max_lsn_with_valid_lease
(same for retain lsns)Solution 2: Prevent gc cutoff from proceeding
The text was updated successfully, but these errors were encountered: