-
Notifications
You must be signed in to change notification settings - Fork 457
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
pageserver: fix secondary progress stats when layers are 404 (#7814)
## Problem Noticed this issue in staging. When a tenant is under somewhat heavy timeline creation/deletion thrashing, it becomes quite common for secondary downloads to encounter 404s downloading layers. This is tolerated by design, because heatmaps are not guaranteed to be up to date with what layers/timelines actually exist. However, we were not updating the SecondaryProgress structure in this case, so after such a download pass, we would leave a SecondaryProgress state with lower "downloaded" stats than "total" stats. This causes the storage controller to consider this secondary location inelegible for optimization actions such as we do after shard splits This issue has relative low impact because a typical tenant will eventually upload a heatmap where we do download all the layers and thereby enable the controller to progress with migrations -- the heavy thrashing of timeline creation/deletion is an artifact of our nightly stress tests. ## Summary of changes - In the layer 404 case, subtract the skipped layer's stats from the totals, so that at the end of this download pass we should still end up in a complete state. - When updating `last_downloaded`, do a sanity check that our progress is complete. In debug builds, assert out if this is not the case. In prod builds, correct the stats and log a warning.
- Loading branch information
Showing
1 changed file
with
63 additions
and
29 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
4ce6e2d
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3166 tests run: 3023 passed, 3 failed, 140 skipped (full report)
Failures on Postgres 14
test_storage_controller_many_tenants[github-actions-selfhosted]
: releasetest_basebackup_with_high_slru_count[github-actions-selfhosted-sequential-10-13-30]
: releasetest_basebackup_with_high_slru_count[github-actions-selfhosted-vectored-10-13-30]
: releaseFlaky tests (2)
Postgres 15
test_vm_bit_clear_on_heap_lock
: debugPostgres 14
test_secondary_background_downloads
: releaseCode coverage* (full report)
functions
:31.4% (6412 of 20427 functions)
lines
:48.0% (49275 of 102559 lines)
* collected from Rust tests only
4ce6e2d at 2024-05-21T14:07:32.585Z :recycle: