-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pageserver: wait for lsn lease duration after transition into AttachedSingle #9024
pageserver: wait for lsn lease duration after transition into AttachedSingle #9024
Conversation
…dSingle Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
4968 tests run: 4804 passed, 0 failed, 164 skipped (full report)Flaky tests (5)Postgres 17
Postgres 16
Postgres 15
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
d1b2bb1 at 2024-09-19T15:58:13.456Z :recycle: |
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
(Removing myself from this review) |
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I think this is looking good.
I am surprised how many gc sensitive tests we have ... but which were not sensitive to lsn lease initial wait before? I'll try to understand these more and possibly chat with you...
…dSingle (#9024) Part of #7497, closes #8890. ## Problem Since leases are in-memory objects, we need to take special care of them after pageserver restarts and while doing a live migration. The approach we took for pageserver restart is to wait for at least lease duration before doing first GC. We want to do the same for live migration. Since we do not do any GC when a tenant is in `AttachedStale` or `AttachedMulti` mode, only the transition from `AttachedMulti` to `AttachedSingle` requires this treatment. ## Summary of changes - Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is temporarily blocked from GC until we reach the deadline. This information does not persist to S3. - In `GCBlock::start`, skip the GC iteration if we are blocked by the lsn lease deadline. - In `TenantManager::upsert_location`, set the lsn_lease_deadline to `Instant::now() + lsn_lease_length` so the granted leases have a chance to be renewed before we run GC for the first time after transitioned from AttachedMulti to AttachedSingle. Signed-off-by: Yuchen Liang <yuchen@neon.tech> Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Part of #7497, closes #8817. ## Problem See #8817. ## Summary of changes **compute_ctl** - Renew lsn lease as soon as `/configure` updates pageserver_connstr, use `state_changed` Condvar for synchronization. **pageserver** As mentioned in #8817 (comment), we still want some permanent error reported if a lease cannot be granted. By considering attachment mode and the added `lsn_lease_deadline` when processing lease requests, we can also bound the case of bad requests to a very short period after migration/restart. - Refactor #9024 and move `lsn_lease_deadline` to `AttachedTenantConf` so timeline can easily access it. - Have separate HTTP `init_lsn_lease` and libpq `renew_lsn_lease` API. - Always do LSN verification for the initial HTTP lease request. - LSN verification for the renewal is **still done** when tenants are not in `AttachedSingle` and we have pass the `lsn_lease_deadline`, which give plenty of time for compute to renew the lease. **neon_local** - add and call `timeline_init_lsn_lease` mgmt_api at static endpoint start. The initial lsn lease http request is sent when we run `cargo neon endpoint start <static endpoint>`. ## Testing - Extend `test_readonly_node_gc` to do pageserver restarts and migration. ## Future Work - The control plane should make the initial lease request through HTTP when creating a static endpoint. This is currently only done in `neon_local`. Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Part of #7497, closes #8817. ## Problem See #8817. ## Summary of changes **compute_ctl** - Renew lsn lease as soon as `/configure` updates pageserver_connstr, use `state_changed` Condvar for synchronization. **pageserver** As mentioned in #8817 (comment), we still want some permanent error reported if a lease cannot be granted. By considering attachment mode and the added `lsn_lease_deadline` when processing lease requests, we can also bound the case of bad requests to a very short period after migration/restart. - Refactor #9024 and move `lsn_lease_deadline` to `AttachedTenantConf` so timeline can easily access it. - Have separate HTTP `init_lsn_lease` and libpq `renew_lsn_lease` API. - Always do LSN verification for the initial HTTP lease request. - LSN verification for the renewal is **still done** when tenants are not in `AttachedSingle` and we have pass the `lsn_lease_deadline`, which give plenty of time for compute to renew the lease. **neon_local** - add and call `timeline_init_lsn_lease` mgmt_api at static endpoint start. The initial lsn lease http request is sent when we run `cargo neon endpoint start <static endpoint>`. ## Testing - Extend `test_readonly_node_gc` to do pageserver restarts and migration. ## Future Work - The control plane should make the initial lease request through HTTP when creating a static endpoint. This is currently only done in `neon_local`. Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Part of #7497, closes #8890.
Problem
Since leases are in-memory objects, we need to take special care of them after pageserver restarts and while doing a live migration. The approach we took for pageserver restart is to wait for at least lease duration before doing first GC. We want to do the same for live migration. Since we do not do any GC when a tenant is in
AttachedStale
orAttachedMulti
mode, only the transition fromAttachedMulti
toAttachedSingle
requires this treatment.Summary of changes
lsn_lease_deadline
field inGcBlock::reasons
: the tenant is temporarily blocked from GC until we reach the deadline. This information does not persist to S3.GCBlock::start
, skip the GC iteration if we are blocked by the lsn lease deadline.TenantManager::upsert_location
, set the lsn_lease_deadline toInstant::now() + lsn_lease_length
so the granted leases have a chance to be renewed before we run GC for the first time after transitioned from AttachedMulti to AttachedSingle.Checklist before requesting a review
Checklist before merging