Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: wait for lsn lease duration after transition into AttachedSingle #9024

Merged
merged 17 commits into from
Sep 19, 2024

Conversation

yliang412
Copy link
Contributor

@yliang412 yliang412 commented Sep 17, 2024

Part of #7497, closes #8890.

Problem

Since leases are in-memory objects, we need to take special care of them after pageserver restarts and while doing a live migration. The approach we took for pageserver restart is to wait for at least lease duration before doing first GC. We want to do the same for live migration. Since we do not do any GC when a tenant is in AttachedStale or AttachedMulti mode, only the transition from AttachedMulti to AttachedSingle requires this treatment.

Summary of changes

  • Added lsn_lease_deadline field in GcBlock::reasons: the tenant is temporarily blocked from GC until we reach the deadline. This information does not persist to S3.
  • In GCBlock::start, skip the GC iteration if we are blocked by the lsn lease deadline.
  • In TenantManager::upsert_location, set the lsn_lease_deadline to Instant::now() + lsn_lease_length so the granted leases have a chance to be renewed before we run GC for the first time after transitioned from AttachedMulti to AttachedSingle.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

…dSingle

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
@yliang412 yliang412 marked this pull request as ready for review September 17, 2024 14:31
@yliang412 yliang412 requested a review from a team as a code owner September 17, 2024 14:32
@yliang412 yliang412 added the c/storage/pageserver Component: storage: pageserver label Sep 17, 2024
@koivunej koivunej changed the title pageserver: wait for lsn lease duration after transition into AttachedSignle pageserver: wait for lsn lease duration after transition into AttachedSingle Sep 17, 2024
Copy link

github-actions bot commented Sep 17, 2024

4968 tests run: 4804 passed, 0 failed, 164 skipped (full report)


Flaky tests (5)

Postgres 17

  • test_ondemand_wal_download_in_replication_slot_funcs: release-x86-64

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 31.9% (7425 of 23298 functions)
  • lines: 49.9% (59745 of 119838 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
d1b2bb1 at 2024-09-19T15:58:13.456Z :recycle:

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
@problame
Copy link
Contributor

(Removing myself from this review)

@problame problame removed their request for review September 18, 2024 13:33
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I think this is looking good.

I am surprised how many gc sensitive tests we have ... but which were not sensitive to lsn lease initial wait before? I'll try to understand these more and possibly chat with you...

@yliang412 yliang412 enabled auto-merge (squash) September 19, 2024 15:05
@yliang412 yliang412 merged commit 1708743 into main Sep 19, 2024
78 checks passed
@yliang412 yliang412 deleted the yuchen/lsn-lease-attached-multi-to-single-safety branch September 19, 2024 16:27
yliang412 added a commit that referenced this pull request Sep 20, 2024
davidgomes pushed a commit that referenced this pull request Sep 21, 2024
…dSingle (#9024)

Part of #7497, closes #8890.

## Problem

Since leases are in-memory objects, we need to take special care of them
after pageserver restarts and while doing a live migration. The approach
we took for pageserver restart is to wait for at least lease duration
before doing first GC. We want to do the same for live migration. Since
we do not do any GC when a tenant is in `AttachedStale` or
`AttachedMulti` mode, only the transition from `AttachedMulti` to
`AttachedSingle` requires this treatment.

## Summary of changes

- Added `lsn_lease_deadline` field in `GcBlock::reasons`: the tenant is
temporarily blocked from GC until we reach the deadline. This
information does not persist to S3.
- In `GCBlock::start`, skip the GC iteration if we are blocked by the
lsn lease deadline.
- In `TenantManager::upsert_location`, set the lsn_lease_deadline to
`Instant::now() + lsn_lease_length` so the granted leases have a chance
to be renewed before we run GC for the first time after transitioned
from AttachedMulti to AttachedSingle.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
yliang412 added a commit that referenced this pull request Sep 27, 2024
Part of #7497, closes #8817.

## Problem

See #8817. 

## Summary of changes

**compute_ctl**

- Renew lsn lease as soon as `/configure` updates pageserver_connstr,
use `state_changed` Condvar for synchronization.

**pageserver**

As mentioned in
#8817 (comment),
we still want some permanent error reported if a lease cannot be
granted. By considering attachment mode and the added
`lsn_lease_deadline` when processing lease requests, we can also bound
the case of bad requests to a very short period after migration/restart.

- Refactor #9024 and move
`lsn_lease_deadline` to `AttachedTenantConf` so timeline can easily
access it.
- Have separate HTTP `init_lsn_lease` and  libpq `renew_lsn_lease` API.
  - Always do LSN verification for the initial HTTP lease request.
- LSN verification for the renewal is **still done** when tenants are
not in `AttachedSingle` and we have pass the `lsn_lease_deadline`, which
give plenty of time for compute to renew the lease.
 
**neon_local**

- add and call `timeline_init_lsn_lease` mgmt_api at static endpoint
start. The initial lsn lease http request is sent when we run `cargo
neon endpoint start <static endpoint>`.


## Testing

- Extend `test_readonly_node_gc` to do pageserver restarts and
migration.

## Future Work

- The control plane should make the initial lease request through HTTP
when creating a static endpoint. This is currently only done in
`neon_local`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
bayandin pushed a commit that referenced this pull request Sep 29, 2024
Part of #7497, closes #8817.

## Problem

See #8817. 

## Summary of changes

**compute_ctl**

- Renew lsn lease as soon as `/configure` updates pageserver_connstr,
use `state_changed` Condvar for synchronization.

**pageserver**

As mentioned in
#8817 (comment),
we still want some permanent error reported if a lease cannot be
granted. By considering attachment mode and the added
`lsn_lease_deadline` when processing lease requests, we can also bound
the case of bad requests to a very short period after migration/restart.

- Refactor #9024 and move
`lsn_lease_deadline` to `AttachedTenantConf` so timeline can easily
access it.
- Have separate HTTP `init_lsn_lease` and  libpq `renew_lsn_lease` API.
  - Always do LSN verification for the initial HTTP lease request.
- LSN verification for the renewal is **still done** when tenants are
not in `AttachedSingle` and we have pass the `lsn_lease_deadline`, which
give plenty of time for compute to renew the lease.
 
**neon_local**

- add and call `timeline_init_lsn_lease` mgmt_api at static endpoint
start. The initial lsn lease http request is sent when we run `cargo
neon endpoint start <static endpoint>`.


## Testing

- Extend `test_readonly_node_gc` to do pageserver restarts and
migration.

## Future Work

- The control plane should make the initial lease request through HTTP
when creating a static endpoint. This is currently only done in
`neon_local`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wait for the lease duration after transitioning to AttachedSingle, before doing any GC
3 participants