Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(pageserver): use leases to temporarily block gc #8084

Merged
merged 8 commits into from
Jun 18, 2024

Conversation

yliang412
Copy link
Contributor

@yliang412 yliang412 commented Jun 17, 2024

Part of #7497, extracts from #7996, closes #8063.

Problem

With the LSN lease API introduced in #7808, we want to implement the real lease logic to prevent GC from proceeding pass some LSN when garbage collecting layers in a timeline.

To do so, we keeps an additional in-memory mapping of LSNs to leases in GCInfo. This mapping is updated when

  • a request is made to grant a new lease (insertion of new leases/renewal of existing leases)
  • GCInfo is refreshed during GC (removal of expired leases)

We use this mapping during GC similar to how we use retain_lsns for branches. The idea is that we will keep a layer if the layer start LSN less or equal to any of the leased LSN. This guarantees that we will keep all the layers needed to reconstruct all pages at all the leased LSNs with valid leases at a given time.

Future Task

  • For the current lease implementation, we are actually keeping all the layers below the maximum LSN with valid leases (same problem for retain_lsns). Theoretically, for each LSN with valid lease, we only need to keep down to the most recent image layer at that lease/branch LSN. This can be a win for reducing the amount of data we retain.

Summary of changes

  • Maintain a in-memory mapping of LSN to lease expiration time in GCInfo.
  • Requesting a lease for a LSN below latest_gc_cutoff_lsn will error out.
  • Remove expired lease from the map when refreshing GCInfo.
  • Block GC (similar to how we uses retain_lsns).
  • A unit test test the make_lsn_lease API and how leases get used during GC.
  • Delay GC by lease period at startup.
  • Lease duration config in TenantConf (useful for running tests)

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@yliang412 yliang412 marked this pull request as ready for review June 17, 2024 20:25
@yliang412 yliang412 requested a review from a team as a code owner June 17, 2024 20:25
@yliang412 yliang412 requested review from jcsp and skyzh June 17, 2024 20:25
Copy link

github-actions bot commented Jun 17, 2024

3222 tests run: 3105 passed, 0 failed, 117 skipped (full report)


Code coverage* (full report)

  • functions: 32.4% (6842 of 21098 functions)
  • lines: 50.0% (53363 of 106812 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
cf488a8 at 2024-06-18T17:25:02.644Z :recycle:

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
yliang412 and others added 2 commits June 18, 2024 10:51
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
Signed-off-by: Yuchen Liang <yuchen@neon.tech>
@yliang412 yliang412 enabled auto-merge (squash) June 18, 2024 15:04
@yliang412 yliang412 disabled auto-merge June 18, 2024 15:40
yliang412 and others added 2 commits June 18, 2024 12:28
@yliang412 yliang412 enabled auto-merge (squash) June 18, 2024 16:31
@yliang412 yliang412 merged commit 30b890e into main Jun 18, 2024
57 checks passed
@yliang412 yliang412 deleted the yuchen/lsn-lease-block-gc branch June 18, 2024 17:37
yliang412 added a commit that referenced this pull request Jun 24, 2024
…PI (#8104)

Part of #7497, closes #8072.

## Problem

Currently the `get_lsn_by_timestamp` and branch creation pageserver APIs do not provide a pleasant client experience where the looked-up LSN might be GC-ed between the two API calls.

This PR attempts to prevent common races between GC and branch creation by making use of LSN leases provided in #8084. A lease can be optionally granted to a looked-up LSN. With the lease, GC will not touch layers needed to reconstruct all pages at this LSN for the duration of the lease.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
conradludgate pushed a commit that referenced this pull request Jun 27, 2024
…PI (#8104)

Part of #7497, closes #8072.

## Problem

Currently the `get_lsn_by_timestamp` and branch creation pageserver APIs do not provide a pleasant client experience where the looked-up LSN might be GC-ed between the two API calls.

This PR attempts to prevent common races between GC and branch creation by making use of LSN leases provided in #8084. A lease can be optionally granted to a looked-up LSN. With the lease, GC will not touch layers needed to reconstruct all pages at this LSN for the duration of the lease.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
yliang412 added a commit that referenced this pull request Jul 4, 2024
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
yliang412 added a commit that referenced this pull request Jul 8, 2024
…8254)

## Problem

LSN Leases introduced in #8084 is a new API that is made shard-aware
from day 1. To support ephemeral endpoint in #7994 without linking
Postgres C API against `compute_ctl`, part of the sharding needs to
reside in `utils`.

## Summary of changes

- Create a new `shard` module in utils crate.
- Move more interface related part of tenant sharding API to utils and
re-export them in pageserver_api.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
VladLazar pushed a commit that referenced this pull request Jul 8, 2024
Part of #7497, closes #8071. (accidentally closed #8208, reopened here)

## Problem

After the changes in #8084, we need synthetic size to also account for
leased LSNs so that users do not get free retention by running a small
ephemeral endpoint for a long time.

## Summary of changes

This PR integrates LSN leases into the synthetic size calculation. We
model leases as read-only branches started at the leased LSN (except it
does not have a timeline id).

Other changes:
- Add new unit tests testing whether a lease behaves like a read-only
branch.
- Change `/size_debug` response to include lease point in the SVG
visualization.
- Fix `/lsn_lease` HTTP API to do proper parsing for POST.



Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Co-authored-by: Christian Schwarz <christian@neon.tech>
skyzh pushed a commit that referenced this pull request Jul 15, 2024
…8254)

## Problem

LSN Leases introduced in #8084 is a new API that is made shard-aware
from day 1. To support ephemeral endpoint in #7994 without linking
Postgres C API against `compute_ctl`, part of the sharding needs to
reside in `utils`.

## Summary of changes

- Create a new `shard` module in utils crate.
- Move more interface related part of tenant sharding API to utils and
re-export them in pageserver_api.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
yliang412 pushed a commit that referenced this pull request Aug 28, 2024
…tes (#7994)

Part of #7497

## Problem

Static computes pinned at some fix LSN could be created initially within
PITR interval but eventually go out it. To make sure that Static
computes are not affected by GC, we need to start using the LSN lease
API (introduced in #8084) in compute_ctl.

## Summary of changes

**compute_ctl**
- Spawn a thread for when a static compute starts to periodically ping
pageserver(s) to make LSN lease requests.
- Add `test_readonly_node_gc` to test if static compute can read all
pages without error.
  - (test will fail on main without the code change here)

**page_service**
- `wait_or_get_last_lsn` will now allow `request_lsn` less than
`latest_gc_cutoff_lsn` to proceed if there is a lease on `request_lsn`.

Signed-off-by: Yuchen Liang <yuchen@neon.tech>
Co-authored-by: Alexey Kondratov <kondratov.aleksey@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants