Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: rolling storage controller restarts RFC #8310

Merged
merged 6 commits into from
Aug 28, 2024

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Jul 8, 2024

Problem

Storage controller upgrades (restarts, more generally) can cause multi-second availability gaps.
While the storage controller does not sit on the main data path, it's generally not acceptable
to block management requests for extended periods of time (e.g. #8034).

Summary of changes

This RFC describes the issues around the current storage controller restart procedure
and describes an implementation which reduces downtime to a few milliseconds on the happy path.

Related #7797

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@VladLazar VladLazar requested review from jcsp and yliang412 July 8, 2024 12:08
Copy link

github-actions bot commented Jul 8, 2024

3806 tests run: 3700 passed, 0 failed, 106 skipped (full report)


Flaky tests (5)

Postgres 16

Postgres 15

Postgres 14

Code coverage* (full report)

  • functions: 32.3% (7306 of 22610 functions)
  • lines: 50.4% (59080 of 117260 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
bfb0c59 at 2024-08-28T14:04:56.984Z :recycle:

@VladLazar VladLazar requested a review from jcsp July 10, 2024 08:05
docs/rfcs/034-storage-controller-restarts.md Outdated Show resolved Hide resolved
docs/rfcs/034-storage-controller-restarts.md Outdated Show resolved Hide resolved
docs/rfcs/034-storage-controller-restarts.md Outdated Show resolved Hide resolved
docs/rfcs/034-storage-controller-restarts.md Outdated Show resolved Hide resolved
@jcsp
Copy link
Collaborator

jcsp commented Jul 12, 2024

Thinking of extra safety measures: we might in future like to carry an HTTP header on controller requests to pageservers, which would change for new leaders, so that pageservers could refuse requests from stale leaders. Might be worth embedding some counter in the leader table for that purpose.

VladLazar added a commit that referenced this pull request Jul 26, 2024
## Problem
We are missing the step-down primitive required to implement rolling
restarts of the storage controller.

## Summary of changes
Add `/control/v1/step_down` endpoint which puts the storage controller
into a state where it rejects
all API requests apart from `/control/v1/step_down`, `/status` and
`/metrics`. When receiving the request,
storage controller cancels all pending reconciles and waits for them to
exit gracefully. The response contains
a snapshot of the in-memory observed state.

Related:
* neondatabase/cloud#14701
* #7797
* #8310
VladLazar added a commit to neondatabase/helm-charts that referenced this pull request Aug 22, 2024
Add a startAsCandidate setting for the storcon helm chart (default false).
When set to true, the service restarts gracefully (see neondatabase/neon#8310).

This doesn't change anything as is. Changes to neondatabase/infra will stage the roll-out of this.
@VladLazar VladLazar force-pushed the vlad/storcon-improved-restarts-rfc branch from 91b4533 to bfb0c59 Compare August 28, 2024 11:13
@VladLazar VladLazar enabled auto-merge (squash) August 28, 2024 11:22
@VladLazar VladLazar merged commit 5eb7322 into main Aug 28, 2024
64 of 67 checks passed
@VladLazar VladLazar deleted the vlad/storcon-improved-restarts-rfc branch August 28, 2024 13:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants