graceful reconfiguration v1 design doc #26785

jubrad · 2024-04-23T20:34:00Z

Motivation

Design doc for graceful reconfiguration of managed clusters.
Currently limited to timeout based reconfiguration delay (v1).

https://github.com/MaterializeInc/database-issues/issues/5976

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

ggevay · 2024-04-24T15:54:51Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+```
+CONCERN:
+Allowing delay=false and a timeout=0 provides two similar mechanisms of
+using the old behavior. I think this is what we want as it ensures the old


How about special-casing timeout=0 to just invoke the old code, i.e., to do exactly the same as with delay=false?

ggevay · 2024-04-24T16:02:29Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+### How do we handle multiple schedules?
+If this feature is introducing multiple schedules we'll need some rules governing the interaction between schedules.
+There's world where each schedule needs to consider it's interaction with all other schedules and that seems kind of
+scary.


I think there could be a division of schedules into 2 categories:

Schedules that turn clusters On/Off.

Schedules that affect the set of replicas of a cluster, if the cluster is On.

The interaction among all the schedules in the 1. category are already handled by the current code. (We wait for all schedules to have their say about a cluster before touching a cluster, and the cluster will be On if at least one schedule says that it should be On.)

The interaction between 1. and 2. would be:

If 1. decided that the cluster is Off, then 2. has nothing to do, because there are no replicas.

If 1. decided that the cluster is On, then 2. can decide on the exact set of replicas.

This leaves open the interactions among different schedules in category 2. What would be another example of a schedule in 2.? Maybe auto-scaling? I guess that would need to have some custom-defined interaction with graceful reconf timeout schedule.

ggevay · 2024-04-24T16:04:54Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+
+### How does a multi-subscriber coord work with schedules?
+We definitely don't want two schedule decisions to be acting at the same time. Currently this isn't
+an issue, but it seems a bit scary. Is this a concern?


I'd guess that stuff like altering clusters will always be a centralized thing, and only 1 instance of handle_scheduling_decisions would be allowed to run at a time. (But I'm not an expert on PlatformV2 stuff.)

maddyblue

I haven't read over the implementation yet because I need to read up on how cluster scheduling works today first. Left some thoughts about how this should work from a user perspective.

maddyblue · 2024-05-14T17:32:37Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+### Details
+
+#### Invariants:
+  - Only one reconfiguration will be allowed at a time for a cluster.


What happens when a user tries this exactly? Error? The running recon is stopped and the new one occurs instead?

yeah, I'm not really sure what the implications are of switching the cluster_schedule from a single value to a vec.
Having two refresh schedules doesn't make much sense nor does having two reconfigure schedules. Perhaps it's we're able to say that the list of schedules must be unique with respect to schedule type and for reconfigure this will be resolved by overwriting the existing reconfigure.

maddyblue · 2024-05-14T17:33:10Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+by respecifying the statement with a TIMEOUT of 0.
+Ex:
+```sql
+ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP [AUTO DETECT] TIMEOUT 5 minutes );


What happens when CLEANUP is not specified? It acts the same as TIMEOUT 0?

I would expect that there's a difference between these two statements
ALTER CLUSTER c1 SET (SIZE 'small') ;
ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP TIMEOUT 0 );

The former should use the old mechanism of drop/create.
The latter should create a reconfigure that fires a decision the the next cycle of handle_cluster_scheduling.

Reasonsing:
If you have an ongoing reconfiguration but you want to preemptively end it, setting the timeout to 0 could provide the mechanism if we're able to detect that there aren't differences and modify a reconfigure schedule.

Keeping the former syntax and functionality alive seems useful for clusters with sources/sinks as well as very cost sensitive protection against something like coordinator stalls or envd crash, or even a poorly timed reconfigure which occurs during a maintenance window. I could be convinced that this isn't a real concern and we should always do a scheduled cleanup except in the sink/source case which the cluster sequencer could figure out internally.

maddyblue · 2024-05-14T17:35:40Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+are replacing.
+
+## V1 Scope
+ - Timeout based reconfiguration cleanup scheduling


Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.

I guess I see the timeout as bounds you're setting on the cost of zero downtime during reconfiguration and hydration-detection as our way of helping reduce those costs.

If we just used hydration-detection I worry that there would be no upper bound to the cost.

Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.

The WAIT = FOR <interval> is definitely not the ideal user experience, but I think we can ship it much faster than we can ship WAIT = UNTIL CAUGHT UP. No one has yet proposed a clear implementation for computing "is this replica caught up?", and I'd hate to block this work

Maybe the right goal is to aim for getting WAIT = FOR <interval> merged and in "private preview", with the expectation that it's unlikely that folks will want to use it. But we can at least check our assumption there! And while we're getting feedback on WAIT = FOR <interval> we can try to shake out the question of computing "is this replica caught up?", and it should drop into place pretty cleanly with the foundation we lay with WAIT = FOR< interval>.

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

maddyblue · 2024-05-14T18:06:22Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+
+
+#### Guard Rails
+- Only one reconfigure can occur at a time for a given cluster. If a cluster


Can you specify exactly what you want to happen here? My thought: overwrites the in-progress recon with the new one. Possibly issues a NOTICE that this occurred.

This is exactly what I was thinking. There's a small bit on the cluster sequencing portion on this flow, but I'll make it clear throughout that this will be an ovewrite of the existing reconfigure.

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

benesch

Thanks very much for writing this up, @jubrad! Sorry it took me a whole week to get to looking at this.

One major comment around making ALTER CLUSTER blocking that I think has the potential to majorly speed up the implementation time. Maybe it would be helpful to circle up live this week? I'll follow up on Slack.

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

benesch · 2024-05-20T02:48:05Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+Introduce new SQL to define a delay duration for alter cluster.
+
+`ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP TIMEOUT  5 minutes );`
+
+This will alter the cluster and create new cluster replicas, but will not
+remove existing replicas. Instead, it will set the schedule for the cluster
+to `ClusterSchedule::Reconfigure` and set the deadline to `now + the provided
+duration`. When CLEANUP TIMEOUT is not provided the cluster will behave as it
+previously did, immediately tearing down and then creating the new replicas.
+
+Initially, all reconfigurations will wait for the full timeout; however, when we
+move to a hydration/parity cleanup detection mechanism, the reconfiguration may
+finish before the timeout. The same SQL can be used in both cases.


I think it's worth looking ahead a smidge and seeing how we'll extend this syntax when we support a better policy for doing the cutover. Seeing it written out, I'm worried that CLEANUP TIMEOUT is too specific.

Here's an alternative using an option named WAIT:

WAIT = FOR <interval> | UNTIL HYDRATED | UNTIL CAUGHT UP

The FOR variant is the v1: just wait a fixed duration to do the cutover.

The UNTIL HYDRATED is the v1.5: wait until the new replicas are hydrated (but not necessarily caught up to the old replicas). We may want to skip this based on @teskje's comments here about the pitfalls and difficulties of assessing whether a replica is hydrated.

The UNTIL CAUGHT UP variant is the v2: wait until the new replicas are "caught up" to the old replicas, for some as-yet-undetermined notion of caught-upedness.

The other big benefit of this phrasing is that it avoids using the term "timeout." On reflection, I think we should reserve "timeout" for things going wrong—i.e., aborting the entire operation because it took too long. Asking ALTER CLUSTER to wait 5m for the new replicas to catch up is less of a "time out" after 5m, and more of an intentional delay/wait.

I really like this. Initially I had a
WITH [<condition name>] CLEANUP TIMEOUT = <interval>
but UNTIL HYDRATED WAIT FOR 10 minutes sounds way better.

@maddyblue had questioned whether we need to expose the HYDRATION/CAUGHT UP mechanism at all. I think this is a super valid point.

Would a user care about the mechanism being used to trigger cleanup?

Would a user ever want to wait for the full duration when we have a mechanism to trigger the cleanup early?

If we don't want to do provide this dials, we could consider something like

WITH ( CREATE BEFORE DELETE, MAX OVERLAP = 10m )

This is generated from the part of my brain that works in terraform/pulumi, but.. those are also tools that are likely to be directly interacting with creating/resizing clusters.

Agree with Nikhil that CLEANUP TIMEOUT is too specific. Snowflake uses a syntax similar to his proposal for warehouse resizing: WAIT_FOR_COMPLETION.

Snowflake uses a syntax similar to his proposal for warehouse resizing: WAIT_FOR_COMPLETION.

Omg, I went looking for prior art, and somehow forgot to look at ALTER WAREHOUSE. Psyched that my proposal here lines up!

but UNTIL HYDRATED WAIT FOR 10 minutes sounds way better.

Wait, you'd only use one of the options at a time though!

CREATE CLUSTER (WAIT FOR '10m'); CREATE CLUSTER (WAIT UNTIL HYDRATED); CREATE CLUSTER (WAIT UNTIL CAUGHT UP);

Would a user ever want to wait for the full duration when we have a mechanism to trigger the cleanup early?

No, I don't think so—I think that's why you want FOR ... and UNTIL HYDRATED to be mutually exclusive options.

Would a user care about the mechanism being used to trigger cleanup?

I mused on this here: #26785 (comment)

Supporting WAIT = FOR ... is more about allowing us to ship something incrementally here, vs nailing the user experience from day one. In the fullness of time, we definitely want WAIT UNTIL CAUGHT UP to be the default, so most users get the good behavior by default. In that world, I think WAIT = FOR '0m' is still useful as a way to force a change through immediately.

No, I don't think so—I think that's why you want FOR ... and UNTIL HYDRATED to be mutually exclusive options.

I'm not entirely convinced that we want FOR <duration> and UNTIL CAUGHT UP to be mutually exclusive. It's possible we do, but we would need to be pretty confident in our UNTIL CAUGHT UP mechanism if we go that route. My preference would be to provide the user the knob to set a hard limit on the overlap duration, potentially failing forward.

Agree with Nikhil that CLEANUP TIMEOUT is too specific. Snowflake uses a syntax similar to his proposal for warehouse resizing: WAIT_FOR_COMPLETION.

WAIT_FOR_COMPLETION isn't quite what we're doing.
The usage description I see is: To block the immediate return of the ALTER WAREHOUSE command until the resize is complete, add the WAIT_FOR_COMPLETION parameter., which is really about changing the return strategy of the query not the underlying mechanism of the alter. We'd discussed having a BACKGROUND=true|false option as part of this, and that is directly comparable to WAIT_FOR_COMPLETION.

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

benesch · 2024-05-20T03:00:45Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+are replacing.
+
+## V1 Scope
+ - Timeout based reconfiguration cleanup scheduling


Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.

The WAIT = FOR <interval> is definitely not the ideal user experience, but I think we can ship it much faster than we can ship WAIT = UNTIL CAUGHT UP. No one has yet proposed a clear implementation for computing "is this replica caught up?", and I'd hate to block this work

Maybe the right goal is to aim for getting WAIT = FOR <interval> merged and in "private preview", with the expectation that it's unlikely that folks will want to use it. But we can at least check our assumption there! And while we're getting feedback on WAIT = FOR <interval> we can try to shake out the question of computing "is this replica caught up?", and it should drop into place pretty cleanly with the foundation we lay with WAIT = FOR< interval>.

benesch · 2024-05-20T03:19:56Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+
+In introducing a second `ClusterScehdule` variant, we now have to account for
+the interaction between these variants. To do so, `ClusterSchedule::Manual`
+will be special-cased. We will allow multiple `ClusterSchedules` to be set
+concurrently on a given cluster, however, when `ClusterSchedule::Reconfigure`
+is one of them, it will be the only schedule that will cause a decision to be
+emitted.
+
+Benefits of this approach:
+ - No other `ClusterSchedule` could interact with a cluster being reconfigured,
+   which may reduce the complexity of behavior between interacting schedules.
+ - Interactions with refresh. A resize on an active refresh to a smaller cluster
+   size would complete (or timeout) rather than be quiesced by the refresh
+   schedule decision. This should give a better indication of whether the new
+   size would work on subsequent refreshes.
+
+Downsides of this approach:
+ - Replicas may be alive for the entire delay period rather than just the period
+   it takes to perform the refresh, this could lead to additional/unexpected
+   billing.
+
+#### Catalog Changes
+__Cluster__
+Add a new ClusterSchedule Enum along with a `Reconfigure`  variant which would
+be applied temporarily during reconfigure.
+```rust
+ClusterSchedule::Reconfigure {
+  // timestamp for user-provided timeout
+  timeout: Option<u64>,
+  // V2 hydration config may end up here when we determine how it should be
+  // used to determine reconfiguration is complete This could also want to be
+  // a detection_mechanism: enum if we want different detection mechanisms that
+  // the user can specify 
+  // auto_detect: bool
+}
+```
+
+#### Scheduling
+`cluster_scheduling`'s `check_schedule_policies`, will be updated
+to additionally check `check_reconfiguration_policy`, which sends
+`ScheduleDecisions` with the decision `FinalizeReconfigure`, when the reconfigure timeout deadline has passed.
+
+`check_reconfiguration_policy`, for V1, will trigger a `FinalizeReconfiguration` `ScheduingDecision` for clusters that
+have a `Reconfigure` `ClusterSchedule` with a timeout value greater than the current time. In V2, we will add logic to this to also check for parity/hydration. At this point, it may need to be backgrounded as it's likely a much more costly check.
+
+`Message::SchedulingDecisions` will be adjusted from 
+`SchedulingDecisions(Vec<(&'static str, Vec<(ClusterId, bool)>)>)` to be
+`SchedulingDecisions(Vec<SchedulingPolicyDecision>)`
+```rust
+enum SchedulingPolicyDecision {
+  Refresh(Vec<(ClusterId, bool)>),
+  FinalizeReconfigure(Vec<(ClusterId)>),
+}
+```
+
+The logic for `handle_schedule_decision` will be directed to the correct
+function based on the  variant of decision `handle_refresh_decision` or
+`handle_finalize_reconfigure_decision`.
+
+For each cluster `handle_finalize_reconfigure_decisions` will check the current
+catalog to avoid conflicts with catalog updates since the `FinalizeDecision`
+was sent. If there's a collision that should prevent the finalization, such as a
+new size adjustment, then break; Otherwise, remove the active replicas and move
+the reconfiguration replicas to be active. Finally, remove the `Reconfigure`
+schedule from the cluster.
+
+
+#### Cluster Sequencer
+The following roughly defines the new logic for the cluster sequencer when
+performing an alter cluster on a managed cluster.
+
+There are two states to look at
+1. the newly provided config
+2. the reconfigue config (what is in the catalog)
+
+First scenario, the cluster being altered has no `Reconfigure` `cluster_schedule`:
+If cleanup_timeout is None:
+ - use the old mechanism ( drop and replace )
+If the new alter statement provides a `cleanup_timeout`: 
+ - check for sources/sinks,
+ - update the catalog with the changes and add a `Reconfigure` cluster schedule
+ - deploy new reconfigure replicas
+
+Second scenario, the cluster being altered is undergoing a reconfigure:
+If cleanup_timeout is None,
+ - Use the old mechanism - Drop and replace including all reconfigure replicas and remove the schedule.
+If the new alter statement provides a `cleanup_timeout`:
+ - If the new config matches the current config
+   - bump the timeout
+ - If the new config doesn't match the current config
+   - check for sources/sinks
+   - update cluster catalog with new schedule and config
+   - drop reconfigure replicas
+   - launch new reconfigure replicas
+
+Note:
+An `ALTER CLUSTER` statement that is run twice (double shot), either needs to
+push out the timeout of an ongoing reconfiguration or needs to entirely replace
+the existing reconfigure schedule dropping and recreating any reconfigure
+replicas. I believe the former is the more useful behavior.


I'd like to propose an alternative v1: just block the ALTER CLUSTER command until the cutover is complete. I mentioned this briefly on the issue:

Should the ALTER CLUSTER statement block until the reconfiguration is complete? This is the "correct" behavior from the PostgreSQL perspective. For example, CREATE INDEX and ALTER TABLE both block until the potentially very slow operation is complete. This has the appealing property of allowing us to report failures to complete the operation back to the user. It also seems a good bit simpler to build, because we don't need to build any job scheduling infrastructure.

I think it'd be muuuuch simpler to start like this, since you basically wouldn't need any durable state to record that the alteration was in progress. Here's how it'd work:

ALTER CLUSTER command is planned normally, resulting in a call to sequence_alter_cluster on the coordinator main loop.

If ALTER CLUSTER command involves a reconfiguration, add the new replicas with the -pending suffix (and pending = true in the mz_cluster_replicas table) in a catalog transaction. In the same catalog transaction, ensure there are no existing pending replicas—if there are, send back an error, as another ALTER CLUSTER operation on the same cluster is already in progress. (In the future, we might prefer to have subsequent ALTER CLUSTER operations just block until the earlier operation completes, but that's more work to rig up.)

Off of the coordinator main loop, wait out the WAIT policy.

Once the WAIT policy completes, issue another catalog transaction (on the coordinator main thread) that drops the old replicas as removes the pending bit from the new replicas.

Report success back to the user.

If the client goes away while the ALTER is running, we can have the coordinator abort the operation: just drop all the outstanding pending replicas.

I don't think you'd need to interact with the cluster scheduler at all, other than to make sure that the schedule policy wasn't ON REFRESH. The interaction between SCHEDULE = ON REFRESH seems real tricky, and since REFRESH EVERY views are generally for cold/warm data rather than hot data, the lack of graceful cluster reconfiguration seems quite tolerable.

There's only one subtle bit: what if envd crashes in the middle of handling an ALTER CLUSTER command? E.g., say the ALTER CLUSTER is being executed during the maintenance window. I think the answer is pretty clean: just have envd drop all pending replicas when it boots. We know that if we have pending replicas, the old envd 1) was in the middle of processing an ALTER CLUSTER command and 2) returned a "connection closed" error or somesuch that did not indicate success, and so we can simply drop the pending replicas and pretend the alteration never happend.

And, if blocking ALTER CLUSTER statements does turn out to be a pain point, we can roll out support for ALTER CLUSTER ... (BACKGROUND = TRUE) in some future release of Materialize.

I have some concerns about implementing non-blocking as follow up. As someone who regularly deals with cloud infra I find it much harder to work with things that must be handled over a single connection. Networks are flaky, laptops go to sleep, and connection timeouts threaten longer lived calls.

This might work fine over PGWIRE where it's not abnormal to have long running queries (10s of minutes), but HTTP calls via the console may not be as resilient here. The connections themselves may be fine if they don't hit timeouts, but may find ourselves fighting with user interactions with long lived connections in the browser.

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

ggevay · 2024-05-20T15:10:24Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+configuration matches the `prior_config`. This seems somewhat convoluted and
+I don't know that allowing for cancellation is necessary as long as we allow for
+adjustment of the current reconfigure timeout, and we allow for overwriting the
+current reconfiguration entirely.


So, what would be the exact workflow if the new size is not big enough? E.g., let's say the user is trying to size down, but it turns out that they can't size down, because the new size is not able to hydrate. Therefore, the user would like to go back to the original size. In this case, should the user give an ALTER CLUSTER to the original size? And then I would expect that the reconfiguration replicas simply disappear, and the active replicas stay active. Is this what will happen?

I still need to update the doc based on the discussion earlier today. Here's where I think we're leaning.. If the user believes the new smaller size will not work they would cancel the existing query being run assuming they are running a blocking query. If they are running the query in the background they would issue a cancel command like cancel alter cluster <x>. In either case the existing reconfiguration replicas disappear and the actives state active.

That's my understanding too!

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

maddyblue · 2024-05-30T18:04:07Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+
+#### Reconfiguration Task
+
+The reconfiguration task will perform the following actions:


What is the task exactly? A tokio task? A coordinator message that gets sequenced and can possibly send other coord messages?

Addressed in new update, and discussed in huddle, but I'll do a quick outline here.

Initially the thought was a tokio task, however, it seems like it'll be cleaner to do this through a cluster pending reconfiguration message handler that performs the check, finalizes the reconfiguration and returns to the user.

We'll have to adjust the sequencer to know it shouldn't retire the connection if it's a graceful reconfiguration.

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

maddyblue · 2024-05-30T18:09:27Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+we will treat any disruption in connection as a failed DDL which will need to
+be aborted.
+
+For V2, for each cluster, we will want to inspect whether an active


This is less a v2 thing and more a BACKGROUND = true thing yeah? Can it talk about background instead of v2 then? I kinda think since background isn't fully defined here, we should just omit this altogether maybe with a note somewhere saying the background or v2 stuff will be in a later design doc.

I'm guessing the rollout of v1,v2 is actually

1 support for WAIT FOR foreground
2 support for WAIT UNTIL foreground
3 support for background

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

antiguru · 2024-06-06T18:43:21Z

Thanks of writing this design doc! I want to capture one thought I had, but don't feel obliged in any way to incorporate it in your design! In the past we had the idea of replica sets, which groups replicas in a cluster based on properties, and allows to reason about each independently of the cluster.

In the case of graceful reconfiguration, this could argue about replica sets instead of specific replicas, which might make the reasoning simpler. For example, it avoids having to think about how to name pending replicas, because each replica set provides a namespace.

For context, we didn't implement replica sets last year because it was a large change that complicated a user-facing feature even further, but maybe it's time to reconsider this eventually? (#21351)

jubrad · 2024-06-08T03:33:41Z

@antiguru very cool! Conceptually I like the idea of using replica sets to manage groups of replicas for graceful reconfiguration. I suspect we'd have to do something swap out the default/implicit replica set for the reconfiguration replica set once we've met our check/finalization conditions then tear down the pre-alter replica set.

It seems like, for graceful reconfiguration, we would only want to act on the implicit replica set, and we would not support any graceful reconfiguration feature for replica set sets themselves.

For the sake of this design doc, and roadmap of the feature, I think it's likely a good idea to hold off on incorporating this at least for v1. I also suspect incorporating replica sets wouldn't broadly change the design, but would clean up some of the individual replica management.

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

benesch · 2024-06-17T05:22:35Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+   prior version of the configuration. If a reconfiguration is not ongoing they
+   should be the only replicas.
+ - Reconfiguration Replica: These replicas only exist during a reconfiguration,
+   their names will be suffixed with `reconfig-`. They will move to active


(See comment below about using a -pending suffix instead: https://github.com/MaterializeInc/materialize/pull/26785/files#r1606149023)

benesch · 2024-06-17T05:27:07Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+For v2, we will introduce backgrounded reconfiguration which will need some
+new semantics in order to cancel. Something like the following should suffice.
+```sql
+CANEL ALTER CLUSTER <cluster>


Alternative strawman syntax for discussion: ALTER CLUSTER ... CANCEL.

The main benefit I see is that it'd be easier to search for and keeps the cancel statement scoped to ALTER CLUSTER, but it also seems like it'd be odd to have the action at the end of the ddl statement.

The argument for cancel alter cluster. If we do introduce more cancelable statements, ie jobs, then the syntax there would likely be create job and cancel job and leading cancel could align more with that? It also begins the ddl with the action being performed with I like.

This syntax is for v2 and we can probably refine it there, but I'm also fine with ALTER CLUSTER <cluster> CANCEL essentially could you could either provide set or cancel keywords after the cluster name, seems easy.

benesch · 2024-06-17T05:27:19Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+    ALTER CLUSTER c1 SET (SIZE 'small') WITH (
+      WAIT = FOR <interval>,
+      BACKGROUND = {true|false},
+    )
+    ```
+ - `UNTIL CAUGHT UP`, which will wait for a "catch up" mechanism to return true
+   or for a timeout to be met. It will roll back or forward depending on the value
+   provided to `ON TIMEOUT`.
+    ```sql
+    ALTER CLUSTER c1 SET (SIZE 'small') WITH (
+      WAIT = UNTIL CAUGHT UP (
+        TIMEOUT = <interval>,
+        ON TIMEOUT = {CONTINUE | ABORT}
+      )
+      BACKGROUND = {true|false},
+    )
+    ```


I really like how this shook out!

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

maddyblue · 2024-06-27T20:20:13Z

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

+replicas.
+
+For v2, we will need to store the AlterClusterPlan durably for all backgrounded reconfigurations.
+We will then need the coordinator to attempt to continue an ongoing recongiuration on restart.


There's no need to continue anything here! On restart the catalog will remove the non-pending replicas, promote the pending to non-pending, then proceed with compute controller initialization like normal which will cause the new desired state to take affect.

Is that the behavior we want? I imagine if environmentd were to get restarted leading a small amount of downtime we wouldn't want to remove all non-pending fully hydrated replicas and promote replicas that have an unknown hydration state? I would expect the behavior we want to strive for here is to continue with the alter and only perform the finalization once the alter strategy condition is met.

My understanding is that on envd restart we always get a whole new set of replicas. The new envd does not hook in to the in-progress hydrating replicas, so they give us no benefit. Recommend verifying this behavior with the cluster team.

If envd is restarted but some clusterds are not restarted, we aim to not disturb these clusterds, and reuse them as much as possible. That is, the coordinator tries to bootstrap timestamps in a way that existing dataflows on the running clusterds can be reused, and the clusterds try to recognize when compute commands that they receive from a new controller reflect already running things. See also platform/reconciliation.md.

If the v2 mechanism is not based on rehydration but a parity of the frontiers between old and new replicas then we should be able to handle both the case that existing flows are kept alive and the case where full rehydration is required. In the former we'd wait for new replicas to catch up ,and would only finalize the alter once that had occurred. In the latter, since both sets of replicas had their dataflows restarted, we'd expect to have parity immediately.

doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 3 times, most recently from 73b3833 to 7502d6c Compare April 23, 2024 21:53

ggevay reviewed Apr 24, 2024

View reviewed changes

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 0ab3bb4 to fbb5b42 Compare April 30, 2024 17:43

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 7 times, most recently from a2f965f to 1d2b0a2 Compare May 10, 2024 17:27

jubrad changed the title ~~WIP graceful reconfiguration v1 design doc~~ graceful reconfiguration v1 design doc May 10, 2024

jubrad marked this pull request as ready for review May 10, 2024 17:31

jubrad requested a review from maddyblue May 13, 2024 21:29

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 1d2b0a2 to 9688a82 Compare May 14, 2024 19:15

maddyblue reviewed May 15, 2024

View reviewed changes

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 6 times, most recently from 311e2bf to 8425a7c Compare May 17, 2024 20:47

Graceful reconfiguration v1 design doc

40ef0f7

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 8425a7c to 40ef0f7 Compare May 17, 2024 20:50

benesch reviewed May 20, 2024

View reviewed changes

ggevay reviewed May 20, 2024

View reviewed changes

update graceful reconfig design doc to use tasks rather than schedules

50fd4c8

maddyblue reviewed May 30, 2024

View reviewed changes

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 3c60a14 to 295995d Compare June 3, 2024 21:24

jubrad requested review from maddyblue and benesch June 4, 2024 16:49

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 295995d to 60a363a Compare June 5, 2024 02:59

maddyblue reviewed Jun 5, 2024

View reviewed changes

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 60a363a to a02b643 Compare June 6, 2024 16:38

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 5980f1e to e5108fd Compare June 8, 2024 03:49

jubrad requested a review from maddyblue June 14, 2024 18:03

benesch reviewed Jun 17, 2024

View reviewed changes

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 3cdfca3 to 6e3492f Compare June 21, 2024 14:51

jubrad mentioned this pull request Jun 22, 2024

add pending field to managed cluster replicas #27808

Merged

5 tasks

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 6e3492f to 09a2720 Compare June 27, 2024 18:00

maddyblue reviewed Jun 27, 2024

View reviewed changes

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 48443dc to 715f763 Compare June 28, 2024 14:32

jubrad requested a review from maddyblue June 28, 2024 19:32

maddyblue approved these changes Jul 2, 2024

View reviewed changes

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 715f763 to b33374a Compare July 2, 2024 15:23

update task mechanism for gracefulconfig design doc

ab3623d

jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from b33374a to ab3623d Compare July 3, 2024 04:38

jubrad enabled auto-merge July 3, 2024 04:38

jubrad merged commit bf57081 into MaterializeInc:main Jul 3, 2024
7 checks passed

jubrad mentioned this pull request Jul 3, 2024

Feature/graceful reconfig v1 (wait for <duration>) #27904

Merged

5 tasks



		#### Guard Rails
		- Only one reconfigure can occur at a time for a given cluster. If a cluster


		#### Reconfiguration Task

		The reconfiguration task will perform the following actions:

graceful reconfiguration v1 design doc #26785

graceful reconfiguration v1 design doc #26785

Conversation

jubrad commented Apr 23, 2024

Motivation

Tips for reviewer

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maddyblue left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jubrad May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jubrad May 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jubrad May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ggevay May 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antiguru commented Jun 6, 2024

jubrad commented Jun 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jubrad May 17, 2024 •

edited

Loading

jubrad May 17, 2024 •

edited

Loading

jubrad May 20, 2024 •

edited

Loading

ggevay May 20, 2024 •

edited

Loading