Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

graceful reconfiguration v1 design doc #26785

Merged

Conversation

jubrad
Copy link
Contributor

@jubrad jubrad commented Apr 23, 2024

Motivation

Design doc for graceful reconfiguration of managed clusters.
Currently limited to timeout based reconfiguration delay (v1).

https://github.com/MaterializeInc/database-issues/issues/5976

Tips for reviewer

Checklist

@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 3 times, most recently from 73b3833 to 7502d6c Compare April 23, 2024 21:53
```
CONCERN:
Allowing delay=false and a timeout=0 provides two similar mechanisms of
using the old behavior. I think this is what we want as it ensures the old
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about special-casing timeout=0 to just invoke the old code, i.e., to do exactly the same as with delay=false?

### How do we handle multiple schedules?
If this feature is introducing multiple schedules we'll need some rules governing the interaction between schedules.
There's world where each schedule needs to consider it's interaction with all other schedules and that seems kind of
scary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be a division of schedules into 2 categories:

  1. Schedules that turn clusters On/Off.
  2. Schedules that affect the set of replicas of a cluster, if the cluster is On.

The interaction among all the schedules in the 1. category are already handled by the current code. (We wait for all schedules to have their say about a cluster before touching a cluster, and the cluster will be On if at least one schedule says that it should be On.)

The interaction between 1. and 2. would be:

  • If 1. decided that the cluster is Off, then 2. has nothing to do, because there are no replicas.
  • If 1. decided that the cluster is On, then 2. can decide on the exact set of replicas.

This leaves open the interactions among different schedules in category 2. What would be another example of a schedule in 2.? Maybe auto-scaling? I guess that would need to have some custom-defined interaction with graceful reconf timeout schedule.


### How does a multi-subscriber coord work with schedules?
We definitely don't want two schedule decisions to be acting at the same time. Currently this isn't
an issue, but it seems a bit scary. Is this a concern?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd guess that stuff like altering clusters will always be a centralized thing, and only 1 instance of handle_scheduling_decisions would be allowed to run at a time. (But I'm not an expert on PlatformV2 stuff.)

@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 0ab3bb4 to fbb5b42 Compare April 30, 2024 17:43
@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 7 times, most recently from a2f965f to 1d2b0a2 Compare May 10, 2024 17:27
@jubrad jubrad changed the title WIP graceful reconfiguration v1 design doc graceful reconfiguration v1 design doc May 10, 2024
@jubrad jubrad marked this pull request as ready for review May 10, 2024 17:31
@jubrad jubrad requested a review from maddyblue May 13, 2024 21:29
@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 1d2b0a2 to 9688a82 Compare May 14, 2024 19:15
Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't read over the implementation yet because I need to read up on how cluster scheduling works today first. Left some thoughts about how this should work from a user perspective.

### Details

#### Invariants:
- Only one reconfiguration will be allowed at a time for a cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when a user tries this exactly? Error? The running recon is stopped and the new one occurs instead?

Copy link
Contributor Author

@jubrad jubrad May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I'm not really sure what the implications are of switching the cluster_schedule from a single value to a vec.
Having two refresh schedules doesn't make much sense nor does having two reconfigure schedules. Perhaps it's we're able to say that the list of schedules must be unique with respect to schedule type and for reconfigure this will be resolved by overwriting the existing reconfigure.

by respecifying the statement with a TIMEOUT of 0.
Ex:
```sql
ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP [AUTO DETECT] TIMEOUT 5 minutes );
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when CLEANUP is not specified? It acts the same as TIMEOUT 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that there's a difference between these two statements
ALTER CLUSTER c1 SET (SIZE 'small') ;
ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP TIMEOUT 0 );

The former should use the old mechanism of drop/create.
The latter should create a reconfigure that fires a decision the the next cycle of handle_cluster_scheduling.

Reasonsing:
If you have an ongoing reconfiguration but you want to preemptively end it, setting the timeout to 0 could provide the mechanism if we're able to detect that there aren't differences and modify a reconfigure schedule.

Keeping the former syntax and functionality alive seems useful for clusters with sources/sinks as well as very cost sensitive protection against something like coordinator stalls or envd crash, or even a poorly timed reconfigure which occurs during a maintenance window. I could be convinced that this isn't a real concern and we should always do a scheduled cleanup except in the sink/source case which the cluster sequencer could figure out internally.

are replacing.

## V1 Scope
- Timeout based reconfiguration cleanup scheduling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.

Copy link
Contributor Author

@jubrad jubrad May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I see the timeout as bounds you're setting on the cost of zero downtime during reconfiguration and hydration-detection as our way of helping reduce those costs.

If we just used hydration-detection I worry that there would be no upper bound to the cost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.

The WAIT = FOR <interval> is definitely not the ideal user experience, but I think we can ship it much faster than we can ship WAIT = UNTIL CAUGHT UP. No one has yet proposed a clear implementation for computing "is this replica caught up?", and I'd hate to block this work

Maybe the right goal is to aim for getting WAIT = FOR <interval> merged and in "private preview", with the expectation that it's unlikely that folks will want to use it. But we can at least check our assumption there! And while we're getting feedback on WAIT = FOR <interval> we can try to shake out the question of computing "is this replica caught up?", and it should drop into place pretty cleanly with the foundation we lay with WAIT = FOR< interval>.



#### Guard Rails
- Only one reconfigure can occur at a time for a given cluster. If a cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specify exactly what you want to happen here? My thought: overwrites the in-progress recon with the new one. Possibly issues a NOTICE that this occurred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what I was thinking. There's a small bit on the cluster sequencing portion on this flow, but I'll make it clear throughout that this will be an ovewrite of the existing reconfigure.

@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 6 times, most recently from 311e2bf to 8425a7c Compare May 17, 2024 20:47
@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 8425a7c to 40ef0f7 Compare May 17, 2024 20:50
Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for writing this up, @jubrad! Sorry it took me a whole week to get to looking at this.

One major comment around making ALTER CLUSTER blocking that I think has the potential to majorly speed up the implementation time. Maybe it would be helpful to circle up live this week? I'll follow up on Slack.

Comment on lines 101 to 113
Introduce new SQL to define a delay duration for alter cluster.

`ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP TIMEOUT 5 minutes );`

This will alter the cluster and create new cluster replicas, but will not
remove existing replicas. Instead, it will set the schedule for the cluster
to `ClusterSchedule::Reconfigure` and set the deadline to `now + the provided
duration`. When CLEANUP TIMEOUT is not provided the cluster will behave as it
previously did, immediately tearing down and then creating the new replicas.

Initially, all reconfigurations will wait for the full timeout; however, when we
move to a hydration/parity cleanup detection mechanism, the reconfiguration may
finish before the timeout. The same SQL can be used in both cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth looking ahead a smidge and seeing how we'll extend this syntax when we support a better policy for doing the cutover. Seeing it written out, I'm worried that CLEANUP TIMEOUT is too specific.

Here's an alternative using an option named WAIT:

WAIT = FOR <interval> | UNTIL HYDRATED | UNTIL CAUGHT UP

The FOR variant is the v1: just wait a fixed duration to do the cutover.

The UNTIL HYDRATED is the v1.5: wait until the new replicas are hydrated (but not necessarily caught up to the old replicas). We may want to skip this based on @teskje's comments here about the pitfalls and difficulties of assessing whether a replica is hydrated.

The UNTIL CAUGHT UP variant is the v2: wait until the new replicas are "caught up" to the old replicas, for some as-yet-undetermined notion of caught-upedness.

The other big benefit of this phrasing is that it avoids using the term "timeout." On reflection, I think we should reserve "timeout" for things going wrong—i.e., aborting the entire operation because it took too long. Asking ALTER CLUSTER to wait 5m for the new replicas to catch up is less of a "time out" after 5m, and more of an intentional delay/wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this. Initially I had a
WITH [<condition name>] CLEANUP TIMEOUT = <interval>
but UNTIL HYDRATED WAIT FOR 10 minutes sounds way better.

@maddyblue had questioned whether we need to expose the HYDRATION/CAUGHT UP mechanism at all. I think this is a super valid point.

  • Would a user care about the mechanism being used to trigger cleanup?
  • Would a user ever want to wait for the full duration when we have a mechanism to trigger the cleanup early?

If we don't want to do provide this dials, we could consider something like

WITH ( CREATE BEFORE DELETE, MAX OVERLAP = 10m )

This is generated from the part of my brain that works in terraform/pulumi, but.. those are also tools that are likely to be directly interacting with creating/resizing clusters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with Nikhil that CLEANUP TIMEOUT is too specific. Snowflake uses a syntax similar to his proposal for warehouse resizing: WAIT_FOR_COMPLETION.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Snowflake uses a syntax similar to his proposal for warehouse resizing: WAIT_FOR_COMPLETION.

Omg, I went looking for prior art, and somehow forgot to look at ALTER WAREHOUSE. Psyched that my proposal here lines up!

but UNTIL HYDRATED WAIT FOR 10 minutes sounds way better.

Wait, you'd only use one of the options at a time though!

CREATE CLUSTER (WAIT FOR '10m');
CREATE CLUSTER (WAIT UNTIL HYDRATED);
CREATE CLUSTER (WAIT UNTIL CAUGHT UP);

Would a user ever want to wait for the full duration when we have a mechanism to trigger the cleanup early?

No, I don't think so—I think that's why you want FOR ... and UNTIL HYDRATED to be mutually exclusive options.

Would a user care about the mechanism being used to trigger cleanup?

I mused on this here: #26785 (comment)

Supporting WAIT = FOR ... is more about allowing us to ship something incrementally here, vs nailing the user experience from day one. In the fullness of time, we definitely want WAIT UNTIL CAUGHT UP to be the default, so most users get the good behavior by default. In that world, I think WAIT = FOR '0m' is still useful as a way to force a change through immediately.

Copy link
Contributor Author

@jubrad jubrad May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think so—I think that's why you want FOR ... and UNTIL HYDRATED to be mutually exclusive options.

I'm not entirely convinced that we want FOR <duration> and UNTIL CAUGHT UP to be mutually exclusive. It's possible we do, but we would need to be pretty confident in our UNTIL CAUGHT UP mechanism if we go that route. My preference would be to provide the user the knob to set a hard limit on the overlap duration, potentially failing forward.

Agree with Nikhil that CLEANUP TIMEOUT is too specific. Snowflake uses a syntax similar to his proposal for warehouse resizing: WAIT_FOR_COMPLETION.

WAIT_FOR_COMPLETION isn't quite what we're doing.
The usage description I see is: To block the immediate return of the ALTER WAREHOUSE command until the resize is complete, add the WAIT_FOR_COMPLETION parameter., which is really about changing the return strategy of the query not the underlying mechanism of the alter. We'd discussed having a BACKGROUND=true|false option as part of this, and that is directly comparable to WAIT_FOR_COMPLETION.

are replacing.

## V1 Scope
- Timeout based reconfiguration cleanup scheduling
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.

The WAIT = FOR <interval> is definitely not the ideal user experience, but I think we can ship it much faster than we can ship WAIT = UNTIL CAUGHT UP. No one has yet proposed a clear implementation for computing "is this replica caught up?", and I'd hate to block this work

Maybe the right goal is to aim for getting WAIT = FOR <interval> merged and in "private preview", with the expectation that it's unlikely that folks will want to use it. But we can at least check our assumption there! And while we're getting feedback on WAIT = FOR <interval> we can try to shake out the question of computing "is this replica caught up?", and it should drop into place pretty cleanly with the foundation we lay with WAIT = FOR< interval>.

Comment on lines 115 to 215

In introducing a second `ClusterScehdule` variant, we now have to account for
the interaction between these variants. To do so, `ClusterSchedule::Manual`
will be special-cased. We will allow multiple `ClusterSchedules` to be set
concurrently on a given cluster, however, when `ClusterSchedule::Reconfigure`
is one of them, it will be the only schedule that will cause a decision to be
emitted.

Benefits of this approach:
- No other `ClusterSchedule` could interact with a cluster being reconfigured,
which may reduce the complexity of behavior between interacting schedules.
- Interactions with refresh. A resize on an active refresh to a smaller cluster
size would complete (or timeout) rather than be quiesced by the refresh
schedule decision. This should give a better indication of whether the new
size would work on subsequent refreshes.

Downsides of this approach:
- Replicas may be alive for the entire delay period rather than just the period
it takes to perform the refresh, this could lead to additional/unexpected
billing.

#### Catalog Changes
__Cluster__
Add a new ClusterSchedule Enum along with a `Reconfigure` variant which would
be applied temporarily during reconfigure.
```rust
ClusterSchedule::Reconfigure {
// timestamp for user-provided timeout
timeout: Option<u64>,
// V2 hydration config may end up here when we determine how it should be
// used to determine reconfiguration is complete This could also want to be
// a detection_mechanism: enum if we want different detection mechanisms that
// the user can specify
// auto_detect: bool
}
```

#### Scheduling
`cluster_scheduling`'s `check_schedule_policies`, will be updated
to additionally check `check_reconfiguration_policy`, which sends
`ScheduleDecisions` with the decision `FinalizeReconfigure`, when the reconfigure timeout deadline has passed.

`check_reconfiguration_policy`, for V1, will trigger a `FinalizeReconfiguration` `ScheduingDecision` for clusters that
have a `Reconfigure` `ClusterSchedule` with a timeout value greater than the current time. In V2, we will add logic to this to also check for parity/hydration. At this point, it may need to be backgrounded as it's likely a much more costly check.

`Message::SchedulingDecisions` will be adjusted from
`SchedulingDecisions(Vec<(&'static str, Vec<(ClusterId, bool)>)>)` to be
`SchedulingDecisions(Vec<SchedulingPolicyDecision>)`
```rust
enum SchedulingPolicyDecision {
Refresh(Vec<(ClusterId, bool)>),
FinalizeReconfigure(Vec<(ClusterId)>),
}
```

The logic for `handle_schedule_decision` will be directed to the correct
function based on the variant of decision `handle_refresh_decision` or
`handle_finalize_reconfigure_decision`.

For each cluster `handle_finalize_reconfigure_decisions` will check the current
catalog to avoid conflicts with catalog updates since the `FinalizeDecision`
was sent. If there's a collision that should prevent the finalization, such as a
new size adjustment, then break; Otherwise, remove the active replicas and move
the reconfiguration replicas to be active. Finally, remove the `Reconfigure`
schedule from the cluster.


#### Cluster Sequencer
The following roughly defines the new logic for the cluster sequencer when
performing an alter cluster on a managed cluster.

There are two states to look at
1. the newly provided config
2. the reconfigue config (what is in the catalog)

First scenario, the cluster being altered has no `Reconfigure` `cluster_schedule`:
If cleanup_timeout is None:
- use the old mechanism ( drop and replace )
If the new alter statement provides a `cleanup_timeout`:
- check for sources/sinks,
- update the catalog with the changes and add a `Reconfigure` cluster schedule
- deploy new reconfigure replicas

Second scenario, the cluster being altered is undergoing a reconfigure:
If cleanup_timeout is None,
- Use the old mechanism - Drop and replace including all reconfigure replicas and remove the schedule.
If the new alter statement provides a `cleanup_timeout`:
- If the new config matches the current config
- bump the timeout
- If the new config doesn't match the current config
- check for sources/sinks
- update cluster catalog with new schedule and config
- drop reconfigure replicas
- launch new reconfigure replicas

Note:
An `ALTER CLUSTER` statement that is run twice (double shot), either needs to
push out the timeout of an ongoing reconfiguration or needs to entirely replace
the existing reconfigure schedule dropping and recreating any reconfigure
replicas. I believe the former is the more useful behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to propose an alternative v1: just block the ALTER CLUSTER command until the cutover is complete. I mentioned this briefly on the issue:

Should the ALTER CLUSTER statement block until the reconfiguration is complete? This is the "correct" behavior from the PostgreSQL perspective. For example, CREATE INDEX and ALTER TABLE both block until the potentially very slow operation is complete. This has the appealing property of allowing us to report failures to complete the operation back to the user. It also seems a good bit simpler to build, because we don't need to build any job scheduling infrastructure.

I think it'd be muuuuch simpler to start like this, since you basically wouldn't need any durable state to record that the alteration was in progress. Here's how it'd work:

  1. ALTER CLUSTER command is planned normally, resulting in a call to sequence_alter_cluster on the coordinator main loop.
  2. If ALTER CLUSTER command involves a reconfiguration, add the new replicas with the -pending suffix (and pending = true in the mz_cluster_replicas table) in a catalog transaction. In the same catalog transaction, ensure there are no existing pending replicas—if there are, send back an error, as another ALTER CLUSTER operation on the same cluster is already in progress. (In the future, we might prefer to have subsequent ALTER CLUSTER operations just block until the earlier operation completes, but that's more work to rig up.)
  3. Off of the coordinator main loop, wait out the WAIT policy.
  4. Once the WAIT policy completes, issue another catalog transaction (on the coordinator main thread) that drops the old replicas as removes the pending bit from the new replicas.
  5. Report success back to the user.

If the client goes away while the ALTER is running, we can have the coordinator abort the operation: just drop all the outstanding pending replicas.

I don't think you'd need to interact with the cluster scheduler at all, other than to make sure that the schedule policy wasn't ON REFRESH. The interaction between SCHEDULE = ON REFRESH seems real tricky, and since REFRESH EVERY views are generally for cold/warm data rather than hot data, the lack of graceful cluster reconfiguration seems quite tolerable.

There's only one subtle bit: what if envd crashes in the middle of handling an ALTER CLUSTER command? E.g., say the ALTER CLUSTER is being executed during the maintenance window. I think the answer is pretty clean: just have envd drop all pending replicas when it boots. We know that if we have pending replicas, the old envd 1) was in the middle of processing an ALTER CLUSTER command and 2) returned a "connection closed" error or somesuch that did not indicate success, and so we can simply drop the pending replicas and pretend the alteration never happend.

And, if blocking ALTER CLUSTER statements does turn out to be a pain point, we can roll out support for ALTER CLUSTER ... (BACKGROUND = TRUE) in some future release of Materialize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concerns about implementing non-blocking as follow up. As someone who regularly deals with cloud infra I find it much harder to work with things that must be handled over a single connection. Networks are flaky, laptops go to sleep, and connection timeouts threaten longer lived calls.

This might work fine over PGWIRE where it's not abnormal to have long running queries (10s of minutes), but HTTP calls via the console may not be as resilient here. The connections themselves may be fine if they don't hit timeouts, but may find ourselves fighting with user interactions with long lived connections in the browser.

configuration matches the `prior_config`. This seems somewhat convoluted and
I don't know that allowing for cancellation is necessary as long as we allow for
adjustment of the current reconfigure timeout, and we allow for overwriting the
current reconfiguration entirely.
Copy link
Contributor

@ggevay ggevay May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what would be the exact workflow if the new size is not big enough? E.g., let's say the user is trying to size down, but it turns out that they can't size down, because the new size is not able to hydrate. Therefore, the user would like to go back to the original size. In this case, should the user give an ALTER CLUSTER to the original size? And then I would expect that the reconfiguration replicas simply disappear, and the active replicas stay active. Is this what will happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to update the doc based on the discussion earlier today. Here's where I think we're leaning.. If the user believes the new smaller size will not work they would cancel the existing query being run assuming they are running a blocking query. If they are running the query in the background they would issue a cancel command like cancel alter cluster <x>. In either case the existing reconfiguration replicas disappear and the actives state active.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's my understanding too!


#### Reconfiguration Task

The reconfiguration task will perform the following actions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the task exactly? A tokio task? A coordinator message that gets sequenced and can possibly send other coord messages?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in new update, and discussed in huddle, but I'll do a quick outline here.

Initially the thought was a tokio task, however, it seems like it'll be cleaner to do this through a cluster pending reconfiguration message handler that performs the check, finalizes the reconfiguration and returns to the user.

We'll have to adjust the sequencer to know it shouldn't retire the connection if it's a graceful reconfiguration.

we will treat any disruption in connection as a failed DDL which will need to
be aborted.

For V2, for each cluster, we will want to inspect whether an active
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is less a v2 thing and more a BACKGROUND = true thing yeah? Can it talk about background instead of v2 then? I kinda think since background isn't fully defined here, we should just omit this altogether maybe with a note somewhere saying the background or v2 stuff will be in a later design doc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing the rollout of v1,v2 is actually

1 support for WAIT FOR foreground
2 support for WAIT UNTIL foreground
3 support for background

@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 3c60a14 to 295995d Compare June 3, 2024 21:24
@jubrad jubrad requested review from maddyblue and benesch June 4, 2024 16:49
@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 295995d to 60a363a Compare June 5, 2024 02:59
@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 60a363a to a02b643 Compare June 6, 2024 16:38
@antiguru
Copy link
Member

antiguru commented Jun 6, 2024

Thanks of writing this design doc! I want to capture one thought I had, but don't feel obliged in any way to incorporate it in your design! In the past we had the idea of replica sets, which groups replicas in a cluster based on properties, and allows to reason about each independently of the cluster.

In the case of graceful reconfiguration, this could argue about replica sets instead of specific replicas, which might make the reasoning simpler. For example, it avoids having to think about how to name pending replicas, because each replica set provides a namespace.

For context, we didn't implement replica sets last year because it was a large change that complicated a user-facing feature even further, but maybe it's time to reconsider this eventually? (#21351)

@jubrad
Copy link
Contributor Author

jubrad commented Jun 8, 2024

@antiguru very cool! Conceptually I like the idea of using replica sets to manage groups of replicas for graceful reconfiguration. I suspect we'd have to do something swap out the default/implicit replica set for the reconfiguration replica set once we've met our check/finalization conditions then tear down the pre-alter replica set.

It seems like, for graceful reconfiguration, we would only want to act on the implicit replica set, and we would not support any graceful reconfiguration feature for replica set sets themselves.

For the sake of this design doc, and roadmap of the feature, I think it's likely a good idea to hold off on incorporating this at least for v1. I also suspect incorporating replica sets wouldn't broadly change the design, but would clean up some of the individual replica management.

@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 5980f1e to e5108fd Compare June 8, 2024 03:49
@jubrad jubrad requested a review from maddyblue June 14, 2024 18:03
prior version of the configuration. If a reconfiguration is not ongoing they
should be the only replicas.
- Reconfiguration Replica: These replicas only exist during a reconfiguration,
their names will be suffixed with `reconfig-`. They will move to active
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(See comment below about using a -pending suffix instead: https://github.com/MaterializeInc/materialize/pull/26785/files#r1606149023)

For v2, we will introduce backgrounded reconfiguration which will need some
new semantics in order to cancel. Something like the following should suffice.
```sql
CANEL ALTER CLUSTER <cluster>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative strawman syntax for discussion: ALTER CLUSTER ... CANCEL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main benefit I see is that it'd be easier to search for and keeps the cancel statement scoped to ALTER CLUSTER, but it also seems like it'd be odd to have the action at the end of the ddl statement.

The argument for cancel alter cluster. If we do introduce more cancelable statements, ie jobs, then the syntax there would likely be create job and cancel job and leading cancel could align more with that? It also begins the ddl with the action being performed with I like.

This syntax is for v2 and we can probably refine it there, but I'm also fine with ALTER CLUSTER <cluster> CANCEL essentially could you could either provide set or cancel keywords after the cluster name, seems easy.

Comment on lines +54 to +70
ALTER CLUSTER c1 SET (SIZE 'small') WITH (
WAIT = FOR <interval>,
BACKGROUND = {true|false},
)
```
- `UNTIL CAUGHT UP`, which will wait for a "catch up" mechanism to return true
or for a timeout to be met. It will roll back or forward depending on the value
provided to `ON TIMEOUT`.
```sql
ALTER CLUSTER c1 SET (SIZE 'small') WITH (
WAIT = UNTIL CAUGHT UP (
TIMEOUT = <interval>,
ON TIMEOUT = {CONTINUE | ABORT}
)
BACKGROUND = {true|false},
)
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how this shook out!

@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 3cdfca3 to 6e3492f Compare June 21, 2024 14:51
@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 6e3492f to 09a2720 Compare June 27, 2024 18:00
replicas.

For v2, we will need to store the AlterClusterPlan durably for all backgrounded reconfigurations.
We will then need the coordinator to attempt to continue an ongoing recongiuration on restart.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to continue anything here! On restart the catalog will remove the non-pending replicas, promote the pending to non-pending, then proceed with compute controller initialization like normal which will cause the new desired state to take affect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the behavior we want? I imagine if environmentd were to get restarted leading a small amount of downtime we wouldn't want to remove all non-pending fully hydrated replicas and promote replicas that have an unknown hydration state? I would expect the behavior we want to strive for here is to continue with the alter and only perform the finalization once the alter strategy condition is met.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that on envd restart we always get a whole new set of replicas. The new envd does not hook in to the in-progress hydrating replicas, so they give us no benefit. Recommend verifying this behavior with the cluster team.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If envd is restarted but some clusterds are not restarted, we aim to not disturb these clusterds, and reuse them as much as possible. That is, the coordinator tries to bootstrap timestamps in a way that existing dataflows on the running clusterds can be reused, and the clusterds try to recognize when compute commands that they receive from a new controller reflect already running things. See also platform/reconciliation.md.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the v2 mechanism is not based on rehydration but a parity of the frontiers between old and new replicas then we should be able to handle both the case that existing flows are kept alive and the case where full rehydration is required. In the former we'd wait for new replicas to catch up ,and would only finalize the alter once that had occurred. In the latter, since both sets of replicas had their dataflows restarted, we'd expect to have parity immediately.

@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch 2 times, most recently from 48443dc to 715f763 Compare June 28, 2024 14:32
@jubrad jubrad requested a review from maddyblue June 28, 2024 19:32
@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from 715f763 to b33374a Compare July 2, 2024 15:23
@jubrad jubrad force-pushed the graceful-reconfiguration-v1-design-doc branch from b33374a to ab3623d Compare July 3, 2024 04:38
@jubrad jubrad enabled auto-merge July 3, 2024 04:38
@jubrad jubrad merged commit bf57081 into MaterializeInc:main Jul 3, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants