-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
graceful reconfiguration v1 design doc #26785
graceful reconfiguration v1 design doc #26785
Conversation
73b3833
to
7502d6c
Compare
``` | ||
CONCERN: | ||
Allowing delay=false and a timeout=0 provides two similar mechanisms of | ||
using the old behavior. I think this is what we want as it ensures the old |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about special-casing timeout=0 to just invoke the old code, i.e., to do exactly the same as with delay=false?
### How do we handle multiple schedules? | ||
If this feature is introducing multiple schedules we'll need some rules governing the interaction between schedules. | ||
There's world where each schedule needs to consider it's interaction with all other schedules and that seems kind of | ||
scary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there could be a division of schedules into 2 categories:
- Schedules that turn clusters On/Off.
- Schedules that affect the set of replicas of a cluster, if the cluster is On.
The interaction among all the schedules in the 1. category are already handled by the current code. (We wait for all schedules to have their say about a cluster before touching a cluster, and the cluster will be On if at least one schedule says that it should be On.)
The interaction between 1. and 2. would be:
- If 1. decided that the cluster is Off, then 2. has nothing to do, because there are no replicas.
- If 1. decided that the cluster is On, then 2. can decide on the exact set of replicas.
This leaves open the interactions among different schedules in category 2. What would be another example of a schedule in 2.? Maybe auto-scaling? I guess that would need to have some custom-defined interaction with graceful reconf timeout schedule.
|
||
### How does a multi-subscriber coord work with schedules? | ||
We definitely don't want two schedule decisions to be acting at the same time. Currently this isn't | ||
an issue, but it seems a bit scary. Is this a concern? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd guess that stuff like altering clusters will always be a centralized thing, and only 1 instance of handle_scheduling_decisions
would be allowed to run at a time. (But I'm not an expert on PlatformV2 stuff.)
0ab3bb4
to
fbb5b42
Compare
a2f965f
to
1d2b0a2
Compare
1d2b0a2
to
9688a82
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't read over the implementation yet because I need to read up on how cluster scheduling works today first. Left some thoughts about how this should work from a user perspective.
### Details | ||
|
||
#### Invariants: | ||
- Only one reconfiguration will be allowed at a time for a cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when a user tries this exactly? Error? The running recon is stopped and the new one occurs instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I'm not really sure what the implications are of switching the cluster_schedule from a single value to a vec.
Having two refresh schedules doesn't make much sense nor does having two reconfigure schedules. Perhaps it's we're able to say that the list of schedules must be unique with respect to schedule type and for reconfigure this will be resolved by overwriting the existing reconfigure.
by respecifying the statement with a TIMEOUT of 0. | ||
Ex: | ||
```sql | ||
ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP [AUTO DETECT] TIMEOUT 5 minutes ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when CLEANUP
is not specified? It acts the same as TIMEOUT 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that there's a difference between these two statements
ALTER CLUSTER c1 SET (SIZE 'small') ;
ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP TIMEOUT 0 );
The former should use the old mechanism of drop/create.
The latter should create a reconfigure that fires a decision the the next cycle of handle_cluster_scheduling.
Reasonsing:
If you have an ongoing reconfiguration but you want to preemptively end it, setting the timeout to 0 could provide the mechanism if we're able to detect that there aren't differences and modify a reconfigure schedule.
Keeping the former syntax and functionality alive seems useful for clusters with sources/sinks as well as very cost sensitive protection against something like coordinator stalls or envd crash, or even a poorly timed reconfigure which occurs during a maintenance window. I could be convinced that this isn't a real concern and we should always do a scheduled cleanup except in the sink/source case which the cluster sequencer could figure out internally.
are replacing. | ||
|
||
## V1 Scope | ||
- Timeout based reconfiguration cleanup scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I see the timeout as bounds you're setting on the cost of zero downtime during reconfiguration and hydration-detection as our way of helping reduce those costs.
If we just used hydration-detection I worry that there would be no upper bound to the cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.
The WAIT = FOR <interval>
is definitely not the ideal user experience, but I think we can ship it much faster than we can ship WAIT = UNTIL CAUGHT UP
. No one has yet proposed a clear implementation for computing "is this replica caught up?", and I'd hate to block this work
Maybe the right goal is to aim for getting WAIT = FOR <interval>
merged and in "private preview", with the expectation that it's unlikely that folks will want to use it. But we can at least check our assumption there! And while we're getting feedback on WAIT = FOR <interval>
we can try to shake out the question of computing "is this replica caught up?", and it should drop into place pretty cleanly with the foundation we lay with WAIT = FOR< interval>
.
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
|
||
|
||
#### Guard Rails | ||
- Only one reconfigure can occur at a time for a given cluster. If a cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specify exactly what you want to happen here? My thought: overwrites the in-progress recon with the new one. Possibly issues a NOTICE that this occurred.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exactly what I was thinking. There's a small bit on the cluster sequencing portion on this flow, but I'll make it clear throughout that this will be an ovewrite of the existing reconfigure.
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
311e2bf
to
8425a7c
Compare
8425a7c
to
40ef0f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for writing this up, @jubrad! Sorry it took me a whole week to get to looking at this.
One major comment around making ALTER CLUSTER
blocking that I think has the potential to majorly speed up the implementation time. Maybe it would be helpful to circle up live this week? I'll follow up on Slack.
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
Introduce new SQL to define a delay duration for alter cluster. | ||
|
||
`ALTER CLUSTER c1 SET (SIZE 'small') WITH ( CLEANUP TIMEOUT 5 minutes );` | ||
|
||
This will alter the cluster and create new cluster replicas, but will not | ||
remove existing replicas. Instead, it will set the schedule for the cluster | ||
to `ClusterSchedule::Reconfigure` and set the deadline to `now + the provided | ||
duration`. When CLEANUP TIMEOUT is not provided the cluster will behave as it | ||
previously did, immediately tearing down and then creating the new replicas. | ||
|
||
Initially, all reconfigurations will wait for the full timeout; however, when we | ||
move to a hydration/parity cleanup detection mechanism, the reconfiguration may | ||
finish before the timeout. The same SQL can be used in both cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth looking ahead a smidge and seeing how we'll extend this syntax when we support a better policy for doing the cutover. Seeing it written out, I'm worried that CLEANUP TIMEOUT
is too specific.
Here's an alternative using an option named WAIT
:
WAIT = FOR <interval> | UNTIL HYDRATED | UNTIL CAUGHT UP
The FOR
variant is the v1: just wait a fixed duration to do the cutover.
The UNTIL HYDRATED
is the v1.5: wait until the new replicas are hydrated (but not necessarily caught up to the old replicas). We may want to skip this based on @teskje's comments here about the pitfalls and difficulties of assessing whether a replica is hydrated.
The UNTIL CAUGHT UP
variant is the v2: wait until the new replicas are "caught up" to the old replicas, for some as-yet-undetermined notion of caught-upedness.
The other big benefit of this phrasing is that it avoids using the term "timeout." On reflection, I think we should reserve "timeout" for things going wrong—i.e., aborting the entire operation because it took too long. Asking ALTER CLUSTER
to wait 5m for the new replicas to catch up is less of a "time out" after 5m, and more of an intentional delay/wait.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this. Initially I had a
WITH [<condition name>] CLEANUP TIMEOUT = <interval>
but UNTIL HYDRATED WAIT FOR 10 minutes
sounds way better.
@maddyblue had questioned whether we need to expose the HYDRATION
/CAUGHT UP
mechanism at all. I think this is a super valid point.
- Would a user care about the mechanism being used to trigger cleanup?
- Would a user ever want to wait for the full duration when we have a mechanism to trigger the cleanup early?
If we don't want to do provide this dials, we could consider something like
WITH ( CREATE BEFORE DELETE, MAX OVERLAP = 10m )
This is generated from the part of my brain that works in terraform/pulumi, but.. those are also tools that are likely to be directly interacting with creating/resizing clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with Nikhil that CLEANUP TIMEOUT
is too specific. Snowflake uses a syntax similar to his proposal for warehouse resizing: WAIT_FOR_COMPLETION
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Snowflake uses a syntax similar to his proposal for warehouse resizing:
WAIT_FOR_COMPLETION
.
Omg, I went looking for prior art, and somehow forgot to look at ALTER WAREHOUSE
. Psyched that my proposal here lines up!
but
UNTIL HYDRATED WAIT FOR 10
minutes sounds way better.
Wait, you'd only use one of the options at a time though!
CREATE CLUSTER (WAIT FOR '10m');
CREATE CLUSTER (WAIT UNTIL HYDRATED);
CREATE CLUSTER (WAIT UNTIL CAUGHT UP);
Would a user ever want to wait for the full duration when we have a mechanism to trigger the cleanup early?
No, I don't think so—I think that's why you want FOR ...
and UNTIL HYDRATED
to be mutually exclusive options.
Would a user care about the mechanism being used to trigger cleanup?
I mused on this here: #26785 (comment)
Supporting WAIT = FOR ...
is more about allowing us to ship something incrementally here, vs nailing the user experience from day one. In the fullness of time, we definitely want WAIT UNTIL CAUGHT UP
to be the default, so most users get the good behavior by default. In that world, I think WAIT = FOR '0m'
is still useful as a way to force a change through immediately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think so—I think that's why you want FOR ... and UNTIL HYDRATED to be mutually exclusive options.
I'm not entirely convinced that we want FOR <duration>
and UNTIL CAUGHT UP
to be mutually exclusive. It's possible we do, but we would need to be pretty confident in our UNTIL CAUGHT UP
mechanism if we go that route. My preference would be to provide the user the knob to set a hard limit on the overlap duration, potentially failing forward.
Agree with Nikhil that CLEANUP TIMEOUT is too specific. Snowflake uses a syntax similar to his proposal for warehouse resizing: WAIT_FOR_COMPLETION.
WAIT_FOR_COMPLETION isn't quite what we're doing.
The usage description I see is: To block the immediate return of the ALTER WAREHOUSE command until the resize is complete, add the WAIT_FOR_COMPLETION parameter.
, which is really about changing the return strategy of the query not the underlying mechanism of the alter
. We'd discussed having a BACKGROUND=true|false
option as part of this, and that is directly comparable to WAIT_FOR_COMPLETION
.
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
are replacing. | ||
|
||
## V1 Scope | ||
- Timeout based reconfiguration cleanup scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this useful for users? It will still lead to unpredictable downtime. (Perhaps some of this has been discussed in places I'm not aware of but:) I'm interested in not building the timeout feature and only building the hydration-detection feature.
The WAIT = FOR <interval>
is definitely not the ideal user experience, but I think we can ship it much faster than we can ship WAIT = UNTIL CAUGHT UP
. No one has yet proposed a clear implementation for computing "is this replica caught up?", and I'd hate to block this work
Maybe the right goal is to aim for getting WAIT = FOR <interval>
merged and in "private preview", with the expectation that it's unlikely that folks will want to use it. But we can at least check our assumption there! And while we're getting feedback on WAIT = FOR <interval>
we can try to shake out the question of computing "is this replica caught up?", and it should drop into place pretty cleanly with the foundation we lay with WAIT = FOR< interval>
.
|
||
In introducing a second `ClusterScehdule` variant, we now have to account for | ||
the interaction between these variants. To do so, `ClusterSchedule::Manual` | ||
will be special-cased. We will allow multiple `ClusterSchedules` to be set | ||
concurrently on a given cluster, however, when `ClusterSchedule::Reconfigure` | ||
is one of them, it will be the only schedule that will cause a decision to be | ||
emitted. | ||
|
||
Benefits of this approach: | ||
- No other `ClusterSchedule` could interact with a cluster being reconfigured, | ||
which may reduce the complexity of behavior between interacting schedules. | ||
- Interactions with refresh. A resize on an active refresh to a smaller cluster | ||
size would complete (or timeout) rather than be quiesced by the refresh | ||
schedule decision. This should give a better indication of whether the new | ||
size would work on subsequent refreshes. | ||
|
||
Downsides of this approach: | ||
- Replicas may be alive for the entire delay period rather than just the period | ||
it takes to perform the refresh, this could lead to additional/unexpected | ||
billing. | ||
|
||
#### Catalog Changes | ||
__Cluster__ | ||
Add a new ClusterSchedule Enum along with a `Reconfigure` variant which would | ||
be applied temporarily during reconfigure. | ||
```rust | ||
ClusterSchedule::Reconfigure { | ||
// timestamp for user-provided timeout | ||
timeout: Option<u64>, | ||
// V2 hydration config may end up here when we determine how it should be | ||
// used to determine reconfiguration is complete This could also want to be | ||
// a detection_mechanism: enum if we want different detection mechanisms that | ||
// the user can specify | ||
// auto_detect: bool | ||
} | ||
``` | ||
|
||
#### Scheduling | ||
`cluster_scheduling`'s `check_schedule_policies`, will be updated | ||
to additionally check `check_reconfiguration_policy`, which sends | ||
`ScheduleDecisions` with the decision `FinalizeReconfigure`, when the reconfigure timeout deadline has passed. | ||
|
||
`check_reconfiguration_policy`, for V1, will trigger a `FinalizeReconfiguration` `ScheduingDecision` for clusters that | ||
have a `Reconfigure` `ClusterSchedule` with a timeout value greater than the current time. In V2, we will add logic to this to also check for parity/hydration. At this point, it may need to be backgrounded as it's likely a much more costly check. | ||
|
||
`Message::SchedulingDecisions` will be adjusted from | ||
`SchedulingDecisions(Vec<(&'static str, Vec<(ClusterId, bool)>)>)` to be | ||
`SchedulingDecisions(Vec<SchedulingPolicyDecision>)` | ||
```rust | ||
enum SchedulingPolicyDecision { | ||
Refresh(Vec<(ClusterId, bool)>), | ||
FinalizeReconfigure(Vec<(ClusterId)>), | ||
} | ||
``` | ||
|
||
The logic for `handle_schedule_decision` will be directed to the correct | ||
function based on the variant of decision `handle_refresh_decision` or | ||
`handle_finalize_reconfigure_decision`. | ||
|
||
For each cluster `handle_finalize_reconfigure_decisions` will check the current | ||
catalog to avoid conflicts with catalog updates since the `FinalizeDecision` | ||
was sent. If there's a collision that should prevent the finalization, such as a | ||
new size adjustment, then break; Otherwise, remove the active replicas and move | ||
the reconfiguration replicas to be active. Finally, remove the `Reconfigure` | ||
schedule from the cluster. | ||
|
||
|
||
#### Cluster Sequencer | ||
The following roughly defines the new logic for the cluster sequencer when | ||
performing an alter cluster on a managed cluster. | ||
|
||
There are two states to look at | ||
1. the newly provided config | ||
2. the reconfigue config (what is in the catalog) | ||
|
||
First scenario, the cluster being altered has no `Reconfigure` `cluster_schedule`: | ||
If cleanup_timeout is None: | ||
- use the old mechanism ( drop and replace ) | ||
If the new alter statement provides a `cleanup_timeout`: | ||
- check for sources/sinks, | ||
- update the catalog with the changes and add a `Reconfigure` cluster schedule | ||
- deploy new reconfigure replicas | ||
|
||
Second scenario, the cluster being altered is undergoing a reconfigure: | ||
If cleanup_timeout is None, | ||
- Use the old mechanism - Drop and replace including all reconfigure replicas and remove the schedule. | ||
If the new alter statement provides a `cleanup_timeout`: | ||
- If the new config matches the current config | ||
- bump the timeout | ||
- If the new config doesn't match the current config | ||
- check for sources/sinks | ||
- update cluster catalog with new schedule and config | ||
- drop reconfigure replicas | ||
- launch new reconfigure replicas | ||
|
||
Note: | ||
An `ALTER CLUSTER` statement that is run twice (double shot), either needs to | ||
push out the timeout of an ongoing reconfiguration or needs to entirely replace | ||
the existing reconfigure schedule dropping and recreating any reconfigure | ||
replicas. I believe the former is the more useful behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to propose an alternative v1: just block the ALTER CLUSTER
command until the cutover is complete. I mentioned this briefly on the issue:
Should the ALTER CLUSTER statement block until the reconfiguration is complete? This is the "correct" behavior from the PostgreSQL perspective. For example, CREATE INDEX and ALTER TABLE both block until the potentially very slow operation is complete. This has the appealing property of allowing us to report failures to complete the operation back to the user. It also seems a good bit simpler to build, because we don't need to build any job scheduling infrastructure.
I think it'd be muuuuch simpler to start like this, since you basically wouldn't need any durable state to record that the alteration was in progress. Here's how it'd work:
ALTER CLUSTER
command is planned normally, resulting in a call tosequence_alter_cluster
on the coordinator main loop.- If
ALTER CLUSTER
command involves a reconfiguration, add the new replicas with the-pending
suffix (andpending = true
in themz_cluster_replicas
table) in a catalog transaction. In the same catalog transaction, ensure there are no existing pending replicas—if there are, send back an error, as anotherALTER CLUSTER
operation on the same cluster is already in progress. (In the future, we might prefer to have subsequentALTER CLUSTER
operations just block until the earlier operation completes, but that's more work to rig up.) - Off of the coordinator main loop, wait out the
WAIT
policy. - Once the
WAIT
policy completes, issue another catalog transaction (on the coordinator main thread) that drops the old replicas as removes thepending
bit from the new replicas. - Report success back to the user.
If the client goes away while the ALTER
is running, we can have the coordinator abort the operation: just drop all the outstanding pending replicas.
I don't think you'd need to interact with the cluster scheduler at all, other than to make sure that the schedule policy wasn't ON REFRESH
. The interaction between SCHEDULE = ON REFRESH
seems real tricky, and since REFRESH EVERY
views are generally for cold/warm data rather than hot data, the lack of graceful cluster reconfiguration seems quite tolerable.
There's only one subtle bit: what if envd
crashes in the middle of handling an ALTER CLUSTER
command? E.g., say the ALTER CLUSTER
is being executed during the maintenance window. I think the answer is pretty clean: just have envd
drop all pending replicas when it boots. We know that if we have pending replicas, the old envd
1) was in the middle of processing an ALTER CLUSTER
command and 2) returned a "connection closed" error or somesuch that did not indicate success, and so we can simply drop the pending replicas and pretend the alteration never happend.
And, if blocking ALTER CLUSTER
statements does turn out to be a pain point, we can roll out support for ALTER CLUSTER ... (BACKGROUND = TRUE)
in some future release of Materialize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have some concerns about implementing non-blocking as follow up. As someone who regularly deals with cloud infra I find it much harder to work with things that must be handled over a single connection. Networks are flaky, laptops go to sleep, and connection timeouts threaten longer lived calls.
This might work fine over PGWIRE where it's not abnormal to have long running queries (10s of minutes), but HTTP calls via the console may not be as resilient here. The connections themselves may be fine if they don't hit timeouts, but may find ourselves fighting with user interactions with long lived connections in the browser.
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
configuration matches the `prior_config`. This seems somewhat convoluted and | ||
I don't know that allowing for cancellation is necessary as long as we allow for | ||
adjustment of the current reconfigure timeout, and we allow for overwriting the | ||
current reconfiguration entirely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what would be the exact workflow if the new size is not big enough? E.g., let's say the user is trying to size down, but it turns out that they can't size down, because the new size is not able to hydrate. Therefore, the user would like to go back to the original size. In this case, should the user give an ALTER CLUSTER to the original size? And then I would expect that the reconfiguration replicas simply disappear, and the active replicas stay active. Is this what will happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need to update the doc based on the discussion earlier today. Here's where I think we're leaning.. If the user believes the new smaller size will not work they would cancel the existing query being run assuming they are running a blocking query. If they are running the query in the background they would issue a cancel command like cancel alter cluster <x>
. In either case the existing reconfiguration replicas disappear and the actives state active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my understanding too!
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
|
||
#### Reconfiguration Task | ||
|
||
The reconfiguration task will perform the following actions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the task exactly? A tokio task? A coordinator message that gets sequenced and can possibly send other coord messages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in new update, and discussed in huddle, but I'll do a quick outline here.
Initially the thought was a tokio task, however, it seems like it'll be cleaner to do this through a cluster pending reconfiguration message handler that performs the check, finalizes the reconfiguration and returns to the user.
We'll have to adjust the sequencer to know it shouldn't retire the connection if it's a graceful reconfiguration.
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
we will treat any disruption in connection as a failed DDL which will need to | ||
be aborted. | ||
|
||
For V2, for each cluster, we will want to inspect whether an active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less a v2 thing and more a BACKGROUND = true
thing yeah? Can it talk about background instead of v2 then? I kinda think since background isn't fully defined here, we should just omit this altogether maybe with a note somewhere saying the background or v2 stuff will be in a later design doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing the rollout of v1,v2 is actually
1 support for WAIT FOR
foreground
2 support for WAIT UNTIL
foreground
3 support for background
3c60a14
to
295995d
Compare
295995d
to
60a363a
Compare
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
60a363a
to
a02b643
Compare
Thanks of writing this design doc! I want to capture one thought I had, but don't feel obliged in any way to incorporate it in your design! In the past we had the idea of replica sets, which groups replicas in a cluster based on properties, and allows to reason about each independently of the cluster. In the case of graceful reconfiguration, this could argue about replica sets instead of specific replicas, which might make the reasoning simpler. For example, it avoids having to think about how to name pending replicas, because each replica set provides a namespace. For context, we didn't implement replica sets last year because it was a large change that complicated a user-facing feature even further, but maybe it's time to reconsider this eventually? (#21351) |
@antiguru very cool! Conceptually I like the idea of using replica sets to manage groups of replicas for graceful reconfiguration. I suspect we'd have to do something swap out the default/implicit replica set for the reconfiguration replica set once we've met our check/finalization conditions then tear down the pre-alter replica set. It seems like, for graceful reconfiguration, we would only want to act on the implicit replica set, and we would not support any graceful reconfiguration feature for replica set sets themselves. For the sake of this design doc, and roadmap of the feature, I think it's likely a good idea to hold off on incorporating this at least for v1. I also suspect incorporating replica sets wouldn't broadly change the design, but would clean up some of the individual replica management. |
5980f1e
to
e5108fd
Compare
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
prior version of the configuration. If a reconfiguration is not ongoing they | ||
should be the only replicas. | ||
- Reconfiguration Replica: These replicas only exist during a reconfiguration, | ||
their names will be suffixed with `reconfig-`. They will move to active |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(See comment below about using a -pending
suffix instead: https://github.com/MaterializeInc/materialize/pull/26785/files#r1606149023)
For v2, we will introduce backgrounded reconfiguration which will need some | ||
new semantics in order to cancel. Something like the following should suffice. | ||
```sql | ||
CANEL ALTER CLUSTER <cluster> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative strawman syntax for discussion: ALTER CLUSTER ... CANCEL
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main benefit I see is that it'd be easier to search for and keeps the cancel statement scoped to ALTER CLUSTER
, but it also seems like it'd be odd to have the action at the end of the ddl statement.
The argument for cancel alter cluster
. If we do introduce more cancelable statements, ie jobs, then the syntax there would likely be create job
and cancel job
and leading cancel could align more with that? It also begins the ddl with the action being performed with I like.
This syntax is for v2 and we can probably refine it there, but I'm also fine with ALTER CLUSTER <cluster> CANCEL
essentially could you could either provide set
or cancel
keywords after the cluster name, seems easy.
ALTER CLUSTER c1 SET (SIZE 'small') WITH ( | ||
WAIT = FOR <interval>, | ||
BACKGROUND = {true|false}, | ||
) | ||
``` | ||
- `UNTIL CAUGHT UP`, which will wait for a "catch up" mechanism to return true | ||
or for a timeout to be met. It will roll back or forward depending on the value | ||
provided to `ON TIMEOUT`. | ||
```sql | ||
ALTER CLUSTER c1 SET (SIZE 'small') WITH ( | ||
WAIT = UNTIL CAUGHT UP ( | ||
TIMEOUT = <interval>, | ||
ON TIMEOUT = {CONTINUE | ABORT} | ||
) | ||
BACKGROUND = {true|false}, | ||
) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how this shook out!
3cdfca3
to
6e3492f
Compare
6e3492f
to
09a2720
Compare
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
replicas. | ||
|
||
For v2, we will need to store the AlterClusterPlan durably for all backgrounded reconfigurations. | ||
We will then need the coordinator to attempt to continue an ongoing recongiuration on restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to continue anything here! On restart the catalog will remove the non-pending replicas, promote the pending to non-pending, then proceed with compute controller initialization like normal which will cause the new desired state to take affect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that the behavior we want? I imagine if environmentd were to get restarted leading a small amount of downtime we wouldn't want to remove all non-pending fully hydrated replicas and promote replicas that have an unknown hydration state? I would expect the behavior we want to strive for here is to continue with the alter and only perform the finalization once the alter strategy condition is met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that on envd restart we always get a whole new set of replicas. The new envd does not hook in to the in-progress hydrating replicas, so they give us no benefit. Recommend verifying this behavior with the cluster team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If envd is restarted but some clusterds are not restarted, we aim to not disturb these clusterds, and reuse them as much as possible. That is, the coordinator tries to bootstrap timestamps in a way that existing dataflows on the running clusterds can be reused, and the clusterds try to recognize when compute commands that they receive from a new controller reflect already running things. See also platform/reconciliation.md
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the v2 mechanism is not based on rehydration but a parity of the frontiers between old and new replicas then we should be able to handle both the case that existing flows are kept alive and the case where full rehydration is required. In the former we'd wait for new replicas to catch up ,and would only finalize the alter once that had occurred. In the latter, since both sets of replicas had their dataflows restarted, we'd expect to have parity immediately.
48443dc
to
715f763
Compare
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
doc/developer/design/20240411_graceful_reconfiguration_of_managed_clusters_v1.md
Outdated
Show resolved
Hide resolved
715f763
to
b33374a
Compare
b33374a
to
ab3623d
Compare
Motivation
Design doc for graceful reconfiguration of managed clusters.
Currently limited to timeout based reconfiguration delay (v1).
https://github.com/MaterializeInc/database-issues/issues/5976
Tips for reviewer
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.