-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Method to ask cockroachdb if it is "safe" to decommission a node #70486
Comments
@mwang1026 this seems to be in-between kv and server. Wanna have an item on both spreadsheets? |
Do we have clear scoping of the work? I don't like two spreadsheets because if it bubbles to the top of one but not the other that it creates conflict. If we have clear scope of work we can coordinate prio |
This is an epic.
|
This didn't make it into the planning doc for kv. @piyush-singh if you want to make a push for this from the OX side of things we can talk dependencies |
This looks like a dupe of #55768, I am going to close it in favor of that issue. |
Hi @lunevalex,
|
@lunevalex, Please amend 55768 with all the details (or similar) captured above, if you wish to close this issue. |
Adding a +1 to the importance of this for CockroachDB Dedicated. A chief use case is when we have a multiregion cluster with ranges pinned to particular regions. If we're asked to remove a region from the cluster, it would be useful to know that the region is possible to remove. If we just Since |
This item was discussed as part of our post-mortem discussion on Jan 6 2023 for a customer outage: |
@Schtick i've edited your comment to remove reference to a specific customer. |
@AlexTalks could you provide a link to the change that addresses this issue? |
Is it #91893 ? |
Is your feature request related to a problem? Please describe.
Our operators have automated the provisioning of cockroachdb clusters on-premise. We would like to be able ask cockroachdb if it is safe to remove a node.
The main concern is around data redundancy, i.e How do we know if we will have enough replicas in zone or region?
We don't want to inspect zone constraints, we want to simply ask cockroach if we remove a node, will we be able to avoid an outage? We want to guarantee that we can maintain the correct RF for all databases on the clusters.
For more context, we have to imagine that end users have access to a webui portal, where they can remove nodes. At scale we can't manually verify every removal of a node for 100s of clusters.
For example:
If we have 9 nodes across 3 regions, can we safely remove 4 nodes and maintain quorum for the databases with 5 RF?
If we have 6 nodes in 1 region, can we safely remove 1 node?
Do we have under replicated ranges that are about to be up-replicated to X node?
Describe the solution you'd like
A solution to ask this question from SQL layer would be easy for operators to use.
Alternatively:
cockroach node decommission --dry_run
Describe alternatives you've considered
SQL statements retrieving the replication factor for all zones and then comparing it to node counts.
Additional context
We have seen that there is "cockroach node decommission". However it does not appear to finish gracefully in situations as described above.
gz#9825
gz#10113
gz#10216
Jira issue: CRDB-10098
Epic CRDB-20924
The text was updated successfully, but these errors were encountered: