Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method to ask cockroachdb if it is "safe" to decommission a node #70486

Closed
data-matt opened this issue Sep 21, 2021 · 12 comments
Closed

Method to ask cockroachdb if it is "safe" to decommission a node #70486

data-matt opened this issue Sep 21, 2021 · 12 comments
Assignees
Labels
A-cli-admin CLI commands that pertain to controlling and configuring nodes A-kv-decom-rolling-restart Decommission and Rolling Restarts C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-postmortem Originated from a Postmortem action item. T-kv KV Team

Comments

@data-matt
Copy link

data-matt commented Sep 21, 2021

Is your feature request related to a problem? Please describe.
Our operators have automated the provisioning of cockroachdb clusters on-premise. We would like to be able ask cockroachdb if it is safe to remove a node.

The main concern is around data redundancy, i.e How do we know if we will have enough replicas in zone or region?
We don't want to inspect zone constraints, we want to simply ask cockroach if we remove a node, will we be able to avoid an outage? We want to guarantee that we can maintain the correct RF for all databases on the clusters.

For more context, we have to imagine that end users have access to a webui portal, where they can remove nodes. At scale we can't manually verify every removal of a node for 100s of clusters.

For example:
If we have 9 nodes across 3 regions, can we safely remove 4 nodes and maintain quorum for the databases with 5 RF?
If we have 6 nodes in 1 region, can we safely remove 1 node?
Do we have under replicated ranges that are about to be up-replicated to X node?

Describe the solution you'd like
A solution to ask this question from SQL layer would be easy for operators to use.

Alternatively:
cockroach node decommission --dry_run

Describe alternatives you've considered
SQL statements retrieving the replication factor for all zones and then comparing it to node counts.

Additional context
We have seen that there is "cockroach node decommission". However it does not appear to finish gracefully in situations as described above.

gz#9825

gz#10113

gz#10216

Jira issue: CRDB-10098

Epic CRDB-20924

@data-matt data-matt added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 21, 2021
@thtruo thtruo added the A-cli-admin CLI commands that pertain to controlling and configuring nodes label Sep 21, 2021
@blathers-crl blathers-crl bot added T-server-and-security DB Server & Security T-kv KV Team labels Sep 23, 2021
@knz
Copy link
Contributor

knz commented Sep 23, 2021

@mwang1026 this seems to be in-between kv and server. Wanna have an item on both spreadsheets?

@mwang1026
Copy link

Do we have clear scoping of the work? I don't like two spreadsheets because if it bubbles to the top of one but not the other that it creates conflict. If we have clear scope of work we can coordinate prio

@knz
Copy link
Contributor

knz commented Sep 23, 2021

This is an epic.
There are two tasks under it:

@mwang1026
Copy link

This didn't make it into the planning doc for kv. @piyush-singh if you want to make a push for this from the OX side of things we can talk dependencies

@lunevalex
Copy link
Collaborator

This looks like a dupe of #55768, I am going to close it in favor of that issue.

@data-matt
Copy link
Author

data-matt commented Oct 28, 2021

Hi @lunevalex,
I do not believe #55768 matches this issue for the following reasons:

@data-matt data-matt reopened this Oct 28, 2021
@data-matt
Copy link
Author

@lunevalex, Please amend 55768 with all the details (or similar) captured above, if you wish to close this issue.

@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Nov 18, 2021
@knz knz added A-kv-decom-rolling-restart Decommission and Rolling Restarts T-kv KV Team and removed T-server-and-security DB Server & Security labels Jun 16, 2022
@AlexTalks AlexTalks self-assigned this Oct 27, 2022
@DuskEagle
Copy link
Member

Adding a +1 to the importance of this for CockroachDB Dedicated. A chief use case is when we have a multiregion cluster with ranges pinned to particular regions. If we're asked to remove a region from the cluster, it would be useful to know that the region is possible to remove.

If we just cockroach node decommission nodes one at a time, like we do today, we end up with a case where the decommissioning process stalls when removing the last $REPLICATION_FACTOR nodes of the region.

Since cockroach node decommission can take a list of node IDs, ideally we could run cockroach node decommission --dry-run, passing in a list of all the node IDs for that region, and this preflight check would ensure we would succeed in removing the region before beginning the process.

@Schtick Schtick added the O-postmortem Originated from a Postmortem action item. label Jan 19, 2023
@Schtick
Copy link
Collaborator

Schtick commented Jan 19, 2023

This item was discussed as part of our post-mortem discussion on Jan 6 2023 for a customer outage:
Remove Region Failure Leads To Full Cluster Down Outage

@rafiss
Copy link
Collaborator

rafiss commented Jan 19, 2023

@Schtick i've edited your comment to remove reference to a specific customer.

@rafiss
Copy link
Collaborator

rafiss commented Mar 7, 2023

@AlexTalks could you provide a link to the change that addresses this issue?

@data-matt
Copy link
Author

Is it #91893 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cli-admin CLI commands that pertain to controlling and configuring nodes A-kv-decom-rolling-restart Decommission and Rolling Restarts C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-postmortem Originated from a Postmortem action item. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

9 participants