Method to ask cockroachdb if it is "safe" to decommission a node #70486

data-matt · 2021-09-21T11:28:05Z

Is your feature request related to a problem? Please describe.
Our operators have automated the provisioning of cockroachdb clusters on-premise. We would like to be able ask cockroachdb if it is safe to remove a node.

The main concern is around data redundancy, i.e How do we know if we will have enough replicas in zone or region?
We don't want to inspect zone constraints, we want to simply ask cockroach if we remove a node, will we be able to avoid an outage? We want to guarantee that we can maintain the correct RF for all databases on the clusters.

For more context, we have to imagine that end users have access to a webui portal, where they can remove nodes. At scale we can't manually verify every removal of a node for 100s of clusters.

For example:
If we have 9 nodes across 3 regions, can we safely remove 4 nodes and maintain quorum for the databases with 5 RF?
If we have 6 nodes in 1 region, can we safely remove 1 node?
Do we have under replicated ranges that are about to be up-replicated to X node?

Describe the solution you'd like
A solution to ask this question from SQL layer would be easy for operators to use.

Alternatively:
cockroach node decommission --dry_run

Describe alternatives you've considered
SQL statements retrieving the replication factor for all zones and then comparing it to node counts.

Additional context
We have seen that there is "cockroach node decommission". However it does not appear to finish gracefully in situations as described above.

gz#9825

gz#10113

gz#10216

Jira issue: CRDB-10098

Epic CRDB-20924

knz · 2021-09-23T17:03:41Z

@mwang1026 this seems to be in-between kv and server. Wanna have an item on both spreadsheets?

mwang1026 · 2021-09-23T17:10:10Z

Do we have clear scoping of the work? I don't like two spreadsheets because if it bubbles to the top of one but not the other that it creates conflict. If we have clear scope of work we can coordinate prio

knz · 2021-09-23T17:12:08Z

This is an epic.
There are two tasks under it:

for KV, create an API to predict the result of a decommission by running a simulation in the allocator (allocator: write a discrete event simulator #70552 could help?).
for server, to integrate that API in the CLI tools

mwang1026 · 2021-10-04T21:18:06Z

This didn't make it into the planning doc for kv. @piyush-singh if you want to make a push for this from the OX side of things we can talk dependencies

lunevalex · 2021-10-28T18:30:03Z

This looks like a dupe of #55768, I am going to close it in favor of that issue.

data-matt · 2021-10-28T18:50:34Z

Hi @lunevalex,
I do not believe #55768 matches this issue for the following reasons:

api: make an API endpoint that returns success when a node can be safely shut down #55768 is based upon the situation when "it is safe to take a node offline as part of a rolling restart, rolling upgrade, etc".
api: make an API endpoint that returns success when a node can be safely shut down #55768 does not capture all of the requirements in this issue. I see no mention of zoneconfig, replication factor.

data-matt · 2021-10-28T18:53:10Z

@lunevalex, Please amend 55768 with all the details (or similar) captured above, if you wish to close this issue.

DuskEagle · 2022-12-21T23:29:26Z

Adding a +1 to the importance of this for CockroachDB Dedicated. A chief use case is when we have a multiregion cluster with ranges pinned to particular regions. If we're asked to remove a region from the cluster, it would be useful to know that the region is possible to remove.

If we just cockroach node decommission nodes one at a time, like we do today, we end up with a case where the decommissioning process stalls when removing the last $REPLICATION_FACTOR nodes of the region.

Since cockroach node decommission can take a list of node IDs, ideally we could run cockroach node decommission --dry-run, passing in a list of all the node IDs for that region, and this preflight check would ensure we would succeed in removing the region before beginning the process.

Schtick · 2023-01-19T15:39:26Z

This item was discussed as part of our post-mortem discussion on Jan 6 2023 for a customer outage:
Remove Region Failure Leads To Full Cluster Down Outage

rafiss · 2023-01-19T16:42:08Z

@Schtick i've edited your comment to remove reference to a specific customer.

rafiss · 2023-03-07T15:02:04Z

@AlexTalks could you provide a link to the change that addresses this issue?

data-matt · 2023-03-07T15:22:46Z

Is it #91893 ?

data-matt added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Sep 21, 2021

thtruo added the A-cli-admin CLI commands that pertain to controlling and configuring nodes label Sep 21, 2021

blathers-crl bot added T-server-and-security DB Server & Security T-kv KV Team labels Sep 23, 2021

data-matt mentioned this issue Oct 20, 2021

Consider remaining disk space when decommissioning a node #71757

Open

lunevalex closed this as completed Oct 28, 2021

data-matt reopened this Oct 28, 2021

exalate-issue-sync bot removed the T-kv KV Team label Nov 18, 2021

knz added A-kv-decom-rolling-restart Decommission and Rolling Restarts T-kv KV Team and removed T-server-and-security DB Server & Security labels Jun 16, 2022

irfansharif mentioned this issue Oct 14, 2022

kv,spanconfig: multi-tenant replication reports #89987

Closed

3 tasks

AlexTalks self-assigned this Oct 27, 2022

AlexTalks mentioned this issue Oct 27, 2022

decommission: pre-flight checks #90752

Closed

8 tasks

Schtick added the O-postmortem Originated from a Postmortem action item. label Jan 19, 2023

exalate-issue-sync bot closed this as completed Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Method to ask cockroachdb if it is "safe" to decommission a node #70486

Method to ask cockroachdb if it is "safe" to decommission a node #70486

data-matt commented Sep 21, 2021 •

edited by exalate-issue-sync bot

Loading

knz commented Sep 23, 2021

mwang1026 commented Sep 23, 2021

knz commented Sep 23, 2021 •

edited by irfansharif

Loading

mwang1026 commented Oct 4, 2021

lunevalex commented Oct 28, 2021

data-matt commented Oct 28, 2021 •

edited by BramGruneir

Loading

data-matt commented Oct 28, 2021

DuskEagle commented Dec 21, 2022

Schtick commented Jan 19, 2023 •

edited by rafiss

Loading

rafiss commented Jan 19, 2023

rafiss commented Mar 7, 2023

data-matt commented Mar 7, 2023

Method to ask cockroachdb if it is "safe" to decommission a node #70486

Method to ask cockroachdb if it is "safe" to decommission a node #70486

Comments

data-matt commented Sep 21, 2021 • edited by exalate-issue-sync bot Loading

knz commented Sep 23, 2021

mwang1026 commented Sep 23, 2021

knz commented Sep 23, 2021 • edited by irfansharif Loading

mwang1026 commented Oct 4, 2021

lunevalex commented Oct 28, 2021

data-matt commented Oct 28, 2021 • edited by BramGruneir Loading

data-matt commented Oct 28, 2021

DuskEagle commented Dec 21, 2022

Schtick commented Jan 19, 2023 • edited by rafiss Loading

rafiss commented Jan 19, 2023

rafiss commented Mar 7, 2023

data-matt commented Mar 7, 2023

data-matt commented Sep 21, 2021 •

edited by exalate-issue-sync bot

Loading

knz commented Sep 23, 2021 •

edited by irfansharif

Loading

data-matt commented Oct 28, 2021 •

edited by BramGruneir

Loading

Schtick commented Jan 19, 2023 •

edited by rafiss

Loading