storage: allow removing leaseholder in ChangeReplicas #40333

tbg · 2019-08-29T15:40:20Z

The replicate queue code is littered with decisions to transfer a lease away simply because the leaseholder should be removed. When the replication factor is low or there are constraints on the range it can happen that we can't transfer this lease away until we've added another replica. Adding another replica before removing the leaseholder is precisely what we are trying to avoid in #12768.

TestInitialPartitioning definitely hits this problem (it will fail if rebalancing is always carried out atomically), so it's a good starting point for investigation.

The lease transfer code that the queue uses (centered around allocator.TransferLeaseTarget) is very hard to reason about as well, so I'm not even sure if maybe it would fail to find a target when there is one sometimes.

With all that in mind it'd be nice if we could just issue any replication change, including one that removes the leaseholder. The reason we don't allow this today is that this will either wedge the range (because the lease will remain active, but the leader isn't serving the range any more) or cause potential anomalies (if we allow another node to get the lease without properly invalidating the old one).

The right way to make this work, I think, is to use the fact that the replica change removing the leaseholder is necessarily evaluated and proposed on the leaseholder. If we intercept that request accordingly and make sure that it causes the leaseholder to behave as if a TransferLease had been issued (setting its minProposedTS, etc) then we could treat a lease whose leaseholder store isn't in the descriptor as invalid (even if the epoch is still active), meaning that any member of the range could obtain the lease immediately after the replica change is through.
We'll need to allow leases to be obtained by VOTER_INCOMING replicas, but this is fine.
We could also add a leaseholder hint to the replica change to fix the lease transfer target, which is something the replicate queue (or allocator 2.0) would want to do.

Jira issue: CRDB-5549

The text was updated successfully, but these errors were encountered:

As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None

40363: storage: work around can't-swap-leaseholder r=nvanbenschoten a=tbg As of #40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until #40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None Co-authored-by: Tobias Schottdorf <tobias.schottdorf@gmail.com>

Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None

Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. The PR also adds a flag to control how many of these subtests to run. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None

github-actions · 2021-06-04T19:50:58Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
5 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

github-actions · 2023-09-19T11:11:47Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

nvanbenschoten · 2024-05-20T20:01:58Z

This was addressed by #74077.

tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 29, 2019

tbg mentioned this issue Aug 30, 2019

storage: work around can't-swap-leaseholder #40363

Merged

tbg mentioned this issue Sep 5, 2019

stability: Rebalances must be atomic #12768

Closed

ajwerner mentioned this issue Sep 10, 2019

roachtest: import/tpcc/warehouses=1000/nodes=32 failed #39072

Closed

ajwerner mentioned this issue Sep 24, 2019

partitionccl: enhance partition test to use leaseholders and more nodes #41030

Closed

ajwerner mentioned this issue Nov 25, 2019

partitionccl: enhance partition test to use leaseholders and more nodes #42758

Open

github-actions bot added the no-issue-activity label Jun 4, 2021

knz removed the no-issue-activity label Jun 5, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

lunevalex mentioned this issue Jul 22, 2021

kvserver: lease counts diverge when a new node is added to a cluster #67740

Closed

github-actions bot added the no-issue-activity label Sep 19, 2023

github-actions bot added the X-stale label Oct 2, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 2, 2023

exalate-issue-sync bot closed this as completed Oct 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: allow removing leaseholder in ChangeReplicas #40333

storage: allow removing leaseholder in ChangeReplicas #40333

tbg commented Aug 29, 2019 •

edited by cockroach-jira-scripts

Loading

github-actions bot commented Jun 4, 2021

github-actions bot commented Sep 19, 2023

nvanbenschoten commented May 20, 2024

storage: allow removing leaseholder in ChangeReplicas #40333

storage: allow removing leaseholder in ChangeReplicas #40333

Comments

tbg commented Aug 29, 2019 • edited by cockroach-jira-scripts Loading

github-actions bot commented Jun 4, 2021

github-actions bot commented Sep 19, 2023

nvanbenschoten commented May 20, 2024

tbg commented Aug 29, 2019 •

edited by cockroach-jira-scripts

Loading