kvserver: lease counts diverge when a new node is added to a cluster #67740

aayushshah15 · 2021-07-19T04:36:02Z

Describe the problem

In a cluster with at least 3 localities, adding a new node to an existing locality reliably triggers a lease count divergence. The lease counts continue to diverge until this newly added node is fully hydrated with the ~mean number of replicas.

This looks something like the following:

Cause

In order to understand how this happens, consider a cluster with 3 racks: rack=0, rack=1, rack=2 and 9 nodes. Let's walk through adding a new node (n10) to rack=0. Because of the diversity heuristic, only the other existing nodes in rack=0 are allowed to shed their replicas away to n10. For ranges that have their leases also in rack=0, this means that those leaseholders will first need to shed their lease away to one of the nodes in racks 1 or 2 and expect those nodes to execute the rebalance. This will continue until n10 has received roughly ~mean number of replicas relative to the rest of the nodes in the cluster.

However, in a cluster with enough data, fully hydrating the new node will take a while, sometimes on the order of hours (note that n10 can only receive new replicas at the snapshot rate dictated by kv.snapshot_rebalance.max_rate). Until this happens, nodes in rack=0 will continue shedding leases away to nodes in racks 1 and 2 until they basically have zero leases.

To Reproduce

I can reproduce this by following the steps outlined above on both 20.2 and 21.1.

Additional details

Until n10 is fully hydrated, nodes in rack=0 will continue hitting the considerRebalance path (for the ranges for which they are the leaseholders) in the allocator:

cockroach/pkg/kv/kvserver/replicate_queue.go

Line 487 in 7495434

    
           return rq.considerRebalance(ctx, repl, voterReplicas, nonVoterReplicas, canTransferLeaseFrom, dryRun)

Since there is indeed a valid rebalance candidate, RebalanceVoter will return ok==true here:

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 1079 to 1087 in 7495434

    
           addTarget, removeTarget, details, ok := rq.allocator.RebalanceVoter( 
        
           	ctx, 
        
           	zone, 
        
           	repl.RaftStatus(), 
        
           	existingVoters, 
        
           	existingNonVoters, 
        
           	rangeUsageInfo, 
        
           	storeFilterThrottled, 
        
           )

This will then lead to the call to maybeTransferLeaseAway here:

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 1106 to 1107 in 7495434

    
           } else if done, err := rq.maybeTransferLeaseAway( 
        
           	ctx, repl, removeTarget.StoreID, dryRun, canTransferLeaseFrom,

This will transfer the lease away to another replica.

/cc. @cockroachdb/kv

gz#5876

Epic CRDB-10569

gz#9817

The text was updated successfully, but these errors were encountered:

aayushshah15 · 2021-07-19T04:52:20Z

I wanted to post this more generally because I'm not quite sure what behaviour we'd rather have here. I also struggled to come up with any concrete ideas that would cleanly improve the situation.

irfansharif · 2021-07-19T14:11:14Z

Is it possible to have a leaseholder execute a membership change removing itself away from the raft group? (Either by letting the lease expire, or in a more coordinated fashion by transferring the lease to the incoming replica.) That would reduce the number of lease transfers needed for a given range from 2 to 1, and it would stay the same rack/locality.

If that's not possible, can we have an outgoing leaseholder hold onto its lease over the duration of the rebalance, transferring it away to another locality only when the learner replica has been fully caught up? That would reduce how long a cluster observes this diverged state.

nvanbenschoten · 2021-07-19T18:46:59Z

Is it possible to have a leaseholder execute a membership change removing itself away from the raft group?

This is something that @tbg has discussed in the past. It's certainly not possible today, because we don't have a way to coalesce a lease transfer with a membership change into the same Raft log entry. If we did, then I think we could simplify this and a few other edge cases with 1x replication.

can we have an outgoing leaseholder hold onto its lease over the duration of the rebalance, transferring it away to another locality only when the learner replica has been fully caught up? That would reduce how long a cluster observes this diverged state.

This is a good idea, though I don't think it would work today because we eagerly throw away LEARNER replicas from previous ChangeReplicas attempts whenever we see them (see maybeLeaveAtomicChangeReplicas). This isn't strictly necessary, but it would require us to break the contract that a LEARNER replica is always transient and never has a larger lifetime than its AdminChangeReplicasRequest. This contract prevents us from coordinating the lifetime of a LEARNER across multiple ChangeReplicas attempts.

An interesting variation of this for when we know we want a longer-lived learner would be to create a NON_VOTER instead of a LEARNER before transferring away the lease and then allow the new leaseholder to perform the VOTER<->NON_VOTER swap before removing the old VOTER.

lunevalex · 2021-07-19T18:51:08Z

Relates to #51867

lunevalex · 2021-07-22T17:42:52Z

@aayushshah15 there was a previous idea described here #40333 for how to fix this.

nvanbenschoten · 2021-08-11T04:01:08Z

Here's an easy local reproduction of this issue that may come in handy when trying to fix it:

# if not already created and staged
> roachprod create local -n 4
> roachprod put local cockroach

> roachprod start local:1-3 --racks=3

> cockroach workload init kv --splits=10000

> roachprod start local:4 --racks=3

I looked briefly into this intermediate NON_VOTER idea and it doesn't look trivial. After a refresh of the code, I believe the "then allow the new leaseholder to perform the VOTER<->NON_VOTER swap before removing the old VOTER" step would require a good amount of specialized code. Without that, the new leaseholder's allocator would first issue an AllocatorRemoveNonVoter action before attempting an AllocatorConsiderRebalance action. So it would throw away the temporary non-voter before it noticed that it could use it to perform the snapshot-less swap.

We could address this by maintaining more state about the purpose of the temporary non-voter in the range descriptor. But this wouldn't be a small change.

This makes me question whether transferring the lease away for a short duration is even good enough to consider this issue resolved. In setups like multi-region deployments with one replica per region and leaseholder preferences (note: REGION survivability uses 5x replication, so it doesn't hit this), violating these preferences even for a short period of time will be disruptive.

So why are we doing so in the first place? The straightforward answer that's spelled out in #40333 is that we can't perform a ChangeReplicas Raft operation atomically with a lease transfer. But the not as clear-cut follow-up question is whether we actually need to support that to accomplish what we want here. Atomic replication changes, despite the name, are not atomic. During the joint consensus stage of an atomic voter rebalance, there is a time where we have two voters in the same locality tier. It seems possible that we could delay the lease transfer until the destination replica is a VOTER_INCOMING and the source replica is a VOTER_DEMOTING_LEARNER.

For this to be safe, we'd need to change our leaseholder eligibility rules in two ways. First, we'd need to allow VOTER_INCOMING replicas to hold the lease, and, second, we'd need to allow VOTER_DEMOTING_LEARNER replicas to hold the lease. The first of these seems perfectly safe. A VOTER_INCOMING can only ever transition to a VOTER_FULL, so there's no reason it shouldn't be able to hold the lease. We've already noted this here. Letting a VOTER_DEMOTING_LEARNER remain the leaseholder is a little more involved, but it should also be possible to make safe. To do so, we would need to ensure that we never finalize the atomic replication change and move the VOTER_DEMOTING_LEARNER to a LEARNER before the leaseholder has been moved off this replica. This last part would be involved, but not terribly so.

I think that if we made this change, we would not only fix this issue sufficiently but could actually guarantee that the lease would never leave a region during an intra-region rebalancing operation, all while never violating the invariant that a voting replica always holds the lease.

tbg · 2021-08-17T13:15:50Z

It seems possible that we could delay the lease transfer until the destination replica is a VOTER_INCOMING and the source replica is a VOTER_DEMOTING_LEARNER.

That sounds like a good idea. Something that bugs me about the original issue is also how we have the replica locally in the same rack, but we chose to send the leaseholder to possibly far away and then send a snapshot from there. This also would be avoided then. Of course exploring the ideas of requesting snapshots to be sent from suitable replicas (regardless of leadership status) would give us even more flexibility but it's orthogonal.

aayushshah15 · 2021-08-17T16:01:44Z

but we chose to send the leaseholder to possibly far away and then send a snapshot from there

This is actually something that came up in a customer call last week as something that substantially adds to the operating costs of CRDB. This is because network traffic within the same AZ is ~free whereas cross-AZ traffic is not. Our current inability to make sure that we're sending snapshots from within the same locality also makes operations like node decommissioning quite expensive (which has prompted customers to try some unsupported workflows).

irfansharif · 2021-08-17T16:13:25Z

Related: #42491

nvanbenschoten · 2021-10-12T05:17:50Z

@shralex here is a collection of code pointers and areas for investigation.

The first is about raft leadership and the flexibility of etcd/raft. If we want to transfer a lease from a VOTER_OUTGOING to a VOTER_INCOMING, we'd like to ensure that the raft leadership would be able to follow along. This means that raft leadership would need to be transferred from a voting replica only on the left side of a joint quorum to a voting replica only on the right side of a joint quorum. In the notation we used for joint quorum configurations and with a leader on replica 3, we would need to (1) be able to enter the joint quorum configuration (1, 2, 3)&&(1, 2, 4), and then (2) be able to transfer the leadership from replica 3 to replica 4. I believe both of these operations are possible, but we should confirm. @tbg do you know off the top of your head? If this is not possible, we'll need to determine whether it is an artificial or fundamental limitation, lifting the limitation if present and possible to circumvent. Regardless of whether this is possible today or not, we should add a test for this in https://github.com/etcd-io/etcd/tree/main/raft/testdata.

Next, we can start looking at how CRDB performs rebalance operations. Let's start at the replicateQueue. The replicateQueue is one of a handful of "replica queues" that periodically process each replica on a given store and perform actions on them. Alongside the replicateQueue sit the consistencyQueue, the gcQueue, the mergeQueue, the raftLogQueue, the raftSnapshotQueue, the replicaGCQueue, the splitQueue, and the timeSeriesMaintenanceQueue. The replicateQueue is in charge of periodically picking up each replica and checking for replication config changes, executing them if necessary. The replicateQueue makes decisions about what operation to perform, if any, using the Allocator. Lateral replica movement is suggested by the Allocator when no other operation is necessary:

cockroach/pkg/kv/kvserver/allocator.go

Line 687 in 48b3793

action = AllocatorConsiderRebalance

In such cases, the replicateQueue checks for rebalance targets in its shouldQueue method here:

cockroach/pkg/kv/kvserver/replicate_queue.go

Line 246 in 48b3793

rangeUsageInfo := rangeUsageInfoForRepl(repl)

If a target for a lateral rebalance is established, the queue proceeded to its process method to execute the rebalance here:

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 483 to 484 in 48b3793

    
           case AllocatorConsiderRebalance: 
        
           	return rq.considerRebalance(ctx, repl, voterReplicas, nonVoterReplicas, canTransferLeaseFrom, dryRun)

A replica is only able to initiate a replication change if it is the current leaseholder. So if a replica reaches this part in the code, it knows (with enough certainty), that it is the current leaseholder. So then when deciding which replica to remove in a lateral replica rebalance, it takes special notice of if it is the removal target. In such cases, it immediately transfers the lease away:

cockroach/pkg/kv/kvserver/replicate_queue.go

Lines 1109 to 1110 in 48b3793

    
           } else if done, err := rq.maybeTransferLeaseAway( 
        
           	ctx, repl, removeTarget.StoreID, dryRun, canTransferLeaseFrom,

This is the start of where we'll need to start making changes. As discussed above and in person, we'd like to defer this lease transfer until after the incoming replica has received its initial snapshot. So let's see where that is.

Back to the code, the replica then (if it didn't just transfer the lease away) calls into replicateQueue.changeReplicas, which calls into Replica.changeReplicasImpl, which coordinates the replication change, which is broken up into several steps. There's a lot in that function that is worth a read. One key point to notice is that each stage in the replication change is performed using a distributed transaction (see execChangeReplicasTxn) with a special "commit trigger". These commit triggers are a bit of extra code that is run atomically with the commit on the transaction's base range. This leads to a few extra pieces of state being attached to the transaction's EndTxn request's Raft proposal, which are then passed to etcd/raft during the Raft proposal and finalized with etcd/raft when the Raft proposal commits and applies.

I'll leave this by noting that the rebalance operation enters and leaves its joint voter configuration in Replica.execReplicationChangesForVoters:

cockroach/pkg/kv/kvserver/replica_command.go

Lines 1697 to 1718 in 48b3793

    
           // execReplicationChangesForVoters carries out the atomic membership change that 
        
           // finalizes the addition and/or removal of voting replicas. Any voters in the 
        
           // process of being added (as reflected by the replication changes) must have 
        
           // been added as learners already and caught up. Voter removals (from the 
        
           // replication changes) will be carried out by first demoting to a learner 
        
           // instead of outright removal (this avoids a [raft-bug] that can lead to 
        
           // unavailability). All of this occurs in one atomic raft membership change 
        
           // which is carried out across two phases. On error, it is possible that the 
        
           // range is in the intermediate ("joint") configuration in which a quorum of 
        
           // both the old and new sets of voters is required. If a range is encountered in 
        
           // this state, maybeLeaveAtomicReplicationChange can fix this, but it is the 
        
           // caller's job to do this when necessary. 
        
           // 
        
           // The atomic membership change is carried out chiefly via the construction of a 
        
           // suitable ChangeReplicasTrigger, see prepareChangeReplicasTrigger for details. 
        
           // 
        
           //  When adding/removing only a single voter, joint consensus is not used. 
        
           //  Notably, demotions must always use joint consensus, even if only a single 
        
           //  voter is being demoted, due to a (liftable) limitation in etcd/raft. 
        
           // 
        
           // [raft-bug]: https://github.com/etcd-io/etcd/issues/11284 
        
           func (r *Replica) execReplicationChangesForVoters(

So this is where we could move the lease transfer from the VOTER_OUTGOING to the VOTER_INCOMING. But that's likely not where we'll actually want to put it. The reason for this is that we need to be able to recover (i.e. leave the joint quorum configuration) from a partially successful rebalance and roll back or roll forward. In the cases where we need to roll forward, we still want to perform the lease transfer, because we can't let the VOTER_OUTGOING be removed while it still holds the lease. Luckily, we have a single place where we perform this roll forward, in maybeLeaveAtomicChangeReplicas, which is called by execReplicationChangesForVoters. So I think this is where we'll want to put the lease transfer.

That's a high-level lay of the land, but there will certainly be things that come up when we begin touching the code. One thing that I'm aware of from spending a short amount of time trying to prototype this a few months ago is that this code won't let a Raft proposal remove its proposer. But it's using the wrong definition of "remove", so it won't let the proposer move to VOTER_OUTGOING. That's one of the many things we'll need to resolve.

tbg · 2021-10-13T10:25:40Z

In the notation we used for joint quorum configurations and with a leader on replica 3, we would need to (1) be able to enter the joint quorum configuration (1, 2, 3)&&(1, 2, 4), and then (2) be able to transfer the leadership from replica 3 to replica 4. I believe both of these operations are possible, but we should confirm. @tbg do you know off the top of your head?

Based on my reading of the code below, this should "just work", but we still need to verify it.

https://github.com/etcd-io/etcd/blob/519f62b269cbc5f0438587cdcd9e3d4653c6515b/raft/raft.go#L1339-L1370

shralex · 2021-10-20T17:30:15Z

@tbg can you clarify this case vs what you referenced above: https://github.com/etcd-io/etcd/blob/519f62b269cbc5f0438587cdcd9e3d4653c6515b/raft/testdata/confchange_v1_remove_leader.txt#L201

tbg · 2021-10-21T08:58:10Z

That unit test shows what happens when you remove the raft leader (tldr: nothing good, actually pretty bad in terms of loss of availability, though I think it would be temporary in our case since once the old leader heartbeats anyone it will be told to delete itself, and after the election timeout someone else would campaign). We would try to avoid this case by transferring leadership away while the leader is a VOTER_DEMOTING. But yes, raft definitely has sharp edges here that we may be better off addressing. We don't control raft elections, so even if we transfer the lease and raft leadership away, it's possible to slip back in before we actually demote the replica. And you don't want to run the risk of having a learner that thinks it's the raft leader, or the unavailability mentioned above. However, we should be able to fix this. If we are in a joint config of type (..., VOTER) && (..., LEARNER) today (i.e. the VOTER is in the middle of demoting to a LEARNER), we can "deny" leadership to that replica without sacrificing availability (since on the right side of this quorum, it doesn't get to vote anyway). So we could change raft so that outgoing voters don't campaign, which means that once we have moved the lease and raft leadership elsewhere, we can transition into the final configuration without accidentally removing/demoting the raft leader.

daniel-crlabs · 2021-11-01T15:33:47Z

As part of my job, I have to ask for updates pertaining to tickets that involve our customers. This gh issue impacts gh 1228, which was closed in lieu of this ticket.
Could you please provide us an update on the status of this issue so I can update our customer, thank you.

nvanbenschoten · 2021-11-01T15:49:10Z

@daniel-crlabs we are hoping to address this issue in the v22.1 release. Given the complexity involved, the fix here will likely not be backported to earlier release branches.

daniel-crlabs · 2021-11-01T16:11:44Z

Sounds great, thank you for the update, I'll let the customer know.

nvanbenschoten · 2021-11-01T21:45:10Z

@shralex to make some of this concrete, here's the order of operations that we would expect before and after this change during a replica rebalance. Imagine we are starting with replicas {1 (leaseholder), 2, 3}, adding replica 4, and removing replica 1.

Current replica rebalance steps:

transfer lease from 1 -> 2 or 3
add 4 as LEARNER
send 4 a snapshot from leaseholder (2 or 3)
enter joint configuration: 1 becomes VOTER_DEMOTING_LEARNER, 4 becomes VOTER_INCOMING
exit joint configuration: 1 becomes LEARNER, 4 becomes VOTER
transfer lease from 2/3 -> 4
(sometime later) learner 1 is removed from Raft group entirely

New replica rebalance steps:

add 4 as LEARNER
send 4 a snapshot from leaseholder (1)
enter joint configuration: 1 becomes VOTER_DEMOTING_LEARNER, 4 becomes VOTER_INCOMING
transfer lease from 1 -> 4
exit joint configuration: 1 becomes LEARNER, 4 becomes VOTER
(sometime later) learner 1 is removed from Raft group entirely

Writing this all out also makes me realize that this change will reduce cross-zone/region network traffic as well. With the new order of operations, we are able to send the learner snapshot directly from 1 to 4, instead of sending it from the temporary leaseholder across zones/regions. This gives us part of the benefit promised by #42491 for free.

NOTE: we no longer use VOTER_OUTGOING, so please pretend my previous references to VOTER_OUTGOING said VOTER_DEMOTING_LEARNER. This is most relavant to the earlier comment:

this code won't let a Raft proposal remove its proposer. But it's using the wrong definition of "remove", so it won't let the proposer move to VOTER_OUTGOING. That's one of the many things we'll need to resolve.

aayushshah15 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. labels Jul 19, 2021

blathers-crl bot added the T-kv KV Team label Jul 21, 2021

mwang1026 mentioned this issue Aug 20, 2021

kv: investigate brown-out while scaling up under load #61005

Closed

nvanbenschoten mentioned this issue Sep 10, 2021

kv: unexpected preemptive transfer of leases upon adding new nodes #51867

Closed

irfansharif mentioned this issue Sep 22, 2021

allocator: write a discrete event simulator #70552

Closed

shralex mentioned this issue Oct 22, 2021

raft: add test for leadership transfer in joint configuration etcd-io/etcd#13436

Merged

mwang1026 assigned shralex Dec 1, 2021

shralex linked a pull request Dec 20, 2021 that will close this issue

kvserver: lease transfer in JOINT configuration #74077

Merged

craig bot closed this as completed in #74077 Feb 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: lease counts diverge when a new node is added to a cluster #67740

kvserver: lease counts diverge when a new node is added to a cluster #67740

aayushshah15 commented Jul 19, 2021 •

edited by RoachietheSupportRoach

Loading

aayushshah15 commented Jul 19, 2021

irfansharif commented Jul 19, 2021

nvanbenschoten commented Jul 19, 2021

lunevalex commented Jul 19, 2021

lunevalex commented Jul 22, 2021

nvanbenschoten commented Aug 11, 2021

tbg commented Aug 17, 2021

aayushshah15 commented Aug 17, 2021

irfansharif commented Aug 17, 2021

nvanbenschoten commented Oct 12, 2021

tbg commented Oct 13, 2021

shralex commented Oct 20, 2021

tbg commented Oct 21, 2021

daniel-crlabs commented Nov 1, 2021

nvanbenschoten commented Nov 1, 2021

daniel-crlabs commented Nov 1, 2021

nvanbenschoten commented Nov 1, 2021

kvserver: lease counts diverge when a new node is added to a cluster #67740

kvserver: lease counts diverge when a new node is added to a cluster #67740

Comments

aayushshah15 commented Jul 19, 2021 • edited by RoachietheSupportRoach Loading

aayushshah15 commented Jul 19, 2021

irfansharif commented Jul 19, 2021

nvanbenschoten commented Jul 19, 2021

lunevalex commented Jul 19, 2021

lunevalex commented Jul 22, 2021

nvanbenschoten commented Aug 11, 2021

tbg commented Aug 17, 2021

aayushshah15 commented Aug 17, 2021

irfansharif commented Aug 17, 2021

nvanbenschoten commented Oct 12, 2021

tbg commented Oct 13, 2021

shralex commented Oct 20, 2021

tbg commented Oct 21, 2021

daniel-crlabs commented Nov 1, 2021

nvanbenschoten commented Nov 1, 2021

daniel-crlabs commented Nov 1, 2021

nvanbenschoten commented Nov 1, 2021

aayushshah15 commented Jul 19, 2021 •

edited by RoachietheSupportRoach

Loading