Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: lease counts diverge when a new node is added to a cluster #67740

Closed
aayushshah15 opened this issue Jul 19, 2021 · 17 comments · Fixed by #74077
Closed

kvserver: lease counts diverge when a new node is added to a cluster #67740

aayushshah15 opened this issue Jul 19, 2021 · 17 comments · Fixed by #74077
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@aayushshah15
Copy link
Contributor

aayushshah15 commented Jul 19, 2021

Describe the problem

In a cluster with at least 3 localities, adding a new node to an existing locality reliably triggers a lease count divergence. The lease counts continue to diverge until this newly added node is fully hydrated with the ~mean number of replicas.

This looks something like the following:
image (1)

Cause

In order to understand how this happens, consider a cluster with 3 racks: rack=0, rack=1, rack=2 and 9 nodes. Let's walk through adding a new node (n10) to rack=0. Because of the diversity heuristic, only the other existing nodes in rack=0 are allowed to shed their replicas away to n10. For ranges that have their leases also in rack=0, this means that those leaseholders will first need to shed their lease away to one of the nodes in racks 1 or 2 and expect those nodes to execute the rebalance. This will continue until n10 has received roughly ~mean number of replicas relative to the rest of the nodes in the cluster.

However, in a cluster with enough data, fully hydrating the new node will take a while, sometimes on the order of hours (note that n10 can only receive new replicas at the snapshot rate dictated by kv.snapshot_rebalance.max_rate). Until this happens, nodes in rack=0 will continue shedding leases away to nodes in racks 1 and 2 until they basically have zero leases.

To Reproduce

I can reproduce this by following the steps outlined above on both 20.2 and 21.1.

Additional details

Until n10 is fully hydrated, nodes in rack=0 will continue hitting the considerRebalance path (for the ranges for which they are the leaseholders) in the allocator:

return rq.considerRebalance(ctx, repl, voterReplicas, nonVoterReplicas, canTransferLeaseFrom, dryRun)

Since there is indeed a valid rebalance candidate, RebalanceVoter will return ok==true here:

addTarget, removeTarget, details, ok := rq.allocator.RebalanceVoter(
ctx,
zone,
repl.RaftStatus(),
existingVoters,
existingNonVoters,
rangeUsageInfo,
storeFilterThrottled,
)

This will then lead to the call to maybeTransferLeaseAway here:

} else if done, err := rq.maybeTransferLeaseAway(
ctx, repl, removeTarget.StoreID, dryRun, canTransferLeaseFrom,

This will transfer the lease away to another replica.

/cc. @cockroachdb/kv

gz#5876

Epic CRDB-10569

gz#9817

@aayushshah15 aayushshah15 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-distribution Relating to rebalancing and leasing. labels Jul 19, 2021
@aayushshah15
Copy link
Contributor Author

I wanted to post this more generally because I'm not quite sure what behaviour we'd rather have here. I also struggled to come up with any concrete ideas that would cleanly improve the situation.

@irfansharif
Copy link
Contributor

Is it possible to have a leaseholder execute a membership change removing itself away from the raft group? (Either by letting the lease expire, or in a more coordinated fashion by transferring the lease to the incoming replica.) That would reduce the number of lease transfers needed for a given range from 2 to 1, and it would stay the same rack/locality.

If that's not possible, can we have an outgoing leaseholder hold onto its lease over the duration of the rebalance, transferring it away to another locality only when the learner replica has been fully caught up? That would reduce how long a cluster observes this diverged state.

@nvanbenschoten
Copy link
Member

Is it possible to have a leaseholder execute a membership change removing itself away from the raft group?

This is something that @tbg has discussed in the past. It's certainly not possible today, because we don't have a way to coalesce a lease transfer with a membership change into the same Raft log entry. If we did, then I think we could simplify this and a few other edge cases with 1x replication.

can we have an outgoing leaseholder hold onto its lease over the duration of the rebalance, transferring it away to another locality only when the learner replica has been fully caught up? That would reduce how long a cluster observes this diverged state.

This is a good idea, though I don't think it would work today because we eagerly throw away LEARNER replicas from previous ChangeReplicas attempts whenever we see them (see maybeLeaveAtomicChangeReplicas). This isn't strictly necessary, but it would require us to break the contract that a LEARNER replica is always transient and never has a larger lifetime than its AdminChangeReplicasRequest. This contract prevents us from coordinating the lifetime of a LEARNER across multiple ChangeReplicas attempts.

An interesting variation of this for when we know we want a longer-lived learner would be to create a NON_VOTER instead of a LEARNER before transferring away the lease and then allow the new leaseholder to perform the VOTER<->NON_VOTER swap before removing the old VOTER.

@lunevalex
Copy link
Collaborator

Relates to #51867

@blathers-crl blathers-crl bot added the T-kv KV Team label Jul 21, 2021
@lunevalex
Copy link
Collaborator

@aayushshah15 there was a previous idea described here #40333 for how to fix this.

@nvanbenschoten
Copy link
Member

Here's an easy local reproduction of this issue that may come in handy when trying to fix it:

# if not already created and staged
> roachprod create local -n 4
> roachprod put local cockroach

> roachprod start local:1-3 --racks=3

> cockroach workload init kv --splits=10000

> roachprod start local:4 --racks=3

Screen Shot 2021-08-11 at 12 00 13 AM


I looked briefly into this intermediate NON_VOTER idea and it doesn't look trivial. After a refresh of the code, I believe the "then allow the new leaseholder to perform the VOTER<->NON_VOTER swap before removing the old VOTER" step would require a good amount of specialized code. Without that, the new leaseholder's allocator would first issue an AllocatorRemoveNonVoter action before attempting an AllocatorConsiderRebalance action. So it would throw away the temporary non-voter before it noticed that it could use it to perform the snapshot-less swap.

We could address this by maintaining more state about the purpose of the temporary non-voter in the range descriptor. But this wouldn't be a small change.


This makes me question whether transferring the lease away for a short duration is even good enough to consider this issue resolved. In setups like multi-region deployments with one replica per region and leaseholder preferences (note: REGION survivability uses 5x replication, so it doesn't hit this), violating these preferences even for a short period of time will be disruptive.

So why are we doing so in the first place? The straightforward answer that's spelled out in #40333 is that we can't perform a ChangeReplicas Raft operation atomically with a lease transfer. But the not as clear-cut follow-up question is whether we actually need to support that to accomplish what we want here. Atomic replication changes, despite the name, are not atomic. During the joint consensus stage of an atomic voter rebalance, there is a time where we have two voters in the same locality tier. It seems possible that we could delay the lease transfer until the destination replica is a VOTER_INCOMING and the source replica is a VOTER_DEMOTING_LEARNER.

For this to be safe, we'd need to change our leaseholder eligibility rules in two ways. First, we'd need to allow VOTER_INCOMING replicas to hold the lease, and, second, we'd need to allow VOTER_DEMOTING_LEARNER replicas to hold the lease. The first of these seems perfectly safe. A VOTER_INCOMING can only ever transition to a VOTER_FULL, so there's no reason it shouldn't be able to hold the lease. We've already noted this here. Letting a VOTER_DEMOTING_LEARNER remain the leaseholder is a little more involved, but it should also be possible to make safe. To do so, we would need to ensure that we never finalize the atomic replication change and move the VOTER_DEMOTING_LEARNER to a LEARNER before the leaseholder has been moved off this replica. This last part would be involved, but not terribly so.

I think that if we made this change, we would not only fix this issue sufficiently but could actually guarantee that the lease would never leave a region during an intra-region rebalancing operation, all while never violating the invariant that a voting replica always holds the lease.

@tbg
Copy link
Member

tbg commented Aug 17, 2021

It seems possible that we could delay the lease transfer until the destination replica is a VOTER_INCOMING and the source replica is a VOTER_DEMOTING_LEARNER.

That sounds like a good idea. Something that bugs me about the original issue is also how we have the replica locally in the same rack, but we chose to send the leaseholder to possibly far away and then send a snapshot from there. This also would be avoided then. Of course exploring the ideas of requesting snapshots to be sent from suitable replicas (regardless of leadership status) would give us even more flexibility but it's orthogonal.

@aayushshah15
Copy link
Contributor Author

but we chose to send the leaseholder to possibly far away and then send a snapshot from there

This is actually something that came up in a customer call last week as something that substantially adds to the operating costs of CRDB. This is because network traffic within the same AZ is ~free whereas cross-AZ traffic is not. Our current inability to make sure that we're sending snapshots from within the same locality also makes operations like node decommissioning quite expensive (which has prompted customers to try some unsupported workflows).

@irfansharif
Copy link
Contributor

Related: #42491

@nvanbenschoten
Copy link
Member

@shralex here is a collection of code pointers and areas for investigation.

The first is about raft leadership and the flexibility of etcd/raft. If we want to transfer a lease from a VOTER_OUTGOING to a VOTER_INCOMING, we'd like to ensure that the raft leadership would be able to follow along. This means that raft leadership would need to be transferred from a voting replica only on the left side of a joint quorum to a voting replica only on the right side of a joint quorum. In the notation we used for joint quorum configurations and with a leader on replica 3, we would need to (1) be able to enter the joint quorum configuration (1, 2, 3)&&(1, 2, 4), and then (2) be able to transfer the leadership from replica 3 to replica 4. I believe both of these operations are possible, but we should confirm. @tbg do you know off the top of your head? If this is not possible, we'll need to determine whether it is an artificial or fundamental limitation, lifting the limitation if present and possible to circumvent. Regardless of whether this is possible today or not, we should add a test for this in https://github.com/etcd-io/etcd/tree/main/raft/testdata.

Next, we can start looking at how CRDB performs rebalance operations. Let's start at the replicateQueue. The replicateQueue is one of a handful of "replica queues" that periodically process each replica on a given store and perform actions on them. Alongside the replicateQueue sit the consistencyQueue, the gcQueue, the mergeQueue, the raftLogQueue, the raftSnapshotQueue, the replicaGCQueue, the splitQueue, and the timeSeriesMaintenanceQueue. The replicateQueue is in charge of periodically picking up each replica and checking for replication config changes, executing them if necessary. The replicateQueue makes decisions about what operation to perform, if any, using the Allocator. Lateral replica movement is suggested by the Allocator when no other operation is necessary:

action = AllocatorConsiderRebalance

In such cases, the replicateQueue checks for rebalance targets in its shouldQueue method here:

rangeUsageInfo := rangeUsageInfoForRepl(repl)

If a target for a lateral rebalance is established, the queue proceeded to its process method to execute the rebalance here:

case AllocatorConsiderRebalance:
return rq.considerRebalance(ctx, repl, voterReplicas, nonVoterReplicas, canTransferLeaseFrom, dryRun)

A replica is only able to initiate a replication change if it is the current leaseholder. So if a replica reaches this part in the code, it knows (with enough certainty), that it is the current leaseholder. So then when deciding which replica to remove in a lateral replica rebalance, it takes special notice of if it is the removal target. In such cases, it immediately transfers the lease away:

} else if done, err := rq.maybeTransferLeaseAway(
ctx, repl, removeTarget.StoreID, dryRun, canTransferLeaseFrom,

This is the start of where we'll need to start making changes. As discussed above and in person, we'd like to defer this lease transfer until after the incoming replica has received its initial snapshot. So let's see where that is.

Back to the code, the replica then (if it didn't just transfer the lease away) calls into replicateQueue.changeReplicas, which calls into Replica.changeReplicasImpl, which coordinates the replication change, which is broken up into several steps. There's a lot in that function that is worth a read. One key point to notice is that each stage in the replication change is performed using a distributed transaction (see execChangeReplicasTxn) with a special "commit trigger". These commit triggers are a bit of extra code that is run atomically with the commit on the transaction's base range. This leads to a few extra pieces of state being attached to the transaction's EndTxn request's Raft proposal, which are then passed to etcd/raft during the Raft proposal and finalized with etcd/raft when the Raft proposal commits and applies.

I'll leave this by noting that the rebalance operation enters and leaves its joint voter configuration in Replica.execReplicationChangesForVoters:

// execReplicationChangesForVoters carries out the atomic membership change that
// finalizes the addition and/or removal of voting replicas. Any voters in the
// process of being added (as reflected by the replication changes) must have
// been added as learners already and caught up. Voter removals (from the
// replication changes) will be carried out by first demoting to a learner
// instead of outright removal (this avoids a [raft-bug] that can lead to
// unavailability). All of this occurs in one atomic raft membership change
// which is carried out across two phases. On error, it is possible that the
// range is in the intermediate ("joint") configuration in which a quorum of
// both the old and new sets of voters is required. If a range is encountered in
// this state, maybeLeaveAtomicReplicationChange can fix this, but it is the
// caller's job to do this when necessary.
//
// The atomic membership change is carried out chiefly via the construction of a
// suitable ChangeReplicasTrigger, see prepareChangeReplicasTrigger for details.
//
// When adding/removing only a single voter, joint consensus is not used.
// Notably, demotions must always use joint consensus, even if only a single
// voter is being demoted, due to a (liftable) limitation in etcd/raft.
//
// [raft-bug]: https://github.com/etcd-io/etcd/issues/11284
func (r *Replica) execReplicationChangesForVoters(

So this is where we could move the lease transfer from the VOTER_OUTGOING to the VOTER_INCOMING. But that's likely not where we'll actually want to put it. The reason for this is that we need to be able to recover (i.e. leave the joint quorum configuration) from a partially successful rebalance and roll back or roll forward. In the cases where we need to roll forward, we still want to perform the lease transfer, because we can't let the VOTER_OUTGOING be removed while it still holds the lease. Luckily, we have a single place where we perform this roll forward, in maybeLeaveAtomicChangeReplicas, which is called by execReplicationChangesForVoters. So I think this is where we'll want to put the lease transfer.

That's a high-level lay of the land, but there will certainly be things that come up when we begin touching the code. One thing that I'm aware of from spending a short amount of time trying to prototype this a few months ago is that this code won't let a Raft proposal remove its proposer. But it's using the wrong definition of "remove", so it won't let the proposer move to VOTER_OUTGOING. That's one of the many things we'll need to resolve.

@tbg
Copy link
Member

tbg commented Oct 13, 2021

In the notation we used for joint quorum configurations and with a leader on replica 3, we would need to (1) be able to enter the joint quorum configuration (1, 2, 3)&&(1, 2, 4), and then (2) be able to transfer the leadership from replica 3 to replica 4. I believe both of these operations are possible, but we should confirm. @tbg do you know off the top of your head?

Based on my reading of the code below, this should "just work", but we still need to verify it.

https://github.com/etcd-io/etcd/blob/519f62b269cbc5f0438587cdcd9e3d4653c6515b/raft/raft.go#L1339-L1370

@shralex
Copy link
Contributor

shralex commented Oct 20, 2021

@tbg
Copy link
Member

tbg commented Oct 21, 2021

That unit test shows what happens when you remove the raft leader (tldr: nothing good, actually pretty bad in terms of loss of availability, though I think it would be temporary in our case since once the old leader heartbeats anyone it will be told to delete itself, and after the election timeout someone else would campaign). We would try to avoid this case by transferring leadership away while the leader is a VOTER_DEMOTING. But yes, raft definitely has sharp edges here that we may be better off addressing. We don't control raft elections, so even if we transfer the lease and raft leadership away, it's possible to slip back in before we actually demote the replica. And you don't want to run the risk of having a learner that thinks it's the raft leader, or the unavailability mentioned above. However, we should be able to fix this. If we are in a joint config of type (..., VOTER) && (..., LEARNER) today (i.e. the VOTER is in the middle of demoting to a LEARNER), we can "deny" leadership to that replica without sacrificing availability (since on the right side of this quorum, it doesn't get to vote anyway). So we could change raft so that outgoing voters don't campaign, which means that once we have moved the lease and raft leadership elsewhere, we can transition into the final configuration without accidentally removing/demoting the raft leader.

@daniel-crlabs
Copy link
Contributor

As part of my job, I have to ask for updates pertaining to tickets that involve our customers. This gh issue impacts gh 1228, which was closed in lieu of this ticket.
Could you please provide us an update on the status of this issue so I can update our customer, thank you.

@nvanbenschoten
Copy link
Member

@daniel-crlabs we are hoping to address this issue in the v22.1 release. Given the complexity involved, the fix here will likely not be backported to earlier release branches.

@daniel-crlabs
Copy link
Contributor

Sounds great, thank you for the update, I'll let the customer know.

@nvanbenschoten
Copy link
Member

@shralex to make some of this concrete, here's the order of operations that we would expect before and after this change during a replica rebalance. Imagine we are starting with replicas {1 (leaseholder), 2, 3}, adding replica 4, and removing replica 1.

Current replica rebalance steps:

  1. transfer lease from 1 -> 2 or 3
  2. add 4 as LEARNER
  3. send 4 a snapshot from leaseholder (2 or 3)
  4. enter joint configuration: 1 becomes VOTER_DEMOTING_LEARNER, 4 becomes VOTER_INCOMING
  5. exit joint configuration: 1 becomes LEARNER, 4 becomes VOTER
  6. transfer lease from 2/3 -> 4
  7. (sometime later) learner 1 is removed from Raft group entirely

New replica rebalance steps:

  1. add 4 as LEARNER
  2. send 4 a snapshot from leaseholder (1)
  3. enter joint configuration: 1 becomes VOTER_DEMOTING_LEARNER, 4 becomes VOTER_INCOMING
  4. transfer lease from 1 -> 4
  5. exit joint configuration: 1 becomes LEARNER, 4 becomes VOTER
  6. (sometime later) learner 1 is removed from Raft group entirely

Writing this all out also makes me realize that this change will reduce cross-zone/region network traffic as well. With the new order of operations, we are able to send the learner snapshot directly from 1 to 4, instead of sending it from the temporary leaseholder across zones/regions. This gives us part of the benefit promised by #42491 for free.

NOTE: we no longer use VOTER_OUTGOING, so please pretend my previous references to VOTER_OUTGOING said VOTER_DEMOTING_LEARNER. This is most relavant to the earlier comment:

this code won't let a Raft proposal remove its proposer. But it's using the wrong definition of "remove", so it won't let the proposer move to VOTER_OUTGOING. That's one of the many things we'll need to resolve.

@shralex shralex linked a pull request Dec 20, 2021 that will close this issue
@craig craig bot closed this as completed in #74077 Feb 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants