-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: lease transfer in JOINT configuration #74077
Conversation
304a581
to
df65c65
Compare
2a227eb
to
527263a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's exciting to see this! When you get a chance, let's try out the repro steps from #67740 (comment) and see how things behave with this change.
@tbg should probably also give this a pass when it gets closer, as he was the author of a lot of this joint configuration code. I also left a comment for him in replica_raft.go
.
Reviewed 15 of 17 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @shralex)
-- commits, line 14 at r1:
nit: s/a/A/
-- commits, line 29 at r1:
Consider adding a release note to this commit. It's a valuable change that users will care about!
pkg/kv/db.go, line 629 at r1 (raw file):
err := db.Run(ctx, b) if err == nil { log.Infof(ctx, "Transferring lease to StoreID %s succeeded", target.String())
Was this just added for debugging purposes, or did you intend for us to keep it?
pkg/kv/kvserver/replica_command.go, line 985 at r1 (raw file):
details string, chgs roachpb.ReplicationChanges, repl *Replica,
Isn't this already the method receiver?
pkg/kv/kvserver/replica_command.go, line 1172 at r1 (raw file):
// NON_VOTER, and VOTER_FULL only. func maybeLeaveAtomicChangeReplicas( ctx context.Context, s *Store, desc *roachpb.RangeDescriptor, r *Replica,
Instead of adding this as a parameter, what do you think about making maybeLeaveAtomicChangeReplicas
and maybeLeaveAtomicChangeReplicasAndRemoveLearners
methods on *Replica
? That way, we can actually get rid of all arguments except for the context.
pkg/kv/kvserver/replica_command.go, line 1177 at r1 (raw file):
// with leaving a joint state when it's in one, so make sure we don't call // it if we're not. if !desc.Replicas().InAtomicReplicationChange(){
It looks like your gofmt is having issues. You should double-check that you have https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher set up correctly.
pkg/kv/kvserver/replica_command.go, line 1184 at r1 (raw file):
voters := desc.Replicas().VoterDescriptors() if r != nil {
The fact that we can ignore this check seems like a source of bugs. The need for this is coming from the call to maybeLeaveAtomicChangeReplicasAndRemoveLearners
on the RHS range in mergeQueue.process
, right? Did you run into issues with the proposed idea to remove that call entirely and enter the AdminRelocateRange
branch when the RHS is in a joint config instead?
pkg/kv/kvserver/replica_command.go, line 1185 at r1 (raw file):
voters := desc.Replicas().VoterDescriptors() if r != nil { // Current leaseholder is being removed
Let's add more to this comment. We can discuss why we're looking in the voters
slice, what it means for the current replica (which we know to be the leaseholder) to be absent from it, and what we will do in that case (transfer the lease).
pkg/kv/kvserver/replica_command.go, line 1203 at r1 (raw file):
transferLeaseOptions{ goal: followTheWorkload, checkTransferLeaseSource: true,
Consider adding a comment about why we set this to true
. Something about having already filtered out the current leaseholder.
pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):
checkTransferLeaseSource: true, checkCandidateFullness: false, disableLeaseCountConvergenceChecks: true,
Could you add a comment about why we set this? I don't understand the full flow in Allocator.TransferLeaseTarget
, but it's interesting to me that we have to set this here, but not elsewhere. For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway
. Should we be doing so? @aayushshah15 do you know what this is about?
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
if target == (roachpb.ReplicaDescriptor{}) { log.Infof(ctx, "could not find a better lease transfer target for r%d", desc.RangeID)
Should we return an error in this case? If the lease can't be transferred away from the current leaseholder, then there's no use trying to perform the config change which we know is going to fail.
Also, regardless of whether we log (log.Warningf
) or return an error, let's add more information about why we're trying to transfer the lease and what the fact that we can't find a target means. It's not a good situation, and we will want to be alerted of it if a customer cluster ever gets stuck in this state.
pkg/kv/kvserver/replica_command.go, line 1213 at r1 (raw file):
desc.RangeID) } else { log.Infof(ctx, "current leaseholder %v is being removed. Transferring lease to %v",
minor nit: s/. T/, t/
Also, s/is being removed/is being removed through an atomic replication change/
pkg/kv/kvserver/replica_command.go, line 2787 at r1 (raw file):
} ops, err := s.relocateOne(ctx, &rangeDesc, voterTargets, nonVoterTargets)
I'm surprised we didn't need to revert this part of the change because of single-step config changes. What will happen if I have a config that looks like {1 [L], 2, 3}
and I issue an AdminRelocateRange that drops the config down to {2 [L], 3}
?
pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):
prefix = false // The following deals with removing a leaseholder. A voter can be removed in two ways.
Consider pulling this into a separate commit/PR, separate from the rest of the behavior changes. This shouldn't have any visible effect, but it would be nice to validate that. A separate commit also has the benefit of being "bisectable" in case there's any rare fallout that we need to track back to this change.
Ideally, we could even test this by manually issuing each step of the joint config process (enter joint, transfer lease, exit joint), but it doesn't look like we have any testing that invokes execChangeReplicasTxn
directly. @tbg are you aware of any tests at this level?
EDIT: maybe we can do something around TestLeaseHolderRemoveSelf
to test this?
pkg/kv/kvserver/replicate_queue.go, line 1288 at r1 (raw file):
// to disregard the existing lease counts on candidates. checkCandidateFullness bool // when true, ignores lease count convergence considerations
nit: missing punctuation.
pkg/roachpb/metadata.go, line 488 at r1 (raw file):
} func (r ReplicaDescriptor) IsAnyVoter() bool {
This will need a comment. It will probably look like the two above.
pkg/roachpb/metadata_replicas.go, line 527 at r1 (raw file):
if !ok { return errReplicaNotFound } else if !repDesc.IsVoterNewConfig() {
This should probably also go in its own commit/PR, with some testing, for the same reasons as listed above.
pkg/testutils/testcluster/testcluster.go, line 886 at r1 (raw file):
) { if err := tc.TransferRangeLease(rangeDesc, dest); err != nil { t.Fatalf(`couldn not transfer lease for range %s error is %+v`, rangeDesc, err)
"could not" or "couldn't"
pkg/testutils/testcluster/testcluster.go, line 897 at r1 (raw file):
) { testutils.SucceedsSoon(t, func() error { if _, err := tc.RemoveVoters(rangeDesc.StartKey.AsRawKey(), src); err != nil {
I'm surprised this works. Are we entering a joint config here?
pkg/kv/kvserver/client_lease_test.go, line 640 at r1 (raw file):
tc.TransferRangeLeaseOrFatal(t, rhsDesc, tc.Target(0)) // Check that the lease is on 1
nit: punctuation needed at the end of a lot of these comments.
pkg/kv/kvserver/client_lease_test.go, line 654 at r1 (raw file):
require.Equal(t, tc.Target(2), leaseHolder) gossipLiveness(t, tc)
Why is this needed?
pkg/kv/kvserver/client_lease_test.go, line 657 at r1 (raw file):
// Relocate range 3 -> 4. _, err = db.Exec("ALTER RANGE " + rhsDesc.RangeID.String() + " RELOCATE FROM 3 TO 4")
Isn't not very common to see SQL used in these tests to perform this kind of change. It's more common to see kv.DB.AdminChangeReplicas
used, or even better, one of the TestCluster
helper methods. Maybe we need a new helper method?
pkg/kv/kvserver/client_lease_test.go, line 665 at r1 (raw file):
require.Equal(t, tc.Target(3), leaseHolder) // Double check that lease moved directly 3 -> 4
This is neat!
pkg/kv/kvserver/client_raft_test.go, line 3803 at r1 (raw file):
}
nit: stray line. That may go away with a gofmt.
pkg/kv/kvserver/client_relocate_range_test.go, line 239 at r1 (raw file):
}) }
tiny nit: based on the rest of this test, it seems like we wanted this new line for consistency.
I remembered another comment I meant to leave here. We'll need to think about mixed version clusters when making this change. We shouldn't begin a joint configuration that would require the removal of the leaseholder until after all nodes in the cluster are running v22.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave a shallow pass, generally looks good. Sad that all the messy lease stuff is now intertwining with the replication changes but I suppose that's what we signed up for. Still, if there's a way to clean that up I would be in favor of that. Maybe all of the lease handling doesn't have to be inline but can be pulled out into something compact that we can call in the right place.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @shralex)
pkg/kv/db.go, line 629 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Was this just added for debugging purposes, or did you intend for us to keep it?
These should go away, we shouldn't be logging in *kv.DB
. If anything, these logs would be expected at the server that executes the actual transfer.
Also, general nits:
- log messages start lowercase
- don't need to call
target.String
, justtarget
should do fine (%s
verb invokesString()
) - we usually refer to stores as
s%d
in logging
pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):
fallthrough case decideWithoutStats: if !opts.disableLeaseCountConvergenceChecks &&
was this just missing & you noticed while writing this PR?
pkg/kv/kvserver/replica_command.go, line 1218 at r1 (raw file):
if err != nil { return nil, err }
Are we sure the lease is now where it should be? There's obviously the case where for some reason the lease gets transferred somewhere else by a third party. Could there be races with lease preferences where the new leaseholder immediately sends the lease back, and we never manage to do the replication change below? Similarly, are there persistent error cases where the lease transfer (or one of the steps above) fail? I assume there are no new concerns here since we were never previously able to remove the leaseholder and so nothing has really changed. But I am vaguely worried that what we do now isn't 100% equivalent and that we need a fail-safe (which the migration forces us to have anyway).
pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider pulling this into a separate commit/PR, separate from the rest of the behavior changes. This shouldn't have any visible effect, but it would be nice to validate that. A separate commit also has the benefit of being "bisectable" in case there's any rare fallout that we need to track back to this change.
Ideally, we could even test this by manually issuing each step of the joint config process (enter joint, transfer lease, exit joint), but it doesn't look like we have any testing that invokes
execChangeReplicasTxn
directly. @tbg are you aware of any tests at this level?EDIT: maybe we can do something around
TestLeaseHolderRemoveSelf
to test this?
I don't think there's direct testing of this, unfortunately. Perhaps a reasonable way to get what we want is by inserting testing knobs between the steps that we can use to check on the replication change at various stages.
pkg/kv/kvserver/replica_raft.go, line 323 at r1 (raw file):
// and then transfer lease directly to v4, but this would change the number of replicas to 4, // and if region1 goes down, we loose a quorum. Instead, we move to a joint config where v1 // (VOTER_DEMOTED_LEARNER) transfer the lease to v4 (VOTER_INCOMING) directly.
DEMOTING
pkg/kv/kvserver/replicate_queue.go, line 1288 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: missing punctuation.
capitalization too and usually (though redundantly) the comment would say
// disableLeaseCountConvergenceChecks, when true, ...
pkg/kv/kvserver/client_raft_test.go, line 3826 at r1 (raw file):
require.NoError(t, err) // Check that lease moved to server 2
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)
pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):
For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway
I dont fully follow why we need this new option. disableLeaseCountConvergenceChecks
as its currently written makes it such that we're forced to use followTheWorkload
or return an empty result.
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should we return an error in this case? If the lease can't be transferred away from the current leaseholder, then there's no use trying to perform the config change which we know is going to fail.
Also, regardless of whether we log (
log.Warningf
) or return an error, let's add more information about why we're trying to transfer the lease and what the fact that we can't find a target means. It's not a good situation, and we will want to be alerted of it if a customer cluster ever gets stuck in this state.
+1 I'm not sure we need to call into TransferLeaseTarget
at all here. Zone configuration lease preferences must be compatible with the constraints placed on the range. So we should never be trying to replace the current leaseholder with another replica that is in violation of those lease preferences. Given all this, I think we should be able to directly transfer the lease to the VOTER_INCOMING
that is replacing the current leaseholder.
Another thing to note is that part of AdminRelocateRange
's contract is that it must transfer the lease to the first voting replica in the input slice. This is done here at the end:
cockroach/pkg/kv/kvserver/replica_command.go
Lines 2791 to 2802 in 527263a
if len(ops) == 0 { | |
if len(voterTargets) > 0 { | |
// NB: we may need to transfer even if there are no ops, to make | |
// sure the attempt is made to make the first target the final | |
// leaseholder. If there are ops, lease transfer will happen during joint config exit, | |
// if leader is removed. | |
if err := transferLease(voterTargets[0]); err != nil { | |
return rangeDesc, err | |
} | |
} | |
return rangeDesc, ctx.Err() | |
} |
TransferLeaseTarget
for me.
pkg/kv/kvserver/replica_command.go, line 2795 at r1 (raw file):
// NB: we may need to transfer even if there are no ops, to make // sure the attempt is made to make the first target the final // leaseholder. If there are ops, lease transfer will happen during joint config exit,
nit: this addition is a bit out of place here, and wrapping.
pkg/kv/kvserver/replica_command.go, line 2796 at r1 (raw file):
// sure the attempt is made to make the first target the final // leaseholder. If there are ops, lease transfer will happen during joint config exit, // if leader is removed.
s/leader/leaseholder
pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):
prefix = false // The following deals with removing a leaseholder. A voter can be removed in two ways.
note that comments should be wrapped at 80 chars (see: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/181371303/Go+Golang+coding+guidelines#Line-Length) here and elsewhere.
To add to this, we will also still need the removed allocator logic under mixed-version states. |
pkg/kv/kvserver/replicate_queue.go, line 543 at r1 (raw file):
This call should also probably go away. |
527263a
to
4182acc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider adding a release note to this commit. It's a valuable change that users will care about!
Sure, I'll add before this is merged, once we converge.
pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
was this just missing & you noticed while writing this PR?
I reverted this change, as the same can be achieved by setting checkTransferLeaseSource to false
pkg/kv/kvserver/replica_command.go, line 1177 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It looks like your gofmt is having issues. You should double-check that you have https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher set up correctly.
thanks! I set it up previously but it somehow disappeared...
pkg/kv/kvserver/replica_command.go, line 1203 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Consider adding a comment about why we set this to
true
. Something about having already filtered out the current leaseholder.
actually I think it should be false because of the lease count checks in the TransferLeaseTarget method
pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway
I dont fully follow why we need this new option.
disableLeaseCountConvergenceChecks
as its currently written makes it such that we're forced to usefollowTheWorkload
or return an empty result.
What I was observing is that when the lease counts are roughly similar, this method would return an empty result. I now removed this new option, since checkTransferLeaseSource=false would achieve what I needed here.
pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Updated this to return an error.
I think we should be able to directly transfer the lease to the VOTER_INCOMING that is replacing the current leaseholder
It might be better to make this code a bit more general, such that it works even if there are multiple incoming nodes, or none. Also, there are cases where the joining node is not the best choice. Running the selection logic addresses both issues.
pkg/kv/kvserver/client_lease_test.go, line 654 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this needed?
Without this, the liveness status of candidate nodes is unknown, and they are excluded from consideration when we're trying to decide where to transfer the lease.
…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards #74077 Release note: None
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards #74077 Release note: None
…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards cockroachdb#74077 Release note: None
…descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards cockroachdb#74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None
74479: backupccl: replace memory accumulator with bound account r=dt a=adityamaru This change replaces the memory accumulator with a raw bound account. The memory accumulator provides a threadsafe wrapper around the bound account, and some redundant resource pooling semantics. Both of these are currently not needed by the sstSink that is being memory monitored. This change deletes the memory accumulator. Release note: None 74546: kvserver: allow VOTER_INCOMING to receive the lease r=shralex a=shralex Previously, VOTER_INCOMING replicas joining the cluster weren't allowed to receive the lease, even though there is no actual problem with doing so. This change removes the restriction, as a step towards #74077 Release note: None 74549: kvserver: determine if leaseholder is removed using proposed replica … r=shralex a=shralex …descriptor Previosly, the determination of whether a leaseholder is being removed was made by looking at the proposed changes. As a step towards #74077 we'd like to instead look at the replica descriptor that the reconfiguration change would result in. This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR. This PR should not cause any change in behavior. Release note: None Co-authored-by: Aditya Maru <adityamaru@gmail.com> Co-authored-by: shralex <shralex@gmail.com>
77841: kvserver: transfer lease in JOINT config directly to a VOTER_INCOMING… r=shralex a=shralex … replica In #74077 we allowed entering a JOINT config when the leaseholder is being demoted, in which case we attempted to transfer the lease away before leaving the JOINT config. Looking for a lease transfer target at that stage created flakiness, since in some situations this target isn't found and we need to retry. This PR takes the approach that when the leaseholder is being removed, we should enter the JOINT config only if there is a VOTER_INCOMING replica. In that case, the lease is transferred directly to that replica, without further checks. And otherwise, the lease must be transferred before attempting to remove the leaseholder. Release justification: fixes flakiness caused by #74077 Release note: None Experiments to demonstrate that the benefits of #74077 remain with this PR. Please refer to the #74077 for details. 1. No leaseholder preference: <img width="985" alt="Screen Shot 2022-03-22 at 7 24 15 PM" src="https://user-images.githubusercontent.com/6037719/159631649-36a296a8-7619-4dab-8db1-a8b603b6d66e.png"> 3. Leaseholder preference is rack 0: <img width="977" alt="Screen Shot 2022-03-22 at 10 39 50 PM" src="https://user-images.githubusercontent.com/6037719/159631504-1cd3350a-89a5-4d77-b9a1-37dfdd2896ae.png"> 78101: colexecjoin: optimize building output on the left in merge joiner r=yuzefovich a=yuzefovich This commit updates the way we're building output in the merge joiner from the left input when building directly from the left batch (i.e. not from the buffered group). There, we need to repeat a single tuple `toAppend` times, so we do it in a loop. This commit adds the optimization of using `Bytes.Copy` for the bytes-like types as well as BCE for sliceable types. ``` MergeJoiner/rows=32-24 29.3MB/s ± 3% 29.5MB/s ± 3% ~ (p=0.684 n=10+10) MergeJoiner/rows=512-24 79.4MB/s ± 2% 77.8MB/s ± 3% -1.91% (p=0.043 n=10+10) MergeJoiner/rows=4096-24 192MB/s ± 2% 189MB/s ± 1% -1.36% (p=0.029 n=10+10) MergeJoiner/rows=32768-24 278MB/s ± 1% 275MB/s ± 0% -1.30% (p=0.000 n=10+10) MergeJoiner/oneSideRepeat-rows=32-24 37.3MB/s ± 3% 38.0MB/s ± 2% +1.78% (p=0.029 n=10+10) MergeJoiner/oneSideRepeat-rows=512-24 212MB/s ± 1% 215MB/s ± 2% +1.42% (p=0.003 n=9+10) MergeJoiner/oneSideRepeat-rows=4096-24 765MB/s ± 4% 770MB/s ± 3% ~ (p=0.436 n=10+10) MergeJoiner/oneSideRepeat-rows=32768-24 1.22GB/s ± 2% 1.23GB/s ± 2% ~ (p=0.393 n=10+10) MergeJoiner/bothSidesRepeat-rows=32-24 22.7MB/s ± 2% 22.9MB/s ± 2% ~ (p=0.203 n=9+10) MergeJoiner/bothSidesRepeat-rows=512-24 102MB/s ± 4% 104MB/s ± 2% +2.38% (p=0.011 n=10+10) MergeJoiner/bothSidesRepeat-rows=4096-24 117MB/s ± 1% 127MB/s ± 1% +9.11% (p=0.000 n=10+9) MergeJoiner/bothSidesRepeat-rows=32768-24 59.2MB/s ± 1% 67.1MB/s ± 1% +13.48% (p=0.000 n=10+10) ``` Release note: None Co-authored-by: shralex <shralex@gmail.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
… replica In cockroachdb#74077 we allowed entering a JOINT config when the leaseholder is being demoted, in which case we attempted to transfer the lease away before leaving the JOINT config. Looking for a lease transfer target at that stage created flakiness, since in some situations this target isn't found and we need to retry. This PR takes the approach that when the leaseholder is being removed, we should enter the JOINT config only if there is a VOTER_INCOMING replica. In that case, the lease is transferred directly to that replica, without further checks. And otherwise, the lease must be transferred before attempting to remove the leaseholder. Release justification: fixes flakiness caused by cockroachdb#74077 Release note: None
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
124284: kvserver: rebalance ranges with one voter using joint configurations r=nvanbenschoten a=kvoli The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to #74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: #108420 Fixes: #124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1. 124444: packer: move `add-apt-repository` call to after `sleep` r=jlinder a=rickystewart Updates are still going on on the VM while this is happening. Epic: none Release note: None Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to #74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: #108420 Fixes: #124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
The allocator would add a voter, instead of both adding and removing the existing voter when rebalancing ranges with one replica. Removing the leaseholder replica was not possible prior to cockroachdb#74077, so the addition only was necessary. This restriction is no longer necessary. Allow rebalancing a one voter range between stores using joint configurations, where the lease will be transferred to the incoming voter store, from the outgoing demoting voter. Scattering ranges with one voter will now leave the range with exactly one voter, where previously both the leaseholder voter evaluating the scatter, and the new voter would be left. Before this patch, scattering 1000 ranges with RF=1 on a 5 store cluster: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 1 | 1001 | ########## | 500 | ########## 5 | 291 | ### | 147 | ### 4 | 275 | ### | 137 | ### 3 | 229 | ### | 118 | ### 2 | 206 | ### | 99 | ## ``` After: ``` store_id | replica_count | replica_distribution | lease_count | lease_distribution -----------+---------------+----------------------+-------------+--------------------- 2 | 242 | ########## | 241 | ########## 4 | 227 | ########## | 227 | ########## 5 | 217 | ######### | 216 | ######### 3 | 209 | ######### | 208 | ######### 1 | 106 | ##### | 109 | ##### ``` Fixes: cockroachdb#108420 Fixes: cockroachdb#124171 Release note (bug fix): Scattering a range with replication factor=1, no longer erroneously up-replicates the range to two replicas. Leases will also no longer thrash between nodes when perturbed with replication factor=1.
Previously:
Because of (1), users needed to transfer the lease before removing
the leaseholder. Because of (2), when relocating a range from the
leaseholder A to a new node B, there was no possibility to transfer
the lease to B before it was fully added as VOTER. Adding it as a
voter, however, could degrade fault tolerance. For example, if A
and B are in region R1, C in region R2 and D in R3, and we had
(A, C, D), and now adding B to the cluster to replace A results in
the intermediate configuration (A, B, C, D) the failure of R1 would
make the cluster unavailable since no quorum can be established.
Since B can't be added before A is removed, the system would
transfer the lease out to C, remove A and add B, and then transfer
the lease again to B. This resulted a temporary migration of leases
out of their preferred region, imbalance of lease count and degraded
performance.
The PR fixes this, by (1) allowing removing the leaseholder, and
transferring the lease right before we exit the JOINT config. And (2),
allowing a VOTER_INCOMING to accept the lease.
Release note (performance improvement): Fixes a limitation which meant
that, upon adding a new node to the cluster, lease counts among existing
nodes could diverge until the new node was fully upreplicated.
Here are a few experiments that demonstrate the benefit of the feature.
1.
Without the change (master):
With the change:
We can see that without the patch the number of leases on server 0 (black line) goes all the way to 0 before it goes back up and that the number of leases in other racks goes up, both undesirable. With the patch both things are no longer happening.
ALTER RANGE default CONFIGURE ZONE USING lease_preferences='[[+rack=0]]';
Without the change (master):
With the change:
We can see that without the change the number of leaseholders in racks 1 and 2 together (not in preferred region) grows from 300 to 1000, then goes back to 40. With the fix it doesn’t grow at all.