Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: lease transfer in JOINT configuration #74077

Merged
merged 1 commit into from
Feb 19, 2022

Conversation

shralex
Copy link
Contributor

@shralex shralex commented Dec 20, 2021

Previously:

  1. Removing a leaseholder was not allowed.
  2. A VOTER_INCOMING node wasn't able to accept the lease.

Because of (1), users needed to transfer the lease before removing
the leaseholder. Because of (2), when relocating a range from the
leaseholder A to a new node B, there was no possibility to transfer
the lease to B before it was fully added as VOTER. Adding it as a
voter, however, could degrade fault tolerance. For example, if A
and B are in region R1, C in region R2 and D in R3, and we had
(A, C, D), and now adding B to the cluster to replace A results in
the intermediate configuration (A, B, C, D) the failure of R1 would
make the cluster unavailable since no quorum can be established.
Since B can't be added before A is removed, the system would
transfer the lease out to C, remove A and add B, and then transfer
the lease again to B. This resulted a temporary migration of leases
out of their preferred region, imbalance of lease count and degraded
performance.

The PR fixes this, by (1) allowing removing the leaseholder, and
transferring the lease right before we exit the JOINT config. And (2),
allowing a VOTER_INCOMING to accept the lease.

Release note (performance improvement): Fixes a limitation which meant
that, upon adding a new node to the cluster, lease counts among existing
nodes could diverge until the new node was fully upreplicated.

Here are a few experiments that demonstrate the benefit of the feature.
1.

roachprod create local -n 4 // if not already created and staged
roachprod put local cockroach
roachprod start local:1-3 --racks=3 // add 3 servers in 3 different racks
cockroach workload init kv --splits=10000
roachprod start local:4 --racks=3 // add a 4th server in one of the racks

Without the change (master):
Screen Shot 2022-02-09 at 8 35 35 AM

With the change:
Screen Shot 2022-02-08 at 8 46 41 PM

We can see that without the patch the number of leases on server 0 (black line) goes all the way to 0 before it goes back up and that the number of leases in other racks goes up, both undesirable. With the patch both things are no longer happening.

  1. Same as 1, but with a leaseholder preference of rack 0:

ALTER RANGE default CONFIGURE ZONE USING lease_preferences='[[+rack=0]]';

Without the change (master):
Screen Shot 2022-02-09 at 10 45 27 PM

With the change:
leaseholder preferences - with change

We can see that without the change the number of leaseholders in racks 1 and 2 together (not in preferred region) grows from 300 to 1000, then goes back to 40. With the fix it doesn’t grow at all.

@shralex shralex requested a review from a team as a code owner December 20, 2021 06:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's exciting to see this! When you get a chance, let's try out the repro steps from #67740 (comment) and see how things behave with this change.

@tbg should probably also give this a pass when it gets closer, as he was the author of a lot of this joint configuration code. I also left a comment for him in replica_raft.go.

Reviewed 15 of 17 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15 and @shralex)


-- commits, line 14 at r1:
nit: s/a/A/


-- commits, line 29 at r1:
Consider adding a release note to this commit. It's a valuable change that users will care about!


pkg/kv/db.go, line 629 at r1 (raw file):

	err := db.Run(ctx, b)
	if err == nil {
		log.Infof(ctx, "Transferring lease to StoreID %s succeeded", target.String())

Was this just added for debugging purposes, or did you intend for us to keep it?


pkg/kv/kvserver/replica_command.go, line 985 at r1 (raw file):

	details string,
	chgs roachpb.ReplicationChanges,
	repl *Replica,

Isn't this already the method receiver?


pkg/kv/kvserver/replica_command.go, line 1172 at r1 (raw file):

// NON_VOTER, and VOTER_FULL only.
func maybeLeaveAtomicChangeReplicas(
	ctx context.Context, s *Store, desc *roachpb.RangeDescriptor, r *Replica,

Instead of adding this as a parameter, what do you think about making maybeLeaveAtomicChangeReplicas and maybeLeaveAtomicChangeReplicasAndRemoveLearners methods on *Replica? That way, we can actually get rid of all arguments except for the context.


pkg/kv/kvserver/replica_command.go, line 1177 at r1 (raw file):

	// with leaving a joint state when it's in one, so make sure we don't call
	// it if we're not.
	if !desc.Replicas().InAtomicReplicationChange(){

It looks like your gofmt is having issues. You should double-check that you have https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher set up correctly.


pkg/kv/kvserver/replica_command.go, line 1184 at r1 (raw file):

	voters := desc.Replicas().VoterDescriptors()
	if r != nil {

The fact that we can ignore this check seems like a source of bugs. The need for this is coming from the call to maybeLeaveAtomicChangeReplicasAndRemoveLearners on the RHS range in mergeQueue.process, right? Did you run into issues with the proposed idea to remove that call entirely and enter the AdminRelocateRange branch when the RHS is in a joint config instead?


pkg/kv/kvserver/replica_command.go, line 1185 at r1 (raw file):

	voters := desc.Replicas().VoterDescriptors()
	if r != nil {
		// Current leaseholder is being removed

Let's add more to this comment. We can discuss why we're looking in the voters slice, what it means for the current replica (which we know to be the leaseholder) to be absent from it, and what we will do in that case (transfer the lease).


pkg/kv/kvserver/replica_command.go, line 1203 at r1 (raw file):

				transferLeaseOptions{
					goal:                               followTheWorkload,
					checkTransferLeaseSource:           true,

Consider adding a comment about why we set this to true. Something about having already filtered out the current leaseholder.


pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):

					checkTransferLeaseSource:           true,
					checkCandidateFullness:             false,
					disableLeaseCountConvergenceChecks: true,

Could you add a comment about why we set this? I don't understand the full flow in Allocator.TransferLeaseTarget, but it's interesting to me that we have to set this here, but not elsewhere. For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway. Should we be doing so? @aayushshah15 do you know what this is about?


pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):

			if target == (roachpb.ReplicaDescriptor{}) {
				log.Infof(ctx, "could not find a better lease transfer target for r%d",
					desc.RangeID)

Should we return an error in this case? If the lease can't be transferred away from the current leaseholder, then there's no use trying to perform the config change which we know is going to fail.

Also, regardless of whether we log (log.Warningf) or return an error, let's add more information about why we're trying to transfer the lease and what the fact that we can't find a target means. It's not a good situation, and we will want to be alerted of it if a customer cluster ever gets stuck in this state.


pkg/kv/kvserver/replica_command.go, line 1213 at r1 (raw file):

					desc.RangeID)
			} else {
				log.Infof(ctx, "current leaseholder %v is being removed. Transferring lease to %v",

minor nit: s/. T/, t/

Also, s/is being removed/is being removed through an atomic replication change/


pkg/kv/kvserver/replica_command.go, line 2787 at r1 (raw file):

			}

			ops, err := s.relocateOne(ctx, &rangeDesc, voterTargets, nonVoterTargets)

I'm surprised we didn't need to revert this part of the change because of single-step config changes. What will happen if I have a config that looks like {1 [L], 2, 3} and I issue an AdminRelocateRange that drops the config down to {2 [L], 3}?


pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):

		prefix = false

		// The following deals with removing a leaseholder. A voter can be removed in two ways.

Consider pulling this into a separate commit/PR, separate from the rest of the behavior changes. This shouldn't have any visible effect, but it would be nice to validate that. A separate commit also has the benefit of being "bisectable" in case there's any rare fallout that we need to track back to this change.

Ideally, we could even test this by manually issuing each step of the joint config process (enter joint, transfer lease, exit joint), but it doesn't look like we have any testing that invokes execChangeReplicasTxn directly. @tbg are you aware of any tests at this level?

EDIT: maybe we can do something around TestLeaseHolderRemoveSelf to test this?


pkg/kv/kvserver/replicate_queue.go, line 1288 at r1 (raw file):

	// to disregard the existing lease counts on candidates.
	checkCandidateFullness bool
	// when true, ignores lease count convergence considerations

nit: missing punctuation.


pkg/roachpb/metadata.go, line 488 at r1 (raw file):

}

func (r ReplicaDescriptor) IsAnyVoter() bool {

This will need a comment. It will probably look like the two above.


pkg/roachpb/metadata_replicas.go, line 527 at r1 (raw file):

	if !ok {
		return errReplicaNotFound
	} else if !repDesc.IsVoterNewConfig() {

This should probably also go in its own commit/PR, with some testing, for the same reasons as listed above.


pkg/testutils/testcluster/testcluster.go, line 886 at r1 (raw file):

) {
	if err := tc.TransferRangeLease(rangeDesc, dest); err != nil {
		t.Fatalf(`couldn not transfer lease for range %s error is %+v`, rangeDesc, err)

"could not" or "couldn't"


pkg/testutils/testcluster/testcluster.go, line 897 at r1 (raw file):

) {
	testutils.SucceedsSoon(t, func() error {
		if _, err := tc.RemoveVoters(rangeDesc.StartKey.AsRawKey(), src); err != nil {

I'm surprised this works. Are we entering a joint config here?


pkg/kv/kvserver/client_lease_test.go, line 640 at r1 (raw file):

	tc.TransferRangeLeaseOrFatal(t, rhsDesc, tc.Target(0))

	// Check that the lease is on 1

nit: punctuation needed at the end of a lot of these comments.


pkg/kv/kvserver/client_lease_test.go, line 654 at r1 (raw file):

	require.Equal(t, tc.Target(2), leaseHolder)

	gossipLiveness(t, tc)

Why is this needed?


pkg/kv/kvserver/client_lease_test.go, line 657 at r1 (raw file):

	// Relocate range 3 -> 4.
	_, err = db.Exec("ALTER RANGE " + rhsDesc.RangeID.String() + " RELOCATE FROM 3 TO 4")

Isn't not very common to see SQL used in these tests to perform this kind of change. It's more common to see kv.DB.AdminChangeReplicas used, or even better, one of the TestCluster helper methods. Maybe we need a new helper method?


pkg/kv/kvserver/client_lease_test.go, line 665 at r1 (raw file):

	require.Equal(t, tc.Target(3), leaseHolder)

	// Double check that lease moved directly 3 -> 4

This is neat!


pkg/kv/kvserver/client_raft_test.go, line 3803 at r1 (raw file):

}

nit: stray line. That may go away with a gofmt.


pkg/kv/kvserver/client_relocate_range_test.go, line 239 at r1 (raw file):

		})
	}

tiny nit: based on the rest of this test, it seems like we wanted this new line for consistency.

@nvanbenschoten
Copy link
Member

I remembered another comment I meant to leave here. We'll need to think about mixed version clusters when making this change. We shouldn't begin a joint configuration that would require the removal of the leaseholder until after all nodes in the cluster are running v22.1.

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave a shallow pass, generally looks good. Sad that all the messy lease stuff is now intertwining with the replication changes but I suppose that's what we signed up for. Still, if there's a way to clean that up I would be in favor of that. Maybe all of the lease handling doesn't have to be inline but can be pulled out into something compact that we can call in the right place.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, and @shralex)


pkg/kv/db.go, line 629 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Was this just added for debugging purposes, or did you intend for us to keep it?

These should go away, we shouldn't be logging in *kv.DB. If anything, these logs would be expected at the server that executes the actual transfer.

Also, general nits:

  • log messages start lowercase
  • don't need to call target.String, just target should do fine (%s verb invokes String())
  • we usually refer to stores as s%d in logging

pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):

				fallthrough
			case decideWithoutStats:
				if !opts.disableLeaseCountConvergenceChecks &&

was this just missing & you noticed while writing this PR?


pkg/kv/kvserver/replica_command.go, line 1218 at r1 (raw file):

				if err != nil {
					return nil, err
				}

Are we sure the lease is now where it should be? There's obviously the case where for some reason the lease gets transferred somewhere else by a third party. Could there be races with lease preferences where the new leaseholder immediately sends the lease back, and we never manage to do the replication change below? Similarly, are there persistent error cases where the lease transfer (or one of the steps above) fail? I assume there are no new concerns here since we were never previously able to remove the leaseholder and so nothing has really changed. But I am vaguely worried that what we do now isn't 100% equivalent and that we need a fail-safe (which the migration forces us to have anyway).


pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider pulling this into a separate commit/PR, separate from the rest of the behavior changes. This shouldn't have any visible effect, but it would be nice to validate that. A separate commit also has the benefit of being "bisectable" in case there's any rare fallout that we need to track back to this change.

Ideally, we could even test this by manually issuing each step of the joint config process (enter joint, transfer lease, exit joint), but it doesn't look like we have any testing that invokes execChangeReplicasTxn directly. @tbg are you aware of any tests at this level?

EDIT: maybe we can do something around TestLeaseHolderRemoveSelf to test this?

I don't think there's direct testing of this, unfortunately. Perhaps a reasonable way to get what we want is by inserting testing knobs between the steps that we can use to check on the replication change at various stages.


pkg/kv/kvserver/replica_raft.go, line 323 at r1 (raw file):

		// and then transfer lease directly to v4, but this would change the number of replicas to 4,
		// and if region1 goes down, we loose a quorum. Instead, we move to a joint config where v1
		// (VOTER_DEMOTED_LEARNER) transfer the lease to v4 (VOTER_INCOMING) directly.

DEMOTING


pkg/kv/kvserver/replicate_queue.go, line 1288 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: missing punctuation.

capitalization too and usually (though redundantly) the comment would say

// disableLeaseCountConvergenceChecks, when true, ...


pkg/kv/kvserver/client_raft_test.go, line 3826 at r1 (raw file):

	require.NoError(t, err)

	// Check that lease moved to server 2

.

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)


pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):

For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway

I dont fully follow why we need this new option. disableLeaseCountConvergenceChecks as its currently written makes it such that we're forced to use followTheWorkload or return an empty result.


pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we return an error in this case? If the lease can't be transferred away from the current leaseholder, then there's no use trying to perform the config change which we know is going to fail.

Also, regardless of whether we log (log.Warningf) or return an error, let's add more information about why we're trying to transfer the lease and what the fact that we can't find a target means. It's not a good situation, and we will want to be alerted of it if a customer cluster ever gets stuck in this state.

+1 I'm not sure we need to call into TransferLeaseTarget at all here. Zone configuration lease preferences must be compatible with the constraints placed on the range. So we should never be trying to replace the current leaseholder with another replica that is in violation of those lease preferences. Given all this, I think we should be able to directly transfer the lease to the VOTER_INCOMING that is replacing the current leaseholder.

Another thing to note is that part of AdminRelocateRange's contract is that it must transfer the lease to the first voting replica in the input slice. This is done here at the end:

if len(ops) == 0 {
if len(voterTargets) > 0 {
// NB: we may need to transfer even if there are no ops, to make
// sure the attempt is made to make the first target the final
// leaseholder. If there are ops, lease transfer will happen during joint config exit,
// if leader is removed.
if err := transferLease(voterTargets[0]); err != nil {
return rangeDesc, err
}
}
return rangeDesc, ctx.Err()
}
This further confuses this call into TransferLeaseTarget for me.


pkg/kv/kvserver/replica_command.go, line 2795 at r1 (raw file):

					// NB: we may need to transfer even if there are no ops, to make
					// sure the attempt is made to make the first target the final
					// leaseholder. If there are ops, lease transfer will happen during joint config exit,

nit: this addition is a bit out of place here, and wrapping.


pkg/kv/kvserver/replica_command.go, line 2796 at r1 (raw file):

					// sure the attempt is made to make the first target the final
					// leaseholder. If there are ops, lease transfer will happen during joint config exit,
					// if leader is removed.

s/leader/leaseholder


pkg/kv/kvserver/replica_raft.go, line 302 at r1 (raw file):

		prefix = false

		// The following deals with removing a leaseholder. A voter can be removed in two ways.

note that comments should be wrapped at 80 chars (see: https://cockroachlabs.atlassian.net/wiki/spaces/CRDB/pages/181371303/Go+Golang+coding+guidelines#Line-Length) here and elsewhere.

@aayushshah15
Copy link
Contributor

We'll need to think about mixed version clusters when making this change. We shouldn't begin a joint configuration that would require the removal of the leaseholder until after all nodes in the cluster are running v22.1.

To add to this, we will also still need the removed allocator logic under mixed-version states.

@aayushshah15
Copy link
Contributor


pkg/kv/kvserver/replicate_queue.go, line 543 at r1 (raw file):

		// See about transferring the lease away if we're about to remove the
		// leaseholder.
		done, err := rq.maybeTransferLeaseAway(

This call should also probably go away.

Copy link
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @nvanbenschoten, @shralex, and @tbg)


-- commits, line 29 at r1:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider adding a release note to this commit. It's a valuable change that users will care about!

Sure, I'll add before this is merged, once we converge.


pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

was this just missing & you noticed while writing this PR?

I reverted this change, as the same can be achieved by setting checkTransferLeaseSource to false


pkg/kv/kvserver/replica_command.go, line 1177 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It looks like your gofmt is having issues. You should double-check that you have https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher set up correctly.

thanks! I set it up previously but it somehow disappeared...


pkg/kv/kvserver/replica_command.go, line 1203 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Consider adding a comment about why we set this to true. Something about having already filtered out the current leaseholder.

actually I think it should be false because of the lease count checks in the TransferLeaseTarget method


pkg/kv/kvserver/replica_command.go, line 1205 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

For instance, we don't disable the lease count convergence checks in replicateQueue.maybeTransferLeaseAway

I dont fully follow why we need this new option. disableLeaseCountConvergenceChecks as its currently written makes it such that we're forced to use followTheWorkload or return an empty result.

What I was observing is that when the lease counts are roughly similar, this method would return an empty result. I now removed this new option, since checkTransferLeaseSource=false would achieve what I needed here.


pkg/kv/kvserver/replica_command.go, line 1211 at r1 (raw file):
Updated this to return an error.

I think we should be able to directly transfer the lease to the VOTER_INCOMING that is replacing the current leaseholder

It might be better to make this code a bit more general, such that it works even if there are multiple incoming nodes, or none. Also, there are cases where the joining node is not the best choice. Running the selection logic addresses both issues.


pkg/kv/kvserver/client_lease_test.go, line 654 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this needed?

Without this, the liveness status of candidate nodes is unknown, and they are excluded from consideration when we're trying to decide where to transfer the lease.

shralex added a commit that referenced this pull request Jan 5, 2022
…descriptor

Previosly, the determination of whether a leaseholder is being removed was made
by looking at the proposed changes. As a step towards #74077
we'd like to instead look at the replica descriptor that the reconfiguration change would result in.
This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR.
This PR should not cause any change in behavior.

Release note: None
shralex added a commit that referenced this pull request Jan 5, 2022
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed
to receive the lease, even though there is no actual problem with doing so.
This change removes the restriction, as a step towards #74077

Release note: None
shralex added a commit that referenced this pull request Jan 5, 2022
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed
to receive the lease, even though there is no actual problem with doing so.
This change removes the restriction, as a step towards #74077

Release note: None
shralex added a commit that referenced this pull request Jan 6, 2022
…descriptor

Previosly, the determination of whether a leaseholder is being removed was made
by looking at the proposed changes. As a step towards #74077
we'd like to instead look at the replica descriptor that the reconfiguration change would result in.
This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR.
This PR should not cause any change in behavior.

Release note: None
shralex added a commit to shralex/cockroach that referenced this pull request Jan 6, 2022
Previously, VOTER_INCOMING replicas joining the cluster weren't allowed
to receive the lease, even though there is no actual problem with doing so.
This change removes the restriction, as a step towards cockroachdb#74077

Release note: None
shralex added a commit to shralex/cockroach that referenced this pull request Jan 6, 2022
…descriptor

Previosly, the determination of whether a leaseholder is being removed was made
by looking at the proposed changes. As a step towards cockroachdb#74077
we'd like to instead look at the replica descriptor that the reconfiguration change would result in.
This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR.
This PR should not cause any change in behavior.

Release note: None
craig bot pushed a commit that referenced this pull request Jan 6, 2022
74479: backupccl: replace memory accumulator with bound account r=dt a=adityamaru

This change replaces the memory accumulator with a raw bound
account. The memory accumulator provides a threadsafe wrapper
around the bound account, and some redundant resource pooling
semantics. Both of these are currently not needed by the sstSink
that is being memory monitored. This change deletes the memory
accumulator.

Release note: None

74546: kvserver: allow VOTER_INCOMING to receive the lease r=shralex a=shralex

Previously, VOTER_INCOMING replicas joining the cluster weren't allowed
to receive the lease, even though there is no actual problem with doing so.
This change removes the restriction, as a step towards #74077

Release note: None

74549: kvserver: determine if leaseholder is removed using proposed replica … r=shralex a=shralex

…descriptor

Previosly, the determination of whether a leaseholder is being removed was made
by looking at the proposed changes. As a step towards #74077
we'd like to instead look at the replica descriptor that the reconfiguration change would result in.
This prepares the ground for making a distinction between incoming and outgoing replicas in the next PR.
This PR should not cause any change in behavior.

Release note: None

Co-authored-by: Aditya Maru <adityamaru@gmail.com>
Co-authored-by: shralex <shralex@gmail.com>
craig bot pushed a commit that referenced this pull request Mar 23, 2022
77841: kvserver: transfer lease in JOINT config directly to a VOTER_INCOMING… r=shralex a=shralex

… replica

In #74077 we allowed entering a JOINT
config when the leaseholder is being demoted, in which case we attempted to transfer the lease
away before leaving the JOINT config. Looking for a lease transfer target at that stage created
flakiness, since in some situations this target isn't found and we need to retry. This PR
takes the approach that when the leaseholder is being removed, we should enter the JOINT config
only if there is a VOTER_INCOMING replica. In that case, the lease is transferred directly to
that replica, without further checks. And otherwise, the lease must be transferred before attempting
to remove the leaseholder.

Release justification: fixes flakiness caused by #74077
Release note: None

Experiments to demonstrate that the benefits of  #74077
remain with this PR. Please refer to the #74077 for details.
1. No leaseholder preference:
<img width="985" alt="Screen Shot 2022-03-22 at 7 24 15 PM" src="https://user-images.githubusercontent.com/6037719/159631649-36a296a8-7619-4dab-8db1-a8b603b6d66e.png">


3. Leaseholder preference is rack 0:
<img width="977" alt="Screen Shot 2022-03-22 at 10 39 50 PM" src="https://user-images.githubusercontent.com/6037719/159631504-1cd3350a-89a5-4d77-b9a1-37dfdd2896ae.png">



78101: colexecjoin: optimize building output on the left in merge joiner r=yuzefovich a=yuzefovich

This commit updates the way we're building output in the merge joiner
from the left input when building directly from the left batch (i.e. not
from the buffered group). There, we need to repeat a single tuple
`toAppend` times, so we do it in a loop. This commit adds the
optimization of using `Bytes.Copy` for the bytes-like types as well as
BCE for sliceable types.
```
MergeJoiner/rows=32-24                     29.3MB/s ± 3%  29.5MB/s ± 3%     ~     (p=0.684 n=10+10)
MergeJoiner/rows=512-24                    79.4MB/s ± 2%  77.8MB/s ± 3%   -1.91%  (p=0.043 n=10+10)
MergeJoiner/rows=4096-24                    192MB/s ± 2%   189MB/s ± 1%   -1.36%  (p=0.029 n=10+10)
MergeJoiner/rows=32768-24                   278MB/s ± 1%   275MB/s ± 0%   -1.30%  (p=0.000 n=10+10)
MergeJoiner/oneSideRepeat-rows=32-24       37.3MB/s ± 3%  38.0MB/s ± 2%   +1.78%  (p=0.029 n=10+10)
MergeJoiner/oneSideRepeat-rows=512-24       212MB/s ± 1%   215MB/s ± 2%   +1.42%  (p=0.003 n=9+10)
MergeJoiner/oneSideRepeat-rows=4096-24      765MB/s ± 4%   770MB/s ± 3%     ~     (p=0.436 n=10+10)
MergeJoiner/oneSideRepeat-rows=32768-24    1.22GB/s ± 2%  1.23GB/s ± 2%     ~     (p=0.393 n=10+10)
MergeJoiner/bothSidesRepeat-rows=32-24     22.7MB/s ± 2%  22.9MB/s ± 2%     ~     (p=0.203 n=9+10)
MergeJoiner/bothSidesRepeat-rows=512-24     102MB/s ± 4%   104MB/s ± 2%   +2.38%  (p=0.011 n=10+10)
MergeJoiner/bothSidesRepeat-rows=4096-24    117MB/s ± 1%   127MB/s ± 1%   +9.11%  (p=0.000 n=10+9)
MergeJoiner/bothSidesRepeat-rows=32768-24  59.2MB/s ± 1%  67.1MB/s ± 1%  +13.48%  (p=0.000 n=10+10)
```

Release note: None

Co-authored-by: shralex <shralex@gmail.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
shralex added a commit to shralex/cockroach that referenced this pull request Mar 25, 2022
… replica

In cockroachdb#74077 we allowed entering a JOINT
config when the leaseholder is being demoted, in which case we attempted to transfer the lease
away before leaving the JOINT config. Looking for a lease transfer target at that stage created
flakiness, since in some situations this target isn't found and we need to retry. This PR
takes the approach that when the leaseholder is being removed, we should enter the JOINT config
only if there is a VOTER_INCOMING replica. In that case, the lease is transferred directly to
that replica, without further checks. And otherwise, the lease must be transferred before attempting
to remove the leaseholder.

Release justification: fixes flakiness caused by cockroachdb#74077
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this pull request May 16, 2024
The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to cockroachdb#74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: cockroachdb#108420
Fixes: cockroachdb#124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.
kvoli added a commit to kvoli/cockroach that referenced this pull request May 16, 2024
The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to cockroachdb#74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: cockroachdb#108420
Fixes: cockroachdb#124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.
kvoli added a commit to kvoli/cockroach that referenced this pull request May 20, 2024
The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to cockroachdb#74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: cockroachdb#108420
Fixes: cockroachdb#124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.
kvoli added a commit to kvoli/cockroach that referenced this pull request May 20, 2024
The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to cockroachdb#74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: cockroachdb#108420
Fixes: cockroachdb#124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.
kvoli added a commit to kvoli/cockroach that referenced this pull request May 20, 2024
The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to cockroachdb#74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: cockroachdb#108420
Fixes: cockroachdb#124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.
craig bot pushed a commit that referenced this pull request May 20, 2024
124284: kvserver: rebalance ranges with one voter using joint configurations  r=nvanbenschoten a=kvoli

The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to #74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: #108420
Fixes: #124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.

124444: packer: move `add-apt-repository` call to after `sleep` r=jlinder a=rickystewart

Updates are still going on on the VM while this is happening.

Epic: none
Release note: None

Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
blathers-crl bot pushed a commit that referenced this pull request May 20, 2024
The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to #74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: #108420
Fixes: #124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.
kvoli added a commit to kvoli/cockroach that referenced this pull request May 28, 2024
The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to cockroachdb#74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: cockroachdb#108420
Fixes: cockroachdb#124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.
kvoli added a commit to kvoli/cockroach that referenced this pull request May 28, 2024
The allocator would add a voter, instead of both adding and removing the
existing voter when rebalancing ranges with one replica. Removing the
leaseholder replica was not possible prior to cockroachdb#74077, so the addition
only was necessary.

This restriction is no longer necessary. Allow rebalancing a one voter
range between stores using joint configurations, where the lease will be
transferred to the incoming voter store, from the outgoing demoting
voter.

Scattering ranges with one voter will now leave the range with exactly
one voter, where previously both the leaseholder voter evaluating the
scatter, and the new voter would be left.

Before this patch, scattering 1000 ranges with RF=1 on a 5 store
cluster:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         1 |          1001 | ##########           |         500 | ##########
         5 |           291 | ###                  |         147 | ###
         4 |           275 | ###                  |         137 | ###
         3 |           229 | ###                  |         118 | ###
         2 |           206 | ###                  |          99 | ##
```

After:

```
  store_id | replica_count | replica_distribution | lease_count | lease_distribution
-----------+---------------+----------------------+-------------+---------------------
         2 |           242 | ##########           |         241 | ##########
         4 |           227 | ##########           |         227 | ##########
         5 |           217 | #########            |         216 | #########
         3 |           209 | #########            |         208 | #########
         1 |           106 | #####                |         109 | #####
```

Fixes: cockroachdb#108420
Fixes: cockroachdb#124171

Release note (bug fix): Scattering a range with replication factor=1, no
longer erroneously up-replicates the range to two replicas. Leases will
also no longer thrash between nodes when perturbed with replication
factor=1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: lease counts diverge when a new node is added to a cluster
5 participants