kvserver: Allow rebalances between stores on the same nodes. #51567

lunevalex · 2020-07-17T21:22:17Z

This change modifies the replica_queue to allow rebalances between multiple stores within a single node.
This is possible because we previously introduced atomic rebalances in #12768.

The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition.

With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup.

Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.

cockroach-teamcity · 2020-07-17T21:22:24Z

This change is

nvanbenschoten

This is very exciting work! I'm probably the least familiar with this code out of the three reviewers, so someone else should take a look, but everything seems sane here.

Reviewed 6 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, and @tbg)

pkg/kv/kvserver/allocator.go, line 565 at r1 (raw file):

		}
	}
	return roachpb.ReplicaDescriptor{}, "", errors.New("could not select an appropriate replica to be removed")

When do we expect to hit this? Is that even possible now? Won't doRemoveTarget itself return an error in this case?

pkg/kv/kvserver/allocator.go, line 691 at r1 (raw file):

		var removeDetails string
		var storeFilter roachpb.StoreDescriptorFilter = roachpb.NoOpStoreDescriptorFilter{}
		if isSameNodeRebalanceAttempt(existingPlusOneNew, target) {

Could you leave a comment about why we use existingPlusOneNew instead of replicaCandidates. Something about not wanting to add replicas on two stores on the same node even in the case where the existing replica is removable. And then mention what happens in that case (we return false?).

Also, can this just use existingReplicas and then avoid the StoreID check entirely?

pkg/kv/kvserver/allocator.go, line 691 at r1 (raw file):

nit: how do you feel about our use of newReplica and target in this block? I find it awfully confusing that they're used somewhat interchangeably and yet they mean roughly the same thing. And we have a removeReplica in scope, which I would expect to parallel newReplica, and yet it doesn't really. Maybe it's just because I'm not familiar with the code, but it seems like this could be cleaner.

I wrote this out while reading over the existing code and before seeing your cleanup with removalTarget. Leaving it here to show appreciation for your work.

pkg/kv/kvserver/allocator.go, line 692 at r1 (raw file):

		var storeFilter roachpb.StoreDescriptorFilter = roachpb.NoOpStoreDescriptorFilter{}
		if isSameNodeRebalanceAttempt(existingPlusOneNew, target) {
			storeFilter = roachpb.SameNodeStoreDescriptorFilter{NodeID: target.store.Node.NodeID}

Could you add a comment here? Something that just ensures that the reader understands the double-negation that's going on here. I found it tricky to navigate the logic around a filter that filters out replicas that can not, not be part of the resulting range.

A comment that said "// We're rebalancing between stores on the same node, so only consider other stores on the same node as candidates to be removed." would have helped a lot.

pkg/kv/kvserver/allocator.go, line 707 at r1 (raw file):

			return zero, zero, "", false
		}
		if target.store.StoreID != removalTarget.store.StoreID && removalTarget.less(*target) {

Why do we now need this less call? Why wasn't it needed before? Did simulateRemoveTarget used to return an err if there were no better candidates? Does it no longer?

pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):

		existingStoreIDs[i] = exist.StoreID
	}
	sl, _, _ := a.storePool.getStoreListFromIDs(existingStoreIDs, storeFilterNone)

storeFilterNone confused me. I thought it was something you added here, but I see that it's not new. It would be a shame to overload this term.

Thinking about this code a bit more, I actually find it kind of strange that we need this storeFilter at all. Why don't we filter from the candidate list right up top where we determine which filter to use in the same way that we use simulateFilterUnremovableReplicas?

pkg/kv/kvserver/allocator.go, line 1408 at r1 (raw file):

// isSameNodeRebalanceAttempt returns true if the target candidate is on the
//same node as an existing replica.

nit: space at beginning of comment.

pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):

		return 0
	}
	// For same node rebalances we need to manually consider the

Why is this? Isn't the balance score a function of a candidate's StoreCapacity vs. the mean StoreCapacity? If the score is not a function of any per-node state then I don't see why something like this would be needed.

Does this relate to your comment about "An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around." in the commit message?

pkg/kv/kvserver/allocator_scorer.go, line 475 at r1 (raw file):

// ordered from least qualified for removal to most qualified. Stores that are
// marked as not valid, are in violation of a required criteria.
func removeCandidates(

Mention in the comment above that only stores matching the filter will be considered as removal candidates.

pkg/kv/kvserver/allocator_test.go, line 681 at r1 (raw file):

	// We run through all the ranges once to get the cluster to balance,
	// after that we should not be seeing replicas move

nit: punctuation throughout.

pkg/kv/kvserver/allocator_test.go, line 694 at r1 (raw file):

		if ok {
			// Update the descriptor
			newReplicas := make([]roachpb.ReplicaDescriptor, len(ranges[i].InternalReplicas))

I don't think this is doing what we want. It's creating a list of len N and then appending to that len. I think you want newReplicas := make([]roachpb.ReplicaDescriptor, 0, len(ranges[i].InternalReplicas))

pkg/kv/kvserver/replica_command.go, line 1135 at r1 (raw file):

			byNodeAndStoreID[chg.Target.NodeID] = byStoreID
		} else {
			// The only operation that is allowed within a node is an Add/Remove

nit: punctuation.

pkg/kv/kvserver/replica_command.go, line 1159 at r1 (raw file):

		}
		delete(byNodeAndStoreID, rDesc.NodeID)
		if len(byStoreID) == 2 {

Consider restructuring this using a switch len(byStoreID) { case 1: ... case 2: ... default: panic("unexpected") } to catch bugs where the length is not 1 or 2.

pkg/kv/kvserver/replica_command.go, line 1162 at r1 (raw file):

			chg, k := byStoreID[rDesc.StoreID]
			// We should be removing the replica from the existing store during a
			// rebalance within the node

nit: punctuation here and below.

pkg/kv/kvserver/replica_command.go, line 1163 at r1 (raw file):

			// We should be removing the replica from the existing store during a
			// rebalance within the node
			if !k || chg.ChangeType != roachpb.REMOVE_REPLICA {

When would !k?

pkg/kv/kvserver/replica_command_test.go, line 123 at r1 (raw file):

		{ChangeType: roachpb.REMOVE_REPLICA, Target: roachpb.ReplicationTarget{NodeID: 2, StoreID: 1}},
	})
	require.Error(t, err)

We generally prefer require.Regexp so we can make an assertion on the error itself instead of just that there was an error. The mechanism has changed over time (it used to be testutils.IsError), but the general principle has caught multiple bugs over the course of the project that would have otherwise allowed tests to continue silently passing even when broken.

pkg/kv/kvserver/replica_command_test.go, line 196 at r1 (raw file):

	require.NoError(t, err)

	// Test Case 14: We are rebalancing within a node and do a remove

nit: punctuation.

pkg/roachpb/metadata.go, line 649 at r1 (raw file):

// StoreDescriptorFilter defines an interfaces that filters out
// StoreDescriptors.

Could you mention what the return value means? Filter out or passes the filter?

pkg/roachpb/metadata.go, line 666 at r1 (raw file):

}

// NoOpStoreDescriptorFilter is a dummy implementation of StoreDescriptorFilter

It's not really a dummy because we use it in non-testing code. Also, "NoOp" is pretty ambiguous. How about AllStoreDescriptorFilter or something that makes it clear that all stores pass the filter instead of no stores.

Also, missing punctuation in this comment and the next.

tbg

I'm also pretty excited about this. There are two big contributions here, the first one obviously being the feature you're unlocking and it looks like that might be mostly done. However, there's a second one, which I think can still improve, and that is encoding your newly achieved understanding for future generations. It seems that there is something going on with the per-store means and constraints not fitting together well and leading to poor rebalancing decisions, and neither Nathan or I could figure it out from looking at your changes (and I even had you try to explain it to me before)! This tells me that a lot will be won by doubling down on this aspect in particular.

Reviewed 6 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, and @nvanbenschoten)

pkg/kv/kvserver/allocator.go, line 692 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a comment here? Something that just ensures that the reader understands the double-negation that's going on here. I found it tricky to navigate the logic around a filter that filters out replicas that can not, not be part of the resulting range.

A comment that said "// We're rebalancing between stores on the same node, so only consider other stores on the same node as candidates to be removed." would have helped a lot.

+1

pkg/kv/kvserver/allocator.go, line 704 at r1 (raw file):

		)
		if err != nil {
			log.Warningf(ctx, "simulating RemoveTarget failed: %+v", err)

Not your code, but this is a questionable use of Warningf. If this is important, it should propagate an error up. Otherwise, don't log (or log behind V(2) or something like that). It's always hard to predict how often and when these low-level code paths are hit, so they should not do their own logging.

pkg/kv/kvserver/allocator.go, line 707 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why do we now need this less call? Why wasn't it needed before? Did simulateRemoveTarget used to return an err if there were no better candidates? Does it no longer?

Is this just for determinism? I'm also confused here, a comment would help.

pkg/kv/kvserver/allocator.go, line 1373 at r1 (raw file):

) (*candidate, string, error) {
	if len(candidates) == 0 {
		return nil, "", errors.Errorf("must supply at least one candidate replica to allocator.RemoveTarget()")

nit: doRemoveTarget

pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why is this? Isn't the balance score a function of a candidate's StoreCapacity vs. the mean StoreCapacity? If the score is not a function of any per-node state then I don't see why something like this would be needed.

Does this relate to your comment about "An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around." in the commit message?

Alex tried to explain this to me in-"person" but I didn't understand it then and don't understand it now despite having stared a bit. The only thing I understand is that the situation isn't fully symmetric. If you have three nodes and the first node has two stores (and the others one), then each node has to have exactly one replica of each range, which means twin stores have half the number of replicas compared to the non-twin stores. For example, we'd get n1(s1:20 s2:20) n2(s3:40) n3(s3:40) when fully balanced. This seems to pretty fundamentally mess with the assumption that it is a good idea to reduce the divergence from the mean for all stores alike. What I don't understand is how this particular code factors into that (if it does at all), and if so, if it's a band-aid for a particular problem or whether it's a wholistic "fix".

My best stab at understanding this is that what's new here is that for the first time we're seeing stores with incomparable means in replica movement. Usually (or so I assume), because all replicas considered in a queue action satisfy the constraints, the replica counts of the involved stores are somehow "comparable" in the sense that it would be a reasonable goal to roughly even them out. But for intra-node balances, this isn't true, as the example above shows.

pkg/kv/kvserver/allocator_scorer.go, line 480 at r1 (raw file):

	existingNodeLocalities map[roachpb.NodeID]roachpb.Locality,
	options scorerOptions,
	storeFilter roachpb.StoreDescriptorFilter,

Maybe "canRemoveFilter"?

pkg/kv/kvserver/allocator_test.go, line 648 at r1 (raw file):

	// We start out with 40 ranges on 3 nodes and 3 stores, we then add a new store
	// on Node 1 and try to rebalance all the ranges. What we want to see happen
	// is an equilibrium where 20 ranges move from Stopre 1 to Store 2.

Stopre

pkg/kv/kvserver/allocator_test.go, line 717 at r1 (raw file):

	}

	// Verify that the stores are reasonably balanced

.

pkg/kv/kvserver/allocator_test.go, line 721 at r1 (raw file):

		store1.Capacity.RangeCount-store2.Capacity.RangeCount)) <= minRangeRebalanceThreshold)
	// We dont expect any range wanting to move since the system should have
	// reached a stable state at this point

.

pkg/kv/kvserver/allocator_test.go, line 2897 at r1 (raw file):

			a.storePool.getLocalities(existingRepls),
			a.scorerOptions(),
			roachpb.NoOpStoreDescriptorFilter{},

Does that thing need to live in roachpb? Isn't the only use in kvserver without a need to export?

pkg/kv/kvserver/replica_command.go, line 1137 at r1 (raw file):

			// The only operation that is allowed within a node is an Add/Remove
			for _, prevChg := range byStoreID {
				if prevChg.ChangeType == chg.ChangeType {

also:

if prevChg.ChangeType != roachpb.ADD_REPLICA { return fmt.Errorf("can only add-remove a replica within a node, but got %+v", chgs) }

since we never issue remove-add (and I'd want to know if we did accidentally).

pkg/kv/kvserver/replica_command.go, line 1150 at r1 (raw file):

	}

	// Then, check that we're not adding a second replica on nodes that already

We do sometimes add a second replica on a node now.

pkg/kv/kvserver/replica_command.go, line 1163 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

When would !k?

I'm confused by this code, it seems reasonable that !k. For example, if you have a descriptor for stores 1-5 but s5 is completely uninvolved in the change (but lives on a node that is involved), we would still loop over its rDesc and hit this branch.

Also, I don't see why we can check for != REMOVE_REPLICA here, after all the order here is not stable (or if it is stable, not in a way that we want to rely upon), no?

I'll look at this piece of code more closely in the next round of review.

pkg/kv/kvserver/replica_command.go, line 1165 at r1 (raw file):

			if !k || chg.ChangeType != roachpb.REMOVE_REPLICA {
				return errors.Errorf(
					"Expected replica to be removed from %v during a lateral rebalance within the node.", rDesc)

include the map.

pkg/kv/kvserver/replica_command_test.go, line 197 at r1 (raw file):

	// Test Case 14: We are rebalancing within a node and do a remove
	descRelancing := &roachpb.RangeDescriptor{

Rebalancing

Also, a test for the "uninvolved store" case above would be good, where you have an extra store on n1 that isn't involved in the lateral rebalance. I think this would tickle a bug right now.

pkg/roachpb/metadata.go, line 648 at r1 (raw file):

}

// StoreDescriptorFilter defines an interfaces that filters out

interface

Also, whatever can live closer to the allocator should move there. roachpb is already a grab bag's grab bag, we were way too unprincipled about putting things here in the early days.

By the way, you don't need an interface here, you can just

type StoreDescriptorPredicate func(StoreDescriptor) bool

This also has the advantage that you're not forced to make types just to have a receiver. For example you can just write

func allowAllPredicate(StoreDescriptor) { return true }

or

var nodeID roachpb.NodeID = 7
doFilter(func(storeDesc roachpb.StoreDescriptor) { return storeDesc.NodeID == nodeID })

pkg/roachpb/metadata.go, line 649 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you mention what the return value means? Filter out or passes the filter?

+1

I would rename Filter. Consider Accept or Reject if either of those make more sense.

pkg/roachpb/metadata.go, line 660 at r1 (raw file):
Rework this comment when the interface has been. Something like

// Filter accepts the descriptors whose NodeID matches that of the argument.

could be the result.

lunevalex

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/allocator.go, line 565 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

When do we expect to hit this? Is that even possible now? Won't doRemoveTarget itself return an error in this case?

It should never happen, since you will always return out of the for loop, but the code needs a return statement, so that was what was there in the previous version. Not sure if there is anything better to return. The reason for the forloop is that this code returns a ReplicaDescriptor instead of *candidate.

pkg/kv/kvserver/allocator.go, line 691 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you leave a comment about why we use existingPlusOneNew instead of replicaCandidates. Something about not wanting to add replicas on two stores on the same node even in the case where the existing replica is removable. And then mention what happens in that case (we return false?).

Also, can this just use existingReplicas and then avoid the StoreID check entirely?

Switched to existingReplicas

pkg/kv/kvserver/allocator.go, line 692 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

+1

Done, thanks for suggesting a comment.

pkg/kv/kvserver/allocator.go, line 704 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Not your code, but this is a questionable use of Warningf. If this is important, it should propagate an error up. Otherwise, don't log (or log behind V(2) or something like that). It's always hard to predict how often and when these low-level code paths are hit, so they should not do their own logging.

switched to VEventf, including the one more Warningf bellow.

pkg/kv/kvserver/allocator.go, line 707 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Is this just for determinism? I'm also confused here, a comment would help.

It has to do with the fact that the scores for stores on the same node are very close together, so sometimes the algorithm would pick the slightly worse one here, which would then create more movement than you wanted. This is here as an extra guardrail and would only manifest itself in a same node scenario but since it was a generally applicable guardrail I did not precondition it on isSameNodeRebalance.

pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Alex tried to explain this to me in-"person" but I didn't understand it then and don't understand it now despite having stared a bit. The only thing I understand is that the situation isn't fully symmetric. If you have three nodes and the first node has two stores (and the others one), then each node has to have exactly one replica of each range, which means twin stores have half the number of replicas compared to the non-twin stores. For example, we'd get n1(s1:20 s2:20) n2(s3:40) n3(s3:40) when fully balanced. This seems to pretty fundamentally mess with the assumption that it is a good idea to reduce the divergence from the mean for all stores alike. What I don't understand is how this particular code factors into that (if it does at all), and if so, if it's a band-aid for a particular problem or whether it's a wholistic "fix".

My best stab at understanding this is that what's new here is that for the first time we're seeing stores with incomparable means in replica movement. Usually (or so I assume), because all replicas considered in a queue action satisfy the constraints, the replica counts of the involved stores are somehow "comparable" in the sense that it would be a reasonable goal to roughly even them out. But for intra-node balances, this isn't true, as the example above shows.

@tbg that is exactly the reason, in the case you describe the mean is (20+20+40+40)/4=30, because of that s1 and s2 always look desirable and you want to move ranges there. If you look at how we use minRangeRebalanceThreshold today, it's used in the calculation of the balanceScore, so since the balanceScore is no longer useful for these replicas we need another mechanism for equating them. The reason for a threshold is that the leaseholders are on different nodes, so each leaseholder is making it's own decisions about moving things, so if you just aim for range equality it's not going to work since the information about replicas is not instantly transmitted so you have to add this threshold factor to dampen the movement. It definitely feels a bit dirty, but I am not sure how else to prevent this problem.

pkg/kv/kvserver/allocator_scorer.go, line 475 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mention in the comment above that only stores matching the filter will be considered as removal candidates.

Ack

pkg/kv/kvserver/allocator_test.go, line 694 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think this is doing what we want. It's creating a list of len N and then appending to that len. I think you want newReplicas := make([]roachpb.ReplicaDescriptor, 0, len(ranges[i].InternalReplicas))

ack

pkg/kv/kvserver/allocator_test.go, line 2897 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Does that thing need to live in roachpb? Isn't the only use in kvserver without a need to export?

I can move it to the allocator, since it's the only place it's used.

pkg/kv/kvserver/replica_command.go, line 1137 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

also:
if prevChg.ChangeType != roachpb.ADD_REPLICA { return fmt.Errorf("can only add-remove a replica within a node, but got %+v", chgs) }
since we never issue remove-add (and I'd want to know if we did accidentally).

ack

pkg/kv/kvserver/replica_command.go, line 1150 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

We do sometimes add a second replica on a node now.

This is existing code, so when would this be possible?

pkg/kv/kvserver/replica_command.go, line 1163 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I'm confused by this code, it seems reasonable that !k. For example, if you have a descriptor for stores 1-5 but s5 is completely uninvolved in the change (but lives on a node that is involved), we would still loop over its rDesc and hit this branch.

Also, I don't see why we can check for != REMOVE_REPLICA here, after all the order here is not stable (or if it is stable, not in a way that we want to rely upon), no?

I'll look at this piece of code more closely in the next round of review.

This happens during the second cycle of the rebalance the descriptor is (n1, s1) (n1,s2, LEARNER) (n2, s3) (n3, s4). Once s2 is caught up the replicate queue will remove n1,s1, so that needs to be allowed.

pkg/kv/kvserver/replica_command_test.go, line 123 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We generally prefer require.Regexp so we can make an assertion on the error itself instead of just that there was an error. The mechanism has changed over time (it used to be testutils.IsError), but the general principle has caught multiple bugs over the course of the project that would have otherwise allowed tests to continue silently passing even when broken.

Good call

pkg/roachpb/metadata.go, line 648 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

interface

Also, whatever can live closer to the allocator should move there. roachpb is already a grab bag's grab bag, we were way too unprincipled about putting things here in the early days.

By the way, you don't need an interface here, you can just
type StoreDescriptorPredicate func(StoreDescriptor) bool
This also has the advantage that you're not forced to make types just to have a receiver. For example you can just write
func allowAllPredicate(StoreDescriptor) { return true }
or
var nodeID roachpb.NodeID = 7
doFilter(func(storeDesc roachpb.StoreDescriptor) { return storeDesc.NodeID == nodeID })

TIL

pkg/roachpb/metadata.go, line 660 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Rework this comment when the interface has been. Something like

// Filter accepts the descriptors whose NodeID matches that of the argument.

could be the result.

Done.

lunevalex · 2020-07-21T19:33:03Z

@tbg @nvanbenschoten went through and addressed feedback and questions.

darinpp

Fix or add comments about the issue with the replication factor of 1.

pkg/kv/kvserver/replica_command_test.go

darinpp · 2020-07-21T23:41:35Z

pkg/kv/kvserver/replica_command_test.go

+	require.Regexp(t, "refer to n4 twice for change ADD_REPLICA", err)
+
+	// Test Case 11: Try to remove twice to the same node.
+	err = validateReplicationChanges(desc, roachpb.ReplicationChanges{


A comment here that explains why are we trying to prevent this case. I would imagine that if somehow a node ends up with two replicas (each in its own store) and the allocator is suggesting removal of the two - then we should allow it as it is a transition from bad to good state.

pkg/kv/kvserver/replica_command.go

darinpp · 2020-07-22T06:08:04Z

pkg/kv/kvserver/allocator_scorer.go

+	// For same node rebalances we need to manually consider the
+	// minRangeRebalanceThreshold, since in lopsided clusters their under/over
+	// replication is not properly captured in the balanceScore.
+	if c.store.Node.NodeID == o.store.Node.NodeID && c.store.Node.NodeID != 0 {


I don't understand this. You shouldn't need it if we are trying to have the same number of ranges per store. If you want the same number of ranges per node - Is it not possible then to fix the balanceScore instead? We currently compute (and compare to) the mean number of ranges per store. It can change to be per node. Each store will then compare to the mean per node divided to the number of stores for its node.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/allocator.go, line 565 at r1 (raw file):

Previously, lunevalex wrote…

It should never happen, since you will always return out of the for loop, but the code needs a return statement, so that was what was there in the previous version. Not sure if there is anything better to return. The reason for the forloop is that this code returns a ReplicaDescriptor instead of *candidate.

If that's the case then we should log.Fatal here so that it's clearly a code path that should never be hit.

pkg/kv/kvserver/allocator.go, line 691 at r1 (raw file):

Previously, lunevalex wrote…

Switched to existingReplicas

Now that you've made that change, should we generalize the helper a little bit? For instance, make it take a NodeID instead of a *candidate and call it containsNode or something like that.

pkg/kv/kvserver/allocator.go, line 707 at r1 (raw file):

so sometimes the algorithm would pick the slightly worse one here

That's surprising. Why is that the case? Should we fix this in the removal algorithm itself?

pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

storeFilterNone confused me. I thought it was something you added here, but I see that it's not new. It would be a shame to overload this term.

Thinking about this code a bit more, I actually find it kind of strange that we need this storeFilter at all. Why don't we filter from the candidate list right up top where we determine which filter to use in the same way that we use simulateFilterUnremovableReplicas?

I'm still wondering about the second part of this question.

pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):

Previously, lunevalex wrote…

@tbg that is exactly the reason, in the case you describe the mean is (20+20+40+40)/4=30, because of that s1 and s2 always look desirable and you want to move ranges there. If you look at how we use minRangeRebalanceThreshold today, it's used in the calculation of the balanceScore, so since the balanceScore is no longer useful for these replicas we need another mechanism for equating them. The reason for a threshold is that the leaseholders are on different nodes, so each leaseholder is making it's own decisions about moving things, so if you just aim for range equality it's not going to work since the information about replicas is not instantly transmitted so you have to add this threshold factor to dampen the movement. It definitely feels a bit dirty, but I am not sure how else to prevent this problem.

I guess I don't see how that's different from any other imbalanced set of localities at a given locality tier. For instance, if we have 2 nodes in us-east, 1 node in us-central, and 1 node in us-west. Presumably, that configuration does still converge to a reasonable balance, right? We've seen customers run in these kinds of lopsided deployments in the past. So what makes this any different?

pkg/kv/kvserver/allocator_scorer.go, line 501 at r2 (raw file):

		// If the candidate does not pass the predicate then we cannot remove it,
		// as in it must be a replica that remains.
		if !canRemoveStorePredicate(s) {

Should we be adding to candidates with valid: false like we do above?

lunevalex · 2020-07-27T14:52:24Z

@darinpp @tbg @nvanbenschoten Thank you for the reviews and feedback. Based on @darinpp comments and and offline discussion we had I am significantly re-working the allocator changes to better account for these situations by working the same node store preferences into the balance and diversity score calculations. This will result in a much bigger change in the allocator but hopefully will make this solution more scalable. Please stay tuned I will update the PR, once I get to a working solution.

lunevalex · 2020-08-07T21:13:53Z

@nvanbenschoten @darinpp @tbg thanks for the earlier reviews, based on the feedback I significantly re-worked this PR to use the balance_score and diversity_score to drive the correct behavior in the allocator. I made the changes and my live tests on roachprod confirm that they are having the correct behavior on a running cluster. I tested a cluster with the following configuration and got perfect splits (n1, s1), (n2, s2, s3) (n3, s4, s5, s6). I would like some feedback on this new direction for the PR before I go through and invest a lot of time into testing and polish.

darinpp

This feels so much better than we had it before. Simply replacing the use of nodes with stores, makes the rebalancing work for multi-store nodes and simplifies quite a few things.

pkg/kv/kvserver/allocator.go

pkg/kv/kvserver/allocator_scorer.go

pkg/kv/kvserver/replica_command_test.go

lunevalex · 2020-08-17T20:06:15Z

@tbg @nvanbenschoten I addressed comments left by @darinpp, this should now be in a mergeable state and could use another review.

tbg

This looks good, but I don't have a good grasp of all of the details involved. It's a much smaller diff which is good! I'm not sure you updated the commit message yet. There is probably a lucid succinct story you can tell now about your solution.

Reviewed 3 of 11 files at r3, 10 of 10 files at r4.
Dismissed @nvanbenschoten from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I guess I don't see how that's different from any other imbalanced set of localities at a given locality tier. For instance, if we have 2 nodes in us-east, 1 node in us-central, and 1 node in us-west. Presumably, that configuration does still converge to a reasonable balance, right? We've seen customers run in these kinds of lopsided deployments in the past. So what makes this any different?

This code is now reverted to original, but I'm curious about Nathan's last question. What makes two stores on a node different from a lopsided locality/attribute constraint? And if there's no difference, then how did it ever work, plus did we fix it now?

lunevalex

I did update the commit message let me massage it a bit more.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

This code is now reverted to original, but I'm curious about Nathan's last question. What makes two stores on a node different from a lopsided locality/attribute constraint? And if there's no difference, then how did it ever work, plus did we fix it now?

This is a good question, in theory it should not make it different because the 2 nodes in us-east will always look to be under-replicated. It in theory should exhibit the same behavior I saw with multiple stores on the same node, I will have to run some tests to answer this question. It is unlikely my changes improved this scenario.

pkg/kv/kvserver/allocator_scorer.go, line 218 at r2 (raw file):

Previously, darinpp wrote…

I don't understand this. You shouldn't need it if we are trying to have the same number of ranges per store. If you want the same number of ranges per node - Is it not possible then to fix the balanceScore instead? We currently compute (and compare to) the mean number of ranges per store. It can change to be per node. Each store will then compare to the mean per node divided to the number of stores for its node.

Done

63390: opt: remove view deps user defined types used by dependent table r=rytaft,mgartner a=the-ericwang35 Fixes #63191. Previously, when a view depended on a table that had a user defined type column, we would create a dependency between the view and the type, even if the view did not directly reference that column. This is because the table referencing the UDT has an implicit CHECK constraint on the type. For example, ``` CREATE TYPE typ AS ENUM('a', 'b', 'c'); CREATE TABLE t (i INT, j typ); # Has check constraint (j IN (x'40':::@100053, x'80':::@100053, x'c0':::@100053)) CREATE VIEW v AS (SELECT i FROM t); # Has a dependency on typ. ``` When we create `v`, it calls `buildScalar` on `t`'s check constraint and then `maybeTrackUserDefinedTypeDepsForViews`, which adds dependencies to the types even if the view does not directly use them. This patch addresses this by not capturing this dependency inside `maybeTrackUserDefinedTypeDepsForViews`. Instead, it will only capture dependencies explicitly used in the view. Then, in `ConstructCreateView`, we will look for column dependencies, and see if any columns are a UDT. Release note: None 63489: kvserver: remove same node check from TransferLeaseTarget r=lunevalex a=lunevalex In #51567 we added the ability to rebalance replicas between stores on the same node. This PR reverts the changes introduced in #12565, since we no longer need to special case multiple stores per node. Release note: None Co-authored-by: Eric Wang <ericw@cockroachlabs.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>

@adityamaru

Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.

@adityamaru

Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.

@adityamaru

Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.

@adityamaru

65584: kvserver: fix bug obstructing range merges on multi-store clusters r=aayushshah15 a=aayushshah15 Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in #51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed. Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>

@adityamaru

Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.

This commit makes an effort to make `equivalenceClass`es more well-defined. Previously, we would coalesce the equivalence classes for any two existing replicas if they shared the exact same locality hierarchy (including the node ids). This logic became ineffective with cockroachdb#51567 since we disallow multiple replicas on a single node under all circumstances. Furthermore, coalescing the equivalence classes for existing replicas did not buy us anything and instead required a bunch of custom code for us to correctly deal with them. This commit takes an opinionated approach and gets rid of the logic that coalesces two existing replicas' equivalence classes together. Release note: None

73843: kvcoord: simplify txn already committed assertion r=nvanbenschoten a=tbg - kvcoord: improve TestCommitSanityCheckAssertionFiresOnUndetectedAmbiguousCommit - kvcoord: simplify txn already committed assertion 73951: kvserver: improve definition of equivalence classes r=aayushshah15 a=aayushshah15 This commit makes an effort to make `equivalenceClass`es more well-defined. Previously, we would coalesce the equivalence classes for any two existing replicas if they shared the exact same locality hierarchy (including the node ids). This logic became ineffective with #51567 since we disallow multiple replicas on a single node under all circumstances. Furthermore, coalescing the equivalence classes for existing replicas did not buy us anything and instead required a bunch of custom code for us to correctly deal with them. This commit takes an opinionated approach and gets rid of the logic that coalesces two existing replicas' equivalence classes together. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>

This commit makes an effort to make `equivalenceClass`es more well-defined. Previously, we would coalesce the equivalence classes for any two existing replicas if they shared the exact same locality hierarchy (including the node ids). This logic became ineffective with cockroachdb#51567 since we disallow multiple replicas on a single node under all circumstances. Furthermore, coalescing the equivalence classes for existing replicas did not buy us anything and instead required a bunch of custom code for us to correctly deal with them. This commit takes an opinionated approach and gets rid of the logic that coalesces two existing replicas' equivalence classes together. Release note: None

lunevalex requested review from tbg, nvanbenschoten and darinpp July 17, 2020 21:22

lunevalex force-pushed the alex/rebalance-between-stores-on-single-node branch from afb036a to 18b278a Compare July 18, 2020 03:20

nvanbenschoten reviewed Jul 20, 2020

View reviewed changes

tbg requested a review from nvanbenschoten July 21, 2020 09:42

tbg reviewed Jul 21, 2020

View reviewed changes

lunevalex force-pushed the alex/rebalance-between-stores-on-single-node branch from 18b278a to b13924a Compare July 21, 2020 19:32

lunevalex commented Jul 21, 2020

View reviewed changes

lunevalex force-pushed the alex/rebalance-between-stores-on-single-node branch from b13924a to 4e1be18 Compare July 21, 2020 20:53

darinpp suggested changes Jul 22, 2020

View reviewed changes

nvanbenschoten reviewed Jul 23, 2020

View reviewed changes

lunevalex force-pushed the alex/rebalance-between-stores-on-single-node branch from 4e1be18 to b59fab8 Compare August 7, 2020 20:39

lunevalex requested review from tbg and darinpp August 13, 2020 14:27

darinpp approved these changes Aug 13, 2020

View reviewed changes

pkg/kv/kvserver/allocator.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/allocator_scorer.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/allocator_scorer.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/replica_command_test.go Outdated Show resolved Hide resolved

lunevalex force-pushed the alex/rebalance-between-stores-on-single-node branch 3 times, most recently from 8d38d81 to 2332697 Compare August 17, 2020 18:41

lunevalex requested a review from nvanbenschoten August 17, 2020 20:04

tbg requested review from darinpp and tbg August 18, 2020 09:28

tbg reviewed Aug 18, 2020

View reviewed changes

lunevalex commented Aug 18, 2020

View reviewed changes

lunevalex force-pushed the alex/rebalance-between-stores-on-single-node branch from 2332697 to 06ee87e Compare August 18, 2020 15:25

aayushshah15 mentioned this pull request May 21, 2021

kvserver: fix bug obstructing range merges on multi-store clusters #65584

Merged

aayushshah15 mentioned this pull request May 31, 2021

release-21.1: kvserver: fix bug obstructing range merges on multi-store clusters #65889

Merged

bdarnell mentioned this pull request Sep 1, 2021

cockroach start Doc Update cockroachdb/docs#11204

Closed

aayushshah15 mentioned this pull request Dec 16, 2021

kvserver: improve definition of equivalence classes #73951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: Allow rebalances between stores on the same nodes. #51567

kvserver: Allow rebalances between stores on the same nodes. #51567

lunevalex commented Jul 17, 2020 •

edited

Loading

cockroach-teamcity commented Jul 17, 2020

nvanbenschoten left a comment

tbg left a comment

lunevalex left a comment

lunevalex commented Jul 21, 2020

darinpp left a comment

darinpp Jul 21, 2020

darinpp Jul 22, 2020

nvanbenschoten left a comment

lunevalex commented Jul 27, 2020

lunevalex commented Aug 7, 2020

darinpp left a comment

lunevalex commented Aug 17, 2020

tbg left a comment

lunevalex left a comment

kvserver: Allow rebalances between stores on the same nodes. #51567

kvserver: Allow rebalances between stores on the same nodes. #51567

Conversation

lunevalex commented Jul 17, 2020 • edited Loading

cockroach-teamcity commented Jul 17, 2020

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

lunevalex left a comment

Choose a reason for hiding this comment

lunevalex commented Jul 21, 2020

darinpp left a comment

Choose a reason for hiding this comment

darinpp Jul 21, 2020

Choose a reason for hiding this comment

darinpp Jul 22, 2020

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

lunevalex commented Jul 27, 2020

lunevalex commented Aug 7, 2020

darinpp left a comment

Choose a reason for hiding this comment

lunevalex commented Aug 17, 2020

tbg left a comment

Choose a reason for hiding this comment

lunevalex left a comment

Choose a reason for hiding this comment

lunevalex commented Jul 17, 2020 •

edited

Loading