-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: Allow rebalances between stores on the same nodes. #51567
kvserver: Allow rebalances between stores on the same nodes. #51567
Conversation
afb036a
to
18b278a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very exciting work! I'm probably the least familiar with this code out of the three reviewers, so someone else should take a look, but everything seems sane here.
Reviewed 6 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, and @tbg)
pkg/kv/kvserver/allocator.go, line 565 at r1 (raw file):
} } return roachpb.ReplicaDescriptor{}, "", errors.New("could not select an appropriate replica to be removed")
When do we expect to hit this? Is that even possible now? Won't doRemoveTarget
itself return an error in this case?
pkg/kv/kvserver/allocator.go, line 691 at r1 (raw file):
var removeDetails string var storeFilter roachpb.StoreDescriptorFilter = roachpb.NoOpStoreDescriptorFilter{} if isSameNodeRebalanceAttempt(existingPlusOneNew, target) {
Could you leave a comment about why we use existingPlusOneNew
instead of replicaCandidates. Something about not wanting to add replicas on two stores on the same node even in the case where the existing replica is removable. And then mention what happens in that case (we return false?).
Also, can this just use existingReplicas
and then avoid the StoreID
check entirely?
pkg/kv/kvserver/allocator.go, line 691 at r1 (raw file):
nit: how do you feel about our use of
newReplica
andtarget
in this block? I find it awfully confusing that they're used somewhat interchangeably and yet they mean roughly the same thing. And we have aremoveReplica
in scope, which I would expect to parallelnewReplica
, and yet it doesn't really. Maybe it's just because I'm not familiar with the code, but it seems like this could be cleaner.
I wrote this out while reading over the existing code and before seeing your cleanup with removalTarget
. Leaving it here to show appreciation for your work.
pkg/kv/kvserver/allocator.go, line 692 at r1 (raw file):
var storeFilter roachpb.StoreDescriptorFilter = roachpb.NoOpStoreDescriptorFilter{} if isSameNodeRebalanceAttempt(existingPlusOneNew, target) { storeFilter = roachpb.SameNodeStoreDescriptorFilter{NodeID: target.store.Node.NodeID}
Could you add a comment here? Something that just ensures that the reader understands the double-negation that's going on here. I found it tricky to navigate the logic around a filter that filters out replicas that can not, not be part of the resulting range.
A comment that said "// We're rebalancing between stores on the same node, so only consider other stores on the same node as candidates to be removed." would have helped a lot.
pkg/kv/kvserver/allocator.go, line 707 at r1 (raw file):
return zero, zero, "", false } if target.store.StoreID != removalTarget.store.StoreID && removalTarget.less(*target) {
Why do we now need this less
call? Why wasn't it needed before? Did simulateRemoveTarget
used to return an err if there were no better candidates? Does it no longer?
pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):
existingStoreIDs[i] = exist.StoreID } sl, _, _ := a.storePool.getStoreListFromIDs(existingStoreIDs, storeFilterNone)
storeFilterNone
confused me. I thought it was something you added here, but I see that it's not new. It would be a shame to overload this term.
Thinking about this code a bit more, I actually find it kind of strange that we need this storeFilter at all. Why don't we filter from the candidate list right up top where we determine which filter to use in the same way that we use simulateFilterUnremovableReplicas
?
pkg/kv/kvserver/allocator.go, line 1408 at r1 (raw file):
// isSameNodeRebalanceAttempt returns true if the target candidate is on the //same node as an existing replica.
nit: space at beginning of comment.
pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):
return 0 } // For same node rebalances we need to manually consider the
Why is this? Isn't the balance score a function of a candidate's StoreCapacity vs. the mean StoreCapacity? If the score is not a function of any per-node state then I don't see why something like this would be needed.
Does this relate to your comment about "An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around." in the commit message?
pkg/kv/kvserver/allocator_scorer.go, line 475 at r1 (raw file):
// ordered from least qualified for removal to most qualified. Stores that are // marked as not valid, are in violation of a required criteria. func removeCandidates(
Mention in the comment above that only stores matching the filter will be considered as removal candidates.
pkg/kv/kvserver/allocator_test.go, line 681 at r1 (raw file):
// We run through all the ranges once to get the cluster to balance, // after that we should not be seeing replicas move
nit: punctuation throughout.
pkg/kv/kvserver/allocator_test.go, line 694 at r1 (raw file):
if ok { // Update the descriptor newReplicas := make([]roachpb.ReplicaDescriptor, len(ranges[i].InternalReplicas))
I don't think this is doing what we want. It's creating a list of len N and then appending to that len. I think you want newReplicas := make([]roachpb.ReplicaDescriptor, 0, len(ranges[i].InternalReplicas))
pkg/kv/kvserver/replica_command.go, line 1135 at r1 (raw file):
byNodeAndStoreID[chg.Target.NodeID] = byStoreID } else { // The only operation that is allowed within a node is an Add/Remove
nit: punctuation.
pkg/kv/kvserver/replica_command.go, line 1159 at r1 (raw file):
} delete(byNodeAndStoreID, rDesc.NodeID) if len(byStoreID) == 2 {
Consider restructuring this using a switch len(byStoreID) { case 1: ... case 2: ... default: panic("unexpected") }
to catch bugs where the length is not 1 or 2.
pkg/kv/kvserver/replica_command.go, line 1162 at r1 (raw file):
chg, k := byStoreID[rDesc.StoreID] // We should be removing the replica from the existing store during a // rebalance within the node
nit: punctuation here and below.
pkg/kv/kvserver/replica_command.go, line 1163 at r1 (raw file):
// We should be removing the replica from the existing store during a // rebalance within the node if !k || chg.ChangeType != roachpb.REMOVE_REPLICA {
When would !k
?
pkg/kv/kvserver/replica_command_test.go, line 123 at r1 (raw file):
{ChangeType: roachpb.REMOVE_REPLICA, Target: roachpb.ReplicationTarget{NodeID: 2, StoreID: 1}}, }) require.Error(t, err)
We generally prefer require.Regexp
so we can make an assertion on the error itself instead of just that there was an error. The mechanism has changed over time (it used to be testutils.IsError
), but the general principle has caught multiple bugs over the course of the project that would have otherwise allowed tests to continue silently passing even when broken.
pkg/kv/kvserver/replica_command_test.go, line 196 at r1 (raw file):
require.NoError(t, err) // Test Case 14: We are rebalancing within a node and do a remove
nit: punctuation.
pkg/roachpb/metadata.go, line 649 at r1 (raw file):
// StoreDescriptorFilter defines an interfaces that filters out // StoreDescriptors.
Could you mention what the return value means? Filter out or passes the filter?
pkg/roachpb/metadata.go, line 666 at r1 (raw file):
} // NoOpStoreDescriptorFilter is a dummy implementation of StoreDescriptorFilter
It's not really a dummy because we use it in non-testing code. Also, "NoOp" is pretty ambiguous. How about AllStoreDescriptorFilter or something that makes it clear that all stores pass the filter instead of no stores.
Also, missing punctuation in this comment and the next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also pretty excited about this. There are two big contributions here, the first one obviously being the feature you're unlocking and it looks like that might be mostly done. However, there's a second one, which I think can still improve, and that is encoding your newly achieved understanding for future generations. It seems that there is something going on with the per-store means and constraints not fitting together well and leading to poor rebalancing decisions, and neither Nathan or I could figure it out from looking at your changes (and I even had you try to explain it to me before)! This tells me that a lot will be won by doubling down on this aspect in particular.
Reviewed 6 of 6 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, and @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 692 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you add a comment here? Something that just ensures that the reader understands the double-negation that's going on here. I found it tricky to navigate the logic around a filter that filters out replicas that can not, not be part of the resulting range.
A comment that said "// We're rebalancing between stores on the same node, so only consider other stores on the same node as candidates to be removed." would have helped a lot.
+1
pkg/kv/kvserver/allocator.go, line 704 at r1 (raw file):
) if err != nil { log.Warningf(ctx, "simulating RemoveTarget failed: %+v", err)
Not your code, but this is a questionable use of Warningf. If this is important, it should propagate an error up. Otherwise, don't log (or log behind V(2) or something like that). It's always hard to predict how often and when these low-level code paths are hit, so they should not do their own logging.
pkg/kv/kvserver/allocator.go, line 707 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why do we now need this
less
call? Why wasn't it needed before? DidsimulateRemoveTarget
used to return an err if there were no better candidates? Does it no longer?
Is this just for determinism? I'm also confused here, a comment would help.
pkg/kv/kvserver/allocator.go, line 1373 at r1 (raw file):
) (*candidate, string, error) { if len(candidates) == 0 { return nil, "", errors.Errorf("must supply at least one candidate replica to allocator.RemoveTarget()")
nit: doRemoveTarget
pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why is this? Isn't the balance score a function of a candidate's StoreCapacity vs. the mean StoreCapacity? If the score is not a function of any per-node state then I don't see why something like this would be needed.
Does this relate to your comment about "An additional set of changes were necessary in the allocator heuristics to better detect when the stores on a single node are balanced and stop attempting to move ranges around." in the commit message?
Alex tried to explain this to me in-"person" but I didn't understand it then and don't understand it now despite having stared a bit. The only thing I understand is that the situation isn't fully symmetric. If you have three nodes and the first node has two stores (and the others one), then each node has to have exactly one replica of each range, which means twin stores have half the number of replicas compared to the non-twin stores. For example, we'd get n1(s1:20 s2:20) n2(s3:40) n3(s3:40)
when fully balanced. This seems to pretty fundamentally mess with the assumption that it is a good idea to reduce the divergence from the mean for all stores alike. What I don't understand is how this particular code factors into that (if it does at all), and if so, if it's a band-aid for a particular problem or whether it's a wholistic "fix".
My best stab at understanding this is that what's new here is that for the first time we're seeing stores with incomparable means in replica movement. Usually (or so I assume), because all replicas considered in a queue action satisfy the constraints, the replica counts of the involved stores are somehow "comparable" in the sense that it would be a reasonable goal to roughly even them out. But for intra-node balances, this isn't true, as the example above shows.
pkg/kv/kvserver/allocator_scorer.go, line 480 at r1 (raw file):
existingNodeLocalities map[roachpb.NodeID]roachpb.Locality, options scorerOptions, storeFilter roachpb.StoreDescriptorFilter,
Maybe "canRemoveFilter"?
pkg/kv/kvserver/allocator_test.go, line 648 at r1 (raw file):
// We start out with 40 ranges on 3 nodes and 3 stores, we then add a new store // on Node 1 and try to rebalance all the ranges. What we want to see happen // is an equilibrium where 20 ranges move from Stopre 1 to Store 2.
Stopre
pkg/kv/kvserver/allocator_test.go, line 717 at r1 (raw file):
} // Verify that the stores are reasonably balanced
.
pkg/kv/kvserver/allocator_test.go, line 721 at r1 (raw file):
store1.Capacity.RangeCount-store2.Capacity.RangeCount)) <= minRangeRebalanceThreshold) // We dont expect any range wanting to move since the system should have // reached a stable state at this point
.
pkg/kv/kvserver/allocator_test.go, line 2897 at r1 (raw file):
a.storePool.getLocalities(existingRepls), a.scorerOptions(), roachpb.NoOpStoreDescriptorFilter{},
Does that thing need to live in roachpb
? Isn't the only use in kvserver
without a need to export?
pkg/kv/kvserver/replica_command.go, line 1137 at r1 (raw file):
// The only operation that is allowed within a node is an Add/Remove for _, prevChg := range byStoreID { if prevChg.ChangeType == chg.ChangeType {
also:
if prevChg.ChangeType != roachpb.ADD_REPLICA { return fmt.Errorf("can only add-remove a replica within a node, but got %+v", chgs) }
since we never issue remove-add
(and I'd want to know if we did accidentally).
pkg/kv/kvserver/replica_command.go, line 1150 at r1 (raw file):
} // Then, check that we're not adding a second replica on nodes that already
We do sometimes add a second replica on a node now.
pkg/kv/kvserver/replica_command.go, line 1163 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
When would
!k
?
I'm confused by this code, it seems reasonable that !k
. For example, if you have a descriptor for stores 1-5 but s5 is completely uninvolved in the change (but lives on a node that is involved), we would still loop over its rDesc
and hit this branch.
Also, I don't see why we can check for != REMOVE_REPLICA
here, after all the order here is not stable (or if it is stable, not in a way that we want to rely upon), no?
I'll look at this piece of code more closely in the next round of review.
pkg/kv/kvserver/replica_command.go, line 1165 at r1 (raw file):
if !k || chg.ChangeType != roachpb.REMOVE_REPLICA { return errors.Errorf( "Expected replica to be removed from %v during a lateral rebalance within the node.", rDesc)
include the map.
pkg/kv/kvserver/replica_command_test.go, line 197 at r1 (raw file):
// Test Case 14: We are rebalancing within a node and do a remove descRelancing := &roachpb.RangeDescriptor{
Rebalancing
Also, a test for the "uninvolved store" case above would be good, where you have an extra store on n1 that isn't involved in the lateral rebalance. I think this would tickle a bug right now.
pkg/roachpb/metadata.go, line 648 at r1 (raw file):
} // StoreDescriptorFilter defines an interfaces that filters out
interface
Also, whatever can live closer to the allocator should move there. roachpb
is already a grab bag's grab bag, we were way too unprincipled about putting things here in the early days.
By the way, you don't need an interface here, you can just
type StoreDescriptorPredicate func(StoreDescriptor) bool
This also has the advantage that you're not forced to make types just to have a receiver. For example you can just write
func allowAllPredicate(StoreDescriptor) { return true }
or
var nodeID roachpb.NodeID = 7
doFilter(func(storeDesc roachpb.StoreDescriptor) { return storeDesc.NodeID == nodeID })
pkg/roachpb/metadata.go, line 649 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you mention what the return value means? Filter out or passes the filter?
+1
I would rename Filter. Consider Accept
or Reject
if either of those make more sense.
pkg/roachpb/metadata.go, line 660 at r1 (raw file):
Rework this comment when the interface has been. Something like
// Filter accepts the descriptors whose NodeID matches that of the argument.
could be the result.
18b278a
to
b13924a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/allocator.go, line 565 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
When do we expect to hit this? Is that even possible now? Won't
doRemoveTarget
itself return an error in this case?
It should never happen, since you will always return out of the for loop, but the code needs a return statement, so that was what was there in the previous version. Not sure if there is anything better to return. The reason for the forloop is that this code returns a ReplicaDescriptor instead of *candidate.
pkg/kv/kvserver/allocator.go, line 691 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could you leave a comment about why we use
existingPlusOneNew
instead of replicaCandidates. Something about not wanting to add replicas on two stores on the same node even in the case where the existing replica is removable. And then mention what happens in that case (we return false?).Also, can this just use
existingReplicas
and then avoid theStoreID
check entirely?
Switched to existingReplicas
pkg/kv/kvserver/allocator.go, line 692 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
+1
Done, thanks for suggesting a comment.
pkg/kv/kvserver/allocator.go, line 704 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Not your code, but this is a questionable use of Warningf. If this is important, it should propagate an error up. Otherwise, don't log (or log behind V(2) or something like that). It's always hard to predict how often and when these low-level code paths are hit, so they should not do their own logging.
switched to VEventf, including the one more Warningf bellow.
pkg/kv/kvserver/allocator.go, line 707 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Is this just for determinism? I'm also confused here, a comment would help.
It has to do with the fact that the scores for stores on the same node are very close together, so sometimes the algorithm would pick the slightly worse one here, which would then create more movement than you wanted. This is here as an extra guardrail and would only manifest itself in a same node scenario but since it was a generally applicable guardrail I did not precondition it on isSameNodeRebalance.
pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Alex tried to explain this to me in-"person" but I didn't understand it then and don't understand it now despite having stared a bit. The only thing I understand is that the situation isn't fully symmetric. If you have three nodes and the first node has two stores (and the others one), then each node has to have exactly one replica of each range, which means twin stores have half the number of replicas compared to the non-twin stores. For example, we'd get
n1(s1:20 s2:20) n2(s3:40) n3(s3:40)
when fully balanced. This seems to pretty fundamentally mess with the assumption that it is a good idea to reduce the divergence from the mean for all stores alike. What I don't understand is how this particular code factors into that (if it does at all), and if so, if it's a band-aid for a particular problem or whether it's a wholistic "fix".My best stab at understanding this is that what's new here is that for the first time we're seeing stores with incomparable means in replica movement. Usually (or so I assume), because all replicas considered in a queue action satisfy the constraints, the replica counts of the involved stores are somehow "comparable" in the sense that it would be a reasonable goal to roughly even them out. But for intra-node balances, this isn't true, as the example above shows.
@tbg that is exactly the reason, in the case you describe the mean is (20+20+40+40)/4=30, because of that s1 and s2 always look desirable and you want to move ranges there. If you look at how we use minRangeRebalanceThreshold today, it's used in the calculation of the balanceScore, so since the balanceScore is no longer useful for these replicas we need another mechanism for equating them. The reason for a threshold is that the leaseholders are on different nodes, so each leaseholder is making it's own decisions about moving things, so if you just aim for range equality it's not going to work since the information about replicas is not instantly transmitted so you have to add this threshold factor to dampen the movement. It definitely feels a bit dirty, but I am not sure how else to prevent this problem.
pkg/kv/kvserver/allocator_scorer.go, line 475 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Mention in the comment above that only stores matching the filter will be considered as removal candidates.
Ack
pkg/kv/kvserver/allocator_test.go, line 694 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I don't think this is doing what we want. It's creating a list of len N and then appending to that len. I think you want
newReplicas := make([]roachpb.ReplicaDescriptor, 0, len(ranges[i].InternalReplicas))
ack
pkg/kv/kvserver/allocator_test.go, line 2897 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Does that thing need to live in
roachpb
? Isn't the only use inkvserver
without a need to export?
I can move it to the allocator, since it's the only place it's used.
pkg/kv/kvserver/replica_command.go, line 1137 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
also:
if prevChg.ChangeType != roachpb.ADD_REPLICA { return fmt.Errorf("can only add-remove a replica within a node, but got %+v", chgs) }since we never issue
remove-add
(and I'd want to know if we did accidentally).
ack
pkg/kv/kvserver/replica_command.go, line 1150 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
We do sometimes add a second replica on a node now.
This is existing code, so when would this be possible?
pkg/kv/kvserver/replica_command.go, line 1163 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I'm confused by this code, it seems reasonable that
!k
. For example, if you have a descriptor for stores 1-5 but s5 is completely uninvolved in the change (but lives on a node that is involved), we would still loop over itsrDesc
and hit this branch.Also, I don't see why we can check for
!= REMOVE_REPLICA
here, after all the order here is not stable (or if it is stable, not in a way that we want to rely upon), no?I'll look at this piece of code more closely in the next round of review.
This happens during the second cycle of the rebalance the descriptor is (n1, s1) (n1,s2, LEARNER) (n2, s3) (n3, s4). Once s2 is caught up the replicate queue will remove n1,s1, so that needs to be allowed.
pkg/kv/kvserver/replica_command_test.go, line 123 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
We generally prefer
require.Regexp
so we can make an assertion on the error itself instead of just that there was an error. The mechanism has changed over time (it used to betestutils.IsError
), but the general principle has caught multiple bugs over the course of the project that would have otherwise allowed tests to continue silently passing even when broken.
Good call
pkg/roachpb/metadata.go, line 648 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
interface
Also, whatever can live closer to the allocator should move there.
roachpb
is already a grab bag's grab bag, we were way too unprincipled about putting things here in the early days.By the way, you don't need an interface here, you can just
type StoreDescriptorPredicate func(StoreDescriptor) boolThis also has the advantage that you're not forced to make types just to have a receiver. For example you can just write
func allowAllPredicate(StoreDescriptor) { return true }or
var nodeID roachpb.NodeID = 7 doFilter(func(storeDesc roachpb.StoreDescriptor) { return storeDesc.NodeID == nodeID })
TIL
pkg/roachpb/metadata.go, line 660 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Rework this comment when the interface has been. Something like
// Filter accepts the descriptors whose NodeID matches that of the argument.
could be the result.
Done.
@tbg @nvanbenschoten went through and addressed feedback and questions. |
b13924a
to
4e1be18
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix or add comments about the issue with the replication factor of 1.
require.Regexp(t, "refer to n4 twice for change ADD_REPLICA", err) | ||
|
||
// Test Case 11: Try to remove twice to the same node. | ||
err = validateReplicationChanges(desc, roachpb.ReplicationChanges{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here that explains why are we trying to prevent this case. I would imagine that if somehow a node ends up with two replicas (each in its own store) and the allocator is suggesting removal of the two - then we should allow it as it is a transition from bad to good state.
pkg/kv/kvserver/allocator_scorer.go
Outdated
// For same node rebalances we need to manually consider the | ||
// minRangeRebalanceThreshold, since in lopsided clusters their under/over | ||
// replication is not properly captured in the balanceScore. | ||
if c.store.Node.NodeID == o.store.Node.NodeID && c.store.Node.NodeID != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. You shouldn't need it if we are trying to have the same number of ranges per store. If you want the same number of ranges per node - Is it not possible then to fix the balanceScore instead? We currently compute (and compare to) the mean number of ranges per store. It can change to be per node. Each store will then compare to the mean per node divided to the number of stores for its node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @lunevalex, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/allocator.go, line 565 at r1 (raw file):
Previously, lunevalex wrote…
It should never happen, since you will always return out of the for loop, but the code needs a return statement, so that was what was there in the previous version. Not sure if there is anything better to return. The reason for the forloop is that this code returns a ReplicaDescriptor instead of *candidate.
If that's the case then we should log.Fatal
here so that it's clearly a code path that should never be hit.
pkg/kv/kvserver/allocator.go, line 691 at r1 (raw file):
Previously, lunevalex wrote…
Switched to existingReplicas
Now that you've made that change, should we generalize the helper a little bit? For instance, make it take a NodeID
instead of a *candidate
and call it containsNode
or something like that.
pkg/kv/kvserver/allocator.go, line 707 at r1 (raw file):
so sometimes the algorithm would pick the slightly worse one here
That's surprising. Why is that the case? Should we fix this in the removal algorithm itself?
pkg/kv/kvserver/allocator.go, line 1381 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
storeFilterNone
confused me. I thought it was something you added here, but I see that it's not new. It would be a shame to overload this term.Thinking about this code a bit more, I actually find it kind of strange that we need this storeFilter at all. Why don't we filter from the candidate list right up top where we determine which filter to use in the same way that we use
simulateFilterUnremovableReplicas
?
I'm still wondering about the second part of this question.
pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):
Previously, lunevalex wrote…
@tbg that is exactly the reason, in the case you describe the mean is (20+20+40+40)/4=30, because of that s1 and s2 always look desirable and you want to move ranges there. If you look at how we use minRangeRebalanceThreshold today, it's used in the calculation of the balanceScore, so since the balanceScore is no longer useful for these replicas we need another mechanism for equating them. The reason for a threshold is that the leaseholders are on different nodes, so each leaseholder is making it's own decisions about moving things, so if you just aim for range equality it's not going to work since the information about replicas is not instantly transmitted so you have to add this threshold factor to dampen the movement. It definitely feels a bit dirty, but I am not sure how else to prevent this problem.
I guess I don't see how that's different from any other imbalanced set of localities at a given locality tier. For instance, if we have 2 nodes in us-east, 1 node in us-central, and 1 node in us-west. Presumably, that configuration does still converge to a reasonable balance, right? We've seen customers run in these kinds of lopsided deployments in the past. So what makes this any different?
pkg/kv/kvserver/allocator_scorer.go, line 501 at r2 (raw file):
// If the candidate does not pass the predicate then we cannot remove it, // as in it must be a replica that remains. if !canRemoveStorePredicate(s) {
Should we be adding to candidates
with valid: false
like we do above?
@darinpp @tbg @nvanbenschoten Thank you for the reviews and feedback. Based on @darinpp comments and and offline discussion we had I am significantly re-working the allocator changes to better account for these situations by working the same node store preferences into the balance and diversity score calculations. This will result in a much bigger change in the allocator but hopefully will make this solution more scalable. Please stay tuned I will update the PR, once I get to a working solution. |
4e1be18
to
b59fab8
Compare
@nvanbenschoten @darinpp @tbg thanks for the earlier reviews, based on the feedback I significantly re-worked this PR to use the balance_score and diversity_score to drive the correct behavior in the allocator. I made the changes and my live tests on roachprod confirm that they are having the correct behavior on a running cluster. I tested a cluster with the following configuration and got perfect splits (n1, s1), (n2, s2, s3) (n3, s4, s5, s6). I would like some feedback on this new direction for the PR before I go through and invest a lot of time into testing and polish. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels so much better than we had it before. Simply replacing the use of nodes with stores, makes the rebalancing work for multi-store nodes and simplifies quite a few things.
8d38d81
to
2332697
Compare
@tbg @nvanbenschoten I addressed comments left by @darinpp, this should now be in a mergeable state and could use another review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, but I don't have a good grasp of all of the details involved. It's a much smaller diff which is good! I'm not sure you updated the commit message yet. There is probably a lucid succinct story you can tell now about your solution.
Reviewed 3 of 11 files at r3, 10 of 10 files at r4.
Dismissed @nvanbenschoten from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I guess I don't see how that's different from any other imbalanced set of localities at a given locality tier. For instance, if we have 2 nodes in us-east, 1 node in us-central, and 1 node in us-west. Presumably, that configuration does still converge to a reasonable balance, right? We've seen customers run in these kinds of lopsided deployments in the past. So what makes this any different?
This code is now reverted to original, but I'm curious about Nathan's last question. What makes two stores on a node different from a lopsided locality/attribute constraint? And if there's no difference, then how did it ever work, plus did we fix it now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did update the commit message let me massage it a bit more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp, @lunevalex, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/allocator_scorer.go, line 212 at r1 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This code is now reverted to original, but I'm curious about Nathan's last question. What makes two stores on a node different from a lopsided locality/attribute constraint? And if there's no difference, then how did it ever work, plus did we fix it now?
This is a good question, in theory it should not make it different because the 2 nodes in us-east will always look to be under-replicated. It in theory should exhibit the same behavior I saw with multiple stores on the same node, I will have to run some tests to answer this question. It is unlikely my changes improved this scenario.
pkg/kv/kvserver/allocator_scorer.go, line 218 at r2 (raw file):
Previously, darinpp wrote…
I don't understand this. You shouldn't need it if we are trying to have the same number of ranges per store. If you want the same number of ranges per node - Is it not possible then to fix the balanceScore instead? We currently compute (and compare to) the mean number of ranges per store. It can change to be per node. Each store will then compare to the mean per node divided to the number of stores for its node.
Done
2332697
to
06ee87e
Compare
63390: opt: remove view deps user defined types used by dependent table r=rytaft,mgartner a=the-ericwang35 Fixes #63191. Previously, when a view depended on a table that had a user defined type column, we would create a dependency between the view and the type, even if the view did not directly reference that column. This is because the table referencing the UDT has an implicit CHECK constraint on the type. For example, ``` CREATE TYPE typ AS ENUM('a', 'b', 'c'); CREATE TABLE t (i INT, j typ); # Has check constraint (j IN (x'40':::@100053, x'80':::@100053, x'c0':::@100053)) CREATE VIEW v AS (SELECT i FROM t); # Has a dependency on typ. ``` When we create `v`, it calls `buildScalar` on `t`'s check constraint and then `maybeTrackUserDefinedTypeDepsForViews`, which adds dependencies to the types even if the view does not directly use them. This patch addresses this by not capturing this dependency inside `maybeTrackUserDefinedTypeDepsForViews`. Instead, it will only capture dependencies explicitly used in the view. Then, in `ConstructCreateView`, we will look for column dependencies, and see if any columns are a UDT. Release note: None 63489: kvserver: remove same node check from TransferLeaseTarget r=lunevalex a=lunevalex In #51567 we added the ability to rebalance replicas between stores on the same node. This PR reverts the changes introduced in #12565, since we no longer need to special case multiple stores per node. Release note: None Co-authored-by: Eric Wang <ericw@cockroachlabs.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>
Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.
Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.
Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.
65584: kvserver: fix bug obstructing range merges on multi-store clusters r=aayushshah15 a=aayushshah15 Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in #51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed. Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Previously, we had a safeguard inside `allocateTargetFromList()` (allocator method used for finding the best target to allocate a new replica) that ensured that it would never return a target store on the same node as one of the existing replicas. This check was mostly well-intentioned but it became outdated after we started allowing rebalances between stores on the same node in cockroachdb#51567. The aforementioned check is no longer correct since: 1. Callers of `AdminRelocateRange()` (currently, the merge queue and the `StoreRebalancer`) must have the ability to move replicas laterally. 2. We have a more precise check inside `AdminChangeReplicas` that guards against adding 2 replicas of a range on the same node. This check precisely allows the cases where we're rebalancing laterally. As a result of this check, clusters that use multiple stores per node were more likely to have their range merges fail because the merge queue would fail in its attempt to collocate ranges that required lateral movement of replicas across stores. @adityamaru noticed error messages of the following nature flooding the logs on a cluster while debugging an incident: ``` none of the remaining voters [n1,s2] are legal additions to (n5,s18):1,(n9,s33):2,(n1,s1):3 ``` This patch fixes this behavior by allowing `AdminRelocateRange()` to disable this check, because it may often find itself trying to execute upon relocations that require moving replicas laterally within a given node. Note that we do not allow _upreplication_ logic to disable this check since we do not want to upreplicate across stores on the same node. Release note (bug fix): A bug that made it less likely for range merges to succeed on clusters using multiple stores per node is now fixed.
This commit makes an effort to make `equivalenceClass`es more well-defined. Previously, we would coalesce the equivalence classes for any two existing replicas if they shared the exact same locality hierarchy (including the node ids). This logic became ineffective with cockroachdb#51567 since we disallow multiple replicas on a single node under all circumstances. Furthermore, coalescing the equivalence classes for existing replicas did not buy us anything and instead required a bunch of custom code for us to correctly deal with them. This commit takes an opinionated approach and gets rid of the logic that coalesces two existing replicas' equivalence classes together. Release note: None
This commit makes an effort to make `equivalenceClass`es more well-defined. Previously, we would coalesce the equivalence classes for any two existing replicas if they shared the exact same locality hierarchy (including the node ids). This logic became ineffective with cockroachdb#51567 since we disallow multiple replicas on a single node under all circumstances. Furthermore, coalescing the equivalence classes for existing replicas did not buy us anything and instead required a bunch of custom code for us to correctly deal with them. This commit takes an opinionated approach and gets rid of the logic that coalesces two existing replicas' equivalence classes together. Release note: None
This commit makes an effort to make `equivalenceClass`es more well-defined. Previously, we would coalesce the equivalence classes for any two existing replicas if they shared the exact same locality hierarchy (including the node ids). This logic became ineffective with cockroachdb#51567 since we disallow multiple replicas on a single node under all circumstances. Furthermore, coalescing the equivalence classes for existing replicas did not buy us anything and instead required a bunch of custom code for us to correctly deal with them. This commit takes an opinionated approach and gets rid of the logic that coalesces two existing replicas' equivalence classes together. Release note: None
73843: kvcoord: simplify txn already committed assertion r=nvanbenschoten a=tbg - kvcoord: improve TestCommitSanityCheckAssertionFiresOnUndetectedAmbiguousCommit - kvcoord: simplify txn already committed assertion 73951: kvserver: improve definition of equivalence classes r=aayushshah15 a=aayushshah15 This commit makes an effort to make `equivalenceClass`es more well-defined. Previously, we would coalesce the equivalence classes for any two existing replicas if they shared the exact same locality hierarchy (including the node ids). This logic became ineffective with #51567 since we disallow multiple replicas on a single node under all circumstances. Furthermore, coalescing the equivalence classes for existing replicas did not buy us anything and instead required a bunch of custom code for us to correctly deal with them. This commit takes an opinionated approach and gets rid of the logic that coalesces two existing replicas' equivalence classes together. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com> Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
This commit makes an effort to make `equivalenceClass`es more well-defined. Previously, we would coalesce the equivalence classes for any two existing replicas if they shared the exact same locality hierarchy (including the node ids). This logic became ineffective with cockroachdb#51567 since we disallow multiple replicas on a single node under all circumstances. Furthermore, coalescing the equivalence classes for existing replicas did not buy us anything and instead required a bunch of custom code for us to correctly deal with them. This commit takes an opinionated approach and gets rid of the logic that coalesces two existing replicas' equivalence classes together. Release note: None
Closes #6782
This change modifies the replica_queue to allow rebalances between multiple stores within a single node.
This is possible because we previously introduced atomic rebalances in #12768.
The first step was to remove the constraints in the allocator that prevented same node rebalances and update the validation in the replica_queue to accept these rebalance proposals. There is one caveat that with 1x replication an atomic rebalance is not possible, so we now support adding multiple replicas of the range to the same node under this condition.
With the constraints removed there would be nothing in the allocator to prevent it from placing multiple replicas of a range on the same node across multiple stores. This is not desired because in a simple 3x replication scenario a single node crash may result in a whole range becoming unavailable. Allocator uses locality tags to model failure domains, but a node was not considered to be a locality. It is thus natural to extend the failure domain definition to the node and model it as a locality tier. Now stores on the same node would be factored into the diversity_score and repel each other, just like nodes in the same datacenter do in a multi-region setup.
Release note (performance improvement): This change removes the last roadblock to running CockroachDB with multiple stores (i.e. disks) per node. The allocation algorithm now supports intra-node rebalances, which means CRDB can fully utilize the additional stores on the same node.