-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: sensitize AdminScatter
to force replica movement
#75894
kvserver: sensitize AdminScatter
to force replica movement
#75894
Conversation
a4b6fc0
to
d800526
Compare
I'm running two clusters side by side, one based on
The results are quite encouraging. These are the graphs of cc @nvanbenschoten, @dt, @kvoli |
d0ed5f3
to
d8b0034
Compare
what explains the spike vs smooth number of AdminScatter requests between the two? |
d8b0034
to
11ff415
Compare
11ff415
to
ac27cda
Compare
17a933e
to
5dec7fd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 6 of 6 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):
perturbedStoreDescs := make([]roachpb.StoreDescriptor, 0, len(sl.stores)) for _, store := range sl.stores { jitter := float64(store.Capacity.RangeCount) * o.rangeRebalanceThreshold * (0.25 + allocRand.Float64())
Why the + 0.25
?
pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):
perturbedStoreDescs := make([]roachpb.StoreDescriptor, 0, len(sl.stores)) for _, store := range sl.stores { jitter := float64(store.Capacity.RangeCount) * o.rangeRebalanceThreshold * (0.25 + allocRand.Float64())
Could we add a comment here explaining why we're jittering by a function of the range rebalance threshold?
pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):
perturbedStoreDescs = append(perturbedStoreDescs, store) } o.rangeRebalanceThreshold = 0
Also, call this out in a comment. It's not entirely clear to me why this is needed if we already jittered. Shouldn't the jitter be enough to push two candidates that we're previously within the threshold out of it?
All else being equal, it would be best to keep the options struct immutable. This would also mean we wouldn't need the first commit, though implementing the interface on a pointer seems fine either way.
pkg/kv/kvserver/allocator_test.go, line 7285 at r2 (raw file):
rangeRebalanceThreshold: 0.05, }, true, /* scatter */
Should the test call this function twice with different values for scatter
to prove that the allocator normally would not make the move?
pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):
Just to check my understanding: If we keep the padding provided by I think this all makes it a bit more unlikely to scatter but quite a bit harder to explain / understand. I personally feel that we should have this thing be more on the aggressive side since:
What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Shouldn't the jitter be enough to push two candidates that we're previously within the threshold out of it?
Just to check my understanding: If we keep the padding provided by
kv.allocator.range_rebalance_threshold
, the only candidates we'll ever push towards the "underfull" side will be the ones that already have fewer-than-mean replicas. Furthermore, the likelihood of these candidates being pushed out will be determined by how far away from the mean they already are.I think this all makes it a bit more unlikely to scatter but quite a bit harder to explain / understand. I personally feel that we should have this thing be more on the aggressive side since:
- Internally, its only ever called on empty ranges. So there's no real cost to the aggressiveness.
- If called through the SQL
SCATTER
command, IMO the aggressiveness is desirable and more congruent with what a caller would expect.What do you think?
I see what you're saying here. I think it's fine if we set this threshold to 0 then, but it feels wrong to do so here. It actually feels a little strange that the allocator knows anything about the scatter
flag. Did you consider a scheme where you teach the rangeCountScorerOptions
about scatter and then pass that in to the allocator from (replicateQueue).considerRebalance
? Or implement an entirely new scatterScorerOptions
?
5dec7fd
to
837ca41
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Why the
+ 0.25
?
No real reason, removed.
pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Could we add a comment here explaining why we're jittering by a function of the range rebalance threshold?
Added before the jitter
attribute is set in scatterScorerOptions
.
pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):
implement an entirely new scatterScorerOptions
This seems much nicer, how do you feel about the latest revision?
pkg/kv/kvserver/allocator_test.go, line 7285 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Should the test call this function twice with different values for
scatter
to prove that the allocator normally would not make the move?
Done.
837ca41
to
4a459d7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/allocator.go, line 914 at r3 (raw file):
} func (a Allocator) storeListForCandidates(candidates []roachpb.ReplicationTarget) StoreList {
It feels like we're running circles around a progression like []ReplicationTarget
-> StoreList
-> []ReplicaDescriptor
-> StoreList
. Is this all principled? Does the typing make sense at each level?
pkg/kv/kvserver/allocator.go, line 1291 at r3 (raw file):
} func (a *Allocator) scorerOptionsForScatter() *scatterScorerOptions {
nit: move this below the main scorerOptions
method.
pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
implement an entirely new scatterScorerOptions
This seems much nicer, how do you feel about the latest revision?
Yeah, this is much nicer.
pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):
} func jittered(val float64, jitter float64, rand allocatorRand) float64 {
Do we need to lock and unlock rand
? That feels pretty fragile, and yet it's what we do elsewhere. Maybe we should clean that up and push the locking inside of an API that mirrors (but does not directly expose) rand.Rand
.
Or better yet (in a separate commit), let's remove allocatorRand
, replace it with a raw *rand.Rand
, and make it thread-safe by wrapping its rand.Source
in a:
type lockedSource struct {
m sync.Mutex
src rand.Source64
}
// TODO: implement rand.Source64.
This is what math/rand.globalRand
does in the standard library.
4a459d7
to
60b80c4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)
pkg/kv/kvserver/allocator.go, line 914 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It feels like we're running circles around a progression like
[]ReplicationTarget
->StoreList
->[]ReplicaDescriptor
->StoreList
. Is this all principled? Does the typing make sense at each level?
It is kind of unfortunate the way we're converting from one type to another in so many of these methods.
One low hanging fruit to improve here was the fact that the Allocator.Remove{Non}Voter
and the Allocator.Allocate{Non}Voter
methods were returning a ReplicaDescriptor
instead of a ReplicationTarget
(I believe this was just a vestige). Aside from that, I don't see any obvious wins. The issue is that some of these methods have multiple callers, not all of whom have access to the same input types.
We might now be going from []ReplicationTarget
-> StoreList
-> ReplicationTarget
, but we shouldn't be doing what you mentioned.
pkg/kv/kvserver/allocator.go, line 1291 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: move this below the main
scorerOptions
method.
Done.
pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):
That feels pretty fragile, and yet it's what we do elsewhere
oh, good catch. I'll do the proper locking with allocatorRand
now and add a TODO to clean this up later (unless you insist we do it in this PR).
60b80c4
to
28a3949
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)
pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
That feels pretty fragile, and yet it's what we do elsewhere
oh, good catch. I'll do the proper locking with
allocatorRand
now and add a TODO to clean this up later (unless you insist we do it in this PR).
Seems fine to do in a separate PR.
pkg/kv/kvserver/allocator_scorer.go, line 150 at r4 (raw file):
result *= -1 }
nit: stray line
pkg/roachpb/metadata_replicas.go, line 472 at r5 (raw file):
} // EmptyReplicationTarget returns true if `target` is an empty replication
It would be slightly more standard to make this a method called Empty
. That would also make it easier to use. We have many instances of that.
28a3949
to
d434cdd
Compare
This is to enable the next commit, which adds a new method to this interface that needs to operate over a pointer receiver. Release note: None
This patch makes it such that `AdminScatter` now triggers a ~mostly random rebalance action. This has always been the contract that most of its callers (especially the `SSTBatcher`) assumed. Previously, calling `AdminScatter` would simply enqueue that range into the `replicateQueue`. The `replicateQueue` only cares about reconciling replica-count differences between stores in the cluster if there are stores that are more than 5% away from the mean. If all candidate stores were within 5% of the mean, then calling `AdminScatter` wouldn't do anything. Now, `AdminScatter` still enqueues the range into the `replicateQueue` but with an option to force it to: 1. Ignore the 5% padding provided by `kv.allocator.range_rebalance_threshold`. 2. Add some jitter to the existing replica-counts of all candidate stores. This means that `AdminScatter` now forces mostly randomized rebalances to stores that are reasonable targets (i.e. we still won't rebalance to stores that are too far above the mean in terms of replica-count, or stores that don't meet the constraints placed on the range, etc). Release note (performance improvement): IMPORTs and index backfills should now do a better job of spreading their load out over the nodes in the cluster.
Previously they would return `ReplicaDescriptor`s, which was a vestige. These return values were almost immediately getting cast into `ReplicationTarget` in every single case anyway. This makes the return types of these allocator methods a little more consistent. Release note: None
d434cdd
to
d792adf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)
pkg/kv/kvserver/allocator_scorer.go, line 150 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: stray line
Fixed.
pkg/roachpb/metadata_replicas.go, line 472 at r5 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
It would be slightly more standard to make this a method called
Empty
. That would also make it easier to use. We have many instances of that.
Done.
Build failed (retrying...): |
Build succeeded: |
After cockroachdb#75894, on every `(replicateQueue).processOneChange` call with the `scatter` option set, stores in the cluster are essentially randomly categorized as "overfull" or "underfull" (replicas on overfull stores are then rebalanced to the underfull ones). Since each existing replica will be on a store categorized as either underfull or overfull, it can be expected to be rebalanced with a probability of `0.5`. This means that, roughly speaking, for `N` replicas, probability of a successful rebalance is `(1 - (0.5)^N)`. The previous implementation of `adminScatter` would keep requeuing a range into the `replicateQueue` until it hit an iteration where none of the range's replicas could be rebalanced. This patch improves this behavior by ensuring that ranges are only enqueued for scattering upto a pre-defined maximum number of times. Release justification: low risk improvement to current functionality Release note: None
After cockroachdb#75894, on every `(replicateQueue).processOneChange` call with the `scatter` option set, stores in the cluster are essentially randomly categorized as "overfull" or "underfull" (replicas on overfull stores are then rebalanced to the underfull ones). Since each existing replica will be on a store categorized as either underfull or overfull, it can be expected to be rebalanced with a probability of `0.5`. This means that, roughly speaking, for `N` replicas, probability of a successful rebalance is `(1 - (0.5)^N)`. The previous implementation of `adminScatter` would keep requeuing a range into the `replicateQueue` until it hit an iteration where none of the range's replicas could be rebalanced. This patch improves this behavior by ensuring that ranges are only enqueued for scattering upto a pre-defined maximum number of times. Release justification: low risk improvement to current functionality Release note: None
77743: kvserver: bound the number of scatter operations per range r=aayushshah15 a=aayushshah15 After #75894, on every `(replicateQueue).processOneChange` call with the `scatter` option set, stores in the cluster are essentially randomly categorized as "overfull" or "underfull" (replicas on overfull stores are then rebalanced to the underfull ones). Since each existing replica will be on a store categorized as either underfull or overfull, it can be expected to be rebalanced with a probability of `0.5`. This means that, roughly speaking, for `N` replicas, probability of a successful rebalance is `(1 - (0.5)^N)`. The previous implementation of `adminScatter` would keep requeuing a range into the `replicateQueue` until it hit an iteration where none of the range's replicas could be rebalanced. This patch improves this behavior by ensuring that ranges are only enqueued for scattering upto a pre-defined maximum number of times. Release justification: low risk improvement to current functionality Release note: None Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
This patch makes it such that
AdminScatter
now triggers a ~mostly randomrebalance action. This has always been the contract that the
SSTBatcher
assumed.
Previously, calling
AdminScatter
would simply enqueue that range into thereplicateQueue
. ThereplicateQueue
only cares about reconcilingreplica-count differences between stores in the cluster if there are stores
that are more than 5% away from the mean. If all candidate stores were within
5% of the mean, then calling
AdminScatter
wouldn't do anything.Now,
AdminScatter
still enqueues the range into thereplicateQueue
but withan option to force it to:
kv.allocator.range_rebalance_threshold
.This means that
AdminScatter
now forces mostly randomized rebalances tostores that are reasonable targets (i.e. we still won't rebalance to stores
that are too far above the mean in terms of replica-count, or stores that don't
meet the constraints placed on the range, etc).
Fixes #74542
Release note (performance improvement): IMPORTs and index backfills
should now do a better job of spreading their load out over the nodes in
the cluster.