kvserver: sensitize `AdminScatter` to force replica movement #75894

aayushshah15 · 2022-02-02T20:59:42Z

This patch makes it such that AdminScatter now triggers a ~mostly random
rebalance action. This has always been the contract that the SSTBatcher
assumed.

Previously, calling AdminScatter would simply enqueue that range into the
replicateQueue. The replicateQueue only cares about reconciling
replica-count differences between stores in the cluster if there are stores
that are more than 5% away from the mean. If all candidate stores were within
5% of the mean, then calling AdminScatter wouldn't do anything.

Now, AdminScatter still enqueues the range into the replicateQueue but with
an option to force it to:

Ignore the 5% padding provided by kv.allocator.range_rebalance_threshold.
Add some jitter to the existing replica-counts of all candidate stores.

This means that AdminScatter now forces mostly randomized rebalances to
stores that are reasonable targets (i.e. we still won't rebalance to stores
that are too far above the mean in terms of replica-count, or stores that don't
meet the constraints placed on the range, etc).

Fixes #74542

Release note (performance improvement): IMPORTs and index backfills
should now do a better job of spreading their load out over the nodes in
the cluster.

cockroach-teamcity · 2022-02-02T20:59:49Z

This change is

aayushshah15 · 2022-02-03T01:08:45Z

I'm running two clusters side by side, one based on master (with 48mb SplitAndScatter threshold in the SSTBatcher) and one based on this PR (also with the same 48mb threshold). To stress things, I am running with kv.bulk_io_write.concurrent_addsstable_requests=10. My repro steps are as follows:

export GCE_PROJECT=andrei-jepsen
USER=aayushs
SUFFIX=addsst-master-48mb
export CLUSTER=$USER-$SUFFIX
rp create $CLUSTER --nodes 15 --gce-machine-type=n1-standard-32

rm -rf ./cockroach-linux-2.6.32-gnu-amd64
./build/builder.sh mkrelease amd64-linux-gnu
rp put $CLUSTER ./cockroach-linux-2.6.32-gnu-amd64 cockroach

rp start $CLUSTER:1-13 --racks=4 --args='--vmodule=allocator=3,sst_batcher=3';
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_recovery.max_rate='2GiB'";
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='2GiB'";

# Crank up the SST ingestion concurrency to make the load-imbalance more prominent
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.bulk_io_write.concurrent_addsstable_requests=10";

rp ssh $CLUSTER:15 -- './cockroach workload fixtures import tpcc --warehouses=30000 {pgurl:1}';

The results are quite encouraging.

These are the graphs of AdminScatter requests against the number of rebalancing operations by the replicateQueue (these include the rebalances that happened due to scatter). The yellow line is the number of rebalances.

On master

With this patch

cc @nvanbenschoten, @dt, @kvoli

aayushshah15 · 2022-02-03T01:10:22Z

These are the corresponding read-amplification charts. This is also very encouraging.

master

With this patch

dt · 2022-02-03T02:13:57Z

what explains the spike vs smooth number of AdminScatter requests between the two?

aayushshah15 · 2022-02-03T02:22:18Z

what explains the spike vs smooth number of AdminScatter requests between the two?

I was wondering the same thing, but notice that those valleys in the rate of AdminScatter requests seem to line up very well with the spikes in read-amplification on that one node. My guess is that that node slows the import down a bit.

It looks like we have a similar pattern in the rate of AddSSTableRequests.

dt · 2022-02-03T03:30:19Z

~~if you still have that cluster, can you plot addsstable.delay.total vs the scatter reqs ?~~

just realized the IP was in the screenshots, so here it is:

nvanbenschoten

Reviewed 6 of 6 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):

	perturbedStoreDescs := make([]roachpb.StoreDescriptor, 0, len(sl.stores))
	for _, store := range sl.stores {
		jitter := float64(store.Capacity.RangeCount) * o.rangeRebalanceThreshold * (0.25 + allocRand.Float64())

Why the + 0.25?

pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):

	perturbedStoreDescs := make([]roachpb.StoreDescriptor, 0, len(sl.stores))
	for _, store := range sl.stores {
		jitter := float64(store.Capacity.RangeCount) * o.rangeRebalanceThreshold * (0.25 + allocRand.Float64())

Could we add a comment here explaining why we're jittering by a function of the range rebalance threshold?

pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

		perturbedStoreDescs = append(perturbedStoreDescs, store)
	}
	o.rangeRebalanceThreshold = 0

Also, call this out in a comment. It's not entirely clear to me why this is needed if we already jittered. Shouldn't the jitter be enough to push two candidates that we're previously within the threshold out of it?

All else being equal, it would be best to keep the options struct immutable. This would also mean we wouldn't need the first commit, though implementing the interface on a pointer seems fine either way.

pkg/kv/kvserver/allocator_test.go, line 7285 at r2 (raw file):

			rangeRebalanceThreshold: 0.05,
		},
		true, /* scatter */

Should the test call this function twice with different values for scatter to prove that the allocator normally would not make the move?

aayushshah15 · 2022-02-07T23:58:16Z

pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

Shouldn't the jitter be enough to push two candidates that we're previously within the threshold out of it?

Just to check my understanding: If we keep the padding provided by kv.allocator.range_rebalance_threshold, the only candidates we'll ever push towards the "underfull" side will be the ones that already have fewer-than-mean replicas. Furthermore, the likelihood of these candidates being pushed out will be determined by how far away from the mean they already are.

I think this all makes it a bit more unlikely to scatter but quite a bit harder to explain / understand. I personally feel that we should have this thing be more on the aggressive side since:

Internally, its only ever called on empty ranges. So there's no real cost to the aggressiveness.
If called through the SQL SCATTER command, IMO the aggressiveness is desirable and more congruent with what a caller would expect.

What do you think?

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Shouldn't the jitter be enough to push two candidates that we're previously within the threshold out of it?

Just to check my understanding: If we keep the padding provided by kv.allocator.range_rebalance_threshold, the only candidates we'll ever push towards the "underfull" side will be the ones that already have fewer-than-mean replicas. Furthermore, the likelihood of these candidates being pushed out will be determined by how far away from the mean they already are.

I think this all makes it a bit more unlikely to scatter but quite a bit harder to explain / understand. I personally feel that we should have this thing be more on the aggressive side since:

Internally, its only ever called on empty ranges. So there's no real cost to the aggressiveness.

If called through the SQL SCATTER command, IMO the aggressiveness is desirable and more congruent with what a caller would expect.

What do you think?

I see what you're saying here. I think it's fine if we set this threshold to 0 then, but it feels wrong to do so here. It actually feels a little strange that the allocator knows anything about the scatter flag. Did you consider a scheme where you teach the rangeCountScorerOptions about scatter and then pass that in to the allocator from (replicateQueue).considerRebalance? Or implement an entirely new scatterScorerOptions?

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why the + 0.25?

No real reason, removed.

pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could we add a comment here explaining why we're jittering by a function of the range rebalance threshold?

Added before the jitter attribute is set in scatterScorerOptions.

pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

implement an entirely new scatterScorerOptions

This seems much nicer, how do you feel about the latest revision?

pkg/kv/kvserver/allocator_test.go, line 7285 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should the test call this function twice with different values for scatter to prove that the allocator normally would not make the move?

Done.

nvanbenschoten

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvserver/allocator.go, line 914 at r3 (raw file):

}

func (a Allocator) storeListForCandidates(candidates []roachpb.ReplicationTarget) StoreList {

It feels like we're running circles around a progression like []ReplicationTarget -> StoreList -> []ReplicaDescriptor -> StoreList. Is this all principled? Does the typing make sense at each level?

pkg/kv/kvserver/allocator.go, line 1291 at r3 (raw file):

}

func (a *Allocator) scorerOptionsForScatter() *scatterScorerOptions {

nit: move this below the main scorerOptions method.

pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

implement an entirely new scatterScorerOptions

This seems much nicer, how do you feel about the latest revision?

Yeah, this is much nicer.

pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):

}

func jittered(val float64, jitter float64, rand allocatorRand) float64 {

Do we need to lock and unlock rand? That feels pretty fragile, and yet it's what we do elsewhere. Maybe we should clean that up and push the locking inside of an API that mirrors (but does not directly expose) rand.Rand.

Or better yet (in a separate commit), let's remove allocatorRand, replace it with a raw *rand.Rand, and make it thread-safe by wrapping its rand.Source in a:

type lockedSource struct {
	m  sync.Mutex
	src rand.Source64
}

// TODO: implement rand.Source64.

This is what math/rand.globalRand does in the standard library.

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvserver/allocator.go, line 914 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It feels like we're running circles around a progression like []ReplicationTarget -> StoreList -> []ReplicaDescriptor -> StoreList. Is this all principled? Does the typing make sense at each level?

It is kind of unfortunate the way we're converting from one type to another in so many of these methods.

One low hanging fruit to improve here was the fact that the Allocator.Remove{Non}Voter and the Allocator.Allocate{Non}Voter methods were returning a ReplicaDescriptor instead of a ReplicationTarget (I believe this was just a vestige). Aside from that, I don't see any obvious wins. The issue is that some of these methods have multiple callers, not all of whom have access to the same input types.

We might now be going from []ReplicationTarget -> StoreList -> ReplicationTarget, but we shouldn't be doing what you mentioned.

pkg/kv/kvserver/allocator.go, line 1291 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this below the main scorerOptions method.

Done.

pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):

That feels pretty fragile, and yet it's what we do elsewhere

oh, good catch. I'll do the proper locking with allocatorRand now and add a TODO to clean this up later (unless you insist we do it in this PR).

nvanbenschoten

Reviewed 2 of 2 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

That feels pretty fragile, and yet it's what we do elsewhere

oh, good catch. I'll do the proper locking with allocatorRand now and add a TODO to clean this up later (unless you insist we do it in this PR).

Seems fine to do in a separate PR.

pkg/kv/kvserver/allocator_scorer.go, line 150 at r4 (raw file):

		result *= -1
	}

nit: stray line

pkg/roachpb/metadata_replicas.go, line 472 at r5 (raw file):

}

// EmptyReplicationTarget returns true if `target` is an empty replication

It would be slightly more standard to make this a method called Empty. That would also make it easier to use. We have many instances of that.

This is to enable the next commit, which adds a new method to this interface that needs to operate over a pointer receiver. Release note: None

This patch makes it such that `AdminScatter` now triggers a ~mostly random rebalance action. This has always been the contract that most of its callers (especially the `SSTBatcher`) assumed. Previously, calling `AdminScatter` would simply enqueue that range into the `replicateQueue`. The `replicateQueue` only cares about reconciling replica-count differences between stores in the cluster if there are stores that are more than 5% away from the mean. If all candidate stores were within 5% of the mean, then calling `AdminScatter` wouldn't do anything. Now, `AdminScatter` still enqueues the range into the `replicateQueue` but with an option to force it to: 1. Ignore the 5% padding provided by `kv.allocator.range_rebalance_threshold`. 2. Add some jitter to the existing replica-counts of all candidate stores. This means that `AdminScatter` now forces mostly randomized rebalances to stores that are reasonable targets (i.e. we still won't rebalance to stores that are too far above the mean in terms of replica-count, or stores that don't meet the constraints placed on the range, etc). Release note (performance improvement): IMPORTs and index backfills should now do a better job of spreading their load out over the nodes in the cluster.

Previously they would return `ReplicaDescriptor`s, which was a vestige. These return values were almost immediately getting cast into `ReplicationTarget` in every single case anyway. This makes the return types of these allocator methods a little more consistent. Release note: None

aayushshah15

TFTR!

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)

pkg/kv/kvserver/allocator_scorer.go, line 150 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: stray line

Fixed.

pkg/roachpb/metadata_replicas.go, line 472 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It would be slightly more standard to make this a method called Empty. That would also make it easier to use. We have many instances of that.

Done.

craig · 2022-02-11T21:16:01Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-02-11T22:59:18Z

Build succeeded:

GitHub CI (Cockroach)

After cockroachdb#75894, on every `(replicateQueue).processOneChange` call with the `scatter` option set, stores in the cluster are essentially randomly categorized as "overfull" or "underfull" (replicas on overfull stores are then rebalanced to the underfull ones). Since each existing replica will be on a store categorized as either underfull or overfull, it can be expected to be rebalanced with a probability of `0.5`. This means that, roughly speaking, for `N` replicas, probability of a successful rebalance is `(1 - (0.5)^N)`. The previous implementation of `adminScatter` would keep requeuing a range into the `replicateQueue` until it hit an iteration where none of the range's replicas could be rebalanced. This patch improves this behavior by ensuring that ranges are only enqueued for scattering upto a pre-defined maximum number of times. Release justification: low risk improvement to current functionality Release note: None

77743: kvserver: bound the number of scatter operations per range r=aayushshah15 a=aayushshah15 After #75894, on every `(replicateQueue).processOneChange` call with the `scatter` option set, stores in the cluster are essentially randomly categorized as "overfull" or "underfull" (replicas on overfull stores are then rebalanced to the underfull ones). Since each existing replica will be on a store categorized as either underfull or overfull, it can be expected to be rebalanced with a probability of `0.5`. This means that, roughly speaking, for `N` replicas, probability of a successful rebalance is `(1 - (0.5)^N)`. The previous implementation of `adminScatter` would keep requeuing a range into the `replicateQueue` until it hit an iteration where none of the range's replicas could be rebalanced. This patch improves this behavior by ensuring that ranges are only enqueued for scattering upto a pre-defined maximum number of times. Release justification: low risk improvement to current functionality Release note: None Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch 3 times, most recently from a4b6fc0 to d800526 Compare February 3, 2022 00:17

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch 2 times, most recently from d0ed5f3 to d8b0034 Compare February 3, 2022 02:07

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from d8b0034 to 11ff415 Compare February 3, 2022 02:52

aayushshah15 marked this pull request as ready for review February 3, 2022 02:53

aayushshah15 requested a review from a team as a code owner February 3, 2022 02:53

aayushshah15 requested a review from nvanbenschoten February 3, 2022 02:53

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 11ff415 to ac27cda Compare February 3, 2022 03:08

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch 2 times, most recently from 17a933e to 5dec7fd Compare February 3, 2022 14:31

aayushshah15 mentioned this pull request Feb 3, 2022

bulk: increased split-after size can result in replica imbalance #75664

Closed

nvanbenschoten reviewed Feb 7, 2022

View reviewed changes

nvanbenschoten reviewed Feb 8, 2022

View reviewed changes

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 5dec7fd to 837ca41 Compare February 8, 2022 02:01

aayushshah15 commented Feb 8, 2022

View reviewed changes

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 837ca41 to 4a459d7 Compare February 8, 2022 02:47

nvanbenschoten reviewed Feb 8, 2022

View reviewed changes

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 4a459d7 to 60b80c4 Compare February 9, 2022 05:35

aayushshah15 commented Feb 9, 2022

View reviewed changes

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 60b80c4 to 28a3949 Compare February 9, 2022 05:50

nvanbenschoten approved these changes Feb 11, 2022

View reviewed changes

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 28a3949 to d434cdd Compare February 11, 2022 19:45

aayushshah15 added 3 commits February 11, 2022 14:45

kvserver: change scorerOptions to be implemented by pointer receivers

431aafe

This is to enable the next commit, which adds a new method to this interface that needs to operate over a pointer receiver. Release note: None

aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from d434cdd to d792adf Compare February 11, 2022 19:45

aayushshah15 commented Feb 11, 2022

View reviewed changes

craig bot merged commit 08ecb04 into cockroachdb:master Feb 11, 2022

cockroach-teamcity mentioned this pull request Feb 11, 2022

kvserver: sensitize AdminScatter to force replica movement cockroachdb/docs#13014

Closed

aayushshah15 deleted the 20220202_randomizeAdminScatter branch February 11, 2022 23:16

exalate-issue-sync bot mentioned this pull request Feb 15, 2022

ALTER TABLE table_name SCATTER is not documented cockroachdb/docs#4975

Open

This was referenced Mar 1, 2022

importccl: investigate potential import performance regression #77157

Closed

roachtest: clearrange/checks=true failed #77080

Closed

nvanbenschoten mentioned this pull request Mar 10, 2022

storage: scatter right after split leads to poor balance #35907

Closed

aayushshah15 mentioned this pull request Mar 13, 2022

kvserver: bound the number of scatter operations per range #77743

Merged

nvanbenschoten mentioned this pull request Apr 7, 2022

kv: decommissioning is slower when adding nodes concurrently #79560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: sensitize `AdminScatter` to force replica movement #75894

kvserver: sensitize `AdminScatter` to force replica movement #75894

aayushshah15 commented Feb 2, 2022 •

edited

Loading

cockroach-teamcity commented Feb 2, 2022

aayushshah15 commented Feb 3, 2022 •

edited

Loading

aayushshah15 commented Feb 3, 2022

dt commented Feb 3, 2022

aayushshah15 commented Feb 3, 2022 •

edited

Loading

dt commented Feb 3, 2022 •

edited

Loading

nvanbenschoten left a comment

aayushshah15 commented Feb 7, 2022

nvanbenschoten left a comment

aayushshah15 left a comment

nvanbenschoten left a comment

aayushshah15 left a comment

nvanbenschoten left a comment

aayushshah15 left a comment

craig bot commented Feb 11, 2022

craig bot commented Feb 11, 2022

kvserver: sensitize AdminScatter to force replica movement #75894

kvserver: sensitize AdminScatter to force replica movement #75894

Conversation

aayushshah15 commented Feb 2, 2022 • edited Loading

cockroach-teamcity commented Feb 2, 2022

aayushshah15 commented Feb 3, 2022 • edited Loading

aayushshah15 commented Feb 3, 2022

dt commented Feb 3, 2022

aayushshah15 commented Feb 3, 2022 • edited Loading

dt commented Feb 3, 2022 • edited Loading

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 commented Feb 7, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

craig bot commented Feb 11, 2022

craig bot commented Feb 11, 2022

kvserver: sensitize `AdminScatter` to force replica movement #75894

kvserver: sensitize `AdminScatter` to force replica movement #75894

aayushshah15 commented Feb 2, 2022 •

edited

Loading

aayushshah15 commented Feb 3, 2022 •

edited

Loading

aayushshah15 commented Feb 3, 2022 •

edited

Loading

dt commented Feb 3, 2022 •

edited

Loading