Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: sensitize AdminScatter to force replica movement #75894

Merged
merged 3 commits into from
Feb 11, 2022

Conversation

aayushshah15
Copy link
Contributor

@aayushshah15 aayushshah15 commented Feb 2, 2022

This patch makes it such that AdminScatter now triggers a ~mostly random
rebalance action. This has always been the contract that the SSTBatcher
assumed.

Previously, calling AdminScatter would simply enqueue that range into the
replicateQueue. The replicateQueue only cares about reconciling
replica-count differences between stores in the cluster if there are stores
that are more than 5% away from the mean. If all candidate stores were within
5% of the mean, then calling AdminScatter wouldn't do anything.

Now, AdminScatter still enqueues the range into the replicateQueue but with
an option to force it to:

  1. Ignore the 5% padding provided by kv.allocator.range_rebalance_threshold.
  2. Add some jitter to the existing replica-counts of all candidate stores.

This means that AdminScatter now forces mostly randomized rebalances to
stores that are reasonable targets (i.e. we still won't rebalance to stores
that are too far above the mean in terms of replica-count, or stores that don't
meet the constraints placed on the range, etc).

Fixes #74542

Release note (performance improvement): IMPORTs and index backfills
should now do a better job of spreading their load out over the nodes in
the cluster.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch 3 times, most recently from a4b6fc0 to d800526 Compare February 3, 2022 00:17
@aayushshah15
Copy link
Contributor Author

aayushshah15 commented Feb 3, 2022

I'm running two clusters side by side, one based on master (with 48mb SplitAndScatter threshold in the SSTBatcher) and one based on this PR (also with the same 48mb threshold). To stress things, I am running with kv.bulk_io_write.concurrent_addsstable_requests=10. My repro steps are as follows:

export GCE_PROJECT=andrei-jepsen
USER=aayushs
SUFFIX=addsst-master-48mb
export CLUSTER=$USER-$SUFFIX
rp create $CLUSTER --nodes 15 --gce-machine-type=n1-standard-32

rm -rf ./cockroach-linux-2.6.32-gnu-amd64
./build/builder.sh mkrelease amd64-linux-gnu
rp put $CLUSTER ./cockroach-linux-2.6.32-gnu-amd64 cockroach

rp start $CLUSTER:1-13 --racks=4 --args='--vmodule=allocator=3,sst_batcher=3';
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_recovery.max_rate='2GiB'";
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='2GiB'";

# Crank up the SST ingestion concurrency to make the load-imbalance more prominent
rp sql $CLUSTER:1 -- -e "SET CLUSTER SETTING kv.bulk_io_write.concurrent_addsstable_requests=10";

rp ssh $CLUSTER:15 -- './cockroach workload fixtures import tpcc --warehouses=30000 {pgurl:1}';

The results are quite encouraging.

These are the graphs of AdminScatter requests against the number of rebalancing operations by the replicateQueue (these include the rebalances that happened due to scatter). The yellow line is the number of rebalances.

On master
master_48mb_scatter_vs_rebalances

With this patch
with_fix_48mb_scatter_vs_rebalances

cc @nvanbenschoten, @dt, @kvoli

@aayushshah15
Copy link
Contributor Author

These are the corresponding read-amplification charts. This is also very encouraging.

master
master_48mb_read_amp

With this patch
with_fix_48mb_read_amp

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch 2 times, most recently from d0ed5f3 to d8b0034 Compare February 3, 2022 02:07
@dt
Copy link
Member

dt commented Feb 3, 2022

what explains the spike vs smooth number of AdminScatter requests between the two?

@aayushshah15
Copy link
Contributor Author

aayushshah15 commented Feb 3, 2022

what explains the spike vs smooth number of AdminScatter requests between the two?

I was wondering the same thing, but notice that those valleys in the rate of AdminScatter requests seem to line up very well with the spikes in read-amplification on that one node. My guess is that that node slows the import down a bit.

It looks like we have a similar pattern in the rate of AddSSTableRequests.
image

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from d8b0034 to 11ff415 Compare February 3, 2022 02:52
@aayushshah15 aayushshah15 marked this pull request as ready for review February 3, 2022 02:53
@aayushshah15 aayushshah15 requested a review from a team as a code owner February 3, 2022 02:53
@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 11ff415 to ac27cda Compare February 3, 2022 03:08
@dt
Copy link
Member

dt commented Feb 3, 2022

if you still have that cluster, can you plot addsstable.delay.total vs the scatter reqs ?

just realized the IP was in the screenshots, so here it is:
Screen Shot 2022-02-02 at 10 35 19 PM

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch 2 times, most recently from 17a933e to 5dec7fd Compare February 3, 2022 14:31
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 6 of 6 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):

	perturbedStoreDescs := make([]roachpb.StoreDescriptor, 0, len(sl.stores))
	for _, store := range sl.stores {
		jitter := float64(store.Capacity.RangeCount) * o.rangeRebalanceThreshold * (0.25 + allocRand.Float64())

Why the + 0.25?


pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):

	perturbedStoreDescs := make([]roachpb.StoreDescriptor, 0, len(sl.stores))
	for _, store := range sl.stores {
		jitter := float64(store.Capacity.RangeCount) * o.rangeRebalanceThreshold * (0.25 + allocRand.Float64())

Could we add a comment here explaining why we're jittering by a function of the range rebalance threshold?


pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

		perturbedStoreDescs = append(perturbedStoreDescs, store)
	}
	o.rangeRebalanceThreshold = 0

Also, call this out in a comment. It's not entirely clear to me why this is needed if we already jittered. Shouldn't the jitter be enough to push two candidates that we're previously within the threshold out of it?

All else being equal, it would be best to keep the options struct immutable. This would also mean we wouldn't need the first commit, though implementing the interface on a pointer seems fine either way.


pkg/kv/kvserver/allocator_test.go, line 7285 at r2 (raw file):

			rangeRebalanceThreshold: 0.05,
		},
		true, /* scatter */

Should the test call this function twice with different values for scatter to prove that the allocator normally would not make the move?

@aayushshah15
Copy link
Contributor Author


pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

Shouldn't the jitter be enough to push two candidates that we're previously within the threshold out of it?

Just to check my understanding: If we keep the padding provided by kv.allocator.range_rebalance_threshold, the only candidates we'll ever push towards the "underfull" side will be the ones that already have fewer-than-mean replicas. Furthermore, the likelihood of these candidates being pushed out will be determined by how far away from the mean they already are.

I think this all makes it a bit more unlikely to scatter but quite a bit harder to explain / understand. I personally feel that we should have this thing be more on the aggressive side since:

  1. Internally, its only ever called on empty ranges. So there's no real cost to the aggressiveness.
  2. If called through the SQL SCATTER command, IMO the aggressiveness is desirable and more congruent with what a caller would expect.

What do you think?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Shouldn't the jitter be enough to push two candidates that we're previously within the threshold out of it?

Just to check my understanding: If we keep the padding provided by kv.allocator.range_rebalance_threshold, the only candidates we'll ever push towards the "underfull" side will be the ones that already have fewer-than-mean replicas. Furthermore, the likelihood of these candidates being pushed out will be determined by how far away from the mean they already are.

I think this all makes it a bit more unlikely to scatter but quite a bit harder to explain / understand. I personally feel that we should have this thing be more on the aggressive side since:

  1. Internally, its only ever called on empty ranges. So there's no real cost to the aggressiveness.
  2. If called through the SQL SCATTER command, IMO the aggressiveness is desirable and more congruent with what a caller would expect.

What do you think?

I see what you're saying here. I think it's fine if we set this threshold to 0 then, but it feels wrong to do so here. It actually feels a little strange that the allocator knows anything about the scatter flag. Did you consider a scheme where you teach the rangeCountScorerOptions about scatter and then pass that in to the allocator from (replicateQueue).considerRebalance? Or implement an entirely new scatterScorerOptions?

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 5dec7fd to 837ca41 Compare February 8, 2022 02:01
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Why the + 0.25?

No real reason, removed.


pkg/kv/kvserver/allocator_scorer.go, line 158 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could we add a comment here explaining why we're jittering by a function of the range rebalance threshold?

Added before the jitter attribute is set in scatterScorerOptions.


pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

implement an entirely new scatterScorerOptions

This seems much nicer, how do you feel about the latest revision?


pkg/kv/kvserver/allocator_test.go, line 7285 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should the test call this function twice with different values for scatter to prove that the allocator normally would not make the move?

Done.

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 837ca41 to 4a459d7 Compare February 8, 2022 02:47
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/allocator.go, line 914 at r3 (raw file):

}

func (a Allocator) storeListForCandidates(candidates []roachpb.ReplicationTarget) StoreList {

It feels like we're running circles around a progression like []ReplicationTarget -> StoreList -> []ReplicaDescriptor -> StoreList. Is this all principled? Does the typing make sense at each level?


pkg/kv/kvserver/allocator.go, line 1291 at r3 (raw file):

}

func (a *Allocator) scorerOptionsForScatter() *scatterScorerOptions {

nit: move this below the main scorerOptions method.


pkg/kv/kvserver/allocator_scorer.go, line 165 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

implement an entirely new scatterScorerOptions

This seems much nicer, how do you feel about the latest revision?

Yeah, this is much nicer.


pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):

}

func jittered(val float64, jitter float64, rand allocatorRand) float64 {

Do we need to lock and unlock rand? That feels pretty fragile, and yet it's what we do elsewhere. Maybe we should clean that up and push the locking inside of an API that mirrors (but does not directly expose) rand.Rand.

Or better yet (in a separate commit), let's remove allocatorRand, replace it with a raw *rand.Rand, and make it thread-safe by wrapping its rand.Source in a:

type lockedSource struct {
	m  sync.Mutex
	src rand.Source64
}

// TODO: implement rand.Source64.

This is what math/rand.globalRand does in the standard library.

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 4a459d7 to 60b80c4 Compare February 9, 2022 05:35
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvserver/allocator.go, line 914 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It feels like we're running circles around a progression like []ReplicationTarget -> StoreList -> []ReplicaDescriptor -> StoreList. Is this all principled? Does the typing make sense at each level?

It is kind of unfortunate the way we're converting from one type to another in so many of these methods.

One low hanging fruit to improve here was the fact that the Allocator.Remove{Non}Voter and the Allocator.Allocate{Non}Voter methods were returning a ReplicaDescriptor instead of a ReplicationTarget (I believe this was just a vestige). Aside from that, I don't see any obvious wins. The issue is that some of these methods have multiple callers, not all of whom have access to the same input types.

We might now be going from []ReplicationTarget -> StoreList -> ReplicationTarget, but we shouldn't be doing what you mentioned.


pkg/kv/kvserver/allocator.go, line 1291 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this below the main scorerOptions method.

Done.


pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):

That feels pretty fragile, and yet it's what we do elsewhere

oh, good catch. I'll do the proper locking with allocatorRand now and add a TODO to clean this up later (unless you insist we do it in this PR).

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 60b80c4 to 28a3949 Compare February 9, 2022 05:50
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r4, 5 of 5 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvserver/allocator_scorer.go, line 143 at r3 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

That feels pretty fragile, and yet it's what we do elsewhere

oh, good catch. I'll do the proper locking with allocatorRand now and add a TODO to clean this up later (unless you insist we do it in this PR).

Seems fine to do in a separate PR.


pkg/kv/kvserver/allocator_scorer.go, line 150 at r4 (raw file):

		result *= -1
	}

nit: stray line


pkg/roachpb/metadata_replicas.go, line 472 at r5 (raw file):

}

// EmptyReplicationTarget returns true if `target` is an empty replication

It would be slightly more standard to make this a method called Empty. That would also make it easier to use. We have many instances of that.

@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from 28a3949 to d434cdd Compare February 11, 2022 19:45
This is to enable the next commit, which adds a new method to this
interface that needs to operate over a pointer receiver.

Release note: None
This patch makes it such that `AdminScatter` now triggers a ~mostly random
rebalance action. This has always been the contract that most of its callers
(especially the `SSTBatcher`) assumed.

Previously, calling `AdminScatter` would simply enqueue that range into the
`replicateQueue`. The `replicateQueue` only cares about reconciling
replica-count differences between stores in the cluster if there are stores
that are more than 5% away from the mean. If all candidate stores were within
5% of the mean, then calling `AdminScatter` wouldn't do anything.

Now, `AdminScatter` still enqueues the range into the `replicateQueue` but with
an option to force it to:
1. Ignore the 5% padding provided by `kv.allocator.range_rebalance_threshold`.
2. Add some jitter to the existing replica-counts of all candidate stores.

This means that `AdminScatter` now forces mostly randomized rebalances to
stores that are reasonable targets (i.e. we still won't rebalance to stores
that are too far above the mean in terms of replica-count, or stores that don't
meet the constraints placed on the range, etc).

Release note (performance improvement): IMPORTs and index backfills
should now do a better job of spreading their load out over the nodes in
the cluster.
Previously they would return `ReplicaDescriptor`s, which was a vestige. These
return values were almost immediately getting cast into `ReplicationTarget` in
every single case anyway. This makes the return types of these allocator
methods a little more consistent.

Release note: None
@aayushshah15 aayushshah15 force-pushed the 20220202_randomizeAdminScatter branch from d434cdd to d792adf Compare February 11, 2022 19:45
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @aayushshah15 and @nvanbenschoten)


pkg/kv/kvserver/allocator_scorer.go, line 150 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: stray line

Fixed.


pkg/roachpb/metadata_replicas.go, line 472 at r5 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

It would be slightly more standard to make this a method called Empty. That would also make it easier to use. We have many instances of that.

Done.

@craig
Copy link
Contributor

craig bot commented Feb 11, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Feb 11, 2022

Build succeeded:

@craig craig bot merged commit 08ecb04 into cockroachdb:master Feb 11, 2022
@aayushshah15 aayushshah15 deleted the 20220202_randomizeAdminScatter branch February 11, 2022 23:16
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 13, 2022
After cockroachdb#75894, on every `(replicateQueue).processOneChange` call with the
`scatter` option set, stores in the cluster are essentially randomly
categorized as "overfull" or "underfull" (replicas on overfull stores are then
rebalanced to the underfull ones). Since each existing replica will be on a
store categorized as either underfull or overfull, it can be expected to be
rebalanced with a probability of `0.5`. This means that, roughly speaking, for
`N` replicas, probability of a successful rebalance is `(1 - (0.5)^N)`.

The previous implementation of `adminScatter` would keep requeuing a range into
the `replicateQueue` until it hit an iteration where none of the range's
replicas could be rebalanced. This patch improves this behavior by ensuring
that ranges are only enqueued for scattering upto a pre-defined maximum number
of times.

Release justification: low risk improvement to current functionality

Release note: None
aayushshah15 added a commit to aayushshah15/cockroach that referenced this pull request Mar 14, 2022
After cockroachdb#75894, on every `(replicateQueue).processOneChange` call with the
`scatter` option set, stores in the cluster are essentially randomly
categorized as "overfull" or "underfull" (replicas on overfull stores are then
rebalanced to the underfull ones). Since each existing replica will be on a
store categorized as either underfull or overfull, it can be expected to be
rebalanced with a probability of `0.5`. This means that, roughly speaking, for
`N` replicas, probability of a successful rebalance is `(1 - (0.5)^N)`.

The previous implementation of `adminScatter` would keep requeuing a range into
the `replicateQueue` until it hit an iteration where none of the range's
replicas could be rebalanced. This patch improves this behavior by ensuring
that ranges are only enqueued for scattering upto a pre-defined maximum number
of times.

Release justification: low risk improvement to current functionality

Release note: None
craig bot pushed a commit that referenced this pull request Mar 14, 2022
77743: kvserver: bound the number of scatter operations per range r=aayushshah15 a=aayushshah15

After #75894, on every `(replicateQueue).processOneChange` call with the
`scatter` option set, stores in the cluster are essentially randomly
categorized as "overfull" or "underfull" (replicas on overfull stores are then
rebalanced to the underfull ones). Since each existing replica will be on a
store categorized as either underfull or overfull, it can be expected to be
rebalanced with a probability of `0.5`. This means that, roughly speaking, for
`N` replicas, probability of a successful rebalance is `(1 - (0.5)^N)`.

The previous implementation of `adminScatter` would keep requeuing a range into
the `replicateQueue` until it hit an iteration where none of the range's
replicas could be rebalanced. This patch improves this behavior by ensuring
that ranges are only enqueued for scattering upto a pre-defined maximum number
of times.

Release justification: low risk improvement to current functionality

Release note: None


Co-authored-by: Aayush Shah <aayush.shah15@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: force replica movement under AdminScatter
4 participants