-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: shard Raft scheduler #98854
Conversation
b13085d
to
8c450f3
Compare
Going to run
Before: After: |
8c450f3
to
a7da484
Compare
The results are pretty good (see PR description), as expected over in #96800. Given the upside, I'm inclined to merge this before the branch cut. The changes are fairly well-contained, so the primary risk would be worker starvation with uneven shard load, but with 12 workers per shard the risk seems fairly minor. Let's have a look at the nightly benchmarks. |
Note to self: consider adjusting the tick loop and heartbeat uncoalescing to pre-shard the inputs and enqueue them directly to the shard, avoiding the O(shards) looping. |
I added a commit with |
dea3245
to
e930504
Compare
Added a microbenchmark for Sharded scheduler without
Pre-sharding via
The performance improvement with higher worker counts at I've squashed the |
e930504
to
500dc1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. All nits are optional.
if len(*b) != numShards { | ||
*b = make([][]roachpb.RangeID, numShards) | ||
} | ||
return *b |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it guaranteed that len(b[i]) == 0
here for all i
(in the happy case when len(*b) == numShards
)?
The Reset/Close
methods below seem to be used to guarantee that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Close()
will handle that. It's usually good form to reset before returning, in case they hold on to other memory (not relevant here).
pkg/kv/kvserver/scheduler.go
Outdated
numShards := numWorkers / workersPerShard | ||
if numShards < 1 { | ||
numShards = 1 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to guard against workersPerShard <= 0
? E.g. assume an unlimited workersPerShard
(equivalently, workersPerShard = numWorkers
) by default.
Also, how about rounding up?
numShards := (numWorkers + workersPerShard - 1) / workersPerShard
^^^ this is always > 0 unless numWorkers == 0
UPD: I see below that you distribute numWorkers % numShards
across other shards, instead of dedicating a remainder shard. It's more balanced this way, nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty straightforward, so doesn't seem worth a helper, but I added a unit test for the shard allocations (which found a bug) and tweaked the structure a bit.
3809fa7
to
5545ec4
Compare
{1, 3, []int{1}}, | ||
{2, 3, []int{2}}, | ||
{3, 3, []int{3}}, | ||
{4, 3, []int{2, 2}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a bit counter-intuitive. I would expect at least one shard of size shardSize
. Why can this happen? Could you add a few more tests like that?
Curious how far from shardSize
the actual size can diverge: is it just +- 1, or more arbitrarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having shards of 2+2 workers is much better than having shards of 3+1. If we assume ranges are evenly distributed across shards then we want each range to have as many workers available as possible. Consider e.g. having 10 ranges in each shard, with 3+1 workers: the ranges in the first shard have 0.3 goroutines per range, while the ranges in the second shard have 0.1 goroutines per range.
As to why, it's because ceil(4/3) = 2 shards, and then we distribute workers evenly.
Looks like this is regressing slightly at lower core counts. Going to do another set with shard size 16 instead of 12.
|
When you converge on some default shard/workers sizes, please add them to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Reviewed 3 of 3 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, and @tbg)
-- commits
line 2 at r3:
nit: could you include the benchmark results in the commit message?
pkg/kv/kvserver/scheduler.go
line 329 at r4 (raw file):
} func (s *raftScheduler) worker(ctx context.Context, shard *raftSchedulerShard) {
Did you consider moving this to be a method on *raftSchedulerShard
? Does the method need access to the full scheduler?
pkg/kv/kvserver/scheduler.go
line 471 at r4 (raw file):
} func (s *raftSchedulerShard) enqueueN(addFlags raftScheduleFlags, ids ...roachpb.RangeID) int {
nit: do we want to use a different name for the raftSchedulerShard
receivers? Maybe ss
? While reading this file, it's easy to forget which scope a method is operating on.
pkg/kv/kvserver/scheduler_test.go
line 258 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Doesn't seem worth it.
+1 to not being worth it, especially if it comes with any runtime cost.
pkg/kv/kvserver/scheduler_test.go
line 397 at r3 (raw file):
} s.EnqueueRaftTicks(ids...) s.mu.queue = rangeIDQueue{} // flush queue
nit: consider mentioning that we haven't started the raftScheduler, so this won't race with workers pulling off the queue.
pkg/kv/kvserver/scheduler_test.go
line 397 at r3 (raw file):
} s.EnqueueRaftTicks(ids...) s.mu.queue = rangeIDQueue{} // flush queue
^ raises the question about whether this is measuring the true cost of mutex contention in the scheduler, because we're not seeing contended mutexes and we're not seeing cache misses. I don't see a good way to measure that additional cost and still have a stable benchmark.
Epic: none Release note: None
5545ec4
to
b529480
Compare
I did another run with 16 workers per shard, and the results are slightly better at lower core counts, but it's hard to say really with the variation here. The below results are compared to
|
8f1615c
to
351c54d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @pavelkalinnikov, and @tbg)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: could you include the benchmark results in the commit message?
Yep, done.
pkg/kv/kvserver/scheduler.go
line 329 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Did you consider moving this to be a method on
*raftSchedulerShard
? Does the method need access to the full scheduler?
Yes, I considered it, but I wanted to keep the diff minimal to convince ourselves that it was reasonably correct. There doesn't seem to be much to gain from it, so I'll leave it like this for now, but worth reconsidering.
pkg/kv/kvserver/scheduler.go
line 471 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: do we want to use a different name for the
raftSchedulerShard
receivers? Maybess
? While reading this file, it's easy to forget which scope a method is operating on.
Good idea, done.
pkg/kv/kvserver/scheduler_test.go
line 397 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: consider mentioning that we haven't started the raftScheduler, so this won't race with workers pulling off the queue.
Done.
pkg/kv/kvserver/scheduler_test.go
line 397 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
^ raises the question about whether this is measuring the true cost of mutex contention in the scheduler, because we're not seeing contended mutexes and we're not seeing cache misses. I don't see a good way to measure that additional cost and still have a stable benchmark.
Yeah, it isn't intended to measure mutex contention, only to measure the basis cost of enqueueing ranges in the scheduler. It was motivated by finding out whether raftSchedulerBatch
was even worth it, or if a naïve filtering loop would do.
351c54d
to
7fca33b
Compare
96-core results are about the same with shard size 16, so let's stick with that:
I'm feeling pretty good about merging this for now, and we can have a look at the roachperf nightlies tomorrow to look for any further impact. Wdyt @nvanbenschoten? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, and @tbg)
-- commits
line 15 at r7:
I wonder if we need this cap anymore. It was added back in 277cc15 to fight collapse under mutation contention. On the other hand, we no longer block on fsync under the raft scheduler, so there's an argument that the worker count is too high now. Not worth changing in either direction right now.
3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5)
We're just missing out on a statistically significant major improvement. That's too bad.
pkg/kv/kvserver/scheduler.go
line 329 at r4 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yes, I considered it, but I wanted to keep the diff minimal to convince ourselves that it was reasonably correct. There doesn't seem to be much to gain from it, so I'll leave it like this for now, but worth reconsidering.
Mind leaving a TODO to clean this up then? And also s/shard/ss/g
in this function and elsewhere for consistent naming.
The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards. By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via `COCKROACH_SCHEDULER_SHARD_SIZE`. This results in a substantial performance improvement on high-CPU nodes: ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=4 7.71k ± 5% 7.93k ± 4% ~ (p=0.310 n=5+5) kv0/enc=false/nodes=3/cpu=8 15.6k ± 3% 14.8k ± 7% ~ (p=0.095 n=5+5) kv0/enc=false/nodes=3/cpu=32 43.4k ± 2% 45.0k ± 3% +3.73% (p=0.032 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 40.5k ± 2% 61.7k ± 1% +52.53% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 35.6k ± 4% 44.5k ± 0% +24.99% (p=0.008 n=5+5) name old p50 new p50 delta kv0/enc=false/nodes=3/cpu=4 21.2 ± 6% 20.6 ± 3% ~ (p=0.397 n=5+5) kv0/enc=false/nodes=3/cpu=8 10.5 ± 0% 11.2 ± 8% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=32 4.16 ± 1% 4.00 ± 5% ~ (p=0.143 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 4.70 ± 0% 4.10 ± 0% -12.77% (p=0.000 n=5+4) name old p95 new p95 delta kv0/enc=false/nodes=3/cpu=4 61.6 ± 5% 60.8 ± 3% ~ (p=0.762 n=5+5) kv0/enc=false/nodes=3/cpu=8 28.3 ± 4% 30.4 ± 0% +7.34% (p=0.016 n=5+4) kv0/enc=false/nodes=3/cpu=32 7.98 ± 2% 7.60 ± 0% -4.76% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 12.3 ± 2% 6.8 ± 0% -44.72% (p=0.000 n=5+4) kv0/enc=false/nodes=3/cpu=96/cloud=gce 10.2 ± 3% 8.9 ± 0% -12.75% (p=0.000 n=5+4) name old p99 new p99 delta kv0/enc=false/nodes=3/cpu=4 89.8 ± 7% 88.9 ± 6% ~ (p=0.921 n=5+5) kv0/enc=false/nodes=3/cpu=8 46.1 ± 0% 48.6 ± 5% +5.47% (p=0.048 n=5+5) kv0/enc=false/nodes=3/cpu=32 11.5 ± 0% 11.0 ± 0% -4.35% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 14.0 ± 3% 12.1 ± 0% -13.32% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 14.3 ± 3% 13.1 ± 0% -8.55% (p=0.000 n=4+5) ``` Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%. The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to `raftSchedulerBatch` pre-sharding the enqueued ranges: ``` name old time/op new time/op delta SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24 457ns ± 2% 564ns ± 2% +23.36% (p=0.001 n=7+7) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24 461ns ± 3% 563ns ± 2% +22.14% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24 459ns ± 2% 591ns ± 2% +28.63% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24 455ns ± 0% 776ns ± 5% +70.60% (p=0.001 n=6+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24 456ns ± 2% 1058ns ± 1% +132.13% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24 7.15ms ± 1% 8.18ms ± 1% +14.33% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24 7.13ms ± 1% 8.18ms ± 1% +14.77% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24 7.12ms ± 2% 7.86ms ± 1% +10.30% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24 7.20ms ± 1% 7.11ms ± 1% -1.27% (p=0.001 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24 7.12ms ± 2% 7.16ms ± 3% ~ (p=0.721 n=8+8) ``` Epic: none Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts.
7fca33b
to
3478140
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @pavelkalinnikov, and @tbg)
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I wonder if we need this cap anymore. It was added back in 277cc15 to fight collapse under mutation contention. On the other hand, we no longer block on fsync under the raft scheduler, so there's an argument that the worker count is too high now. Not worth changing in either direction right now.
Yeah, I've been wondering the same thing, but don't want to start fiddling with it now. Wrote up #99063.
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5)
We're just missing out on a statistically significant major improvement. That's too bad.
I know! I was tempted to rerun these with higher counts to get those juicy precentages, but alas, we'll have to settle for a statistically insignificant 33% improvement.
pkg/kv/kvserver/scheduler.go
line 329 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Mind leaving a TODO to clean this up then? And also
s/shard/ss/g
in this function and elsewhere for consistent naming.
Did this now, on second thought. It avoids accidental misuse of other shards, and there were only a couple of dependencies to inject. Left the code in the same spot though, for history and ease of review.
bors r+ |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards.
By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via
COCKROACH_SCHEDULER_SHARD_SIZE
.This results in a substantial performance improvement on high-CPU nodes:
The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to
raftSchedulerBatch
pre-sharding the enqueued ranges:Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%.
Resolves #98811.
Touches #96800.
Hat tip to @nvanbenschoten for the initial prototype.
Epic: none
Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts.