Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: shard Raft scheduler #98854

Merged
merged 2 commits into from
Mar 21, 2023

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Mar 17, 2023

The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards.

By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via COCKROACH_SCHEDULER_SHARD_SIZE.

This results in a substantial performance improvement on high-CPU nodes:

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4              7.71k ± 5%   7.93k ± 4%     ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=8              15.6k ± 3%   14.8k ± 7%     ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=32             43.4k ± 2%   45.0k ± 3%   +3.73%  (p=0.032 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws   40.5k ± 2%   61.7k ± 1%  +52.53%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce   35.6k ± 4%   44.5k ± 0%  +24.99%  (p=0.008 n=5+5)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4               21.2 ± 6%    20.6 ± 3%     ~     (p=0.397 n=5+5)
kv0/enc=false/nodes=3/cpu=8               10.5 ± 0%    11.2 ± 8%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=32              4.16 ± 1%    4.00 ± 5%     ~     (p=0.143 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    3.00 ± 0%    2.00 ± 0%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    4.70 ± 0%    4.10 ± 0%  -12.77%  (p=0.000 n=5+4)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4               61.6 ± 5%    60.8 ± 3%     ~     (p=0.762 n=5+5)
kv0/enc=false/nodes=3/cpu=8               28.3 ± 4%    30.4 ± 0%   +7.34%  (p=0.016 n=5+4)
kv0/enc=false/nodes=3/cpu=32              7.98 ± 2%    7.60 ± 0%   -4.76%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    12.3 ± 2%     6.8 ± 0%  -44.72%  (p=0.000 n=5+4)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    10.2 ± 3%     8.9 ± 0%  -12.75%  (p=0.000 n=5+4)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4               89.8 ± 7%    88.9 ± 6%     ~     (p=0.921 n=5+5)
kv0/enc=false/nodes=3/cpu=8               46.1 ± 0%    48.6 ± 5%   +5.47%  (p=0.048 n=5+5)
kv0/enc=false/nodes=3/cpu=32              11.5 ± 0%    11.0 ± 0%   -4.35%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    14.0 ± 3%    12.1 ± 0%  -13.32%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    14.3 ± 3%    13.1 ± 0%   -8.55%  (p=0.000 n=4+5)

The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to raftSchedulerBatch pre-sharding the enqueued ranges:

name                                                                 old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%   564ns ± 2%   +23.36%  (p=0.001 n=7+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%   563ns ± 2%   +22.14%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%   591ns ± 2%   +28.63%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   776ns ± 5%   +70.60%  (p=0.001 n=6+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%  1058ns ± 1%  +132.13%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%  8.18ms ± 1%   +14.33%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%  8.18ms ± 1%   +14.77%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  7.86ms ± 1%   +10.30%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  7.11ms ± 1%    -1.27%  (p=0.001 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  7.16ms ± 3%      ~     (p=0.721 n=8+8)

Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%.

Resolves #98811.
Touches #96800.

Hat tip to @nvanbenschoten for the initial prototype.

Epic: none
Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts.

@erikgrinaker erikgrinaker requested review from tbg, nvanbenschoten and a team March 17, 2023 13:14
@erikgrinaker erikgrinaker self-assigned this Mar 17, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker changed the title kvserver: shard Raft scheduler mutex kvserver: shard Raft scheduler Mar 17, 2023
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Mar 17, 2023

Going to run kv0/enc=false/nodes=3/cpu=96. Meanwhile, here's some mutex profiles from 50k unquiesced ranges on a 24-core machine:

$ go tool pprof -diff_base mutex-before.pb.gz mutex-after.pb.gz
(pprof) top -cum
Showing nodes accounting for -214.21s, 54.53% of 392.83s total
Dropped 269 nodes (cum <= 1.96s)
Showing top 10 nodes out of 23
      flat  flat%   sum%        cum   cum%
         0     0%     0%   -215.05s 54.74%  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  -214.21s 54.53% 54.53%   -214.21s 54.53%  sync.(*Mutex).Unlock
         0     0% 54.53%   -212.96s 54.21%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker
         0     0% 54.53%    173.44s 44.15%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
         0     0% 54.53%     39.20s  9.98%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processTick
         0     0% 54.53%     38.29s  9.75%  github.com/cockroachdb/cockroach/pkg/util/hlc.(*Clock).NowAsClockTimestamp
         0     0% 54.53%     38.29s  9.75%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).tick
         0     0% 54.53%    -18.72s  4.77%  sync.(*Cond).Wait

Before:

Screenshot 2023-03-17 at 14 23 27

After:

Screenshot 2023-03-17 at 14 23 40

@erikgrinaker erikgrinaker marked this pull request as ready for review March 17, 2023 16:14
@erikgrinaker erikgrinaker requested a review from a team as a code owner March 17, 2023 16:14
@erikgrinaker
Copy link
Contributor Author

The results are pretty good (see PR description), as expected over in #96800.

Given the upside, I'm inclined to merge this before the branch cut. The changes are fairly well-contained, so the primary risk would be worker starvation with uneven shard load, but with 12 workers per shard the risk seems fairly minor. Let's have a look at the nightly benchmarks.

@erikgrinaker
Copy link
Contributor Author

Note to self: consider adjusting the tick loop and heartbeat uncoalescing to pre-shard the inputs and enqueue them directly to the shard, avoiding the O(shards) looping.

@erikgrinaker
Copy link
Contributor Author

I added a commit with raftSchedulerBatch which avoids the O(shards) factor in enqueueN scans. However, on a 24-core cluster with 50k unquiesced ranges it didn't have any measurable impact on CPU usage, as one might expect. Unclear whether it's really an improvement, wdyt?

@erikgrinaker erikgrinaker force-pushed the shard-raft-scheduler branch 3 times, most recently from dea3245 to e930504 Compare March 18, 2023 17:17
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Mar 18, 2023

Added a microbenchmark for EnqueueRaftTicks(), which enqueues ranges into a stopped scheduler before clearing out the queue. It varies the range count and worker count (which determines the shard count by a factor of 1/12). The results below are for the collect=true case where we include the cost of iterating over the replica map to collect range IDs. This showed that the pre-sharding via raftSchedulerBatch saved enough to be worthwhile.

Sharded scheduler without raftSchedulerBatch, compared to the unsharded scheduler:

name                                                                 old time/op  new time/op   delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%    565ns ± 3%   +23.65%  (p=0.000 n=7+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%    561ns ± 2%   +21.71%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%    679ns ± 3%   +47.91%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   1086ns ± 2%  +138.97%  (p=0.001 n=6+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%   1838ns ± 1%  +303.51%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%   8.09ms ± 1%   +13.05%  (p=0.000 n=8+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%   8.06ms ± 1%   +13.06%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  10.07ms ± 0%   +41.34%  (p=0.001 n=8+6)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  12.59ms ± 1%   +74.87%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  17.10ms ± 1%  +140.14%  (p=0.000 n=8+8)

Pre-sharding via raftSchedulerBatch, compared to the unsharded scheduler:

name                                                                 old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%   564ns ± 2%   +23.36%  (p=0.001 n=7+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%   563ns ± 2%   +22.14%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%   591ns ± 2%   +28.63%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   776ns ± 5%   +70.60%  (p=0.001 n=6+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%  1058ns ± 1%  +132.13%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%  8.18ms ± 1%   +14.33%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%  8.18ms ± 1%   +14.77%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  7.86ms ± 1%   +10.30%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  7.11ms ± 1%    -1.27%  (p=0.001 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  7.16ms ± 3%      ~     (p=0.721 n=8+8)

The performance improvement with higher worker counts at ranges=100000 is interesting though. I haven't tracked it down, but my guess is that it's related to the size of the per-shard state map (i.e. hash collisions).

I've squashed the raftSchedulerBatch commit into the main commit here.

Copy link
Collaborator

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. All nits are optional.

pkg/kv/kvserver/scheduler_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/scheduler.go Outdated Show resolved Hide resolved
if len(*b) != numShards {
*b = make([][]roachpb.RangeID, numShards)
}
return *b
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it guaranteed that len(b[i]) == 0 here for all i (in the happy case when len(*b) == numShards)?

The Reset/Close methods below seem to be used to guarantee that?

Copy link
Contributor Author

@erikgrinaker erikgrinaker Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Close() will handle that. It's usually good form to reset before returning, in case they hold on to other memory (not relevant here).

Comment on lines 253 to 248
numShards := numWorkers / workersPerShard
if numShards < 1 {
numShards = 1
}
Copy link
Collaborator

@pav-kv pav-kv Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to guard against workersPerShard <= 0? E.g. assume an unlimited workersPerShard (equivalently, workersPerShard = numWorkers) by default.

Also, how about rounding up?

numShards := (numWorkers + workersPerShard - 1) / workersPerShard

^^^ this is always > 0 unless numWorkers == 0

UPD: I see below that you distribute numWorkers % numShards across other shards, instead of dedicating a remainder shard. It's more balanced this way, nice.

Copy link
Contributor Author

@erikgrinaker erikgrinaker Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty straightforward, so doesn't seem worth a helper, but I added a unit test for the shard allocations (which found a bug) and tweaked the structure a bit.

pkg/kv/kvserver/scheduler.go Show resolved Hide resolved
pkg/kv/kvserver/scheduler.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/scheduler.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/scheduler_test.go Show resolved Hide resolved
@erikgrinaker erikgrinaker force-pushed the shard-raft-scheduler branch 3 times, most recently from 3809fa7 to 5545ec4 Compare March 20, 2023 15:29
{1, 3, []int{1}},
{2, 3, []int{2}},
{3, 3, []int{3}},
{4, 3, []int{2, 2}},
Copy link
Collaborator

@pav-kv pav-kv Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is a bit counter-intuitive. I would expect at least one shard of size shardSize. Why can this happen? Could you add a few more tests like that?
Curious how far from shardSize the actual size can diverge: is it just +- 1, or more arbitrarily.

Copy link
Contributor Author

@erikgrinaker erikgrinaker Mar 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having shards of 2+2 workers is much better than having shards of 3+1. If we assume ranges are evenly distributed across shards then we want each range to have as many workers available as possible. Consider e.g. having 10 ranges in each shard, with 3+1 workers: the ranges in the first shard have 0.3 goroutines per range, while the ranges in the second shard have 0.1 goroutines per range.

As to why, it's because ceil(4/3) = 2 shards, and then we distribute workers evenly.

@erikgrinaker
Copy link
Contributor Author

Looks like this is regressing slightly at lower core counts. Going to do another set with shard size 16 instead of 12.

name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4    7.71k ± 5%   7.98k ± 2%     ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=8    15.6k ± 3%   15.5k ± 1%     ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=32   43.4k ± 2%   42.7k ± 5%     ~     (p=0.310 n=5+5)

name                          old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4     21.2 ± 6%    20.3 ± 3%     ~     (p=0.238 n=5+5)
kv0/enc=false/nodes=3/cpu=8     10.5 ± 0%    10.8 ± 3%     ~     (p=0.238 n=4+5)
kv0/enc=false/nodes=3/cpu=32    4.16 ± 1%    4.24 ± 8%     ~     (p=0.683 n=5+5)

name                          old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4     61.6 ± 5%    61.6 ± 2%     ~     (p=1.000 n=5+5)
kv0/enc=false/nodes=3/cpu=8     28.3 ± 4%    28.7 ± 2%     ~     (p=0.643 n=5+5)
kv0/enc=false/nodes=3/cpu=32    7.98 ± 2%    8.02 ± 1%     ~     (p=1.000 n=5+5)

name                          old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4     89.8 ± 7%    88.1 ± 0%     ~     (p=0.167 n=5+5)
kv0/enc=false/nodes=3/cpu=8     46.1 ± 0%    47.4 ± 3%     ~     (p=0.167 n=5+5)
kv0/enc=false/nodes=3/cpu=32    11.5 ± 0%    12.8 ± 6%  +11.30%  (p=0.008 n=5+5)

@pav-kv
Copy link
Collaborator

pav-kv commented Mar 20, 2023

@erikgrinaker

Looks like this is regressing slightly at lower core counts. Going to do another set with shard size 16 instead of 12.

When you converge on some default shard/workers sizes, please add them to TestNewSchedulerShards.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Reviewed 3 of 3 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, and @tbg)


-- commits line 2 at r3:
nit: could you include the benchmark results in the commit message?


pkg/kv/kvserver/scheduler.go line 329 at r4 (raw file):

}

func (s *raftScheduler) worker(ctx context.Context, shard *raftSchedulerShard) {

Did you consider moving this to be a method on *raftSchedulerShard? Does the method need access to the full scheduler?


pkg/kv/kvserver/scheduler.go line 471 at r4 (raw file):

}

func (s *raftSchedulerShard) enqueueN(addFlags raftScheduleFlags, ids ...roachpb.RangeID) int {

nit: do we want to use a different name for the raftSchedulerShard receivers? Maybe ss? While reading this file, it's easy to forget which scope a method is operating on.


pkg/kv/kvserver/scheduler_test.go line 258 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Doesn't seem worth it.

+1 to not being worth it, especially if it comes with any runtime cost.


pkg/kv/kvserver/scheduler_test.go line 397 at r3 (raw file):

		}
		s.EnqueueRaftTicks(ids...)
		s.mu.queue = rangeIDQueue{} // flush queue

nit: consider mentioning that we haven't started the raftScheduler, so this won't race with workers pulling off the queue.


pkg/kv/kvserver/scheduler_test.go line 397 at r3 (raw file):

		}
		s.EnqueueRaftTicks(ids...)
		s.mu.queue = rangeIDQueue{} // flush queue

^ raises the question about whether this is measuring the true cost of mutex contention in the scheduler, because we're not seeing contended mutexes and we're not seeing cache misses. I don't see a good way to measure that additional cost and still have a stable benchmark.

@erikgrinaker
Copy link
Contributor Author

I did another run with 16 workers per shard, and the results are slightly better at lower core counts, but it's hard to say really with the variation here. The below results are compared to master. I've bumped the default to 16 to be conservative. I'll rerun the 96-core results as well.

name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4    7.71k ± 5%   7.93k ± 4%    ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=8    15.6k ± 3%   14.8k ± 7%    ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=32   43.4k ± 2%   45.0k ± 3%  +3.73%  (p=0.032 n=5+5)

name                          old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4     21.2 ± 6%    20.6 ± 3%    ~     (p=0.397 n=5+5)
kv0/enc=false/nodes=3/cpu=8     10.5 ± 0%    11.2 ± 8%    ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=32    4.16 ± 1%    4.00 ± 5%    ~     (p=0.143 n=5+5)

name                          old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4     61.6 ± 5%    60.8 ± 3%    ~     (p=0.762 n=5+5)
kv0/enc=false/nodes=3/cpu=8     28.3 ± 4%    30.4 ± 0%  +7.34%  (p=0.016 n=5+4)
kv0/enc=false/nodes=3/cpu=32    7.98 ± 2%    7.60 ± 0%  -4.76%  (p=0.008 n=5+5)

name                          old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4     89.8 ± 7%    88.9 ± 6%    ~     (p=0.921 n=5+5)
kv0/enc=false/nodes=3/cpu=8     46.1 ± 0%    48.6 ± 5%  +5.47%  (p=0.048 n=5+5)
kv0/enc=false/nodes=3/cpu=32    11.5 ± 0%    11.0 ± 0%  -4.35%  (p=0.008 n=5+5)

@erikgrinaker erikgrinaker force-pushed the shard-raft-scheduler branch 2 times, most recently from 8f1615c to 351c54d Compare March 20, 2023 18:05
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @pavelkalinnikov, and @tbg)


-- commits line 2 at r3:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: could you include the benchmark results in the commit message?

Yep, done.


pkg/kv/kvserver/scheduler.go line 329 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you consider moving this to be a method on *raftSchedulerShard? Does the method need access to the full scheduler?

Yes, I considered it, but I wanted to keep the diff minimal to convince ourselves that it was reasonably correct. There doesn't seem to be much to gain from it, so I'll leave it like this for now, but worth reconsidering.


pkg/kv/kvserver/scheduler.go line 471 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: do we want to use a different name for the raftSchedulerShard receivers? Maybe ss? While reading this file, it's easy to forget which scope a method is operating on.

Good idea, done.


pkg/kv/kvserver/scheduler_test.go line 397 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider mentioning that we haven't started the raftScheduler, so this won't race with workers pulling off the queue.

Done.


pkg/kv/kvserver/scheduler_test.go line 397 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

^ raises the question about whether this is measuring the true cost of mutex contention in the scheduler, because we're not seeing contended mutexes and we're not seeing cache misses. I don't see a good way to measure that additional cost and still have a stable benchmark.

Yeah, it isn't intended to measure mutex contention, only to measure the basis cost of enqueueing ranges in the scheduler. It was motivated by finding out whether raftSchedulerBatch was even worth it, or if a naïve filtering loop would do.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Mar 20, 2023

96-core results are about the same with shard size 16, so let's stick with that:

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=96/cloud=aws   40.5k ± 2%   61.7k ± 1%  +52.53%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce   35.6k ± 4%   44.5k ± 0%  +24.99%  (p=0.008 n=5+5)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=96/cloud=aws    3.00 ± 0%    2.00 ± 0%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    4.70 ± 0%    4.10 ± 0%  -12.77%  (p=0.000 n=5+4)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=96/cloud=aws    12.3 ± 2%     6.8 ± 0%  -44.72%  (p=0.000 n=5+4)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    10.2 ± 3%     8.9 ± 0%  -12.75%  (p=0.000 n=5+4)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=96/cloud=aws    14.0 ± 3%    12.1 ± 0%  -13.32%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    14.3 ± 3%    13.1 ± 0%   -8.55%  (p=0.000 n=4+5)

I'm feeling pretty good about merging this for now, and we can have a look at the roachperf nightlies tomorrow to look for any further impact. Wdyt @nvanbenschoten?

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 4 files at r6, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, and @tbg)


-- commits line 15 at r7:
I wonder if we need this cap anymore. It was added back in 277cc15 to fight collapse under mutation contention. On the other hand, we no longer block on fsync under the raft scheduler, so there's an argument that the worker count is too high now. Not worth changing in either direction right now.


-- commits line 35 at r7:

3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5)

We're just missing out on a statistically significant major improvement. That's too bad.


pkg/kv/kvserver/scheduler.go line 329 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yes, I considered it, but I wanted to keep the diff minimal to convince ourselves that it was reasonably correct. There doesn't seem to be much to gain from it, so I'll leave it like this for now, but worth reconsidering.

Mind leaving a TODO to clean this up then? And also s/shard/ss/g in this function and elsewhere for consistent naming.

The Raft scheduler mutex can become very contended on machines with many
cores and high range counts. This patch shards the scheduler by
allocating ranges and workers to individual scheduler shards.

By default, we create a new shard for every 16 workers, and distribute
workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16
is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This
significantly relieves contention at high core counts, while also
avoiding starvation by excessive sharding. The shard size can be
adjusted via `COCKROACH_SCHEDULER_SHARD_SIZE`.

This results in a substantial performance improvement on high-CPU nodes:

```
name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4              7.71k ± 5%   7.93k ± 4%     ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=8              15.6k ± 3%   14.8k ± 7%     ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=32             43.4k ± 2%   45.0k ± 3%   +3.73%  (p=0.032 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws   40.5k ± 2%   61.7k ± 1%  +52.53%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce   35.6k ± 4%   44.5k ± 0%  +24.99%  (p=0.008 n=5+5)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4               21.2 ± 6%    20.6 ± 3%     ~     (p=0.397 n=5+5)
kv0/enc=false/nodes=3/cpu=8               10.5 ± 0%    11.2 ± 8%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=32              4.16 ± 1%    4.00 ± 5%     ~     (p=0.143 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    3.00 ± 0%    2.00 ± 0%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    4.70 ± 0%    4.10 ± 0%  -12.77%  (p=0.000 n=5+4)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4               61.6 ± 5%    60.8 ± 3%     ~     (p=0.762 n=5+5)
kv0/enc=false/nodes=3/cpu=8               28.3 ± 4%    30.4 ± 0%   +7.34%  (p=0.016 n=5+4)
kv0/enc=false/nodes=3/cpu=32              7.98 ± 2%    7.60 ± 0%   -4.76%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    12.3 ± 2%     6.8 ± 0%  -44.72%  (p=0.000 n=5+4)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    10.2 ± 3%     8.9 ± 0%  -12.75%  (p=0.000 n=5+4)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4               89.8 ± 7%    88.9 ± 6%     ~     (p=0.921 n=5+5)
kv0/enc=false/nodes=3/cpu=8               46.1 ± 0%    48.6 ± 5%   +5.47%  (p=0.048 n=5+5)
kv0/enc=false/nodes=3/cpu=32              11.5 ± 0%    11.0 ± 0%   -4.35%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    14.0 ± 3%    12.1 ± 0%  -13.32%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    14.3 ± 3%    13.1 ± 0%   -8.55%  (p=0.000 n=4+5)
```

Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced
ranges, this reduced CPU usage from 12% to 10%.

The basic cost of enqueueing ranges in the scheduler (without workers or
contention) only increases slightly in absolute terms, thanks to
`raftSchedulerBatch` pre-sharding the enqueued ranges:

```
name                                                                 old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%   564ns ± 2%   +23.36%  (p=0.001 n=7+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%   563ns ± 2%   +22.14%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%   591ns ± 2%   +28.63%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   776ns ± 5%   +70.60%  (p=0.001 n=6+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%  1058ns ± 1%  +132.13%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%  8.18ms ± 1%   +14.33%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%  8.18ms ± 1%   +14.77%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  7.86ms ± 1%   +10.30%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  7.11ms ± 1%    -1.27%  (p=0.001 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  7.16ms ± 3%      ~     (p=0.721 n=8+8)
```

Epic: none
Release note (performance improvement): The Raft scheduler is now
sharded to relieve contention during range Raft processing, which can
significantly improve performance at high CPU core counts.
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @pavelkalinnikov, and @tbg)


-- commits line 15 at r7:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I wonder if we need this cap anymore. It was added back in 277cc15 to fight collapse under mutation contention. On the other hand, we no longer block on fsync under the raft scheduler, so there's an argument that the worker count is too high now. Not worth changing in either direction right now.

Yeah, I've been wondering the same thing, but don't want to start fiddling with it now. Wrote up #99063.


-- commits line 35 at r7:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5)

We're just missing out on a statistically significant major improvement. That's too bad.

I know! I was tempted to rerun these with higher counts to get those juicy precentages, but alas, we'll have to settle for a statistically insignificant 33% improvement.


pkg/kv/kvserver/scheduler.go line 329 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mind leaving a TODO to clean this up then? And also s/shard/ss/g in this function and elsewhere for consistent naming.

Did this now, on second thought. It avoids accidental misuse of other shards, and there were only a couple of dependencies to inject. Left the code in the same spot though, for history and ease of review.

@erikgrinaker
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Mar 21, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 21, 2023

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Mar 21, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: optimize lock contention in raft scheduler
4 participants