kvserver: shard Raft scheduler #98854

erikgrinaker · 2023-03-17T13:14:51Z

The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards.

By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via COCKROACH_SCHEDULER_SHARD_SIZE.

This results in a substantial performance improvement on high-CPU nodes:

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4              7.71k ± 5%   7.93k ± 4%     ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=8              15.6k ± 3%   14.8k ± 7%     ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=32             43.4k ± 2%   45.0k ± 3%   +3.73%  (p=0.032 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws   40.5k ± 2%   61.7k ± 1%  +52.53%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce   35.6k ± 4%   44.5k ± 0%  +24.99%  (p=0.008 n=5+5)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4               21.2 ± 6%    20.6 ± 3%     ~     (p=0.397 n=5+5)
kv0/enc=false/nodes=3/cpu=8               10.5 ± 0%    11.2 ± 8%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=32              4.16 ± 1%    4.00 ± 5%     ~     (p=0.143 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    3.00 ± 0%    2.00 ± 0%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    4.70 ± 0%    4.10 ± 0%  -12.77%  (p=0.000 n=5+4)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4               61.6 ± 5%    60.8 ± 3%     ~     (p=0.762 n=5+5)
kv0/enc=false/nodes=3/cpu=8               28.3 ± 4%    30.4 ± 0%   +7.34%  (p=0.016 n=5+4)
kv0/enc=false/nodes=3/cpu=32              7.98 ± 2%    7.60 ± 0%   -4.76%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    12.3 ± 2%     6.8 ± 0%  -44.72%  (p=0.000 n=5+4)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    10.2 ± 3%     8.9 ± 0%  -12.75%  (p=0.000 n=5+4)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4               89.8 ± 7%    88.9 ± 6%     ~     (p=0.921 n=5+5)
kv0/enc=false/nodes=3/cpu=8               46.1 ± 0%    48.6 ± 5%   +5.47%  (p=0.048 n=5+5)
kv0/enc=false/nodes=3/cpu=32              11.5 ± 0%    11.0 ± 0%   -4.35%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    14.0 ± 3%    12.1 ± 0%  -13.32%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    14.3 ± 3%    13.1 ± 0%   -8.55%  (p=0.000 n=4+5)

The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to raftSchedulerBatch pre-sharding the enqueued ranges:

name                                                                 old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%   564ns ± 2%   +23.36%  (p=0.001 n=7+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%   563ns ± 2%   +22.14%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%   591ns ± 2%   +28.63%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   776ns ± 5%   +70.60%  (p=0.001 n=6+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%  1058ns ± 1%  +132.13%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%  8.18ms ± 1%   +14.33%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%  8.18ms ± 1%   +14.77%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  7.86ms ± 1%   +10.30%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  7.11ms ± 1%    -1.27%  (p=0.001 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  7.16ms ± 3%      ~     (p=0.721 n=8+8)

Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%.

Resolves #98811.
Touches #96800.

Hat tip to @nvanbenschoten for the initial prototype.

Epic: none
Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts.

cockroach-teamcity · 2023-03-17T13:15:02Z

This change is

erikgrinaker · 2023-03-17T13:24:25Z

Going to run kv0/enc=false/nodes=3/cpu=96. Meanwhile, here's some mutex profiles from 50k unquiesced ranges on a 24-core machine:

$ go tool pprof -diff_base mutex-before.pb.gz mutex-after.pb.gz
(pprof) top -cum
Showing nodes accounting for -214.21s, 54.53% of 392.83s total
Dropped 269 nodes (cum <= 1.96s)
Showing top 10 nodes out of 23
      flat  flat%   sum%        cum   cum%
         0     0%     0%   -215.05s 54.74%  github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
  -214.21s 54.53% 54.53%   -214.21s 54.53%  sync.(*Mutex).Unlock
         0     0% 54.53%   -212.96s 54.21%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker
         0     0% 54.53%    173.44s 44.15%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2
         0     0% 54.53%     39.20s  9.98%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processTick
         0     0% 54.53%     38.29s  9.75%  github.com/cockroachdb/cockroach/pkg/util/hlc.(*Clock).NowAsClockTimestamp
         0     0% 54.53%     38.29s  9.75%  github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).tick
         0     0% 54.53%    -18.72s  4.77%  sync.(*Cond).Wait

Before:

After:

erikgrinaker · 2023-03-17T16:17:46Z

The results are pretty good (see PR description), as expected over in #96800.

Given the upside, I'm inclined to merge this before the branch cut. The changes are fairly well-contained, so the primary risk would be worker starvation with uneven shard load, but with 12 workers per shard the risk seems fairly minor. Let's have a look at the nightly benchmarks.

erikgrinaker · 2023-03-17T17:25:01Z

Note to self: consider adjusting the tick loop and heartbeat uncoalescing to pre-shard the inputs and enqueue them directly to the shard, avoiding the O(shards) looping.

erikgrinaker · 2023-03-17T20:14:23Z

I added a commit with raftSchedulerBatch which avoids the O(shards) factor in enqueueN scans. However, on a 24-core cluster with 50k unquiesced ranges it didn't have any measurable impact on CPU usage, as one might expect. Unclear whether it's really an improvement, wdyt?

erikgrinaker · 2023-03-18T17:33:12Z

Added a microbenchmark for EnqueueRaftTicks(), which enqueues ranges into a stopped scheduler before clearing out the queue. It varies the range count and worker count (which determines the shard count by a factor of 1/12). The results below are for the collect=true case where we include the cost of iterating over the replica map to collect range IDs. This showed that the pre-sharding via raftSchedulerBatch saved enough to be worthwhile.

Sharded scheduler without raftSchedulerBatch, compared to the unsharded scheduler:

name                                                                 old time/op  new time/op   delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%    565ns ± 3%   +23.65%  (p=0.000 n=7+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%    561ns ± 2%   +21.71%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%    679ns ± 3%   +47.91%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   1086ns ± 2%  +138.97%  (p=0.001 n=6+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%   1838ns ± 1%  +303.51%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%   8.09ms ± 1%   +13.05%  (p=0.000 n=8+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%   8.06ms ± 1%   +13.06%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  10.07ms ± 0%   +41.34%  (p=0.001 n=8+6)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  12.59ms ± 1%   +74.87%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  17.10ms ± 1%  +140.14%  (p=0.000 n=8+8)

Pre-sharding via raftSchedulerBatch, compared to the unsharded scheduler:

name                                                                 old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%   564ns ± 2%   +23.36%  (p=0.001 n=7+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%   563ns ± 2%   +22.14%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%   591ns ± 2%   +28.63%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   776ns ± 5%   +70.60%  (p=0.001 n=6+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%  1058ns ± 1%  +132.13%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%  8.18ms ± 1%   +14.33%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%  8.18ms ± 1%   +14.77%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  7.86ms ± 1%   +10.30%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  7.11ms ± 1%    -1.27%  (p=0.001 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  7.16ms ± 3%      ~     (p=0.721 n=8+8)

The performance improvement with higher worker counts at ranges=100000 is interesting though. I haven't tracked it down, but my guess is that it's related to the size of the per-shard state map (i.e. hash collisions).

I've squashed the raftSchedulerBatch commit into the main commit here.

pav-kv

LGTM. All nits are optional.

pkg/kv/kvserver/scheduler_test.go

pkg/kv/kvserver/scheduler.go

pav-kv · 2023-03-20T10:26:53Z

pkg/kv/kvserver/scheduler.go

+	if len(*b) != numShards {
+		*b = make([][]roachpb.RangeID, numShards)
+	}
+	return *b


Is it guaranteed that len(b[i]) == 0 here for all i (in the happy case when len(*b) == numShards)?

The Reset/Close methods below seem to be used to guarantee that?

Yes, Close() will handle that. It's usually good form to reset before returning, in case they hold on to other memory (not relevant here).

pav-kv · 2023-03-20T10:32:59Z

pkg/kv/kvserver/scheduler.go

+	numShards := numWorkers / workersPerShard
+	if numShards < 1 {
+		numShards = 1
+	}


Do you want to guard against workersPerShard <= 0? E.g. assume an unlimited workersPerShard (equivalently, workersPerShard = numWorkers) by default.

Also, how about rounding up?

numShards := (numWorkers + workersPerShard - 1) / workersPerShard

^^^ this is always > 0 unless numWorkers == 0

UPD: I see below that you distribute numWorkers % numShards across other shards, instead of dedicating a remainder shard. It's more balanced this way, nice.

This is pretty straightforward, so doesn't seem worth a helper, but I added a unit test for the shard allocations (which found a bug) and tweaked the structure a bit.

pkg/kv/kvserver/scheduler.go

pkg/kv/kvserver/scheduler_test.go

pav-kv · 2023-03-20T17:00:18Z

pkg/kv/kvserver/scheduler_test.go

+		{1, 3, []int{1}},
+		{2, 3, []int{2}},
+		{3, 3, []int{3}},
+		{4, 3, []int{2, 2}},


This one is a bit counter-intuitive. I would expect at least one shard of size shardSize. Why can this happen? Could you add a few more tests like that?
Curious how far from shardSize the actual size can diverge: is it just +- 1, or more arbitrarily.

Having shards of 2+2 workers is much better than having shards of 3+1. If we assume ranges are evenly distributed across shards then we want each range to have as many workers available as possible. Consider e.g. having 10 ranges in each shard, with 3+1 workers: the ranges in the first shard have 0.3 goroutines per range, while the ranges in the second shard have 0.1 goroutines per range.

As to why, it's because ceil(4/3) = 2 shards, and then we distribute workers evenly.

erikgrinaker · 2023-03-20T17:03:19Z

Looks like this is regressing slightly at lower core counts. Going to do another set with shard size 16 instead of 12.

name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4    7.71k ± 5%   7.98k ± 2%     ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=8    15.6k ± 3%   15.5k ± 1%     ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=32   43.4k ± 2%   42.7k ± 5%     ~     (p=0.310 n=5+5)

name                          old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4     21.2 ± 6%    20.3 ± 3%     ~     (p=0.238 n=5+5)
kv0/enc=false/nodes=3/cpu=8     10.5 ± 0%    10.8 ± 3%     ~     (p=0.238 n=4+5)
kv0/enc=false/nodes=3/cpu=32    4.16 ± 1%    4.24 ± 8%     ~     (p=0.683 n=5+5)

name                          old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4     61.6 ± 5%    61.6 ± 2%     ~     (p=1.000 n=5+5)
kv0/enc=false/nodes=3/cpu=8     28.3 ± 4%    28.7 ± 2%     ~     (p=0.643 n=5+5)
kv0/enc=false/nodes=3/cpu=32    7.98 ± 2%    8.02 ± 1%     ~     (p=1.000 n=5+5)

name                          old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4     89.8 ± 7%    88.1 ± 0%     ~     (p=0.167 n=5+5)
kv0/enc=false/nodes=3/cpu=8     46.1 ± 0%    47.4 ± 3%     ~     (p=0.167 n=5+5)
kv0/enc=false/nodes=3/cpu=32    11.5 ± 0%    12.8 ± 6%  +11.30%  (p=0.008 n=5+5)

pav-kv · 2023-03-20T17:06:51Z

@erikgrinaker

Looks like this is regressing slightly at lower core counts. Going to do another set with shard size 16 instead of 12.

When you converge on some default shard/workers sizes, please add them to TestNewSchedulerShards.

nvanbenschoten

Nice!

Reviewed 3 of 3 files at r3, 4 of 4 files at r4, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, and @tbg)

-- commits line 2 at r3:
nit: could you include the benchmark results in the commit message?

pkg/kv/kvserver/scheduler.go line 329 at r4 (raw file):

}

func (s *raftScheduler) worker(ctx context.Context, shard *raftSchedulerShard) {

Did you consider moving this to be a method on *raftSchedulerShard? Does the method need access to the full scheduler?

pkg/kv/kvserver/scheduler.go line 471 at r4 (raw file):

}

func (s *raftSchedulerShard) enqueueN(addFlags raftScheduleFlags, ids ...roachpb.RangeID) int {

nit: do we want to use a different name for the raftSchedulerShard receivers? Maybe ss? While reading this file, it's easy to forget which scope a method is operating on.

pkg/kv/kvserver/scheduler_test.go line 258 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Doesn't seem worth it.

+1 to not being worth it, especially if it comes with any runtime cost.

pkg/kv/kvserver/scheduler_test.go line 397 at r3 (raw file):

		}
		s.EnqueueRaftTicks(ids...)
		s.mu.queue = rangeIDQueue{} // flush queue

nit: consider mentioning that we haven't started the raftScheduler, so this won't race with workers pulling off the queue.

pkg/kv/kvserver/scheduler_test.go line 397 at r3 (raw file):

		}
		s.EnqueueRaftTicks(ids...)
		s.mu.queue = rangeIDQueue{} // flush queue

^ raises the question about whether this is measuring the true cost of mutex contention in the scheduler, because we're not seeing contended mutexes and we're not seeing cache misses. I don't see a good way to measure that additional cost and still have a stable benchmark.

Epic: none Release note: None

erikgrinaker · 2023-03-20T18:02:58Z

I did another run with 16 workers per shard, and the results are slightly better at lower core counts, but it's hard to say really with the variation here. The below results are compared to master. I've bumped the default to 16 to be conservative. I'll rerun the 96-core results as well.

name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4    7.71k ± 5%   7.93k ± 4%    ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=8    15.6k ± 3%   14.8k ± 7%    ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=32   43.4k ± 2%   45.0k ± 3%  +3.73%  (p=0.032 n=5+5)

name                          old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4     21.2 ± 6%    20.6 ± 3%    ~     (p=0.397 n=5+5)
kv0/enc=false/nodes=3/cpu=8     10.5 ± 0%    11.2 ± 8%    ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=32    4.16 ± 1%    4.00 ± 5%    ~     (p=0.143 n=5+5)

name                          old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4     61.6 ± 5%    60.8 ± 3%    ~     (p=0.762 n=5+5)
kv0/enc=false/nodes=3/cpu=8     28.3 ± 4%    30.4 ± 0%  +7.34%  (p=0.016 n=5+4)
kv0/enc=false/nodes=3/cpu=32    7.98 ± 2%    7.60 ± 0%  -4.76%  (p=0.008 n=5+5)

name                          old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4     89.8 ± 7%    88.9 ± 6%    ~     (p=0.921 n=5+5)
kv0/enc=false/nodes=3/cpu=8     46.1 ± 0%    48.6 ± 5%  +5.47%  (p=0.048 n=5+5)
kv0/enc=false/nodes=3/cpu=32    11.5 ± 0%    11.0 ± 0%  -4.35%  (p=0.008 n=5+5)

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @pavelkalinnikov, and @tbg)

-- commits line 2 at r3:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: could you include the benchmark results in the commit message?

Yep, done.

pkg/kv/kvserver/scheduler.go line 329 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Did you consider moving this to be a method on *raftSchedulerShard? Does the method need access to the full scheduler?

Yes, I considered it, but I wanted to keep the diff minimal to convince ourselves that it was reasonably correct. There doesn't seem to be much to gain from it, so I'll leave it like this for now, but worth reconsidering.

pkg/kv/kvserver/scheduler.go line 471 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: do we want to use a different name for the raftSchedulerShard receivers? Maybe ss? While reading this file, it's easy to forget which scope a method is operating on.

Good idea, done.

pkg/kv/kvserver/scheduler_test.go line 397 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider mentioning that we haven't started the raftScheduler, so this won't race with workers pulling off the queue.

Done.

pkg/kv/kvserver/scheduler_test.go line 397 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

^ raises the question about whether this is measuring the true cost of mutex contention in the scheduler, because we're not seeing contended mutexes and we're not seeing cache misses. I don't see a good way to measure that additional cost and still have a stable benchmark.

Yeah, it isn't intended to measure mutex contention, only to measure the basis cost of enqueueing ranges in the scheduler. It was motivated by finding out whether raftSchedulerBatch was even worth it, or if a naïve filtering loop would do.

erikgrinaker · 2023-03-20T19:38:55Z

96-core results are about the same with shard size 16, so let's stick with that:

name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=96/cloud=aws   40.5k ± 2%   61.7k ± 1%  +52.53%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce   35.6k ± 4%   44.5k ± 0%  +24.99%  (p=0.008 n=5+5)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=96/cloud=aws    3.00 ± 0%    2.00 ± 0%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    4.70 ± 0%    4.10 ± 0%  -12.77%  (p=0.000 n=5+4)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=96/cloud=aws    12.3 ± 2%     6.8 ± 0%  -44.72%  (p=0.000 n=5+4)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    10.2 ± 3%     8.9 ± 0%  -12.75%  (p=0.000 n=5+4)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=96/cloud=aws    14.0 ± 3%    12.1 ± 0%  -13.32%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    14.3 ± 3%    13.1 ± 0%   -8.55%  (p=0.000 n=4+5)

I'm feeling pretty good about merging this for now, and we can have a look at the roachperf nightlies tomorrow to look for any further impact. Wdyt @nvanbenschoten?

nvanbenschoten

Reviewed 4 of 4 files at r6, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, and @tbg)

-- commits line 15 at r7:
I wonder if we need this cap anymore. It was added back in 277cc15 to fight collapse under mutation contention. On the other hand, we no longer block on fsync under the raft scheduler, so there's an argument that the worker count is too high now. Not worth changing in either direction right now.

-- commits line 35 at r7:

3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5)

We're just missing out on a statistically significant major improvement. That's too bad.

pkg/kv/kvserver/scheduler.go line 329 at r4 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yes, I considered it, but I wanted to keep the diff minimal to convince ourselves that it was reasonably correct. There doesn't seem to be much to gain from it, so I'll leave it like this for now, but worth reconsidering.

Mind leaving a TODO to clean this up then? And also s/shard/ss/g in this function and elsewhere for consistent naming.

The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards. By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via `COCKROACH_SCHEDULER_SHARD_SIZE`. This results in a substantial performance improvement on high-CPU nodes: ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=4 7.71k ± 5% 7.93k ± 4% ~ (p=0.310 n=5+5) kv0/enc=false/nodes=3/cpu=8 15.6k ± 3% 14.8k ± 7% ~ (p=0.095 n=5+5) kv0/enc=false/nodes=3/cpu=32 43.4k ± 2% 45.0k ± 3% +3.73% (p=0.032 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 40.5k ± 2% 61.7k ± 1% +52.53% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 35.6k ± 4% 44.5k ± 0% +24.99% (p=0.008 n=5+5) name old p50 new p50 delta kv0/enc=false/nodes=3/cpu=4 21.2 ± 6% 20.6 ± 3% ~ (p=0.397 n=5+5) kv0/enc=false/nodes=3/cpu=8 10.5 ± 0% 11.2 ± 8% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=32 4.16 ± 1% 4.00 ± 5% ~ (p=0.143 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 4.70 ± 0% 4.10 ± 0% -12.77% (p=0.000 n=5+4) name old p95 new p95 delta kv0/enc=false/nodes=3/cpu=4 61.6 ± 5% 60.8 ± 3% ~ (p=0.762 n=5+5) kv0/enc=false/nodes=3/cpu=8 28.3 ± 4% 30.4 ± 0% +7.34% (p=0.016 n=5+4) kv0/enc=false/nodes=3/cpu=32 7.98 ± 2% 7.60 ± 0% -4.76% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 12.3 ± 2% 6.8 ± 0% -44.72% (p=0.000 n=5+4) kv0/enc=false/nodes=3/cpu=96/cloud=gce 10.2 ± 3% 8.9 ± 0% -12.75% (p=0.000 n=5+4) name old p99 new p99 delta kv0/enc=false/nodes=3/cpu=4 89.8 ± 7% 88.9 ± 6% ~ (p=0.921 n=5+5) kv0/enc=false/nodes=3/cpu=8 46.1 ± 0% 48.6 ± 5% +5.47% (p=0.048 n=5+5) kv0/enc=false/nodes=3/cpu=32 11.5 ± 0% 11.0 ± 0% -4.35% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 14.0 ± 3% 12.1 ± 0% -13.32% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 14.3 ± 3% 13.1 ± 0% -8.55% (p=0.000 n=4+5) ``` Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%. The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to `raftSchedulerBatch` pre-sharding the enqueued ranges: ``` name old time/op new time/op delta SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24 457ns ± 2% 564ns ± 2% +23.36% (p=0.001 n=7+7) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24 461ns ± 3% 563ns ± 2% +22.14% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24 459ns ± 2% 591ns ± 2% +28.63% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24 455ns ± 0% 776ns ± 5% +70.60% (p=0.001 n=6+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24 456ns ± 2% 1058ns ± 1% +132.13% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24 7.15ms ± 1% 8.18ms ± 1% +14.33% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24 7.13ms ± 1% 8.18ms ± 1% +14.77% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24 7.12ms ± 2% 7.86ms ± 1% +10.30% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24 7.20ms ± 1% 7.11ms ± 1% -1.27% (p=0.001 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24 7.12ms ± 2% 7.16ms ± 3% ~ (p=0.721 n=8+8) ``` Epic: none Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @pavelkalinnikov, and @tbg)

-- commits line 15 at r7:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I wonder if we need this cap anymore. It was added back in 277cc15 to fight collapse under mutation contention. On the other hand, we no longer block on fsync under the raft scheduler, so there's an argument that the worker count is too high now. Not worth changing in either direction right now.

Yeah, I've been wondering the same thing, but don't want to start fiddling with it now. Wrote up #99063.

-- commits line 35 at r7:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5)

We're just missing out on a statistically significant major improvement. That's too bad.

I know! I was tempted to rerun these with higher counts to get those juicy precentages, but alas, we'll have to settle for a statistically insignificant 33% improvement.

pkg/kv/kvserver/scheduler.go line 329 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mind leaving a TODO to clean this up then? And also s/shard/ss/g in this function and elsewhere for consistent naming.

Did this now, on second thought. It avoids accidental misuse of other shards, and there were only a couple of dependencies to inject. Left the code in the same spot though, for history and ease of review.

erikgrinaker · 2023-03-20T23:20:10Z

bors r+

craig · 2023-03-21T02:33:51Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-21T04:22:38Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-21T08:50:27Z

Build succeeded:

Bazel Essential CI (Cockroach)

erikgrinaker requested review from tbg, nvanbenschoten and a team March 17, 2023 13:14

erikgrinaker self-assigned this Mar 17, 2023

erikgrinaker force-pushed the shard-raft-scheduler branch from b13085d to 8c450f3 Compare March 17, 2023 13:19

erikgrinaker changed the title ~~kvserver: shard Raft scheduler mutex~~ kvserver: shard Raft scheduler Mar 17, 2023

erikgrinaker force-pushed the shard-raft-scheduler branch from 8c450f3 to a7da484 Compare March 17, 2023 16:14

erikgrinaker marked this pull request as ready for review March 17, 2023 16:14

erikgrinaker requested a review from a team as a code owner March 17, 2023 16:14

erikgrinaker force-pushed the shard-raft-scheduler branch 3 times, most recently from dea3245 to e930504 Compare March 18, 2023 17:17

erikgrinaker force-pushed the shard-raft-scheduler branch from e930504 to 500dc1c Compare March 18, 2023 17:50

erikgrinaker requested a review from pav-kv March 20, 2023 09:27

pav-kv approved these changes Mar 20, 2023

View reviewed changes

erikgrinaker force-pushed the shard-raft-scheduler branch 3 times, most recently from 3809fa7 to 5545ec4 Compare March 20, 2023 15:29

pav-kv reviewed Mar 20, 2023

View reviewed changes

nvanbenschoten reviewed Mar 20, 2023

View reviewed changes

kvserver: add BenchmarkSchedulerEnqueueRaftTicks

91be4d3

Epic: none Release note: None

erikgrinaker force-pushed the shard-raft-scheduler branch from 5545ec4 to b529480 Compare March 20, 2023 18:02

erikgrinaker force-pushed the shard-raft-scheduler branch 2 times, most recently from 8f1615c to 351c54d Compare March 20, 2023 18:05

erikgrinaker commented Mar 20, 2023

View reviewed changes

erikgrinaker force-pushed the shard-raft-scheduler branch from 351c54d to 7fca33b Compare March 20, 2023 19:37

nvanbenschoten approved these changes Mar 20, 2023

View reviewed changes

erikgrinaker mentioned this pull request Mar 20, 2023

kvserver: revisit COCKROACH_SCHEDULER_CONCURRENCY cap #99063

Closed

erikgrinaker force-pushed the shard-raft-scheduler branch from 7fca33b to 3478140 Compare March 20, 2023 21:00

erikgrinaker commented Mar 20, 2023

View reviewed changes

craig bot merged commit 0a77ed8 into cockroachdb:master Mar 21, 2023

erikgrinaker deleted the shard-raft-scheduler branch March 21, 2023 08:52

erikgrinaker mentioned this pull request Mar 21, 2023

kvserver: explore quiescence removal #94592

Open

nvanbenschoten mentioned this pull request Mar 21, 2023

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

Closed

cockroach-teamcity mentioned this pull request Mar 22, 2023

PR #98854 - kvserver: shard Raft scheduler cockroachdb/docs#16563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: shard Raft scheduler #98854

kvserver: shard Raft scheduler #98854

erikgrinaker commented Mar 17, 2023 •

edited

Loading

cockroach-teamcity commented Mar 17, 2023

erikgrinaker commented Mar 17, 2023 •

edited

Loading

erikgrinaker commented Mar 17, 2023

erikgrinaker commented Mar 17, 2023

erikgrinaker commented Mar 17, 2023

erikgrinaker commented Mar 18, 2023 •

edited

Loading

pav-kv left a comment

pav-kv Mar 20, 2023

erikgrinaker Mar 20, 2023 •

edited

Loading

pav-kv Mar 20, 2023 •

edited

Loading

erikgrinaker Mar 20, 2023 •

edited

Loading

pav-kv Mar 20, 2023 •

edited

Loading

erikgrinaker Mar 20, 2023 •

edited

Loading

erikgrinaker commented Mar 20, 2023

pav-kv commented Mar 20, 2023 •

edited

Loading

nvanbenschoten left a comment

erikgrinaker commented Mar 20, 2023

erikgrinaker left a comment

erikgrinaker commented Mar 20, 2023 •

edited

Loading

nvanbenschoten left a comment

erikgrinaker left a comment

erikgrinaker commented Mar 20, 2023

craig bot commented Mar 21, 2023

craig bot commented Mar 21, 2023

craig bot commented Mar 21, 2023

kvserver: shard Raft scheduler #98854

kvserver: shard Raft scheduler #98854

Conversation

erikgrinaker commented Mar 17, 2023 • edited Loading

cockroach-teamcity commented Mar 17, 2023

erikgrinaker commented Mar 17, 2023 • edited Loading

erikgrinaker commented Mar 17, 2023

erikgrinaker commented Mar 17, 2023

erikgrinaker commented Mar 17, 2023

erikgrinaker commented Mar 18, 2023 • edited Loading

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv Mar 20, 2023

Choose a reason for hiding this comment

erikgrinaker Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

pav-kv Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

erikgrinaker Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

pav-kv Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

erikgrinaker Mar 20, 2023 • edited Loading

Choose a reason for hiding this comment

erikgrinaker commented Mar 20, 2023

pav-kv commented Mar 20, 2023 • edited Loading

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker commented Mar 20, 2023

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Mar 20, 2023 • edited Loading

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Mar 20, 2023

craig bot commented Mar 21, 2023

craig bot commented Mar 21, 2023

craig bot commented Mar 21, 2023

erikgrinaker commented Mar 17, 2023 •

edited

Loading

erikgrinaker commented Mar 17, 2023 •

edited

Loading

erikgrinaker commented Mar 18, 2023 •

edited

Loading

erikgrinaker Mar 20, 2023 •

edited

Loading

pav-kv Mar 20, 2023 •

edited

Loading

erikgrinaker Mar 20, 2023 •

edited

Loading

pav-kv Mar 20, 2023 •

edited

Loading

erikgrinaker Mar 20, 2023 •

edited

Loading

pav-kv commented Mar 20, 2023 •

edited

Loading

erikgrinaker commented Mar 20, 2023 •

edited

Loading