Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

Closed
nvanbenschoten opened this issue Feb 8, 2023 · 7 comments
Closed
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-roachperf-investigation Issue opened as a result of roachperf triage branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Feb 8, 2023

kv0/enc=false/nodes=3/cpu=32 and a few related variants regressed on Feb 4th:

Screen Shot 2023-02-08 at 11 41 16 AM

This regression is only observed on AWS and not on GCE. It's also only observed on the 32 vCPU variant and not on the 8 vCPU variant.

I've determined that this was a result of #94165. When I run the benchmark and switch the kv.raft_log.non_blocking_synchronization.enabled cluster setting midway through to disable async storage writes, throughput increases.

Screen Shot 2023-02-08 at 11 40 13 AM

Interestingly, log commit latency also drops.

Screen Shot 2023-02-08 at 11 45 38 AM

One possibility is that fsyncs on AWS instance stores (with nobarrier) are fast enough that we are exceeding the Pebble SyncConcurrency of 512. This would cause the async log writes to become synchronous. I don't know why this is worse than the non-async storage write configuration. One thought is that we might be observing the overhead of the asynchronous write path (two goroutine hops) without the benefit (because we're still blocking before entering it) and so we see the throughput regression.

Another thought is that async storage writes trade-off reduced interference between writes on the same range for reduced batching of writes on the same range. In a system where the p50 fsync latency is .10ms, is this the right trade-off?

Next steps:

  • experiment with a larger SyncConcurrency
  • experiment with nobarrier disabled
  • experiment with EBS volumes

Jira issue: CRDB-24341

@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv Anything in KV that doesn't belong in a more specific category. A-roachperf-investigation Issue opened as a result of roachperf triage labels Feb 8, 2023
@nvanbenschoten nvanbenschoten self-assigned this Feb 8, 2023
@nvanbenschoten
Copy link
Member Author

Omitting the nobarrier option on the instance store mount point did not change the 50μs p50 fsync latency.

@smg260
Copy link
Contributor

smg260 commented Feb 15, 2023

Looks like this did, in fact, affect GCE performance. I arrived at the same commit when bisecting.

GCE
image

@nvanbenschoten nvanbenschoten added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 28, 2023
@nvanbenschoten nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Feb 28, 2023
@cockroachdb cockroachdb deleted a comment from blathers-crl bot Feb 28, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 1, 2023
@nvanbenschoten nvanbenschoten added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 1, 2023
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 15, 2023
Informs cockroachdb#96800.

This commit updates the kv roachtest suite to scale the workload concurrency
inversely with batch size. This ensures that we to account for the cost of each
operation, recognizing that the system can handle fewer concurrent operations as
the cost of each operation grows.

We currently have three kv benchmark variants that set a non-default batch size:
- `kv0/enc=false/nodes=3/batch=16`
- `kv50/enc=false/nodes=4/cpu=96/batch=64`
- `kv95/enc=false/nodes=3/batch=16`

Without this change, these tests badly overload their clusters. This leads to
p50 latencies in the 300-400ms range, about 10x greater than the corresponding
p50 in a non-overloaded cluster. In this degraded regime, performance is
unstable and non-representative. Customers don't run clusters at this level of
overload and maximizing throughput once we hit the latency cliff is no longer a
goal.

By reducing the workload concurrency to avoid from overload, we reduce
throughput by about 10% and reduce p50 and p99 latency by about 90%.

Release note: None
@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Mar 15, 2023

I've been digging into these two classes of regressions. I think there are three different aspects of these tests that are combining to cause the throughput reductions. The first and second apply to the AWS regressions and the second and third apply to the batched kv regressions:

  1. AWS is using local instance stores. p50 fsync latency is 50μs, which is between one and two orders of magnitude faster than the disks CRDB typically runs on. Notice that these kv tests all push about 50% more throughput on AWS than the same test pushes on GCP — a result of the difference in disk performance. This hardware configuration skews the throughput/latency trade-off of async storage writes.

  2. The kv roachtest benchmark suite is over-splitting ranges. Each test initially performs 1000 splits. When I remove this and let load-based splitting pick the optimal number of splits, kv tests end up with around 50. The effect of so many ranges is that intra-range batching at the Raft level is non-existent. We have a TODO to remove this over-splitting. If I do that, throughput meaningfully increases.

  3. The batched kv benchmark variants are not accounting for the batch size in the workload concurrency, so they are badly overloading their cluster. This leads to p50 latencies in the 300-400ms range, about 20x greater than the corresponding p50 in a non-overloaded cluster. In this degraded regime, performance is unstable and non-representative. In other words, we're ending up far to the right of this graph.

So where do we go from here? Ideally, we'd fix all three of these issues. Changing benchmarks is tricky though — we use them to detect regressions and don't want to lose continuity.


That said, I do think we should fix issue 3 immediately. These clusters are so overloaded that the results aren't usable. I've done that in #98648. This change drops p50 latency from about 400ms (far into overload) to about 40ms (approaching overload). However, it also does hurt top-line throughput by about 10%. We'll need to add an annotation to roachperf to reflect this.

With that resolved and the workload no longer overloading the cluster, the question becomes: do we still see a regression from async storage writes on the workload. The answer is no. In fact, async storage writes now shows up as a ~20% throughput win (note: we disable the change using the cluster setting and see a reduction in throughput):

Screenshot 2023-03-14 at 10 16 52 PM

For completeness, we also compare master (with async storage writes enabled) against v22.2 (c5a5c57) using the new workload configuration. Again, master with async storage writes comes out on top. master averages 778.7 qps while v22.2 averages 743.8 qps. When not overloaded, we now see an increase in throughput between the two versions of about 4.7%.

This matches the previous benchmarking performed in #94165. The change trades improved performance (latency and throughput) before overload for worse performance (latency and throughput) far into overload. This is of course a trade-off that we're happy to make.


The over-splitting is the other workload problem that we should do something about. Pre-splitting was added to these tests many years ago when load-based splitting was nascent. Load-based splitting is now mature and is able to find a better split count than we are able to manually. In each of the kv roachtests that I had looked at, removing the --splits flag has increased throughput.

However, I don't think the stability period is the right time for this. It's a disruptive change to all of these benchmarks and may let a real regression slip through. I've opened #98649 for further discussion. I'd suggest that we merge it early in the next release cycle.


This leaves us with the original AWS-only regressions. I'll spend some more time tomorrow/this week looking at it to determine what, if anything, we want to do about those regressions.

@nvanbenschoten
Copy link
Member Author

I dug further into the regression on kv0/enc=false/nodes=3/cpu=32 when running on AWS. There are two additional aspects of the regression that deserve mention. Doing so helps explain why we see this on some tests but not others.

First, recall from the previous comment that AWS instance stores (1) and over-splitting (2) are still very much in play with this regression. The third aspect that's relevant is the high core count in this test. While most kv tests run on a host with 8 vCPUs, this test runs on a host with 32 vCPUs. The fourth aspect that's relevant is that this is kv0, so 100% of the workload is write operations that go through Raft.

What do these four factors add up to? A raft scheduler that's overloaded due to mutex contention. Mutex contention was already the bottleneck for this workload before #94165. #94165 then added two more calls into the raft scheduler for each Raft write, so mutex contention is now an even bigger bottleneck. I believe this is an increase of 66%, from 3 calls into the raft scheduler per write to 5 calls into the raft scheduler per write.

To summarize, the regression requires the confluence of four factors:

  1. AWS instance stores are 20x lower latency than the disks we usually use. With the workload's fixed concurrency, this shifts the limiting factor away from disk sync latency to the raft scheduler. Notice that we don't see a regression on GCP's kv0/enc=false/nodes=3/cpu=32.
  2. Over-splitting ranges means that the test does not benefit from intra-range batching. This means that calls by a replica into the raft scheduler are often on behalf of a single write, not multiple Notice that we don't see as much of a regression on kv0/enc=false/nodes=3/cpu=32/splt=0 (though a regression does still exist).
  3. High core count means that the cluster has more capacity to push higher throughput, but also that the Raft scheduler is doing more work and mutex contention is more of a problem. Notice that we don't see any regression on kv0/enc=false/nodes=3 (8 cpus). Meanwhile, we see an even larger regression on a 96-core variant of the test.
  4. kv0 means that all operations are writes that pass through Raft, so every operation hits the raft scheduler. Notice that we don't see a regression on kv95/enc=false/nodes=3.

So now that this is understood, what can we do about it? There are short-term wins and long-term wins.

An easy short-term win is to eliminate unnecessary uses of the Raft scheduler to relieve pressure on its mutex. We can easily eliminate one of the two calls into the raft scheduler that was added in #94165 through a change like nvanbenschoten@8949816. These calls into the raft scheduler were from MsgStorageApplyResp messages. However, these messages are processed by a goroutine that's already running on the scheduler, so there's no need to re-schedule.

A less trivial change would be to eliminate the other new call into the raft scheduler, which is from MsgStorageAppendResp messages. These messages are sent from a callback on raft log syncs. The messages are also meant for bookkeeping, so they're not critical, but we'd still like them to be delivered with some timeliness to permit truncation of the in-memory unstable Raft log. Without a call into the raft scheduler in response to these messages, it's difficult to guarantee that they will ever be delivered.

I've confirmed through experimentation that avoiding a call into the Raft scheduler in both of these cases eliminates the regression.

A longer-term win would be to improve the Raft scheduler so that it can be run at higher throughput without mutex contention issues. One approach we could take for this is to shard its central mutex. I've done so bluntly in nvanbenschoten@46e416e. That patch not only eliminates the performance regression, but it also allows us to hit a throughput above the baseline — around 80k qps on kv0/enc=false/nodes=3/cpu=32. Unfortunately, this kind of change is probably not appropriate for the stability period.

I'll continue testing and looking into what we can do now to close the gap.

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Mar 16, 2023

I believe this is an increase of 66%, from 3 calls into the raft scheduler per write to 5 calls into the raft scheduler per write.

To verify this, I added some instrumentation to raft scheduler interactions and re-ran the test. Here are the results:

$ rg 'SCHED_PROF' cockroach.stdout.log | wc -l
18688214

$ rg 'SCHED_PROF' cockroach.stdout.log | sort | uniq -c | sort -nr
9264620 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:286
8212666 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1606
1118390 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_buf.go:1163
  90503 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1071
    657 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:281
    656 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:280
    381 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:2118
    340 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:781
      1 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:215

Over a minute or so period, we saw 18688214 raft scheduler "enqueue" calls. There we 4 prominent callers:

  • HandleRaftRequest's call to EnqueueRaftRequest was responsible for 49.6% of the calls
  • sendLocalRaftMsg's call to enqueueRaftUpdateCheck was responsible for 43.9% of the calls
  • propBuf.insertIntoArray call to enqueueRaftUpdateCheck was responsible for 6.0% of the calls
  • handleRaftReadyRaftMuLocked's call to enqueueRaftUpdateCheck was responsible for 0.5% of the calls

sendLocalRaftMsg was added by #94165. So that PR almost doubled the pressure on the raft scheduler, which explains why it's exacerbating mutex contention, leading to the regression under the circumstances outlined above.

I'll update the profiling to give us information about the distribution of callers of sendLocalRaftMsg, which will help us understand how impactful nvanbenschoten@8949816 will be.

@nvanbenschoten
Copy link
Member Author

With the different callers of sendLocalRaftMsg split out, we see:

$ rg 'SCHED_PROF' cockroach.stdout.log | wc -l
21547717

$ rg 'SCHED_PROF' cockroach.stdout.log | sort | uniq -c | sort -nr
10250938 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:286
 6029453 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1491
 3821449 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1032
 1321233 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_buf.go:342
  122528 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1071
     692 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:280
     680 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:281
     385 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:2120
     357 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:780
       2 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:215

From here, we can see that of the 43.9% of all uses of the Raft scheduler from sendLocalRaftMsg's call to enqueueRaftUpdateCheck, 61% are for MsgStorageAppendResp and 39% are for MsgStorageApplyResp.

Put differently, nvanbenschoten@8949816 reduces the use of the raft scheduler by 17.8%. Seems worthwhile.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 17, 2023
Informs cockroachdb#96800.

This commit avoids unnecessary interactions with the Raft scheduler as a way
to relieve pressure on the scheduler's mutex. In
cockroachdb#96800 (comment),
we found that calls into the scheduler in response to synchronously handled
`MsgStorageApply`s were responsible for 17.8% raftScheduler enqueue calls.
These enqueue calls were unnecessary because they were performed while already
running on the raft scheduler, so the corresponding local messages were already
going to be flushed and processed before the scheduler is yielded. By
recognizing when we are in these situations, we can omit the unnecessary
scheduler interactions.

```
name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=32   51.5k ± 2%   57.3k ± 3%  +11.18%  (p=0.008 n=5+5)

name                          old p50(ms)  new p50(ms)  delta
kv0/enc=false/nodes=3/cpu=32    2.40 ± 0%    2.10 ± 5%  -12.50%  (p=0.000 n=4+5)

name                          old p99(ms)  new p99(ms)  delta
kv0/enc=false/nodes=3/cpu=32    11.0 ± 0%     9.6 ± 4%  -12.36%  (p=0.000 n=4+5)
```

Release note: None
craig bot pushed a commit that referenced this issue Mar 18, 2023
98259: roachtest: add an option to create nodes with low memory per core r=lidorcarmel a=lidorcarmel

Currently in roachtest we can create VMs with a default RAM per core, or we can request for highmem machines. This patch is adding an option to ask for highcpu (lower RAM per core) machines.

Previously, on GCE:
- Creating <= 16 core VM created VMs with standard memory per core (~4GM).
- Creating a VM with more cores used 'highcpu' nodes (~1GB per core).
- Using the `HighMem` option created a VM with high memory per core (~6.5GB).
- There was no option to create, for example, a 32 core VM with standard memory (~4GB) - only low mem or high mem.

With this patch the test writer can pick any low/standard/high memory per core ratio.

The initial need is to run restore performance benchmarks with reduced memory to verify nodes don't OOM in that environment. Running with n1-highcpu-8 is not ideal but nodes should not OOM.

We can also, with this change, run with 32 cores and 'standard' RAM (n1-standard-32).

Epic: none

Release note: None

98389: delegate: make actual number of annotations used by delegator r=msirek a=msirek

This updates the delegator, which parses textual SQL statements which
represent specific DDL statements, so that the `Annotations` slice
allocated in `planner.semaCtx` matches the actual number of annotations
built during parsing (if the delegator successfully built a statement).

Fixes: #97362

Release note (bug fix): Rare internal errors in SHOW JOBS
statements which have a WITH clause are fixed.

98402: roachtest: make 'testImpl.Skip/f' behave consistently with 'TestSpec.… r=smg260 a=smg260

roachtest: make 'testImpl.Skip/f' behave consistently with 'TestSpec.Skip'

This will show tests skipped via t.Skip() as ignored in the TC UI, with the caveat that it will show the test as having run twice. See inline comment for details.

resolves: #96351
Release note: None
Epic: None

98648: roachtest: scale kv concurrency inversely with batch size r=kvoli a=nvanbenschoten

Informs #96800.

This commit updates the kv roachtest suite to scale the workload concurrency inversely with batch size. This ensures that we to account for the cost of each operation, recognizing that the system can handle fewer concurrent operations as the cost of each operation grows.

We currently have three kv benchmark variants that set a non-default batch size:
- `kv0/enc=false/nodes=3/batch=16`
- `kv50/enc=false/nodes=4/cpu=96/batch=64`
- `kv95/enc=false/nodes=3/batch=16`

Without this change, these tests badly overload their clusters. This leads to p50 latencies in the 300-400ms range, about 10x greater than the corresponding p50 in a non-overloaded cluster. In this degraded regime, performance is unstable and non-representative. Customers don't run clusters at this level of overload and maximizing throughput once we hit the latency cliff is no longer a goal.

By reducing the workload concurrency to avoid from overload, we reduce throughput by about 10% and reduce p50 and p99 latency by about 90%.

Release note: None

98806: changefeedccl: add WITH key_column option r=[jayshrivastava] a=HonoreDB

Changefeeds running on an outbox table see a synthetic primary key that isn't useful for downstream partitioning. This PR adds an encoder option to use a different column as the key, not in internal logic, but only in message metadata. This breaks end-to-end ordering because we're only ordered with respect to the actual primary key, and the sink will only order with respect to the key we emit. We therefore require the unordered flag here.

Closes #54461.

Release note (enterprise change): Added the WITH key_column option to override the key used in message metadata. This changes the key hashed to determine Kafka partitions. It does not affect the output of key_in_value or the domain of the per-key ordering guarantee.

98824: ci: create bazel-fips docker image r=healthy-pod a=rail

Previously, we use cockroachdb/bazel image to build and run our tests. In order to run FIPS tests, the image has to use particular FIPS-enabled packages.

This PR adds a new bazelbuilder image, that uses FIPS-compliant packages.

* Added `build/.bazelbuilderversion-fips` file.
* Added `run_bazel_fips` wrapper.
* The image builder script uses `dpkg-repack` to reproduce the same packages.

Epic: DEVINF-478
Release note: None

98856: jobspb: mark the key visualizer job as automatic r=zachlite a=zachlite

Epic: None
Release note: None

Co-authored-by: Lidor Carmel <lidor@cockroachlabs.com>
Co-authored-by: Mark Sirek <sirek@cockroachlabs.com>
Co-authored-by: Miral Gadani <miral@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Aaron Zinger <zinger@cockroachlabs.com>
Co-authored-by: Rail Aliiev <rail@iqchoice.com>
Co-authored-by: zachlite <zachlite@gmail.com>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Mar 20, 2023
Informs cockroachdb#96800.

This commit avoids unnecessary interactions with the Raft scheduler as a way
to relieve pressure on the scheduler's mutex. In
cockroachdb#96800 (comment),
we found that calls into the scheduler in response to synchronously handled
`MsgStorageApply`s were responsible for 17.8% raftScheduler enqueue calls.
These enqueue calls were unnecessary because they were performed while already
running on the raft scheduler, so the corresponding local messages were already
going to be flushed and processed before the scheduler is yielded. By
recognizing when we are in these situations, we can omit the unnecessary
scheduler interactions.

```
name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=32   51.5k ± 2%   57.3k ± 3%  +11.18%  (p=0.008 n=5+5)

name                          old p50(ms)  new p50(ms)  delta
kv0/enc=false/nodes=3/cpu=32    2.40 ± 0%    2.10 ± 5%  -12.50%  (p=0.000 n=4+5)

name                          old p99(ms)  new p99(ms)  delta
kv0/enc=false/nodes=3/cpu=32    11.0 ± 0%     9.6 ± 4%  -12.36%  (p=0.000 n=4+5)
```

Release note: None
craig bot pushed a commit that referenced this issue Mar 21, 2023
98833: kv: don't signal raft scheduler when already on scheduler goroutine r=erikgrinaker a=nvanbenschoten

Informs #96800.

This commit avoids unnecessary interactions with the Raft scheduler as a way to relieve pressure on the scheduler's mutex. In #96800 (comment), we found that calls into the scheduler in response to synchronously handled `MsgStorageApply`s were responsible for **17.8%** raftScheduler enqueue calls. These enqueue calls were unnecessary because they were performed while already running on the raft scheduler, so the corresponding local messages were already going to be flushed and processed before the scheduler is yielded. By recognizing when we are in these situations, we can omit the unnecessary scheduler interactions.

```
# benchmarking on AWS with instance stores
# see #96800 for why this has little to no effect on GCP of AWS with EBS

name                          old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=32   51.5k ± 2%   57.3k ± 3%  +11.18%  (p=0.008 n=5+5)

name                          old p50(ms)  new p50(ms)  delta
kv0/enc=false/nodes=3/cpu=32    2.40 ± 0%    2.10 ± 5%  -12.50%  (p=0.000 n=4+5)

name                          old p99(ms)  new p99(ms)  delta
kv0/enc=false/nodes=3/cpu=32    11.0 ± 0%     9.6 ± 4%  -12.36%  (p=0.000 n=4+5)
```

Release note: None

Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
craig bot pushed a commit that referenced this issue Mar 21, 2023
98854: kvserver: shard Raft scheduler r=erikgrinaker a=erikgrinaker

The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards.

By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via `COCKROACH_SCHEDULER_SHARD_SIZE`.

This results in a substantial performance improvement on high-CPU nodes:

```
name                                    old ops/sec  new ops/sec  delta
kv0/enc=false/nodes=3/cpu=4              7.71k ± 5%   7.93k ± 4%     ~     (p=0.310 n=5+5)
kv0/enc=false/nodes=3/cpu=8              15.6k ± 3%   14.8k ± 7%     ~     (p=0.095 n=5+5)
kv0/enc=false/nodes=3/cpu=32             43.4k ± 2%   45.0k ± 3%   +3.73%  (p=0.032 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws   40.5k ± 2%   61.7k ± 1%  +52.53%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce   35.6k ± 4%   44.5k ± 0%  +24.99%  (p=0.008 n=5+5)

name                                    old p50      new p50      delta
kv0/enc=false/nodes=3/cpu=4               21.2 ± 6%    20.6 ± 3%     ~     (p=0.397 n=5+5)
kv0/enc=false/nodes=3/cpu=8               10.5 ± 0%    11.2 ± 8%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=32              4.16 ± 1%    4.00 ± 5%     ~     (p=0.143 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    3.00 ± 0%    2.00 ± 0%     ~     (p=0.079 n=4+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    4.70 ± 0%    4.10 ± 0%  -12.77%  (p=0.000 n=5+4)

name                                    old p95      new p95      delta
kv0/enc=false/nodes=3/cpu=4               61.6 ± 5%    60.8 ± 3%     ~     (p=0.762 n=5+5)
kv0/enc=false/nodes=3/cpu=8               28.3 ± 4%    30.4 ± 0%   +7.34%  (p=0.016 n=5+4)
kv0/enc=false/nodes=3/cpu=32              7.98 ± 2%    7.60 ± 0%   -4.76%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    12.3 ± 2%     6.8 ± 0%  -44.72%  (p=0.000 n=5+4)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    10.2 ± 3%     8.9 ± 0%  -12.75%  (p=0.000 n=5+4)

name                                    old p99      new p99      delta
kv0/enc=false/nodes=3/cpu=4               89.8 ± 7%    88.9 ± 6%     ~     (p=0.921 n=5+5)
kv0/enc=false/nodes=3/cpu=8               46.1 ± 0%    48.6 ± 5%   +5.47%  (p=0.048 n=5+5)
kv0/enc=false/nodes=3/cpu=32              11.5 ± 0%    11.0 ± 0%   -4.35%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=aws    14.0 ± 3%    12.1 ± 0%  -13.32%  (p=0.008 n=5+5)
kv0/enc=false/nodes=3/cpu=96/cloud=gce    14.3 ± 3%    13.1 ± 0%   -8.55%  (p=0.000 n=4+5)

```

The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to `raftSchedulerBatch` pre-sharding the enqueued ranges:

```
name                                                                 old time/op  new time/op  delta
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24         457ns ± 2%   564ns ± 2%   +23.36%  (p=0.001 n=7+7)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24        461ns ± 3%   563ns ± 2%   +22.14%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24        459ns ± 2%   591ns ± 2%   +28.63%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24        455ns ± 0%   776ns ± 5%   +70.60%  (p=0.001 n=6+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24       456ns ± 2%  1058ns ± 1%  +132.13%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24    7.15ms ± 1%  8.18ms ± 1%   +14.33%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24   7.13ms ± 1%  8.18ms ± 1%   +14.77%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24   7.12ms ± 2%  7.86ms ± 1%   +10.30%  (p=0.000 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24   7.20ms ± 1%  7.11ms ± 1%    -1.27%  (p=0.001 n=8+8)
SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24  7.12ms ± 2%  7.16ms ± 3%      ~     (p=0.721 n=8+8)
```

Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%.

Resolves #98811.
Touches #96800.

Hat tip to `@nvanbenschoten` for the initial prototype.

Epic: none
Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts.

Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
@nvanbenschoten
Copy link
Member Author

Now that #98833 and #98854 have landed, I'm going to close this. Those optimizations more than avoid the regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. A-roachperf-investigation Issue opened as a result of roachperf triage branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-performance Perf of queries or internals. Solution not expected to change functional behavior. GA-blocker T-kv KV Team
Projects
None yet
Development

No branches or pull requests

2 participants