roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

nvanbenschoten · 2023-02-08T16:57:28Z

kv0/enc=false/nodes=3/cpu=32 and a few related variants regressed on Feb 4th:

This regression is only observed on AWS and not on GCE. It's also only observed on the 32 vCPU variant and not on the 8 vCPU variant.

I've determined that this was a result of #94165. When I run the benchmark and switch the kv.raft_log.non_blocking_synchronization.enabled cluster setting midway through to disable async storage writes, throughput increases.

Interestingly, log commit latency also drops.

One possibility is that fsyncs on AWS instance stores (with nobarrier) are fast enough that we are exceeding the Pebble SyncConcurrency of 512. This would cause the async log writes to become synchronous. I don't know why this is worse than the non-async storage write configuration. One thought is that we might be observing the overhead of the asynchronous write path (two goroutine hops) without the benefit (because we're still blocking before entering it) and so we see the throughput regression.

Another thought is that async storage writes trade-off reduced interference between writes on the same range for reduced batching of writes on the same range. In a system where the p50 fsync latency is .10ms, is this the right trade-off?

Next steps:

experiment with a larger SyncConcurrency
experiment with nobarrier disabled
experiment with EBS volumes

Jira issue: CRDB-24341

The text was updated successfully, but these errors were encountered:

nvanbenschoten · 2023-02-08T19:00:38Z

Omitting the nobarrier option on the instance store mount point did not change the 50μs p50 fsync latency.

smg260 · 2023-02-15T20:22:05Z

Looks like this did, in fact, affect GCE performance. I arrived at the same commit when bisecting.

GCE

Informs cockroachdb#96800. This commit updates the kv roachtest suite to scale the workload concurrency inversely with batch size. This ensures that we to account for the cost of each operation, recognizing that the system can handle fewer concurrent operations as the cost of each operation grows. We currently have three kv benchmark variants that set a non-default batch size: - `kv0/enc=false/nodes=3/batch=16` - `kv50/enc=false/nodes=4/cpu=96/batch=64` - `kv95/enc=false/nodes=3/batch=16` Without this change, these tests badly overload their clusters. This leads to p50 latencies in the 300-400ms range, about 10x greater than the corresponding p50 in a non-overloaded cluster. In this degraded regime, performance is unstable and non-representative. Customers don't run clusters at this level of overload and maximizing throughput once we hit the latency cliff is no longer a goal. By reducing the workload concurrency to avoid from overload, we reduce throughput by about 10% and reduce p50 and p99 latency by about 90%. Release note: None

nvanbenschoten · 2023-03-15T03:13:15Z

I've been digging into these two classes of regressions. I think there are three different aspects of these tests that are combining to cause the throughput reductions. The first and second apply to the AWS regressions and the second and third apply to the batched kv regressions:

AWS is using local instance stores. p50 fsync latency is 50μs, which is between one and two orders of magnitude faster than the disks CRDB typically runs on. Notice that these kv tests all push about 50% more throughput on AWS than the same test pushes on GCP — a result of the difference in disk performance. This hardware configuration skews the throughput/latency trade-off of async storage writes.
The kv roachtest benchmark suite is over-splitting ranges. Each test initially performs 1000 splits. When I remove this and let load-based splitting pick the optimal number of splits, kv tests end up with around 50. The effect of so many ranges is that intra-range batching at the Raft level is non-existent. We have a TODO to remove this over-splitting. If I do that, throughput meaningfully increases.
The batched kv benchmark variants are not accounting for the batch size in the workload concurrency, so they are badly overloading their cluster. This leads to p50 latencies in the 300-400ms range, about 20x greater than the corresponding p50 in a non-overloaded cluster. In this degraded regime, performance is unstable and non-representative. In other words, we're ending up far to the right of this graph.

So where do we go from here? Ideally, we'd fix all three of these issues. Changing benchmarks is tricky though — we use them to detect regressions and don't want to lose continuity.

That said, I do think we should fix issue 3 immediately. These clusters are so overloaded that the results aren't usable. I've done that in #98648. This change drops p50 latency from about 400ms (far into overload) to about 40ms (approaching overload). However, it also does hurt top-line throughput by about 10%. We'll need to add an annotation to roachperf to reflect this.

With that resolved and the workload no longer overloading the cluster, the question becomes: do we still see a regression from async storage writes on the workload. The answer is no. In fact, async storage writes now shows up as a ~20% throughput win (note: we disable the change using the cluster setting and see a reduction in throughput):

For completeness, we also compare master (with async storage writes enabled) against v22.2 (c5a5c57) using the new workload configuration. Again, master with async storage writes comes out on top. master averages 778.7 qps while v22.2 averages 743.8 qps. When not overloaded, we now see an increase in throughput between the two versions of about 4.7%.

This matches the previous benchmarking performed in #94165. The change trades improved performance (latency and throughput) before overload for worse performance (latency and throughput) far into overload. This is of course a trade-off that we're happy to make.

The over-splitting is the other workload problem that we should do something about. Pre-splitting was added to these tests many years ago when load-based splitting was nascent. Load-based splitting is now mature and is able to find a better split count than we are able to manually. In each of the kv roachtests that I had looked at, removing the --splits flag has increased throughput.

However, I don't think the stability period is the right time for this. It's a disruptive change to all of these benchmarks and may let a real regression slip through. I've opened #98649 for further discussion. I'd suggest that we merge it early in the next release cycle.

This leaves us with the original AWS-only regressions. I'll spend some more time tomorrow/this week looking at it to determine what, if anything, we want to do about those regressions.

nvanbenschoten · 2023-03-16T16:42:23Z

I dug further into the regression on kv0/enc=false/nodes=3/cpu=32 when running on AWS. There are two additional aspects of the regression that deserve mention. Doing so helps explain why we see this on some tests but not others.

First, recall from the previous comment that AWS instance stores (1) and over-splitting (2) are still very much in play with this regression. The third aspect that's relevant is the high core count in this test. While most kv tests run on a host with 8 vCPUs, this test runs on a host with 32 vCPUs. The fourth aspect that's relevant is that this is kv0, so 100% of the workload is write operations that go through Raft.

What do these four factors add up to? A raft scheduler that's overloaded due to mutex contention. Mutex contention was already the bottleneck for this workload before #94165. #94165 then added two more calls into the raft scheduler for each Raft write, so mutex contention is now an even bigger bottleneck. I believe this is an increase of 66%, from 3 calls into the raft scheduler per write to 5 calls into the raft scheduler per write.

To summarize, the regression requires the confluence of four factors:

AWS instance stores are 20x lower latency than the disks we usually use. With the workload's fixed concurrency, this shifts the limiting factor away from disk sync latency to the raft scheduler. Notice that we don't see a regression on GCP's kv0/enc=false/nodes=3/cpu=32.
Over-splitting ranges means that the test does not benefit from intra-range batching. This means that calls by a replica into the raft scheduler are often on behalf of a single write, not multiple Notice that we don't see as much of a regression on kv0/enc=false/nodes=3/cpu=32/splt=0 (though a regression does still exist).
High core count means that the cluster has more capacity to push higher throughput, but also that the Raft scheduler is doing more work and mutex contention is more of a problem. Notice that we don't see any regression on kv0/enc=false/nodes=3 (8 cpus). Meanwhile, we see an even larger regression on a 96-core variant of the test.
kv0 means that all operations are writes that pass through Raft, so every operation hits the raft scheduler. Notice that we don't see a regression on kv95/enc=false/nodes=3.

So now that this is understood, what can we do about it? There are short-term wins and long-term wins.

An easy short-term win is to eliminate unnecessary uses of the Raft scheduler to relieve pressure on its mutex. We can easily eliminate one of the two calls into the raft scheduler that was added in #94165 through a change like nvanbenschoten@8949816. These calls into the raft scheduler were from MsgStorageApplyResp messages. However, these messages are processed by a goroutine that's already running on the scheduler, so there's no need to re-schedule.

A less trivial change would be to eliminate the other new call into the raft scheduler, which is from MsgStorageAppendResp messages. These messages are sent from a callback on raft log syncs. The messages are also meant for bookkeeping, so they're not critical, but we'd still like them to be delivered with some timeliness to permit truncation of the in-memory unstable Raft log. Without a call into the raft scheduler in response to these messages, it's difficult to guarantee that they will ever be delivered.

I've confirmed through experimentation that avoiding a call into the Raft scheduler in both of these cases eliminates the regression.

A longer-term win would be to improve the Raft scheduler so that it can be run at higher throughput without mutex contention issues. One approach we could take for this is to shard its central mutex. I've done so bluntly in nvanbenschoten@46e416e. That patch not only eliminates the performance regression, but it also allows us to hit a throughput above the baseline — around 80k qps on kv0/enc=false/nodes=3/cpu=32. Unfortunately, this kind of change is probably not appropriate for the stability period.

I'll continue testing and looking into what we can do now to close the gap.

nvanbenschoten · 2023-03-16T18:27:56Z

I believe this is an increase of 66%, from 3 calls into the raft scheduler per write to 5 calls into the raft scheduler per write.

To verify this, I added some instrumentation to raft scheduler interactions and re-ran the test. Here are the results:

$ rg 'SCHED_PROF' cockroach.stdout.log | wc -l
18688214

$ rg 'SCHED_PROF' cockroach.stdout.log | sort | uniq -c | sort -nr
9264620 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:286
8212666 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1606
1118390 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_buf.go:1163
  90503 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1071
    657 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:281
    656 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:280
    381 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:2118
    340 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:781
      1 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:215

Over a minute or so period, we saw 18688214 raft scheduler "enqueue" calls. There we 4 prominent callers:

HandleRaftRequest's call to EnqueueRaftRequest was responsible for 49.6% of the calls
sendLocalRaftMsg's call to enqueueRaftUpdateCheck was responsible for 43.9% of the calls
propBuf.insertIntoArray call to enqueueRaftUpdateCheck was responsible for 6.0% of the calls
handleRaftReadyRaftMuLocked's call to enqueueRaftUpdateCheck was responsible for 0.5% of the calls

sendLocalRaftMsg was added by #94165. So that PR almost doubled the pressure on the raft scheduler, which explains why it's exacerbating mutex contention, leading to the regression under the circumstances outlined above.

I'll update the profiling to give us information about the distribution of callers of sendLocalRaftMsg, which will help us understand how impactful nvanbenschoten@8949816 will be.

nvanbenschoten · 2023-03-16T18:53:36Z

With the different callers of sendLocalRaftMsg split out, we see:

$ rg 'SCHED_PROF' cockroach.stdout.log | wc -l
21547717

$ rg 'SCHED_PROF' cockroach.stdout.log | sort | uniq -c | sort -nr
10250938 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:286
 6029453 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1491
 3821449 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1032
 1321233 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_proposal_buf.go:342
  122528 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:1071
     692 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:280
     680 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:281
     385 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:2120
     357 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:780
       2 SCHED_PROF: github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_split.go:215

From here, we can see that of the 43.9% of all uses of the Raft scheduler from sendLocalRaftMsg's call to enqueueRaftUpdateCheck, 61% are for MsgStorageAppendResp and 39% are for MsgStorageApplyResp.

Put differently, nvanbenschoten@8949816 reduces the use of the raft scheduler by 17.8%. Seems worthwhile.

Informs cockroachdb#96800. This commit avoids unnecessary interactions with the Raft scheduler as a way to relieve pressure on the scheduler's mutex. In cockroachdb#96800 (comment), we found that calls into the scheduler in response to synchronously handled `MsgStorageApply`s were responsible for 17.8% raftScheduler enqueue calls. These enqueue calls were unnecessary because they were performed while already running on the raft scheduler, so the corresponding local messages were already going to be flushed and processed before the scheduler is yielded. By recognizing when we are in these situations, we can omit the unnecessary scheduler interactions. ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=32 51.5k ± 2% 57.3k ± 3% +11.18% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/enc=false/nodes=3/cpu=32 2.40 ± 0% 2.10 ± 5% -12.50% (p=0.000 n=4+5) name old p99(ms) new p99(ms) delta kv0/enc=false/nodes=3/cpu=32 11.0 ± 0% 9.6 ± 4% -12.36% (p=0.000 n=4+5) ``` Release note: None

98259: roachtest: add an option to create nodes with low memory per core r=lidorcarmel a=lidorcarmel Currently in roachtest we can create VMs with a default RAM per core, or we can request for highmem machines. This patch is adding an option to ask for highcpu (lower RAM per core) machines. Previously, on GCE: - Creating <= 16 core VM created VMs with standard memory per core (~4GM). - Creating a VM with more cores used 'highcpu' nodes (~1GB per core). - Using the `HighMem` option created a VM with high memory per core (~6.5GB). - There was no option to create, for example, a 32 core VM with standard memory (~4GB) - only low mem or high mem. With this patch the test writer can pick any low/standard/high memory per core ratio. The initial need is to run restore performance benchmarks with reduced memory to verify nodes don't OOM in that environment. Running with n1-highcpu-8 is not ideal but nodes should not OOM. We can also, with this change, run with 32 cores and 'standard' RAM (n1-standard-32). Epic: none Release note: None 98389: delegate: make actual number of annotations used by delegator r=msirek a=msirek This updates the delegator, which parses textual SQL statements which represent specific DDL statements, so that the `Annotations` slice allocated in `planner.semaCtx` matches the actual number of annotations built during parsing (if the delegator successfully built a statement). Fixes: #97362 Release note (bug fix): Rare internal errors in SHOW JOBS statements which have a WITH clause are fixed. 98402: roachtest: make 'testImpl.Skip/f' behave consistently with 'TestSpec.… r=smg260 a=smg260 roachtest: make 'testImpl.Skip/f' behave consistently with 'TestSpec.Skip' This will show tests skipped via t.Skip() as ignored in the TC UI, with the caveat that it will show the test as having run twice. See inline comment for details. resolves: #96351 Release note: None Epic: None 98648: roachtest: scale kv concurrency inversely with batch size r=kvoli a=nvanbenschoten Informs #96800. This commit updates the kv roachtest suite to scale the workload concurrency inversely with batch size. This ensures that we to account for the cost of each operation, recognizing that the system can handle fewer concurrent operations as the cost of each operation grows. We currently have three kv benchmark variants that set a non-default batch size: - `kv0/enc=false/nodes=3/batch=16` - `kv50/enc=false/nodes=4/cpu=96/batch=64` - `kv95/enc=false/nodes=3/batch=16` Without this change, these tests badly overload their clusters. This leads to p50 latencies in the 300-400ms range, about 10x greater than the corresponding p50 in a non-overloaded cluster. In this degraded regime, performance is unstable and non-representative. Customers don't run clusters at this level of overload and maximizing throughput once we hit the latency cliff is no longer a goal. By reducing the workload concurrency to avoid from overload, we reduce throughput by about 10% and reduce p50 and p99 latency by about 90%. Release note: None 98806: changefeedccl: add WITH key_column option r=[jayshrivastava] a=HonoreDB Changefeeds running on an outbox table see a synthetic primary key that isn't useful for downstream partitioning. This PR adds an encoder option to use a different column as the key, not in internal logic, but only in message metadata. This breaks end-to-end ordering because we're only ordered with respect to the actual primary key, and the sink will only order with respect to the key we emit. We therefore require the unordered flag here. Closes #54461. Release note (enterprise change): Added the WITH key_column option to override the key used in message metadata. This changes the key hashed to determine Kafka partitions. It does not affect the output of key_in_value or the domain of the per-key ordering guarantee. 98824: ci: create bazel-fips docker image r=healthy-pod a=rail Previously, we use cockroachdb/bazel image to build and run our tests. In order to run FIPS tests, the image has to use particular FIPS-enabled packages. This PR adds a new bazelbuilder image, that uses FIPS-compliant packages. * Added `build/.bazelbuilderversion-fips` file. * Added `run_bazel_fips` wrapper. * The image builder script uses `dpkg-repack` to reproduce the same packages. Epic: DEVINF-478 Release note: None 98856: jobspb: mark the key visualizer job as automatic r=zachlite a=zachlite Epic: None Release note: None Co-authored-by: Lidor Carmel <lidor@cockroachlabs.com> Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Miral Gadani <miral@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Aaron Zinger <zinger@cockroachlabs.com> Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: zachlite <zachlite@gmail.com>

Informs cockroachdb#96800. This commit avoids unnecessary interactions with the Raft scheduler as a way to relieve pressure on the scheduler's mutex. In cockroachdb#96800 (comment), we found that calls into the scheduler in response to synchronously handled `MsgStorageApply`s were responsible for 17.8% raftScheduler enqueue calls. These enqueue calls were unnecessary because they were performed while already running on the raft scheduler, so the corresponding local messages were already going to be flushed and processed before the scheduler is yielded. By recognizing when we are in these situations, we can omit the unnecessary scheduler interactions. ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=32 51.5k ± 2% 57.3k ± 3% +11.18% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/enc=false/nodes=3/cpu=32 2.40 ± 0% 2.10 ± 5% -12.50% (p=0.000 n=4+5) name old p99(ms) new p99(ms) delta kv0/enc=false/nodes=3/cpu=32 11.0 ± 0% 9.6 ± 4% -12.36% (p=0.000 n=4+5) ``` Release note: None

98833: kv: don't signal raft scheduler when already on scheduler goroutine r=erikgrinaker a=nvanbenschoten Informs #96800. This commit avoids unnecessary interactions with the Raft scheduler as a way to relieve pressure on the scheduler's mutex. In #96800 (comment), we found that calls into the scheduler in response to synchronously handled `MsgStorageApply`s were responsible for **17.8%** raftScheduler enqueue calls. These enqueue calls were unnecessary because they were performed while already running on the raft scheduler, so the corresponding local messages were already going to be flushed and processed before the scheduler is yielded. By recognizing when we are in these situations, we can omit the unnecessary scheduler interactions. ``` # benchmarking on AWS with instance stores # see #96800 for why this has little to no effect on GCP of AWS with EBS name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=32 51.5k ± 2% 57.3k ± 3% +11.18% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/enc=false/nodes=3/cpu=32 2.40 ± 0% 2.10 ± 5% -12.50% (p=0.000 n=4+5) name old p99(ms) new p99(ms) delta kv0/enc=false/nodes=3/cpu=32 11.0 ± 0% 9.6 ± 4% -12.36% (p=0.000 n=4+5) ``` Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>

98854: kvserver: shard Raft scheduler r=erikgrinaker a=erikgrinaker The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards. By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via `COCKROACH_SCHEDULER_SHARD_SIZE`. This results in a substantial performance improvement on high-CPU nodes: ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=4 7.71k ± 5% 7.93k ± 4% ~ (p=0.310 n=5+5) kv0/enc=false/nodes=3/cpu=8 15.6k ± 3% 14.8k ± 7% ~ (p=0.095 n=5+5) kv0/enc=false/nodes=3/cpu=32 43.4k ± 2% 45.0k ± 3% +3.73% (p=0.032 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 40.5k ± 2% 61.7k ± 1% +52.53% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 35.6k ± 4% 44.5k ± 0% +24.99% (p=0.008 n=5+5) name old p50 new p50 delta kv0/enc=false/nodes=3/cpu=4 21.2 ± 6% 20.6 ± 3% ~ (p=0.397 n=5+5) kv0/enc=false/nodes=3/cpu=8 10.5 ± 0% 11.2 ± 8% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=32 4.16 ± 1% 4.00 ± 5% ~ (p=0.143 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 4.70 ± 0% 4.10 ± 0% -12.77% (p=0.000 n=5+4) name old p95 new p95 delta kv0/enc=false/nodes=3/cpu=4 61.6 ± 5% 60.8 ± 3% ~ (p=0.762 n=5+5) kv0/enc=false/nodes=3/cpu=8 28.3 ± 4% 30.4 ± 0% +7.34% (p=0.016 n=5+4) kv0/enc=false/nodes=3/cpu=32 7.98 ± 2% 7.60 ± 0% -4.76% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 12.3 ± 2% 6.8 ± 0% -44.72% (p=0.000 n=5+4) kv0/enc=false/nodes=3/cpu=96/cloud=gce 10.2 ± 3% 8.9 ± 0% -12.75% (p=0.000 n=5+4) name old p99 new p99 delta kv0/enc=false/nodes=3/cpu=4 89.8 ± 7% 88.9 ± 6% ~ (p=0.921 n=5+5) kv0/enc=false/nodes=3/cpu=8 46.1 ± 0% 48.6 ± 5% +5.47% (p=0.048 n=5+5) kv0/enc=false/nodes=3/cpu=32 11.5 ± 0% 11.0 ± 0% -4.35% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 14.0 ± 3% 12.1 ± 0% -13.32% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 14.3 ± 3% 13.1 ± 0% -8.55% (p=0.000 n=4+5) ``` The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to `raftSchedulerBatch` pre-sharding the enqueued ranges: ``` name old time/op new time/op delta SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24 457ns ± 2% 564ns ± 2% +23.36% (p=0.001 n=7+7) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24 461ns ± 3% 563ns ± 2% +22.14% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24 459ns ± 2% 591ns ± 2% +28.63% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24 455ns ± 0% 776ns ± 5% +70.60% (p=0.001 n=6+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24 456ns ± 2% 1058ns ± 1% +132.13% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24 7.15ms ± 1% 8.18ms ± 1% +14.33% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24 7.13ms ± 1% 8.18ms ± 1% +14.77% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24 7.12ms ± 2% 7.86ms ± 1% +10.30% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24 7.20ms ± 1% 7.11ms ± 1% -1.27% (p=0.001 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24 7.12ms ± 2% 7.16ms ± 3% ~ (p=0.721 n=8+8) ``` Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%. Resolves #98811. Touches #96800. Hat tip to `@nvanbenschoten` for the initial prototype. Epic: none Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>

nvanbenschoten · 2023-03-21T19:10:56Z

Now that #98833 and #98854 have landed, I'm going to close this. Those optimizations more than avoid the regression.

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv Anything in KV that doesn't belong in a more specific category. A-roachperf-investigation Issue opened as a result of roachperf triage labels Feb 8, 2023

nvanbenschoten self-assigned this Feb 8, 2023

nvanbenschoten added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Feb 28, 2023

nicktrav mentioned this issue Feb 28, 2023

perf: kv0 regression at or around feb 1 #97790

Closed

nvanbenschoten added the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Feb 28, 2023

cockroachdb deleted a comment from blathers-crl bot Feb 28, 2023

blathers-crl bot added the T-kv KV Team label Mar 1, 2023

nvanbenschoten added GA-blocker and removed release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 1, 2023

nvanbenschoten mentioned this issue Mar 15, 2023

roachtest: scale kv concurrency inversely with batch size #98648

Merged

nvanbenschoten mentioned this issue Mar 15, 2023

roachperf: remove pre-splitting from kv benchmarks #98649

Closed

tbg mentioned this issue Mar 16, 2023

kvserver: optimize lock contention in raft scheduler #98811

Closed

nvanbenschoten mentioned this issue Mar 17, 2023

kv: don't signal raft scheduler when already on scheduler goroutine #98833

Merged

erikgrinaker mentioned this issue Mar 17, 2023

kvserver: shard Raft scheduler #98854

Merged

nvanbenschoten closed this as completed Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

nvanbenschoten commented Feb 8, 2023 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Feb 8, 2023

smg260 commented Feb 15, 2023

nvanbenschoten commented Mar 15, 2023 •

edited

Loading

nvanbenschoten commented Mar 16, 2023

nvanbenschoten commented Mar 16, 2023 •

edited

Loading

nvanbenschoten commented Mar 16, 2023

nvanbenschoten commented Mar 21, 2023

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800

Comments

nvanbenschoten commented Feb 8, 2023 • edited by cockroach-jira-scripts Loading

nvanbenschoten commented Feb 8, 2023

smg260 commented Feb 15, 2023

nvanbenschoten commented Mar 15, 2023 • edited Loading

nvanbenschoten commented Mar 16, 2023

nvanbenschoten commented Mar 16, 2023 • edited Loading

nvanbenschoten commented Mar 16, 2023

nvanbenschoten commented Mar 21, 2023

nvanbenschoten commented Feb 8, 2023 •

edited by cockroach-jira-scripts

Loading

nvanbenschoten commented Mar 15, 2023 •

edited

Loading

nvanbenschoten commented Mar 16, 2023 •

edited

Loading