-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachperf: regression on kv0/enc=false/nodes=3/cpu=32, aws but not gce #96800
Comments
Omitting the |
Informs cockroachdb#96800. This commit updates the kv roachtest suite to scale the workload concurrency inversely with batch size. This ensures that we to account for the cost of each operation, recognizing that the system can handle fewer concurrent operations as the cost of each operation grows. We currently have three kv benchmark variants that set a non-default batch size: - `kv0/enc=false/nodes=3/batch=16` - `kv50/enc=false/nodes=4/cpu=96/batch=64` - `kv95/enc=false/nodes=3/batch=16` Without this change, these tests badly overload their clusters. This leads to p50 latencies in the 300-400ms range, about 10x greater than the corresponding p50 in a non-overloaded cluster. In this degraded regime, performance is unstable and non-representative. Customers don't run clusters at this level of overload and maximizing throughput once we hit the latency cliff is no longer a goal. By reducing the workload concurrency to avoid from overload, we reduce throughput by about 10% and reduce p50 and p99 latency by about 90%. Release note: None
I've been digging into these two classes of regressions. I think there are three different aspects of these tests that are combining to cause the throughput reductions. The first and second apply to the AWS regressions and the second and third apply to the batched kv regressions:
So where do we go from here? Ideally, we'd fix all three of these issues. Changing benchmarks is tricky though — we use them to detect regressions and don't want to lose continuity. That said, I do think we should fix issue 3 immediately. These clusters are so overloaded that the results aren't usable. I've done that in #98648. This change drops p50 latency from about 400ms (far into overload) to about 40ms (approaching overload). However, it also does hurt top-line throughput by about 10%. We'll need to add an annotation to roachperf to reflect this. With that resolved and the workload no longer overloading the cluster, the question becomes: do we still see a regression from async storage writes on the workload. The answer is no. In fact, async storage writes now shows up as a ~20% throughput win (note: we disable the change using the cluster setting and see a reduction in throughput): For completeness, we also compare master (with async storage writes enabled) against v22.2 (c5a5c57) using the new workload configuration. Again, master with async storage writes comes out on top. master averages 778.7 qps while v22.2 averages 743.8 qps. When not overloaded, we now see an increase in throughput between the two versions of about 4.7%. This matches the previous benchmarking performed in #94165. The change trades improved performance (latency and throughput) before overload for worse performance (latency and throughput) far into overload. This is of course a trade-off that we're happy to make. The over-splitting is the other workload problem that we should do something about. Pre-splitting was added to these tests many years ago when load-based splitting was nascent. Load-based splitting is now mature and is able to find a better split count than we are able to manually. In each of the kv roachtests that I had looked at, removing the However, I don't think the stability period is the right time for this. It's a disruptive change to all of these benchmarks and may let a real regression slip through. I've opened #98649 for further discussion. I'd suggest that we merge it early in the next release cycle. This leaves us with the original AWS-only regressions. I'll spend some more time tomorrow/this week looking at it to determine what, if anything, we want to do about those regressions. |
I dug further into the regression on First, recall from the previous comment that AWS instance stores (1) and over-splitting (2) are still very much in play with this regression. The third aspect that's relevant is the high core count in this test. While most kv tests run on a host with 8 vCPUs, this test runs on a host with 32 vCPUs. The fourth aspect that's relevant is that this is kv0, so 100% of the workload is write operations that go through Raft. What do these four factors add up to? A raft scheduler that's overloaded due to mutex contention. Mutex contention was already the bottleneck for this workload before #94165. #94165 then added two more calls into the raft scheduler for each Raft write, so mutex contention is now an even bigger bottleneck. I believe this is an increase of 66%, from 3 calls into the raft scheduler per write to 5 calls into the raft scheduler per write. To summarize, the regression requires the confluence of four factors:
So now that this is understood, what can we do about it? There are short-term wins and long-term wins. An easy short-term win is to eliminate unnecessary uses of the Raft scheduler to relieve pressure on its mutex. We can easily eliminate one of the two calls into the raft scheduler that was added in #94165 through a change like nvanbenschoten@8949816. These calls into the raft scheduler were from A less trivial change would be to eliminate the other new call into the raft scheduler, which is from I've confirmed through experimentation that avoiding a call into the Raft scheduler in both of these cases eliminates the regression. A longer-term win would be to improve the Raft scheduler so that it can be run at higher throughput without mutex contention issues. One approach we could take for this is to shard its central mutex. I've done so bluntly in nvanbenschoten@46e416e. That patch not only eliminates the performance regression, but it also allows us to hit a throughput above the baseline — around 80k qps on I'll continue testing and looking into what we can do now to close the gap. |
To verify this, I added some instrumentation to raft scheduler interactions and re-ran the test. Here are the results:
Over a minute or so period, we saw 18688214 raft scheduler "enqueue" calls. There we 4 prominent callers:
I'll update the profiling to give us information about the distribution of callers of |
With the different callers of
From here, we can see that of the 43.9% of all uses of the Raft scheduler from Put differently, nvanbenschoten@8949816 reduces the use of the raft scheduler by 17.8%. Seems worthwhile. |
Informs cockroachdb#96800. This commit avoids unnecessary interactions with the Raft scheduler as a way to relieve pressure on the scheduler's mutex. In cockroachdb#96800 (comment), we found that calls into the scheduler in response to synchronously handled `MsgStorageApply`s were responsible for 17.8% raftScheduler enqueue calls. These enqueue calls were unnecessary because they were performed while already running on the raft scheduler, so the corresponding local messages were already going to be flushed and processed before the scheduler is yielded. By recognizing when we are in these situations, we can omit the unnecessary scheduler interactions. ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=32 51.5k ± 2% 57.3k ± 3% +11.18% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/enc=false/nodes=3/cpu=32 2.40 ± 0% 2.10 ± 5% -12.50% (p=0.000 n=4+5) name old p99(ms) new p99(ms) delta kv0/enc=false/nodes=3/cpu=32 11.0 ± 0% 9.6 ± 4% -12.36% (p=0.000 n=4+5) ``` Release note: None
98259: roachtest: add an option to create nodes with low memory per core r=lidorcarmel a=lidorcarmel Currently in roachtest we can create VMs with a default RAM per core, or we can request for highmem machines. This patch is adding an option to ask for highcpu (lower RAM per core) machines. Previously, on GCE: - Creating <= 16 core VM created VMs with standard memory per core (~4GM). - Creating a VM with more cores used 'highcpu' nodes (~1GB per core). - Using the `HighMem` option created a VM with high memory per core (~6.5GB). - There was no option to create, for example, a 32 core VM with standard memory (~4GB) - only low mem or high mem. With this patch the test writer can pick any low/standard/high memory per core ratio. The initial need is to run restore performance benchmarks with reduced memory to verify nodes don't OOM in that environment. Running with n1-highcpu-8 is not ideal but nodes should not OOM. We can also, with this change, run with 32 cores and 'standard' RAM (n1-standard-32). Epic: none Release note: None 98389: delegate: make actual number of annotations used by delegator r=msirek a=msirek This updates the delegator, which parses textual SQL statements which represent specific DDL statements, so that the `Annotations` slice allocated in `planner.semaCtx` matches the actual number of annotations built during parsing (if the delegator successfully built a statement). Fixes: #97362 Release note (bug fix): Rare internal errors in SHOW JOBS statements which have a WITH clause are fixed. 98402: roachtest: make 'testImpl.Skip/f' behave consistently with 'TestSpec.… r=smg260 a=smg260 roachtest: make 'testImpl.Skip/f' behave consistently with 'TestSpec.Skip' This will show tests skipped via t.Skip() as ignored in the TC UI, with the caveat that it will show the test as having run twice. See inline comment for details. resolves: #96351 Release note: None Epic: None 98648: roachtest: scale kv concurrency inversely with batch size r=kvoli a=nvanbenschoten Informs #96800. This commit updates the kv roachtest suite to scale the workload concurrency inversely with batch size. This ensures that we to account for the cost of each operation, recognizing that the system can handle fewer concurrent operations as the cost of each operation grows. We currently have three kv benchmark variants that set a non-default batch size: - `kv0/enc=false/nodes=3/batch=16` - `kv50/enc=false/nodes=4/cpu=96/batch=64` - `kv95/enc=false/nodes=3/batch=16` Without this change, these tests badly overload their clusters. This leads to p50 latencies in the 300-400ms range, about 10x greater than the corresponding p50 in a non-overloaded cluster. In this degraded regime, performance is unstable and non-representative. Customers don't run clusters at this level of overload and maximizing throughput once we hit the latency cliff is no longer a goal. By reducing the workload concurrency to avoid from overload, we reduce throughput by about 10% and reduce p50 and p99 latency by about 90%. Release note: None 98806: changefeedccl: add WITH key_column option r=[jayshrivastava] a=HonoreDB Changefeeds running on an outbox table see a synthetic primary key that isn't useful for downstream partitioning. This PR adds an encoder option to use a different column as the key, not in internal logic, but only in message metadata. This breaks end-to-end ordering because we're only ordered with respect to the actual primary key, and the sink will only order with respect to the key we emit. We therefore require the unordered flag here. Closes #54461. Release note (enterprise change): Added the WITH key_column option to override the key used in message metadata. This changes the key hashed to determine Kafka partitions. It does not affect the output of key_in_value or the domain of the per-key ordering guarantee. 98824: ci: create bazel-fips docker image r=healthy-pod a=rail Previously, we use cockroachdb/bazel image to build and run our tests. In order to run FIPS tests, the image has to use particular FIPS-enabled packages. This PR adds a new bazelbuilder image, that uses FIPS-compliant packages. * Added `build/.bazelbuilderversion-fips` file. * Added `run_bazel_fips` wrapper. * The image builder script uses `dpkg-repack` to reproduce the same packages. Epic: DEVINF-478 Release note: None 98856: jobspb: mark the key visualizer job as automatic r=zachlite a=zachlite Epic: None Release note: None Co-authored-by: Lidor Carmel <lidor@cockroachlabs.com> Co-authored-by: Mark Sirek <sirek@cockroachlabs.com> Co-authored-by: Miral Gadani <miral@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Aaron Zinger <zinger@cockroachlabs.com> Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: zachlite <zachlite@gmail.com>
Informs cockroachdb#96800. This commit avoids unnecessary interactions with the Raft scheduler as a way to relieve pressure on the scheduler's mutex. In cockroachdb#96800 (comment), we found that calls into the scheduler in response to synchronously handled `MsgStorageApply`s were responsible for 17.8% raftScheduler enqueue calls. These enqueue calls were unnecessary because they were performed while already running on the raft scheduler, so the corresponding local messages were already going to be flushed and processed before the scheduler is yielded. By recognizing when we are in these situations, we can omit the unnecessary scheduler interactions. ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=32 51.5k ± 2% 57.3k ± 3% +11.18% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/enc=false/nodes=3/cpu=32 2.40 ± 0% 2.10 ± 5% -12.50% (p=0.000 n=4+5) name old p99(ms) new p99(ms) delta kv0/enc=false/nodes=3/cpu=32 11.0 ± 0% 9.6 ± 4% -12.36% (p=0.000 n=4+5) ``` Release note: None
98833: kv: don't signal raft scheduler when already on scheduler goroutine r=erikgrinaker a=nvanbenschoten Informs #96800. This commit avoids unnecessary interactions with the Raft scheduler as a way to relieve pressure on the scheduler's mutex. In #96800 (comment), we found that calls into the scheduler in response to synchronously handled `MsgStorageApply`s were responsible for **17.8%** raftScheduler enqueue calls. These enqueue calls were unnecessary because they were performed while already running on the raft scheduler, so the corresponding local messages were already going to be flushed and processed before the scheduler is yielded. By recognizing when we are in these situations, we can omit the unnecessary scheduler interactions. ``` # benchmarking on AWS with instance stores # see #96800 for why this has little to no effect on GCP of AWS with EBS name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=32 51.5k ± 2% 57.3k ± 3% +11.18% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/enc=false/nodes=3/cpu=32 2.40 ± 0% 2.10 ± 5% -12.50% (p=0.000 n=4+5) name old p99(ms) new p99(ms) delta kv0/enc=false/nodes=3/cpu=32 11.0 ± 0% 9.6 ± 4% -12.36% (p=0.000 n=4+5) ``` Release note: None Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
98854: kvserver: shard Raft scheduler r=erikgrinaker a=erikgrinaker The Raft scheduler mutex can become very contended on machines with many cores and high range counts. This patch shards the scheduler by allocating ranges and workers to individual scheduler shards. By default, we create a new shard for every 16 workers, and distribute workers evenly. We spin up 8 workers per CPU core, capped at 96, so 16 is equivalent to 2 CPUs per shard, or a maximum of 6 shards. This significantly relieves contention at high core counts, while also avoiding starvation by excessive sharding. The shard size can be adjusted via `COCKROACH_SCHEDULER_SHARD_SIZE`. This results in a substantial performance improvement on high-CPU nodes: ``` name old ops/sec new ops/sec delta kv0/enc=false/nodes=3/cpu=4 7.71k ± 5% 7.93k ± 4% ~ (p=0.310 n=5+5) kv0/enc=false/nodes=3/cpu=8 15.6k ± 3% 14.8k ± 7% ~ (p=0.095 n=5+5) kv0/enc=false/nodes=3/cpu=32 43.4k ± 2% 45.0k ± 3% +3.73% (p=0.032 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 40.5k ± 2% 61.7k ± 1% +52.53% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 35.6k ± 4% 44.5k ± 0% +24.99% (p=0.008 n=5+5) name old p50 new p50 delta kv0/enc=false/nodes=3/cpu=4 21.2 ± 6% 20.6 ± 3% ~ (p=0.397 n=5+5) kv0/enc=false/nodes=3/cpu=8 10.5 ± 0% 11.2 ± 8% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=32 4.16 ± 1% 4.00 ± 5% ~ (p=0.143 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 3.00 ± 0% 2.00 ± 0% ~ (p=0.079 n=4+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 4.70 ± 0% 4.10 ± 0% -12.77% (p=0.000 n=5+4) name old p95 new p95 delta kv0/enc=false/nodes=3/cpu=4 61.6 ± 5% 60.8 ± 3% ~ (p=0.762 n=5+5) kv0/enc=false/nodes=3/cpu=8 28.3 ± 4% 30.4 ± 0% +7.34% (p=0.016 n=5+4) kv0/enc=false/nodes=3/cpu=32 7.98 ± 2% 7.60 ± 0% -4.76% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 12.3 ± 2% 6.8 ± 0% -44.72% (p=0.000 n=5+4) kv0/enc=false/nodes=3/cpu=96/cloud=gce 10.2 ± 3% 8.9 ± 0% -12.75% (p=0.000 n=5+4) name old p99 new p99 delta kv0/enc=false/nodes=3/cpu=4 89.8 ± 7% 88.9 ± 6% ~ (p=0.921 n=5+5) kv0/enc=false/nodes=3/cpu=8 46.1 ± 0% 48.6 ± 5% +5.47% (p=0.048 n=5+5) kv0/enc=false/nodes=3/cpu=32 11.5 ± 0% 11.0 ± 0% -4.35% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=aws 14.0 ± 3% 12.1 ± 0% -13.32% (p=0.008 n=5+5) kv0/enc=false/nodes=3/cpu=96/cloud=gce 14.3 ± 3% 13.1 ± 0% -8.55% (p=0.000 n=4+5) ``` The basic cost of enqueueing ranges in the scheduler (without workers or contention) only increases slightly in absolute terms, thanks to `raftSchedulerBatch` pre-sharding the enqueued ranges: ``` name old time/op new time/op delta SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=8-24 457ns ± 2% 564ns ± 2% +23.36% (p=0.001 n=7+7) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=16-24 461ns ± 3% 563ns ± 2% +22.14% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=32-24 459ns ± 2% 591ns ± 2% +28.63% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=64-24 455ns ± 0% 776ns ± 5% +70.60% (p=0.001 n=6+8) SchedulerEnqueueRaftTicks/collect=true/ranges=10/workers=128-24 456ns ± 2% 1058ns ± 1% +132.13% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=8-24 7.15ms ± 1% 8.18ms ± 1% +14.33% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=16-24 7.13ms ± 1% 8.18ms ± 1% +14.77% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=32-24 7.12ms ± 2% 7.86ms ± 1% +10.30% (p=0.000 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=64-24 7.20ms ± 1% 7.11ms ± 1% -1.27% (p=0.001 n=8+8) SchedulerEnqueueRaftTicks/collect=true/ranges=100000/workers=128-24 7.12ms ± 2% 7.16ms ± 3% ~ (p=0.721 n=8+8) ``` Furthermore, on an idle 24-core 3-node cluster with 50.000 unquiesced ranges, this reduced CPU usage from 12% to 10%. Resolves #98811. Touches #96800. Hat tip to `@nvanbenschoten` for the initial prototype. Epic: none Release note (performance improvement): The Raft scheduler is now sharded to relieve contention during range Raft processing, which can significantly improve performance at high CPU core counts. Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
kv0/enc=false/nodes=3/cpu=32
and a few related variants regressed on Feb 4th:This regression is only observed on AWS and not on GCE. It's also only observed on the 32 vCPU variant and not on the 8 vCPU variant.
I've determined that this was a result of #94165. When I run the benchmark and switch the
kv.raft_log.non_blocking_synchronization.enabled
cluster setting midway through to disable async storage writes, throughput increases.Interestingly, log commit latency also drops.
One possibility is that fsyncs on AWS instance stores (with
nobarrier
) are fast enough that we are exceeding the PebbleSyncConcurrency
of 512. This would cause the async log writes to become synchronous. I don't know why this is worse than the non-async storage write configuration. One thought is that we might be observing the overhead of the asynchronous write path (two goroutine hops) without the benefit (because we're still blocking before entering it) and so we see the throughput regression.Another thought is that async storage writes trade-off reduced interference between writes on the same range for reduced batching of writes on the same range. In a system where the p50 fsync latency is .10ms, is this the right trade-off?
Next steps:
SyncConcurrency
nobarrier
disabledJira issue: CRDB-24341
The text was updated successfully, but these errors were encountered: