kvserver: benchmark different cpu load based split thresholds #96869

kvoli · 2023-02-09T14:07:41Z

#96128 Adds support for splitting a range once its leaseholder replica uses more CPU than kv.range_split.load_cpu_threshold. The default value of kv.range_split.load_cpu_threshold is 250ms of CPU use per second, or 1/4 of a CPU core.

This issue is to benchmark performance with different kv.range_split.load_cpu_threshold values set. The results should then inform a default value.

More specifically, benchmark ycsb, kv0, kv95 on three nodes and bisect a value that achieves the highest throughput.

The current value was selected by observing the performance of the cluster from a rebalancing perspective. The specific criteria was to constrain the occurrences of a store being overfull relative to the mean but not having any actions available to resolve being overfull. When running TPCE (50k), CPU splitting with a 250ms threshold performed 1 load based split whilst QPS splitting (2500) performed 12.5.

When running the allocbench/*/kv roachtest suite, CPU splitting (250ms) tended to make between 33-100% more load based splits than QPS splitting (2500) on workloads involving reads (usually large scans), whilst on the write heavy workloads the number of load based splits was identically low.

Here's a comparison of splits running TPCE between master(qps splits)/this branch with 250ms:

The same for allocbench (5 runs of each type, order is r=0/access=skew, r=0/ops=skew, r=50/ops=skew, r=95/access=skew, r=95/ops=skew.

Jira issue: CRDB-24382

The text was updated successfully, but these errors were encountered:

kvoli · 2023-02-13T15:00:10Z

I ran kv(0|95)/(seq|splt=0) to see how different splitting thresholds performed w.r.t max throughput. The results are shown below, I included the maximum and the minimum value over the last 2 weeks from roachperf for comparison.

ops/s	100ms	250ms	500ms	750ms	1000ms	1500ms	roachperf(qps=2.5k) min	roachperf(qps=2.5k) max
nodes=3/cpu=3/kv0/splt=0	42526.8	44038.4	46860.4	44694.6	48016.4	47812.8	40299.5	43131.8
nodes=3/cpu=3/kv95/splt=0	89307.2	100888.4	105332.3	107000.2	99648.4	103244.4	95400.1	106257
nodes=3/cpu=3/kv0/seq	26099.3	24744.1	25412.2	25784	25543.9	24983.5	25123	22758
nodes=3/cpu=3/kv95/seq	102744.5	101782.5	108302	111727.7	108290.2	108854.5	100061	110822.7

Most values fall within the range seen on roachperf within the last two weeks. The current default (250ms).

Load Splits	100	250	500	750	1000	1500
nodes=3/cpu=3/kv0/splt=0	107	38	22	15	12	7
nodes=3/cpu=3/kv95/splt=0	153	63	32	25	17	12
nodes=3/cpu=3/kv0/seq	64	54	50	46	49	65
nodes=3/cpu=3/kv95/seq	465	173	107	83	64	44

Notes:

With the split threshold set to 100ms, the kv95/splt=0 workload led to persistent thrashing throughout the benchmark run . This suggests that a threshold of 100ms is too low as destabilizes the amount of cpu time recorded over replicas as persistent splitting occurs.
On the kv95/splt=0 workload, 750ms and 1000ms split threshold clusters didn't run out of rebalance options, whilst the other clusters did. All eventually reached a steady state regardless except 100ms.
The kv0/seq workload is throttled on the workload runner being sequential. No resources were saturated and overall this is a less interesting workload than kv95/seq.
Similar to the kv95/splt=0 workload, kv95/seq saw thrashing throughout with 100ms and 250ms split threshold. The other workloads, did not thrash however also did not balance cpu usage (except 750ms).

Follow up questions and actions:

Run allocbench with 250ms 500ms and 750ms. This uses 8cpus rather than 32 and would be good to validate different thresholds are okay for rebalancing with small cpu counts.
In the kv95/seq workload, the replica CPU usage appeared balanced but the actual CPU utilization was imbalanced for all workloads other than 100ms and 250ms (which trashed, so not necessarily better). Why?

kvoli · 2023-02-14T14:59:25Z

I ran allocbench on master and with cpu balancing enabled, with different load split thresholds: 250ms 500ms 750ms 1000ms. The results are shown below.

CPU (Max-Min)	r=0/access=skew	r=0/ops=skew	r=50/ops=skew	r=95/access=skew	r=95/ops=skew
master	18.37	21.92	22.8	7.7	44.22
250	15.6	17.73	14.77	5.12	8.32
500	15.29	12.36	15.59	8.47	12.62
750	13.39	14.32	13.65	11.93	11.38
1000	14.75	12.77	12.45	10.84	10.65

Write (Max-Min)	r=0/access=skew	r=0/ops=skew	r=50/ops=skew	r=95/access=skew	r=95/ops=skew
master	45.35	43.3	47.15	1.49	1.1
250	41.66	51.28	41.63	1.78	1.56
500	38.48	42.92	41.58	2.38	1.64
750	38.37	52.89	51.59	2.23	1.16
1000	36.68	37.23	46.68	1.83	1.79

Standard Deviation

Write stddev	r=0/access=skew	r=0/ops=skew	r=50/ops=skew	r=95/access=skew	r=95/ops=skew
master	2.25	4.55	4.73	0.19	0.44
250	5.7	8.66	4.55	0.27	0.12
500	2.5	4.43	9.31	0.2	0.25
750	1.3	9.22	4.92	0.28	0.38
1000	6.38	2.86	8.03	0.4	0.26

CPU Stddev	r=0/access=skew	r=0/ops=skew	r=50/ops=skew	r=95/access=skew	r=95/ops=skew
master	1.75	3.01	3.6	2.16	5.4
250	1.75	3.9	1.83	0.26	1.97
500	1.15	2.22	2.88	1.33	2.56
750	0.64	2.54	1.79	1.59	4.28
1000	2.38	2.49	3.94	4.4	2.15

Notes

allocbench pre-splits the workload such that there is a not an extreme difference between results between clusters where CPU balancing is enabled. .
A higher split threshold, resulted in higher frequency of exhausted rebalancing options. This is where the store identifies that it is overfull w.r.t the mean but is unable to shed any load due as there are no actions available. This tends to occur when the only ranges to take action on are too "large" and would push the recipient to be overfull. . Despite this, the higher split thresholds performed well enough (<15% cpu imbalance).
A threshold of 1000ms and 750ms is likely too high, despite the similar performance. There were more signs of thrashing (higher lease transfer #) with a higher threshold in the r=50/ops=skew benchmark. 1000ms: vs 250ms
The rebalancing snapshot cost (bytes) did not show a noteworthy difference between the clusters.

Longer term, increasing the threshold or removing it entirely in favor of the allocator splitting when necessary seems like the best strategy. The current threshold of 250ms should be increased to 500ms. The benchmarks showed a moderate positive impact on throughput in kv0/95 and no significant downside on allocbench.

96107: build,ci,release: add Linux x86_64 FIPS build r=rickystewart a=rail * Added a script to build [golang-fips](https://github.com/golang-fips/go). The resulted Go version has a `fips` suffix, so we can easily select it in Bazel. * Added a new Bazel toolchain and build config used by the FIPS build only. * The FIPS docker image has FIPS-compliant `openssl` installed. * Added a script to run FIPS build in CI. * Tweaked one of the regexes to be less greedy. * Added the platform to the release process. Epic: DEVINF-478 Fixes: DEVINF-634 Fixes: DEVINF-633 Release note (build change): As a part of this change we start publishing a FIPS compliant tarball and docker image for the Linux x86_64 platform. The build uses OpenSSL libraries for crypto operations by `dlopen`ing the corresponding dynamic libraries. 97113: kvserver: increase cpu lb split threshold to 500ms r=nvanbenschoten a=kvoli This commit increases the default value of `kv.range_split.load_cpu_threshold` to `500ms` from the previous default value of `250ms`. Experiments detailed in #96869 showed moderate performance gains between `500ms` and `250ms`, without a significant downside on rebalancing. Resolves: #96869 Release note (ops change): The default value of `kv.range_split.load_cpu_threshold` is increased from `250ms` to `500ms`. This threshold declares the CPU per-second value above which a range will be split. The value was selected based on performance experiments. 97359: upgrades: fix race condition inside TestUpgradeSchemaChangerElements r=fqazi a=fqazi Previously, this test non-atomically update the job ID, which could lead to race conditions if jobs queries ran for any other reason than our expected one. This was inadequate because the race detector would detect issues with this value being modified. To address this, this patch only stores/loads the job ID atomically. Fixes: #97284 Release note: None 97398: storage: add guardrails to rocksdb.min_wal_sync_interval r=nvanbenschoten a=nvanbenschoten This commit prevents `rocksdb.min_wal_sync_interval` from being set to a negative value or from being set to a value above 1s. This prevents the cluster setting from being used to break node liveness and put a cluster into an unrecoverable state. Release note: None Epic: None 97405: ui: update cluster-ui to 23.1.0-prerelease.4 r=ericharmeling a=ericharmeling This commit updates the cluster-ui version to 23.1.0-prerelease.4. Release note: None Epic: None Co-authored-by: Rail Aliiev <rail@iqchoice.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com> Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com> Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>

97717: multitenant: add AdminUnsplitRequest capability r=knz a=ecwall Fixes #97716 1) Add a new `tenantcapabilitiespb.CanAdminUnsplit` capability. 2) Check capability in `Authorizer.HasCapabilityForBatch`. 3) Remove `ExecutorConfig.RequireSystemTenant` call from `execFactory.ConstructAlterTableUnsplit`, `execFactory.ConstructAlterTableUnsplitAll`. Release note: None 98250: kvserver: add minimum cpu lb split threshold r=andrewbaptist a=kvoli Previously, `kv.range_split.load_cpu_threshold` had no minimum setting value. It is undesirable to allow users to set this setting to low as excessive splitting may occur. `kv.range_split.load_cpu_threshold` now has a minimum setting value of `10ms`. See #96869 for additional context on the threshold. Resolves: #98107 Release note (ops change): `kv.range_split.load_cpu_threshold` now has a minimum setting value of `10ms`. 98270: dashboards: add replica cpu to repl dashboard r=xinhaoz a=kvoli In #96127 we added the option to load balance replica CPU instead of QPS across stores in a cluster. It is desirable to view the signal being controlled for rebalancing in the replication dashboard, similar to QPS. This pr adds the `rebalancing.cpunanospersecond` metric to the replication metrics dashboard. The avg QPS graph on the replication graph previously described the metric as "Exponentially weighted average", however this is not true. This pr updates the description to just be "moving average" which is accurate. Note that follow the workload does use an exponentially weighted value, however the metric in the dashboard is not the same. This pr also updates the graph header to include Replica in the title: "Average Replica Queries per Node". QPS is specific to replicas. This is already mentioned in the description. Resolves: #98109 98289: storage: mvccExportToWriter should always return buffered range keys r=adityamaru a=stevendanna In #96691, we changed the return type of mvccExportToWriter such that it now indicates when a CPU limit has been reached. As part of that change, when the CPU limit was reached, we would immedately `return` rather than `break`ing out of the loop. This introduced a bug, since the code after the loop that the `break` was taking us to is important. Namely, we may have previously buffered range keys that we need to write into our response still. By replacing the break with a return, these range keys were lost. The end-user impact of this is that a BACKUP that _ought_ to have included range keys (such as a backup of a table with a rolled back IMPORT) would not include those range keys and thus would end up resurecting deleted keys upon restore. This PR brings back the `break` and adds a test that covers exporting with range keys under CPU exhaustion. This bug only ever existed on master. Informs #97694 Epic: none Release note: None 98329: sql: fix iteration conditions in crdb_internal.scan r=ajwerner a=stevendanna Rather than using the Next() key of the last key in the response when iterating, we should use the resume span. The previous code could result in a failure in the rare case that the end key of our scan exactly matched the successor key of the very last key in the iteration. Epic: none Release note: None Co-authored-by: Evan Wall <wall@cockroachlabs.com> Co-authored-by: Austen McClernon <austen@cockroachlabs.com> Co-authored-by: Steven Danna <danna@cockroachlabs.com>

kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels Feb 9, 2023

kvoli added this to the 23.1 milestone Feb 9, 2023

kvoli self-assigned this Feb 9, 2023

blathers-crl bot added the T-kv KV Team label Feb 9, 2023

kvoli mentioned this issue Feb 9, 2023

kvserver: allocator cpu balancing for overload protection #90582

Closed

13 tasks

kvoli mentioned this issue Feb 14, 2023

kvserver: increase cpu lb split threshold to 500ms #97113

Merged

craig bot closed this as completed in 9bb2043 Feb 21, 2023

cockroach-teamcity mentioned this issue Feb 22, 2023

PR #97113 - kvserver: increase cpu lb split threshold to 500ms cockroachdb/docs#16312

Open

kvoli mentioned this issue Mar 9, 2023

kvserver: add minimum cpu lb split threshold #98250

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: benchmark different cpu load based split thresholds #96869

kvserver: benchmark different cpu load based split thresholds #96869

kvoli commented Feb 9, 2023 •

edited by cockroach-jira-scripts

Loading

kvoli commented Feb 13, 2023 •

edited

Loading

kvoli commented Feb 14, 2023

kvserver: benchmark different cpu load based split thresholds #96869

kvserver: benchmark different cpu load based split thresholds #96869

Comments

kvoli commented Feb 9, 2023 • edited by cockroach-jira-scripts Loading

kvoli commented Feb 13, 2023 • edited Loading

kvoli commented Feb 14, 2023

kvoli commented Feb 9, 2023 •

edited by cockroach-jira-scripts

Loading

kvoli commented Feb 13, 2023 •

edited

Loading