Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: benchmark different cpu load based split thresholds #96869

Closed
Tracked by #90582
kvoli opened this issue Feb 9, 2023 · 2 comments · Fixed by #97113
Closed
Tracked by #90582

kvserver: benchmark different cpu load based split thresholds #96869

kvoli opened this issue Feb 9, 2023 · 2 comments · Fixed by #97113
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Milestone

Comments

@kvoli
Copy link
Collaborator

kvoli commented Feb 9, 2023

#96128 Adds support for splitting a range once its leaseholder replica uses more CPU than kv.range_split.load_cpu_threshold. The default value of kv.range_split.load_cpu_threshold is 250ms of CPU use per second, or 1/4 of a CPU core.

This issue is to benchmark performance with different kv.range_split.load_cpu_threshold values set. The results should then inform a default value.

More specifically, benchmark ycsb, kv0, kv95 on three nodes and bisect a value that achieves the highest throughput.

The current value was selected by observing the performance of the cluster from a rebalancing perspective. The specific criteria was to constrain the occurrences of a store being overfull relative to the mean but not having any actions available to resolve being overfull. When running TPCE (50k), CPU splitting with a 250ms threshold performed 1 load based split whilst QPS splitting (2500) performed 12.5.

When running the allocbench/*/kv roachtest suite, CPU splitting (250ms) tended to make between 33-100% more load based splits than QPS splitting (2500) on workloads involving reads (usually large scans), whilst on the write heavy workloads the number of load based splits was identically low.

Here's a comparison of splits running TPCE between master(qps splits)/this branch with 250ms:

image.png

The same for allocbench (5 runs of each type, order is r=0/access=skew, r=0/ops=skew, r=50/ops=skew, r=95/access=skew, r=95/ops=skew.
image copy 1.png

Jira issue: CRDB-24382

@kvoli kvoli added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-distribution Relating to rebalancing and leasing. labels Feb 9, 2023
@kvoli kvoli added this to the 23.1 milestone Feb 9, 2023
@kvoli kvoli self-assigned this Feb 9, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Feb 9, 2023
@kvoli
Copy link
Collaborator Author

kvoli commented Feb 13, 2023

I ran kv(0|95)/(seq|splt=0) to see how different splitting thresholds performed w.r.t max throughput. The results are shown below, I included the maximum and the minimum value over the last 2 weeks from roachperf for comparison.

ops/s 100ms 250ms 500ms 750ms 1000ms 1500ms roachperf(qps=2.5k) min roachperf(qps=2.5k) max
nodes=3/cpu=3/kv0/splt=0 42526.8 44038.4 46860.4 44694.6 48016.4 47812.8 40299.5 43131.8
nodes=3/cpu=3/kv95/splt=0 89307.2 100888.4 105332.3 107000.2 99648.4 103244.4 95400.1 106257
nodes=3/cpu=3/kv0/seq 26099.3 24744.1 25412.2 25784 25543.9 24983.5 25123 22758
nodes=3/cpu=3/kv95/seq 102744.5 101782.5 108302 111727.7 108290.2 108854.5 100061 110822.7

image

Most values fall within the range seen on roachperf within the last two weeks. The current default (250ms).

Load Splits 100 250 500 750 1000 1500
nodes=3/cpu=3/kv0/splt=0 107 38 22 15 12 7
nodes=3/cpu=3/kv95/splt=0 153 63 32 25 17 12
nodes=3/cpu=3/kv0/seq 64 54 50 46 49 65
nodes=3/cpu=3/kv95/seq 465 173 107 83 64 44

image

Notes:

  • With the split threshold set to 100ms, the kv95/splt=0 workload led to persistent thrashing throughout the benchmark run image. This suggests that a threshold of 100ms is too low as destabilizes the amount of cpu time recorded over replicas as persistent splitting occurs.
  • On the kv95/splt=0 workload, 750ms and 1000ms split threshold clusters didn't run out of rebalance options, whilst the other clusters did. All eventually reached a steady state regardless except 100ms. image
  • The kv0/seq workload is throttled on the workload runner being sequential. No resources were saturated and overall this is a less interesting workload than kv95/seq.
  • Similar to the kv95/splt=0 workload, kv95/seq saw thrashing throughout with 100ms and 250ms split threshold. The other workloads, did not thrash however also did not balance cpu usage (except 750ms). image

Follow up questions and actions:

  • Run allocbench with 250ms 500ms and 750ms. This uses 8cpus rather than 32 and would be good to validate different thresholds are okay for rebalancing with small cpu counts.
  • In the kv95/seq workload, the replica CPU usage appeared balanced but the actual CPU utilization was imbalanced for all workloads other than 100ms and 250ms (which trashed, so not necessarily better). Why? image

@kvoli
Copy link
Collaborator Author

kvoli commented Feb 14, 2023

I ran allocbench on master and with cpu balancing enabled, with different load split thresholds: 250ms 500ms 750ms 1000ms. The results are shown below.

CPU (Max-Min) r=0/access=skew r=0/ops=skew r=50/ops=skew r=95/access=skew r=95/ops=skew
master 18.37 21.92 22.8 7.7 44.22
250 15.6 17.73 14.77 5.12 8.32
500 15.29 12.36 15.59 8.47 12.62
750 13.39 14.32 13.65 11.93 11.38
1000 14.75 12.77 12.45 10.84 10.65

image

Write (Max-Min) r=0/access=skew r=0/ops=skew r=50/ops=skew r=95/access=skew r=95/ops=skew
master 45.35 43.3 47.15 1.49 1.1
250 41.66 51.28 41.63 1.78 1.56
500 38.48 42.92 41.58 2.38 1.64
750 38.37 52.89 51.59 2.23 1.16
1000 36.68 37.23 46.68 1.83 1.79

image

Standard Deviation
Write stddev r=0/access=skew r=0/ops=skew r=50/ops=skew r=95/access=skew r=95/ops=skew
master 2.25 4.55 4.73 0.19 0.44
250 5.7 8.66 4.55 0.27 0.12
500 2.5 4.43 9.31 0.2 0.25
750 1.3 9.22 4.92 0.28 0.38
1000 6.38 2.86 8.03 0.4 0.26
CPU Stddev r=0/access=skew r=0/ops=skew r=50/ops=skew r=95/access=skew r=95/ops=skew
master 1.75 3.01 3.6 2.16 5.4
250 1.75 3.9 1.83 0.26 1.97
500 1.15 2.22 2.88 1.33 2.56
750 0.64 2.54 1.79 1.59 4.28
1000 2.38 2.49 3.94 4.4 2.15

image

Notes

  • allocbench pre-splits the workload such that there is a not an extreme difference between results between clusters where CPU balancing is enabled. image.
  • A higher split threshold, resulted in higher frequency of exhausted rebalancing options. This is where the store identifies that it is overfull w.r.t the mean but is unable to shed any load due as there are no actions available. This tends to occur when the only ranges to take action on are too "large" and would push the recipient to be overfull. image. Despite this, the higher split thresholds performed well enough (<15% cpu imbalance). image
  • A threshold of 1000ms and 750ms is likely too high, despite the similar performance. There were more signs of thrashing (higher lease transfer #) with a higher threshold in the r=50/ops=skew benchmark. 1000ms: image vs 250ms image
  • The rebalancing snapshot cost (bytes) did not show a noteworthy difference between the clusters.

Longer term, increasing the threshold or removing it entirely in favor of the allocator splitting when necessary seems like the best strategy. The current threshold of 250ms should be increased to 500ms. The benchmarks showed a moderate positive impact on throughput in kv0/95 and no significant downside on allocbench.

craig bot pushed a commit that referenced this issue Feb 21, 2023
96107: build,ci,release: add Linux x86_64 FIPS build r=rickystewart a=rail

* Added a script to build   [golang-fips](https://github.com/golang-fips/go). The resulted Go version has a `fips` suffix, so we can easily select it in Bazel.
* Added a new Bazel toolchain and build config used by the FIPS build only.
* The FIPS docker image has FIPS-compliant `openssl` installed.
* Added a script to run FIPS build in CI.
* Tweaked one of the regexes to be less greedy.
* Added the platform to the release process.

Epic: DEVINF-478
Fixes: DEVINF-634
Fixes: DEVINF-633

Release note (build change): As a part of this change we start
publishing a FIPS compliant tarball and docker image for the Linux
x86_64 platform. The build uses OpenSSL libraries for crypto operations
by `dlopen`ing the corresponding dynamic libraries.

97113: kvserver: increase cpu lb split threshold to 500ms r=nvanbenschoten a=kvoli

This commit increases the default value of
`kv.range_split.load_cpu_threshold` to `500ms` from the previous default value of `250ms`. Experiments detailed in #96869 showed moderate performance gains between `500ms` and `250ms`, without a significant downside on rebalancing.

Resolves: #96869

Release note (ops change): The default value of
`kv.range_split.load_cpu_threshold` is increased from `250ms` to `500ms`. This threshold declares the CPU per-second value above which a range will be split. The value was selected based on performance experiments.

97359: upgrades: fix race condition inside TestUpgradeSchemaChangerElements r=fqazi a=fqazi

Previously, this test non-atomically update the job ID, which could lead to race conditions if jobs queries ran for any other reason than our expected one. This was inadequate because the race detector would detect issues with this value being modified. To address this, this patch only stores/loads the job ID atomically.

Fixes: #97284

Release note: None

97398: storage: add guardrails to rocksdb.min_wal_sync_interval r=nvanbenschoten a=nvanbenschoten

This commit prevents `rocksdb.min_wal_sync_interval` from being set to a negative value or from being set to a value above 1s. This prevents the cluster setting from being used to break node liveness and put a cluster into an unrecoverable state.

Release note: None
Epic: None

97405: ui: update cluster-ui to 23.1.0-prerelease.4 r=ericharmeling a=ericharmeling

This commit updates the cluster-ui version to 23.1.0-prerelease.4.

Release note: None

Epic: None

Co-authored-by: Rail Aliiev <rail@iqchoice.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Faizan Qazi <faizan@cockroachlabs.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Co-authored-by: Eric Harmeling <eric.harmeling@cockroachlabs.com>
@craig craig bot closed this as completed in 9bb2043 Feb 21, 2023
craig bot pushed a commit that referenced this issue Mar 9, 2023
97717: multitenant: add AdminUnsplitRequest capability r=knz a=ecwall

Fixes #97716

1) Add a new `tenantcapabilitiespb.CanAdminUnsplit` capability.
2) Check capability in `Authorizer.HasCapabilityForBatch`.
3) Remove `ExecutorConfig.RequireSystemTenant` call from
   `execFactory.ConstructAlterTableUnsplit`,
   `execFactory.ConstructAlterTableUnsplitAll`.

Release note: None


98250: kvserver: add minimum cpu lb split threshold r=andrewbaptist a=kvoli

Previously, `kv.range_split.load_cpu_threshold` had no minimum setting value. It is undesirable to allow users to set this setting to low as excessive splitting may occur.

`kv.range_split.load_cpu_threshold` now has a minimum setting value of `10ms`.

See #96869 for additional context on the threshold.

Resolves: #98107

Release note (ops change): `kv.range_split.load_cpu_threshold` now has a minimum setting value of `10ms`.

98270: dashboards: add replica cpu to repl dashboard r=xinhaoz a=kvoli

In #96127 we added the option to load balance replica CPU instead of QPS across
stores in a cluster. It is desirable to view the signal being controlled for
rebalancing in the replication dashboard, similar to QPS.

This pr adds the `rebalancing.cpunanospersecond` metric to the replication
metrics dashboard.

The avg QPS graph on the replication graph previously described the metric as
"Exponentially weighted average", however this is not true.

This pr updates the description to just be "moving average" which is accurate.
Note that follow the workload does use an exponentially weighted value, however
the metric in the dashboard is not the same.

This pr also updates the graph header to include Replica in the title: "Average
Replica Queries per Node". QPS is specific to replicas. This is already
mentioned in the description.

Resolves: #98109


98289: storage: mvccExportToWriter should always return buffered range keys r=adityamaru a=stevendanna

In #96691, we changed the return type of mvccExportToWriter such that it now indicates when a CPU limit has been reached. As part of that change, when the CPU limit was reached, we would immedately `return` rather than `break`ing out of the loop. This introduced a bug, since the code after the loop that the `break` was taking us to is important. Namely, we may have previously buffered range keys that we need to write into our response still. By replacing the break with a return, these range keys were lost.

The end-user impact of this is that a BACKUP that _ought_ to have included range keys (such as a backup of a table with a rolled back IMPORT) would not include those range keys and thus would end up resurecting deleted keys upon restore.

This PR brings back the `break` and adds a test that covers exporting with range keys under CPU exhaustion.

This bug only ever existed on master.

Informs #97694

Epic: none

Release note: None

98329: sql: fix iteration conditions in crdb_internal.scan r=ajwerner a=stevendanna

Rather than using the Next() key of the last key in the response when iterating, we should use the resume span. The previous code could result in a failure in the rare case that the end key of our scan exactly matched the successor key of the very last key in the iteration.

Epic: none

Release note: None

Co-authored-by: Evan Wall <wall@cockroachlabs.com>
Co-authored-by: Austen McClernon <austen@cockroachlabs.com>
Co-authored-by: Steven Danna <danna@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant