kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 #109443

rafiss · 2023-08-24T17:50:02Z

Describe the problem

kv95/enc=false/nodes=3/cpu=96 had a regression in July. QPS dropped from ~185k down to ~150k.

To Reproduce

See https://roachperf.crdb.dev/?filter=&view=kv95%2Fenc%3Dfalse%2Fnodes%3D3%2Fcpu%3D96&tab=gce

Jira issue: CRDB-30926

The text was updated successfully, but these errors were encountered:

blathers-crl · 2023-08-24T17:50:04Z

cc @cockroachdb/test-eng

blathers-crl · 2023-08-24T18:14:20Z

Hi @ajstorm, please add branch-* labels to identify which branch(es) this release-blocker affects.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

smg260 · 2023-08-24T22:10:56Z

Bisection results:

2023-08-24T19:39:22Z   Bisecting regression in [kv95/enc=false/nodes=3/cpu=96] using commit range [c89d5868978 (known good),06b5ba78883 (known bad)]
2023-08-24T19:39:22Z   Thresholds [good >= 179410, bad <= 138725]
2023-08-24T20:00:12Z   [9a4f139c000] avg ops/s: [180089]. Auto marked as good.
2023-08-24T20:20:16Z   [8aebb2993e9] avg ops/s: [180375]. Auto marked as good.
2023-08-24T20:36:58Z   [6f420390c1f] avg ops/s: [137051]. Auto marked as bad.
2023-08-24T20:52:30Z   [e0b83fdd096] avg ops/s: [181482]. Auto marked as good.
2023-08-24T21:07:19Z   [25855acc268] avg ops/s: [182082]. Auto marked as good.
2023-08-24T21:25:45Z   [5cc456bb4c4] avg ops/s: [182696]. Auto marked as good.
2023-08-24T21:41:51Z   [dc6b2dc923a] avg ops/s: [180852]. Auto marked as good.
2023-08-24T21:59:40Z   [60e8bcec612] avg ops/s: [137160]. Auto marked as bad.
2023-08-24T21:59:40Z   Bisection complete. Suspect commit:
commit 60e8bcec61269c6648dc40c8094fbf303b8a02aa
Author: Ricky Stewart <ricky@cockroachlabs.com>
Date:   Mon Jul 17 15:09:52 2023 -0500

    bazel,dev: in `dev`, handle different `--compilation_mode`s correctly...

    ... and switch our default compilation mode to `dbg`.

    Under `fastbuild`, built binaries are stripped. This is not the case
    with `dbg`. Have `dev doctor` recommend using `dbg`, and either way
    make `dev` resilient to whatever kind of `--compilation_mode` you're
    using.

    In principle or theory `dbg` is slower than `fastbuild`, but for the Go
    compiler it generates debuggable binaries by default and you have to
    opt-in to stripping, so it shoould make no real difference.

    Now, we think of the `--compilation_mode` simply as follows: `dbg` is
    the default that everyone is opted into by default unless they
    explicitly set `-c opt` (we use this for release). For now I don't see
    a reason anyone would need `fastbuild`.

    Epic: CRDB-17171
    Release note: None
    Closes: #106820

smg260 · 2023-08-24T23:19:44Z

I also ran this test against latest master with the above commit (#106944) and its subsequent "partial revert" commit (#107207), reverted.

Result: 179597.8 ops/s - in line with the higher pre-regression throughput.

smg260 · 2023-08-25T16:24:10Z

Not so fast...

Same code -

via gce-worker

via CI

smg260 · 2023-08-26T01:02:48Z

The start regression date lead to the bisect identifying the above commit, which did cause a regression, but that commit was subsequently fixed, which means something else was introduced.

So we again bisect, but with a new starting good commit (the fix for above: f52fe97ad8) and a known bad commit (c6f4c0ed6e3).

Specifying --first-parent, bisect only follows the first parent of merge commits, resulting in the offending merge commit below.

2023-08-25T20:07:04Z   Bisecting regression in [kv95/enc=false/nodes=3/cpu=96] using commit range [f52fe97ad8b (known good),c6f4c0ed6e3 (known bad)]
2023-08-25T20:07:04Z   Thresholds [good >= 179696, bad <= 144555]
2023-08-25T20:24:11Z   [e23993ff117] avg ops/s: [143712]. Auto marked as bad.
2023-08-25T20:52:30Z   [b005e97c4ba] avg ops/s: [178173]. User marked as good. Threshold updated.
2023-08-25T20:52:30Z   Bisection complete. Suspect commit:

commit e23993ff117521c5bf72390c46f589b275af9747
Merge: b005e97c4ba a792cd30a86 466c4e9b709 8e4578b0e20 db4be30fa87 fb97251da55 a05d100dfae f52fe97ad8b
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Wed Jul 19 22:12:33 2023 +0000

    Merge #106860 #106912 #107099 #107170 #107189 #107197 #107207

<... truncated >

This is a relatively large batch, and we run the roachtest against each hash until we find the one with a decreased throughput. In this case, a792cd30a86 with ~150k.

For sanity, we check other hashes in the batch. The first "hump" is a792cd30a86, clearly less than the others.

commit a792cd30a864cae879f8f91040b7394304ee3572
Author: j82w <jwilley@cockroachlabs.com>
Date:   Fri Jul 14 15:03:47 2023 -0400

    ssmemstorage: improve lock contention on RecordStatement

    1. The stats object lock was being held even during the insights event
    creation and insert which has it's own lock. The insights logic is now
    moved above the stats lock.
    2. Moved latency calculation outside the stats lock.
    3. Moved error code calculation outside the stats lock.
    4. ss_mem_storage converted to use a RWMutex. Each object in the map
    has it's own lock. Read lock allows mulitple threads to read from the
    map at the same time which is the common case. Writes are only done the
    first time a statement is seen.

    Fixes: #106789, Fixes: #55285

    Release note (performance improvement): Reduced lock contention on
    `ssmemstorage.RecordStatement`. This is useful for workloads that
    execute the same statement concurrently on the same SQL instance.

cc @cockroachdb/sql-foundations

rafiss · 2023-08-26T09:04:45Z

Very nice investigation! a792cd3 is work from the @cockroachdb/cluster-observability team.

maryliag · 2023-08-26T16:11:30Z

that's curious, specially because the benchmark done on that commit shows performance improvements:
#106860

cc @j82w

An experiment using drwmutex [1] to speed up read lock contention on 96 vCPUs, as observed in [2]. The final run of `kv95/enc=false/nodes=3/cpu=96` exhibited average throughput of 173413 ops/sec. That's worse than the implementation without RWMutex. It appears that read lock, as implemented by Go's runtime scales poorly to a high number of vCPUs [3]. On the other hand, the write lock under drwmutex requires acquiring 96 locks in this case, which appears to be the only bottleneck; the sharded read lock is optimal enough that it doesn't show up on the cpu profile. The only slow down appears to be the write lock inside getStatsForStmtWithKeySlow which is unavoidable. Although inconclusive, it appears that drwmutex doesn't scale well above a certain number of vCPUs, when the write mutex is on a critical path. [1] https://github.com/jonhoo/drwmutex [2] cockroachdb#109443 [3] golang/go#17973 Epic: none Release note: None

srosenberg · 2023-08-27T23:50:22Z

that's curious, specially because the benchmark done on that commit shows performance improvements

Unfortunately, this is a great example where microbenchmarks are insufficient. They are great for establishing a relative change (wrt baseline), but we can't rely on them to tell us about absolute performance. Below is a summary of my investigation.

Investigation

Below, we refer to a792cd3 as the bad run; the good run is f52fe97a. (I've simply cherry-picked the bad run on top of the good run.)

CPU Profile

First, the CPU profiles didn't yield any obvious insights (e.g., system is overloaded). However, we can clearly see that the bad run is burning ~10% more cpu utilization. Given the context of the change in that commit, it's a hint that the additional cpu cycles are spent on acquiring mutexes.

Good run (grafana link),

Bad run (grafana link),

Mutex Profile

Indeed, the mutex profiles are very telling.

Good run,

Bad run,

RWMutex Scales Poorly with CPU Count

The fact that RWMutex scales poorly is a known issue [1]. It also explains why only kv95/enc=false/nodes=3/cpu=96 and kv95/enc=false/nodes=3/cpu=32 regressed, and not kv95/enc=false/nodes=3. (The latter uses 8 vCPUs.)

DRWMutex

I've experimented with drwmutex [2] which allegedly scales better with cpu count. Roughly speaking, it shards read locks across all available vCPUs. It does so be using x86 assembly which invokes CPUID to obtain the vCPU index where the current goroutine is running. Unfortunately, that doesn't come free wrt overhead. I ended up replacing it with an atomic counter which effectively round robins and only invokes CPUID 1/num_cpus times. While that reduces the overhead of the read lock, the write lock is also expensive–it must acquire a mutex for each vCPU. I was able to get a better average throughput (173413 ops/sec) than the bad run, but it's still worse than the good run. You can examine the changes in [3].

It appears we've hit the wall with the optimization to split the mutex into read/write locks, in the cases where vCPU count > 16 cores. (tpccbench/nodes=3/cpu=16 didn't regress either, so I am assuming we're good with 16 or fewer cores.) The mutex profile with drwmutex shows the overhead is down to a single acquire in getStatsForStmtWithKeySlow, which is unavoidable.

(Note, I've renamed getStatsForStmtWithKeyLocked to getStatsForStmtWithKeySlow in [3].)

Next Steps

I think we need to determine how much performance is gained by read/write locks for <= 16 vCPUs? If it's significant, an adaptive approach could work, i.e., fall back to sync.Mutex when vCPU count > 16. Ultimately, we need a better atomic map implementation [4] because locks are not without overhead, especially when core density increases.

[1] golang/go#17973
[2] https://github.com/jonhoo/drwmutex
[3] https://github.com/srosenberg/cockroach/tree/sr/drwmutex_109443
[4] golang/go#54720

j82w · 2023-08-28T11:58:49Z

@srosenberg should I revert the PR?

srosenberg · 2023-08-29T17:01:28Z

@srosenberg should I revert the PR?

Yep, let's revert and add an annotation to roachperf. Could you also open a follow-up issue, this way we can close this one out.

srosenberg · 2023-09-07T13:18:48Z

I've created a new issue ^^^ as a follow-up.

rafiss added C-investigation Further steps needed to qualify. C-label will change. A-roachperf-investigation Issue opened as a result of roachperf triage T-testeng TestEng Team labels Aug 24, 2023

ajstorm added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 24, 2023

rafiss added the branch-master Failures and bugs on the master branch. label Aug 24, 2023

srosenberg mentioned this issue Sep 7, 2023

performance regression SQLStatsCollector in caused by a792cd3 #110170

Closed

srosenberg closed this as completed Sep 7, 2023

srosenberg mentioned this issue Sep 7, 2023

roachtest: 2023/04/30 kv0/nodes=3/cpu=96 regression [optimizer_use_histograms in internal executor] #102954

Open

rafiss mentioned this issue Oct 2, 2023

catalog/lease: reduce lock contention on descVersionState #111397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 #109443

kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 #109443

rafiss commented Aug 24, 2023 •

edited by cockroach-jira-scripts

Loading

blathers-crl bot commented Aug 24, 2023

blathers-crl bot commented Aug 24, 2023

smg260 commented Aug 24, 2023

smg260 commented Aug 24, 2023

smg260 commented Aug 25, 2023

smg260 commented Aug 26, 2023 •

edited

Loading

rafiss commented Aug 26, 2023

maryliag commented Aug 26, 2023

srosenberg commented Aug 27, 2023 •

edited

Loading

j82w commented Aug 28, 2023

srosenberg commented Aug 29, 2023

srosenberg commented Sep 7, 2023

kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 #109443

kv95/enc=false/nodes=3/cpu=96 regression on July 17, 2023 #109443

Comments

rafiss commented Aug 24, 2023 • edited by cockroach-jira-scripts Loading

blathers-crl bot commented Aug 24, 2023

blathers-crl bot commented Aug 24, 2023

smg260 commented Aug 24, 2023

smg260 commented Aug 24, 2023

smg260 commented Aug 25, 2023

smg260 commented Aug 26, 2023 • edited Loading

rafiss commented Aug 26, 2023

maryliag commented Aug 26, 2023

srosenberg commented Aug 27, 2023 • edited Loading

Investigation

CPU Profile

Mutex Profile

RWMutex Scales Poorly with CPU Count

DRWMutex

Next Steps

j82w commented Aug 28, 2023

srosenberg commented Aug 29, 2023

srosenberg commented Sep 7, 2023

rafiss commented Aug 24, 2023 •

edited by cockroach-jira-scripts

Loading

smg260 commented Aug 26, 2023 •

edited

Loading

srosenberg commented Aug 27, 2023 •

edited

Loading