kv,sql: expensive select queries cause OOM #123000
Labels
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
O-support
Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
P-3
Issues/test failures with no fix SLA
T-sql-queries
SQL Queries Team
Describe the problem
Running multiple queries in parallel can cause nodes to OOM.
To Reproduce
roachprod ssh $CLUSTER:3 "./cockroach workload run kv $(roachprod pgurl $CLUSTER:3) --timeout 5s --tolerate-errors"
for n in {1..3}; do for x in {1..100}; do echo "select * from kv.kv where v='abc';" | roachprod sql $CLUSTER:$n & done; done
Expected behavior
The nodes should not crash. On a muti-tenant system a single tenant system, one tenant could bring down the entire cluster.
Additional data / screenshots
Depending on the number of concurrent select(*), we get different behaviors:
10 queries/node - p50 latency on writes jumps from 2ms -> 900ms, QPS goes from 4000 -> 10
20 queries/node - p50 latency goes to 2s, QPS goes to 1-2
40 queries/node - p50 latency goes to 10s, QPS goes to 0 (timeouts). Causes liveness failures.
100 queries/node - all nodes crash with OOM
CPU Profile: profile.pb.gz
Heap profile (at 30 concurrency): profile.pb.gz
Note that the heap profile doesn't account for all the memory. Here is a line from the cockroach-health log at about the same time as the heap profile:
I240424 19:20:08.898205 324 2@server/status/runtime_log.go:47 ⋮ [T1,Vsystem,n2] 734 runtime stats: 6.2 GiB RSS, 982 goroutines (stacks: 17 MiB), 2.7 GiB/4.1 GiB Go alloc/total (heap fragmentation: 27 MiB, heap reserved: 1.3 GiB, heap released: 1.5 GiB), 2.1 GiB/2.4 GiB CGO alloc/total (9.0 CGO/sec), 394.5/3.1 %(u/s)time, 0.0 %gc (188x), 569 KiB/622 KiB (r/w)net
Environment:
This likely cccurs on all releases. This was tested on 24.1/master.
Additional context
We have seen customer cases where heavy queries can cause either liveness failures or OOMs
Jira issue: CRDB-38160
The text was updated successfully, but these errors were encountered: