Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,sql: expensive select queries cause OOM #123000

Open
andrewbaptist opened this issue Apr 24, 2024 · 2 comments
Open

kv,sql: expensive select queries cause OOM #123000

andrewbaptist opened this issue Apr 24, 2024 · 2 comments
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team

Comments

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Apr 24, 2024

Describe the problem

Running multiple queries in parallel can cause nodes to OOM.

To Reproduce

  1. Create a 3 node cluster initialized with a KV DB.
  2. Begin a background workload to write many KV keys
    roachprod ssh $CLUSTER:3 "./cockroach workload run kv $(roachprod pgurl $CLUSTER:3) --timeout 5s --tolerate-errors"
  3. Wait a little while to allow there to be enough keys written to the system.
  4. Start a number of select queries in the background:
    for n in {1..3}; do for x in {1..100}; do echo "select * from kv.kv where v='abc';" | roachprod sql $CLUSTER:$n & done; done
  5. Notice that all the cockroach nodes crash
[17561.221476] cockroach invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[17629.744740] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=init.scope,mems_allowed=0,global_oom,task_memcg=/user.slice/user-1000.slice/session-431.scope,task=cockroach,pid=32634,uid=1000
[17629.744773] Out of memory: Killed process 32634 (cockroach) total-vm:1735716kB, anon-rss:206688kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:964kB oom_score_adj:0

Expected behavior
The nodes should not crash. On a muti-tenant system a single tenant system, one tenant could bring down the entire cluster.

Additional data / screenshots
Depending on the number of concurrent select(*), we get different behaviors:
10 queries/node - p50 latency on writes jumps from 2ms -> 900ms, QPS goes from 4000 -> 10
20 queries/node - p50 latency goes to 2s, QPS goes to 1-2
40 queries/node - p50 latency goes to 10s, QPS goes to 0 (timeouts). Causes liveness failures.
100 queries/node - all nodes crash with OOM

CPU Profile: profile.pb.gz
Heap profile (at 30 concurrency): profile.pb.gz
Note that the heap profile doesn't account for all the memory. Here is a line from the cockroach-health log at about the same time as the heap profile:
I240424 19:20:08.898205 324 2@server/status/runtime_log.go:47 ⋮ [T1,Vsystem,n2] 734 runtime stats: 6.2 GiB RSS, 982 goroutines (stacks: 17 MiB), 2.7 GiB/4.1 GiB Go alloc/total (heap fragmentation: 27 MiB, heap reserved: 1.3 GiB, heap released: 1.5 GiB), 2.1 GiB/2.4 GiB CGO alloc/total (9.0 CGO/sec), 394.5/3.1 %(u/s)time, 0.0 %gc (188x), 569 KiB/622 KiB (r/w)net

Environment:
This likely cccurs on all releases. This was tested on 24.1/master.

Additional context
We have seen customer cases where heavy queries can cause either liveness failures or OOMs

Jira issue: CRDB-38160

@andrewbaptist andrewbaptist added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs labels Apr 24, 2024
@sumeerbhola sumeerbhola added the T-sql-queries SQL Queries Team label Apr 25, 2024
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Apr 25, 2024
@sumeerbhola sumeerbhola changed the title admission: failure to protect nodes against expensive select queries sql: expensive select queries cause OOM Apr 25, 2024
@sumeerbhola sumeerbhola changed the title sql: expensive select queries cause OOM kv,sql: expensive select queries cause OOM Apr 25, 2024
@sumeerbhola sumeerbhola added the T-kv KV Team label Apr 25, 2024
@sumeerbhola
Copy link
Collaborator

Thanks for the reproduction!
I have tagged this with KV and SQL for initial investigation, and removed admission from the title, since AC does no memory tracking, or admission based on memory.

Glancing at the heap profile, which only accounts for 658MB, all the memory usage is in MVCCScanToBytes in the pebbleResults which should be accounted for in the KV memory accounting. It would be good to have a profile that shows the high memory usage.

@yuzefovich yuzefovich moved this from Triage to 24.2 Release in SQL Queries May 7, 2024
@yuzefovich
Copy link
Member

This query performs a full scan of the kv table, and each KV BatchRequest is issued with TargetBytes limit of 10MiB. When the first BatchRequest is sent, we only pre-reserve 1KiB in the memory accounting system, but when the response comes back, we reconcile the accounting with the actual memory footprint, and for 2nd and all consecutive batches we do keep the reasonable reservation. If we have many queries that start around the same time on the same gateway, then we can issue huge number of such BatchRequests while only pre-reserving a fraction of needed memory, thus overloading the KV server. Ideally, the KV server would protect itself.

In the kvstreamer client we address this limitation by estimating the expected response size and reserving that up front while reconciling the accounting after having received the response. We could do something similar in the txnKVFetcher, but I'd rather address #82164 which tracks using the Streamer to power scans (in addition to lookup and index joins). We do have a prototype for that #103659 - it might be an interesting experiment to see how the system behaves on a binary with that prototype.

@yuzefovich yuzefovich moved this from 24.2 Release to Backlog in SQL Queries May 7, 2024
@andrewbaptist andrewbaptist added the P-3 Issues/test failures with no fix SLA label May 13, 2024
@arulajmani arulajmani removed the T-kv KV Team label May 13, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-3 Issues/test failures with no fix SLA T-sql-queries SQL Queries Team
Projects
Status: Backlog
Status: Incoming
Development

No branches or pull requests

4 participants