very high CPU usage on one of 5 nodes in cluster #17075

gpaul · 2017-07-18T13:35:47Z

Is this a question, feature request, or bug report?

BUG REPORT

Please supply the header (i.e. the first few lines) of your most recent
log file for each node in your cluster. On most unix-based systems
running with defaults, this boils down to the output of

grep -F '[config]' cockroach-data/logs/cockroach.log

When log files are not available, supply the output of cockroach version
and all flags/environment variables passed to cockroach start instead.

This is cockroachdb cluster running a fairly recent master commit. Not ideal, but I'm doing testing that relies on functionality that will only be released in 1.0.4.

$ cockroach version
Build Tag:    a26544c-dirty
Build Time:   2017/07/16 14:48:16
Distribution: CCL
Platform:     linux amd64
Go Version:   go1.8.3
C Compiler:   gcc 4.9.3
Build SHA-1:  a26544ce14386cc07bd21420cfbc63208757f5b0
Build Type:   development

I attached the entire debug bundle (link at the bottom of the description). This is a test cluster that I'm keeping running in this state. I am keeping it running for the moment in the hope that there is some additional output I can provide.

Please describe the issue you observed:

What did you do?

Start a 5-node cluster.

Issue queries to each of the nodes at a rate of ~1 per second.

Perform thousands of inserts and deletes over a period of a day or two.

As these are benchmarks, most inserts are performed serially in a 'preparation' phase and are executed one statement at a time against Node 2, followed by a few hundred insert/delete transactions distributed evenly against all 5 nodes, followed once more by serial delete statements against Node 2 to clean up after the benchmark. Throughout all these actions there is a slow background query rate of ~1/sec performed by other processes in the cluster.

I've attached a complete dump of the database and as you can see there is really very little data at the moment.

What did you expect to see?

Inserts and queries taking milliseconds.

Specifically, queries of the following form not taking several seconds:

SELECT resources.id AS resources_id, resources.rid AS resources_rid, resources.description AS resources_description, aces.actions AS aces_actions, aces.id AS aces_id, aces.user_id AS aces_user_id, aces.group_id AS aces_group_id, aces.resource_id AS aces_resource_id 
FROM aces JOIN users ON users.id = aces.user_id JOIN resources ON resources.id = aces.resource_id 
WHERE users.uid = %(uid_1)s

CockroachDB on all nodes using some fraction of a core.

What did you see instead?

Queries (admittedly the afore-mentioned 3-way joins) taking several seconds.

CockroachDB on nodes 1,2,4,5 hovering around 10% CPU while the process on Node 3 is pegged at ~400% CPU.

top - 13:25:46 up 1 day, 22:28,  2 users,  load average: 6.70, 6.65, 6.58
Tasks: 180 total,   1 running, 179 sleeping,   0 stopped,   0 zombie
%Cpu(s): 90.8 us,  7.3 sy,  0.0 ni,  1.8 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16004820 total,  6722184 free,  4530252 used,  4752384 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11036620 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                   
28794 dcos_co+  20   0 3395548 1.553g  24296 S 379.1 10.2   8760:22 cockroach

I've attached the debug bundle and complete database dump.

I've also attached CPU profile SVG and goroutine stacktraces.

cockroach-debug.zip
dump.zip
cpu-profile.zip

The text was updated successfully, but these errors were encountered:

gpaul · 2017-07-18T13:37:09Z

I don't want to see if whether restarting cockroachdb helps before we're sure we extracted all debug info.

gpaul · 2017-07-18T13:46:33Z

FWIW I started a new 5-node cluster (same version), loaded the attached data into the cluster, then executed

SELECT resources.id AS resources_id, resources.rid AS resources_rid, resources.description AS resources_description, aces.actions AS aces_actions, aces.id AS aces_id, aces.user_id AS aces_user_id, aces.group_id AS aces_group_id, aces.resource_id AS aces_resource_id 
FROM aces JOIN users ON users.id = aces.user_id JOIN resources ON resources.id = aces.resource_id 
WHERE users.uid = 'bbenchuser-00001';

The results appeared instantly as one would expect.

gpaul · 2017-07-18T14:47:37Z

The network seems under control:

[centos@ip-10-10-0-125 ~]$ ping 10.10.0.250
PING 10.10.0.250 (10.10.0.250) 56(84) bytes of data.
64 bytes from 10.10.0.250: icmp_seq=1 ttl=64 time=0.307 ms
64 bytes from 10.10.0.250: icmp_seq=2 ttl=64 time=0.230 ms
64 bytes from 10.10.0.250: icmp_seq=3 ttl=64 time=0.288 ms
64 bytes from 10.10.0.250: icmp_seq=4 ttl=64 time=0.251 ms

gpaul · 2017-07-19T11:59:51Z

This reproduced on a new cluster, running the same versions. I'm going to try and reproduce on a cluster built from the release-1.0 branch.

gpaul · 2017-07-19T12:13:12Z

Interestingly, most graphs in the Admin UI look sane, except for the Distributed Dashboard where the distribution between the nodes seems lopsided.

Here are some screenshots showing this dashboard for the various nodes. The sudden increase in traffic at ~09:00 coincides with the launching of my benchmarks. Note the extreme difference between node 5 and for example nodes 2 and 4.

CPU Usage on the nodes are:
Node 1: 5%
Node 2: 93%
Node 3: 60%
Node 4: 300%
Node 5: 5%

Here are the screenshots which appear to show significant non-uniformity:

gpaul · 2017-07-24T12:31:39Z

The root cause was ... insufficient indexes. The symptom of the cluster not recovering even after hours of extremely low load is still worrying though.

tbg · 2017-07-24T13:11:16Z

@gpaul thanks for your report and sorry this issue has been idle for so long! I agree that it's worrisome that the cluster did not recover when you stopped the workload. We will take a look.

gpaul · 2017-07-24T14:49:20Z

Not at all, combining "it is slow" with "i'm running master" I wasn't expecting a massive turnout :-)

a-robinson · 2017-07-24T21:52:54Z

Hey @gpaul, like @tschottdorf I'm sorry I didn't get to this last week. Thanks for all the details. I have a few follow-up questions for you:

You mentioned running insert/delete statements as part of your workload. Were these statements running at the same time as the slow SELECT statement?
If so, were they modifying the aces and resources tables that the SELECT query was doing full scans over?
Roughly many concurrent queries were you sending?
What specifically do you mean when you say it took time to recover? I assume you mean that the server kept doing work well after you stopped sending requests, but want to make sure I'm not assuming wrong.
If so, did the clients close their connections to the server when they stopped sending requests?

While removing the full table scan should solve all your problems judging by the profile that you attached, it's clear that cockroach's behavior didn't degrade very well here with the load. Node 3 presumably took the brunt of it because it was the leaseholder for the hot range(s).

The NotLeaseholder errors you were seeing also look really bad, but they were recently independently discovered (#17177 (comment)) and fixed (#17196) so at least that part is taken care of.

gpaul · 2017-07-26T07:59:04Z

You mentioned running insert/delete statements as part of your workload. Were these statements running at the same time as the slow SELECT statement?

Yes.

The test I was running inserts records, deletes them, inserts more records, deletes those, etc. All the while SELECT requests that trigger table scans are executing at an average rate of one every few seconds with some standard deviation causing occasional queries to arrive back-to-back.

If so, were they modifying the aces and resources tables that the SELECT query was doing full scans over?

Yes.

The test adds users, adds groups, adds users to groups, then adds a bunch of resources and adds users and groups to those resources through the aces table.

Roughly many concurrent queries were you sending?

Very few. The test itself runs serially against a single node, with SELECTs being performed by background services on all nodes at the aforementioned rate.

What specifically do you mean when you say it took time to recover? I assume you mean that the server kept doing work well after you stopped sending requests, but want to make sure I'm not assuming wrong.

I suspected that loading was causing the performance degradation and so I stopped the test and left the cluster running over night. The next morning the CockroachDB cluster still showed very high CPU usage on node 3. In retrospect this is probably simply due to large table scans.

Given the relatively small amount of live data (attached previously) that was still surprising. One can load the database I attached and execute some queries and you'll find that it is pretty snappy. This leads me to suspect that deleted records incur a penalty when table scanning. Does that seem sensible?

If so, did the clients close their connections to the server when they stopped sending requests?

That's a good question. I'm using the python SQLAlchemy library's Session abstraction which is likely pooling connections. Should I be using per-transaction connections instead?

gpaul · 2017-07-26T08:03:28Z

Also relevant to my last point: there is only one client throughout, but that client uses connection pooling.

a-robinson · 2017-07-26T13:55:13Z

This leads me to suspect that deleted records incur a penalty when table scanning. Does that seem sensible?

Your suspicion is correct. That's most likely what you ran into here. We keep the deleted records in-line with the live records until we GC them, which by default won't happen for 24 hours. This isn't a big deal if you're doing point lookups of records (as would be the case after you added indexes), but will affect table scans.

This is an issue that we've recently started discussing more, since as you've learned it can cause pretty surprising performance degradation in workloads with a lot of updates and deletes. I'm going to close this as a known issue (with #17229 serving as a less specific tracking issue for it), but thanks again for reporting the problem you were having!

cc @andreimatei @tschottdorf @bdarnell as interested parties from the forum discussion on TTLs.

gpaul closed this as completed Jul 24, 2017

tbg reopened this Jul 24, 2017

petermattis added this to the 1.1 milestone Jul 24, 2017

a-robinson mentioned this issue Jul 26, 2017

storage: Performance degradation caused by kv tombstones #17229

Open

a-robinson closed this as completed Jul 26, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

very high CPU usage on one of 5 nodes in cluster #17075

very high CPU usage on one of 5 nodes in cluster #17075

gpaul commented Jul 18, 2017

gpaul commented Jul 18, 2017

gpaul commented Jul 18, 2017 •

edited

Loading

gpaul commented Jul 18, 2017

gpaul commented Jul 19, 2017

gpaul commented Jul 19, 2017

gpaul commented Jul 24, 2017

tbg commented Jul 24, 2017

gpaul commented Jul 24, 2017

a-robinson commented Jul 24, 2017 •

edited

Loading

gpaul commented Jul 26, 2017

gpaul commented Jul 26, 2017

a-robinson commented Jul 26, 2017

very high CPU usage on one of 5 nodes in cluster #17075

very high CPU usage on one of 5 nodes in cluster #17075

Comments

gpaul commented Jul 18, 2017

gpaul commented Jul 18, 2017

gpaul commented Jul 18, 2017 • edited Loading

gpaul commented Jul 18, 2017

gpaul commented Jul 19, 2017

gpaul commented Jul 19, 2017

gpaul commented Jul 24, 2017

tbg commented Jul 24, 2017

gpaul commented Jul 24, 2017

a-robinson commented Jul 24, 2017 • edited Loading

gpaul commented Jul 26, 2017

gpaul commented Jul 26, 2017

a-robinson commented Jul 26, 2017

gpaul commented Jul 18, 2017 •

edited

Loading

a-robinson commented Jul 24, 2017 •

edited

Loading