Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very high CPU usage on one of 5 nodes in cluster #17075

Closed
gpaul opened this issue Jul 18, 2017 · 12 comments
Closed

very high CPU usage on one of 5 nodes in cluster #17075

gpaul opened this issue Jul 18, 2017 · 12 comments
Milestone

Comments

@gpaul
Copy link

gpaul commented Jul 18, 2017

Is this a question, feature request, or bug report?

BUG REPORT

  1. Please supply the header (i.e. the first few lines) of your most recent
    log file for each node in your cluster. On most unix-based systems
    running with defaults, this boils down to the output of

    grep -F '[config]' cockroach-data/logs/cockroach.log

    When log files are not available, supply the output of cockroach version
    and all flags/environment variables passed to cockroach start instead.

This is cockroachdb cluster running a fairly recent master commit. Not ideal, but I'm doing testing that relies on functionality that will only be released in 1.0.4.

$ cockroach version
Build Tag:    a26544c-dirty
Build Time:   2017/07/16 14:48:16
Distribution: CCL
Platform:     linux amd64
Go Version:   go1.8.3
C Compiler:   gcc 4.9.3
Build SHA-1:  a26544ce14386cc07bd21420cfbc63208757f5b0
Build Type:   development

I attached the entire debug bundle (link at the bottom of the description). This is a test cluster that I'm keeping running in this state. I am keeping it running for the moment in the hope that there is some additional output I can provide.

  1. Please describe the issue you observed:
  • What did you do?

Start a 5-node cluster.

Issue queries to each of the nodes at a rate of ~1 per second.

Perform thousands of inserts and deletes over a period of a day or two.

As these are benchmarks, most inserts are performed serially in a 'preparation' phase and are executed one statement at a time against Node 2, followed by a few hundred insert/delete transactions distributed evenly against all 5 nodes, followed once more by serial delete statements against Node 2 to clean up after the benchmark. Throughout all these actions there is a slow background query rate of ~1/sec performed by other processes in the cluster.

I've attached a complete dump of the database and as you can see there is really very little data at the moment.

  • What did you expect to see?

Inserts and queries taking milliseconds.

Specifically, queries of the following form not taking several seconds:

SELECT resources.id AS resources_id, resources.rid AS resources_rid, resources.description AS resources_description, aces.actions AS aces_actions, aces.id AS aces_id, aces.user_id AS aces_user_id, aces.group_id AS aces_group_id, aces.resource_id AS aces_resource_id 
FROM aces JOIN users ON users.id = aces.user_id JOIN resources ON resources.id = aces.resource_id 
WHERE users.uid = %(uid_1)s

CockroachDB on all nodes using some fraction of a core.

  • What did you see instead?

Queries (admittedly the afore-mentioned 3-way joins) taking several seconds.

CockroachDB on nodes 1,2,4,5 hovering around 10% CPU while the process on Node 3 is pegged at ~400% CPU.

top - 13:25:46 up 1 day, 22:28,  2 users,  load average: 6.70, 6.65, 6.58
Tasks: 180 total,   1 running, 179 sleeping,   0 stopped,   0 zombie
%Cpu(s): 90.8 us,  7.3 sy,  0.0 ni,  1.8 id,  0.1 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 16004820 total,  6722184 free,  4530252 used,  4752384 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11036620 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                   
28794 dcos_co+  20   0 3395548 1.553g  24296 S 379.1 10.2   8760:22 cockroach                                                                                                                 

I've attached the debug bundle and complete database dump.

I've also attached CPU profile SVG and goroutine stacktraces.

cockroach-debug.zip
dump.zip
cpu-profile.zip

@gpaul
Copy link
Author

gpaul commented Jul 18, 2017

I don't want to see if whether restarting cockroachdb helps before we're sure we extracted all debug info.

@gpaul
Copy link
Author

gpaul commented Jul 18, 2017

FWIW I started a new 5-node cluster (same version), loaded the attached data into the cluster, then executed

SELECT resources.id AS resources_id, resources.rid AS resources_rid, resources.description AS resources_description, aces.actions AS aces_actions, aces.id AS aces_id, aces.user_id AS aces_user_id, aces.group_id AS aces_group_id, aces.resource_id AS aces_resource_id 
FROM aces JOIN users ON users.id = aces.user_id JOIN resources ON resources.id = aces.resource_id 
WHERE users.uid = 'bbenchuser-00001';

The results appeared instantly as one would expect.

@gpaul
Copy link
Author

gpaul commented Jul 18, 2017

The network seems under control:

[centos@ip-10-10-0-125 ~]$ ping 10.10.0.250
PING 10.10.0.250 (10.10.0.250) 56(84) bytes of data.
64 bytes from 10.10.0.250: icmp_seq=1 ttl=64 time=0.307 ms
64 bytes from 10.10.0.250: icmp_seq=2 ttl=64 time=0.230 ms
64 bytes from 10.10.0.250: icmp_seq=3 ttl=64 time=0.288 ms
64 bytes from 10.10.0.250: icmp_seq=4 ttl=64 time=0.251 ms

@gpaul
Copy link
Author

gpaul commented Jul 19, 2017

This reproduced on a new cluster, running the same versions. I'm going to try and reproduce on a cluster built from the release-1.0 branch.

@gpaul
Copy link
Author

gpaul commented Jul 19, 2017

Interestingly, most graphs in the Admin UI look sane, except for the Distributed Dashboard where the distribution between the nodes seems lopsided.

Here are some screenshots showing this dashboard for the various nodes. The sudden increase in traffic at ~09:00 coincides with the launching of my benchmarks. Note the extreme difference between node 5 and for example nodes 2 and 4.

CPU Usage on the nodes are:
Node 1: 5%
Node 2: 93%
Node 3: 60%
Node 4: 300%
Node 5: 5%

Here are the screenshots which appear to show significant non-uniformity:

batches-node-1
batches-node-2
batches-node-3
batches-node-4
batches-node-5

@gpaul
Copy link
Author

gpaul commented Jul 24, 2017

The root cause was ... insufficient indexes. The symptom of the cluster not recovering even after hours of extremely low load is still worrying though.

@gpaul gpaul closed this as completed Jul 24, 2017
@tbg
Copy link
Member

tbg commented Jul 24, 2017

@gpaul thanks for your report and sorry this issue has been idle for so long! I agree that it's worrisome that the cluster did not recover when you stopped the workload. We will take a look.

@tbg tbg reopened this Jul 24, 2017
@petermattis petermattis added this to the 1.1 milestone Jul 24, 2017
@gpaul
Copy link
Author

gpaul commented Jul 24, 2017

Not at all, combining "it is slow" with "i'm running master" I wasn't expecting a massive turnout :-)

@a-robinson
Copy link
Contributor

a-robinson commented Jul 24, 2017

Hey @gpaul, like @tschottdorf I'm sorry I didn't get to this last week. Thanks for all the details. I have a few follow-up questions for you:

  1. You mentioned running insert/delete statements as part of your workload. Were these statements running at the same time as the slow SELECT statement?
  2. If so, were they modifying the aces and resources tables that the SELECT query was doing full scans over?
  3. Roughly many concurrent queries were you sending?
  4. What specifically do you mean when you say it took time to recover? I assume you mean that the server kept doing work well after you stopped sending requests, but want to make sure I'm not assuming wrong.
  5. If so, did the clients close their connections to the server when they stopped sending requests?

While removing the full table scan should solve all your problems judging by the profile that you attached, it's clear that cockroach's behavior didn't degrade very well here with the load. Node 3 presumably took the brunt of it because it was the leaseholder for the hot range(s).

The NotLeaseholder errors you were seeing also look really bad, but they were recently independently discovered (#17177 (comment)) and fixed (#17196) so at least that part is taken care of.

@gpaul
Copy link
Author

gpaul commented Jul 26, 2017

You mentioned running insert/delete statements as part of your workload. Were these statements running at the same time as the slow SELECT statement?

Yes.

The test I was running inserts records, deletes them, inserts more records, deletes those, etc. All the while SELECT requests that trigger table scans are executing at an average rate of one every few seconds with some standard deviation causing occasional queries to arrive back-to-back.

If so, were they modifying the aces and resources tables that the SELECT query was doing full scans over?

Yes.

The test adds users, adds groups, adds users to groups, then adds a bunch of resources and adds users and groups to those resources through the aces table.

Roughly many concurrent queries were you sending?

Very few. The test itself runs serially against a single node, with SELECTs being performed by background services on all nodes at the aforementioned rate.

What specifically do you mean when you say it took time to recover? I assume you mean that the server kept doing work well after you stopped sending requests, but want to make sure I'm not assuming wrong.

I suspected that loading was causing the performance degradation and so I stopped the test and left the cluster running over night. The next morning the CockroachDB cluster still showed very high CPU usage on node 3. In retrospect this is probably simply due to large table scans.

Given the relatively small amount of live data (attached previously) that was still surprising. One can load the database I attached and execute some queries and you'll find that it is pretty snappy. This leads me to suspect that deleted records incur a penalty when table scanning. Does that seem sensible?

If so, did the clients close their connections to the server when they stopped sending requests?

That's a good question. I'm using the python SQLAlchemy library's Session abstraction which is likely pooling connections. Should I be using per-transaction connections instead?

@gpaul
Copy link
Author

gpaul commented Jul 26, 2017

Also relevant to my last point: there is only one client throughout, but that client uses connection pooling.

@a-robinson
Copy link
Contributor

This leads me to suspect that deleted records incur a penalty when table scanning. Does that seem sensible?

Your suspicion is correct. That's most likely what you ran into here. We keep the deleted records in-line with the live records until we GC them, which by default won't happen for 24 hours. This isn't a big deal if you're doing point lookups of records (as would be the case after you added indexes), but will affect table scans.

This is an issue that we've recently started discussing more, since as you've learned it can cause pretty surprising performance degradation in workloads with a lot of updates and deletes. I'm going to close this as a known issue (with #17229 serving as a less specific tracking issue for it), but thanks again for reporting the problem you were having!

cc @andreimatei @tschottdorf @bdarnell as interested parties from the forum discussion on TTLs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants