-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
very high CPU usage on one of 5 nodes in cluster #17075
Comments
I don't want to see if whether restarting cockroachdb helps before we're sure we extracted all debug info. |
FWIW I started a new 5-node cluster (same version), loaded the attached data into the cluster, then executed
The results appeared instantly as one would expect. |
The network seems under control:
|
This reproduced on a new cluster, running the same versions. I'm going to try and reproduce on a cluster built from the |
Interestingly, most graphs in the Admin UI look sane, except for the Distributed Dashboard where the distribution between the nodes seems lopsided. Here are some screenshots showing this dashboard for the various nodes. The sudden increase in traffic at ~09:00 coincides with the launching of my benchmarks. Note the extreme difference between node 5 and for example nodes 2 and 4. CPU Usage on the nodes are: Here are the screenshots which appear to show significant non-uniformity: |
The root cause was ... insufficient indexes. The symptom of the cluster not recovering even after hours of extremely low load is still worrying though. |
@gpaul thanks for your report and sorry this issue has been idle for so long! I agree that it's worrisome that the cluster did not recover when you stopped the workload. We will take a look. |
Not at all, combining "it is slow" with "i'm running master" I wasn't expecting a massive turnout :-) |
Hey @gpaul, like @tschottdorf I'm sorry I didn't get to this last week. Thanks for all the details. I have a few follow-up questions for you:
While removing the full table scan should solve all your problems judging by the profile that you attached, it's clear that cockroach's behavior didn't degrade very well here with the load. Node 3 presumably took the brunt of it because it was the leaseholder for the hot range(s). The |
Yes. The test I was running inserts records, deletes them, inserts more records, deletes those, etc. All the while SELECT requests that trigger table scans are executing at an average rate of one every few seconds with some standard deviation causing occasional queries to arrive back-to-back.
Yes. The test adds users, adds groups, adds users to groups, then adds a bunch of resources and adds users and groups to those resources through the
Very few. The test itself runs serially against a single node, with SELECTs being performed by background services on all nodes at the aforementioned rate.
I suspected that loading was causing the performance degradation and so I stopped the test and left the cluster running over night. The next morning the CockroachDB cluster still showed very high CPU usage on node 3. In retrospect this is probably simply due to large table scans. Given the relatively small amount of live data (attached previously) that was still surprising. One can load the database I attached and execute some queries and you'll find that it is pretty snappy. This leads me to suspect that deleted records incur a penalty when table scanning. Does that seem sensible?
That's a good question. I'm using the python SQLAlchemy library's |
Also relevant to my last point: there is only one client throughout, but that client uses connection pooling. |
Your suspicion is correct. That's most likely what you ran into here. We keep the deleted records in-line with the live records until we GC them, which by default won't happen for 24 hours. This isn't a big deal if you're doing point lookups of records (as would be the case after you added indexes), but will affect table scans. This is an issue that we've recently started discussing more, since as you've learned it can cause pretty surprising performance degradation in workloads with a lot of updates and deletes. I'm going to close this as a known issue (with #17229 serving as a less specific tracking issue for it), but thanks again for reporting the problem you were having! cc @andreimatei @tschottdorf @bdarnell as interested parties from the forum discussion on TTLs. |
Is this a question, feature request, or bug report?
BUG REPORT
Please supply the header (i.e. the first few lines) of your most recent
log file for each node in your cluster. On most unix-based systems
running with defaults, this boils down to the output of
grep -F '[config]' cockroach-data/logs/cockroach.log
When log files are not available, supply the output of
cockroach version
and all flags/environment variables passed to
cockroach start
instead.This is cockroachdb cluster running a fairly recent master commit. Not ideal, but I'm doing testing that relies on functionality that will only be released in 1.0.4.
I attached the entire debug bundle (link at the bottom of the description). This is a test cluster that I'm keeping running in this state. I am keeping it running for the moment in the hope that there is some additional output I can provide.
Start a 5-node cluster.
Issue queries to each of the nodes at a rate of ~1 per second.
Perform thousands of inserts and deletes over a period of a day or two.
As these are benchmarks, most inserts are performed serially in a 'preparation' phase and are executed one statement at a time against Node 2, followed by a few hundred insert/delete transactions distributed evenly against all 5 nodes, followed once more by serial delete statements against Node 2 to clean up after the benchmark. Throughout all these actions there is a slow background query rate of ~1/sec performed by other processes in the cluster.
I've attached a complete dump of the database and as you can see there is really very little data at the moment.
Inserts and queries taking milliseconds.
Specifically, queries of the following form not taking several seconds:
CockroachDB on all nodes using some fraction of a core.
Queries (admittedly the afore-mentioned 3-way joins) taking several seconds.
CockroachDB on nodes 1,2,4,5 hovering around 10% CPU while the process on Node 3 is pegged at ~400% CPU.
I've attached the debug bundle and complete database dump.
I've also attached CPU profile SVG and goroutine stacktraces.
cockroach-debug.zip
dump.zip
cpu-profile.zip
The text was updated successfully, but these errors were encountered: