K3s regularly spikes in cpu usage and crashes in a loop #7786

Blackclaws · 2022-09-02T13:16:01Z

Blackclaws
Sep 2, 2022

Environmental Info:
K3s Version:
k3s version v1.24.3+k3s1 (990ba0e)
go version go1.18.1

Node(s) CPU architecture, OS, and Version:
Almalinux 8.6, x86_64 8 Cores

Cluster Configuration:
1 Server in HA ready configuration

Describe the bug:
k3s regularly spikes into 700-800% cpu usage until it crashes because the leaderelection is lost (to itself). It then reboots and repeats this loop.

This happens about every 2-3 days.

Additional context / logs:

Manually stopping k3s with systemctl stop k3s and restarting it, fixes it temporarily.

According to the logs etcd apply takes up to 6 seconds sometimes.

Sep 02 15:13:22 miku k3s[2694852]: I0902 15:13:22.250924 2694852 trace.go:205] Trace[1434124888]: "List(recursive=true) etcd3" key:/events,resourceVersion:0,resourceVersionMatch:,limit:500,continue: (02-Sep-2022 15:13:14.773) (total time: 7477ms):
Sep 02 15:13:22 miku k3s[2694852]: Trace[1434124888]: [7.477022009s] [7.477022009s] END

Sep 02 15:13:22 miku k3s[2694852]: E0902 15:13:22.349266 2694852 leaderelection.go:330] error retrieving resource lock kube-system/kube-controller-manager: Get "https://127.0.0.1:6444/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-controller-manager?timeout=5s": context deadline exceeded
Sep 02 15:13:22 miku k3s[2694852]: I0902 15:13:22.349361 2694852 leaderelection.go:283] failed to renew lease kube-system/kube-controller-manager: timed out waiting for the condition
Sep 02 15:13:22 miku k3s[2694852]: E0902 15:13:22.349426 2694852 controllermanager.go:284] "leaderelection lost"

brandond · 2022-09-02T15:44:20Z

brandond
Sep 2, 2022
Collaborator

That usually indicates that the storage isn't sufficient to keep up with the IO requirements of the datastore. By "HA ready configuration", I assume you mean a single-node etcd cluster? Etcd has much higher IO requirements than sqlite, and really won't work well on anything except for SSD. What sort of disk are you running k3s on? Are you using the same physical media for the datastore, your workload, and the containerd image store?

0 replies

Blackclaws · 2022-09-04T10:07:10Z

Blackclaws
Sep 4, 2022
Author

The thing is that etcd provides native snapshotting and this is a server we might want to expand by worker nodes in the future.

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed. Is there any way to increase the resilience to these effects?

0 replies

brandond · 2022-09-06T19:22:29Z

brandond
Sep 6, 2022
Collaborator

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

Have you profiled the disk for latency and throughput? Keep in mind that etcd makes a blocking fsync call for every write operation to the datastore. If your workload is performing a lot of write operations at the same time, all of those will need to be flushed to disk before etcd can commit the write. You may not be exceeding the rated iops, but simply running into problems with bulk IO blocking performance-critical cluster operations.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed.

Is this perhaps occurring during some background operation, such as a scheduled TRIM operation, or some IO-intensive processing job on the part of the workload?

0 replies

Blackclaws · 2022-09-08T09:23:28Z

Blackclaws
Sep 8, 2022
Author

The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned.

Have you profiled the disk for latency and throughput? Keep in mind that etcd makes a blocking fsync call for every write operation to the datastore. If your workload is performing a lot of write operations at the same time, all of those will need to be flushed to disk before etcd can commit the write. You may not be exceeding the rated iops, but simply running into problems with bulk IO blocking performance-critical cluster operations.

Yes. Throughput and latency should be more than sufficient.

What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed.

Is this perhaps occurring during some background operation, such as a scheduled TRIM operation, or some IO-intensive processing job on the part of the workload?

No, it happens seemingly at random. Manually running trim operations doesn't trigger this, writing a lot of data doesn't trigger this. I honestly don't know what triggers this. iostat while it happens shows the storage not particularly under pressure, CPU usage at full all taken by a multitude of k3s processes.

Any sort of process that I might follow to help me debug the cause of this?

0 replies

brandond · 2022-09-08T10:21:55Z

brandond
Sep 8, 2022
Collaborator

CPU usage at full all taken by a multitude of k3s processes.

Can you clarify what you mean by that? There should normally only be 1 k3s process (either k3s server or k3s agent), occasionally 2 if k3s is started with the --log flag and has to re-exec itself to redirect output to a log file. Are you actually seeing multiple k3s processes, or just multiple OS threads?

0 replies

Blackclaws · 2022-09-26T18:35:10Z

Blackclaws
Sep 26, 2022
Author

Multiple OS Threads I'd say. I've attached a view of htop and iostat of the problem occuring. As I said iostat is rather unproblematic. IoWait is low, but for some reason k3s uses a ton of cpu and times out etcd requests:

0 replies

brandond · 2022-09-26T18:39:31Z

brandond
Sep 26, 2022
Collaborator

If you're using iostat, can you look at the f_await and aqu-sz fields? These are more likely to be what's causing etcd problems than just the base IO rate.

https://manpages.debian.org/testing/sysstat/iostat.1.en.html#f_await

0 replies

Blackclaws · 2022-09-26T18:42:40Z

Blackclaws
Sep 26, 2022
Author

Will do the next time this problem shows up. Since this is a production server restoring service had priority over waiting to debug this.

0 replies

Blackclaws · 2022-11-07T08:54:55Z

Blackclaws
Nov 7, 2022
Author

So aqu-sz is really low, f_await isn't tracked, this might be because the underlying disk is a virtual disk.

What I've also noticed is a spike from iptables-save -t filter that saves current filter rules, I think that's part of the startup/shutdown of k3s.

Just to give you an idea of the thread sprawl that k3s seems to produce in this situation:

The machine basically becomes unusable at this point. IO latency does also spike at this point (if I run a benchmark like bonnie++ at the same time) however its hard to tell whether that is simply because of the high load on the machine. Shutting down k3s returns it to normal after which k3s can be restarted and normal operation resumes.

I won't rule out that this is an issue that crops up due to the underlying machine. What I find weird is that it takes quite some time to happen the first time, even with the full load of cluster startup directly applied at the beginning (which I assume is the point of highest load for most use cases). There seems to be some sort of trigger that puts k3s in this state where a recovery takes a:

k3s-killall.sh
systemctl start k3s

Which then leads to a standard startup that works fine.

I wonder why the cpu load spikes this way however, it seems there is work being done since the processes aren't IO blocked. I wonder what work is being done however.

0 replies

brandond · 2023-06-13T17:19:37Z

brandond
Jun 13, 2023
Collaborator

I haven't seen any updates on this in a bit; were you able to identify a root cause of the CPU load?

0 replies

Blackclaws · 2023-06-13T17:25:58Z

Blackclaws
Jun 13, 2023
Author

@brandond Unfortunately not. The issue persists and crops up every once in a while. I can't rule out that it has something to do with disk io being too slow, however we've switched VM hosts etc. It might be related to other VMs running on the same host. Since we don't control the host directly its hard to tell whether the problems coincide with high IO load on the host system.

However the fact that this seems to crop up semi-regularly and the only way to make it stop is to reboot the machine (after which the problem goes away) and seeing how a simple restart of k3s without reboot does nothing to alleviate the issue it might be related to some sort of failure in the IO system that gets cleared.

Its still extremely weird and I'd love to go a second round on this if you have any further ideas on what to test. Especially since the very high CPU load is a bit weird. It basically drives the whole system into the ground because k3s takes up so much of the resources that the existing containers start to stall (followed by k3s itself stalling and timing out, restarting and repeating the cycle). I get the feeling that since k3s is supposed to run on even low end systems this is not something that should be happening on this type of host even using the etcd backend.

0 replies

caroline-suse-rancher · 2023-06-14T21:00:21Z

caroline-suse-rancher
Jun 14, 2023
Collaborator

Converting this to a discussion as (at this point), it doesn't seem to be directly related to a k3s bug.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K3s regularly spikes in cpu usage and crashes in a loop #7786

{{title}}

Replies: 12 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

K3s regularly spikes in cpu usage and crashes in a loop #7786

Blackclaws Sep 2, 2022

Replies: 12 comments

brandond Sep 2, 2022 Collaborator

Blackclaws Sep 4, 2022 Author

brandond Sep 6, 2022 Collaborator

Blackclaws Sep 8, 2022 Author

brandond Sep 8, 2022 Collaborator

Blackclaws Sep 26, 2022 Author

brandond Sep 26, 2022 Collaborator

Blackclaws Sep 26, 2022 Author

Blackclaws Nov 7, 2022 Author

brandond Jun 13, 2023 Collaborator

Blackclaws Jun 13, 2023 Author

caroline-suse-rancher Jun 14, 2023 Collaborator

Blackclaws
Sep 2, 2022

brandond
Sep 2, 2022
Collaborator

Blackclaws
Sep 4, 2022
Author

brandond
Sep 6, 2022
Collaborator

Blackclaws
Sep 8, 2022
Author

brandond
Sep 8, 2022
Collaborator

Blackclaws
Sep 26, 2022
Author

brandond
Sep 26, 2022
Collaborator

Blackclaws
Sep 26, 2022
Author

Blackclaws
Nov 7, 2022
Author

brandond
Jun 13, 2023
Collaborator

Blackclaws
Jun 13, 2023
Author

caroline-suse-rancher
Jun 14, 2023
Collaborator