K3s regularly spikes in cpu usage and crashes in a loop #7786
Replies: 12 comments
-
That usually indicates that the storage isn't sufficient to keep up with the IO requirements of the datastore. By "HA ready configuration", I assume you mean a single-node etcd cluster? Etcd has much higher IO requirements than sqlite, and really won't work well on anything except for SSD. What sort of disk are you running k3s on? Are you using the same physical media for the datastore, your workload, and the containerd image store? |
Beta Was this translation helpful? Give feedback.
-
The thing is that etcd provides native snapshotting and this is a server we might want to expand by worker nodes in the future. The underlying storage are NVME ssds that should have sufficient headroom for fast access. The same physical medium is used for datastore, workload and containerd images as this is a VM that only has a single volume provisioned. What's surprising to me is that during initial startup this isn't a problem at all, but only seems to become a problem once some time has passed. Is there any way to increase the resilience to these effects? |
Beta Was this translation helpful? Give feedback.
-
Have you profiled the disk for latency and throughput? Keep in mind that etcd makes a blocking
Is this perhaps occurring during some background operation, such as a scheduled |
Beta Was this translation helpful? Give feedback.
-
Yes. Throughput and latency should be more than sufficient.
No, it happens seemingly at random. Manually running trim operations doesn't trigger this, writing a lot of data doesn't trigger this. I honestly don't know what triggers this. iostat while it happens shows the storage not particularly under pressure, CPU usage at full all taken by a multitude of k3s processes. Any sort of process that I might follow to help me debug the cause of this? |
Beta Was this translation helpful? Give feedback.
-
Can you clarify what you mean by that? There should normally only be 1 k3s process (either |
Beta Was this translation helpful? Give feedback.
-
Multiple OS Threads I'd say. I've attached a view of htop and iostat of the problem occuring. As I said iostat is rather unproblematic. IoWait is low, but for some reason k3s uses a ton of cpu and times out etcd requests: |
Beta Was this translation helpful? Give feedback.
-
If you're using iostat, can you look at the https://manpages.debian.org/testing/sysstat/iostat.1.en.html#f_await |
Beta Was this translation helpful? Give feedback.
-
Will do the next time this problem shows up. Since this is a production server restoring service had priority over waiting to debug this. |
Beta Was this translation helpful? Give feedback.
-
So aqu-sz is really low, f_await isn't tracked, this might be because the underlying disk is a virtual disk. What I've also noticed is a spike from Just to give you an idea of the thread sprawl that k3s seems to produce in this situation: The machine basically becomes unusable at this point. IO latency does also spike at this point (if I run a benchmark like bonnie++ at the same time) however its hard to tell whether that is simply because of the high load on the machine. Shutting down k3s returns it to normal after which k3s can be restarted and normal operation resumes. I won't rule out that this is an issue that crops up due to the underlying machine. What I find weird is that it takes quite some time to happen the first time, even with the full load of cluster startup directly applied at the beginning (which I assume is the point of highest load for most use cases). There seems to be some sort of trigger that puts k3s in this state where a recovery takes a:
Which then leads to a standard startup that works fine. I wonder why the cpu load spikes this way however, it seems there is work being done since the processes aren't IO blocked. I wonder what work is being done however. |
Beta Was this translation helpful? Give feedback.
-
I haven't seen any updates on this in a bit; were you able to identify a root cause of the CPU load? |
Beta Was this translation helpful? Give feedback.
-
@brandond Unfortunately not. The issue persists and crops up every once in a while. I can't rule out that it has something to do with disk io being too slow, however we've switched VM hosts etc. It might be related to other VMs running on the same host. Since we don't control the host directly its hard to tell whether the problems coincide with high IO load on the host system. However the fact that this seems to crop up semi-regularly and the only way to make it stop is to reboot the machine (after which the problem goes away) and seeing how a simple restart of k3s without reboot does nothing to alleviate the issue it might be related to some sort of failure in the IO system that gets cleared. Its still extremely weird and I'd love to go a second round on this if you have any further ideas on what to test. Especially since the very high CPU load is a bit weird. It basically drives the whole system into the ground because k3s takes up so much of the resources that the existing containers start to stall (followed by k3s itself stalling and timing out, restarting and repeating the cycle). I get the feeling that since k3s is supposed to run on even low end systems this is not something that should be happening on this type of host even using the etcd backend. |
Beta Was this translation helpful? Give feedback.
-
Converting this to a discussion as (at this point), it doesn't seem to be directly related to a k3s bug. |
Beta Was this translation helpful? Give feedback.
-
Environmental Info:
K3s Version:
k3s version v1.24.3+k3s1 (990ba0e)
go version go1.18.1
Node(s) CPU architecture, OS, and Version:
Almalinux 8.6, x86_64 8 Cores
Cluster Configuration:
1 Server in HA ready configuration
Describe the bug:
k3s regularly spikes into 700-800% cpu usage until it crashes because the leaderelection is lost (to itself). It then reboots and repeats this loop.
This happens about every 2-3 days.
Additional context / logs:
Manually stopping k3s with systemctl stop k3s and restarting it, fixes it temporarily.
According to the logs etcd apply takes up to 6 seconds sometimes.
Beta Was this translation helpful? Give feedback.
All reactions