-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd performance issues resolved via drop_caches? #12842
Comments
|
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions. |
We run a large number (60+) of Kubernetes clusters across both bare metal and virtualized environments, each one with either 1 or 2 etcd quorums co-located with the rest of the Kubernetes control plane (in the ones with 2 etcd quorums, Kubernetes data is stored in one and Calico data is stored in the other). Nearly all systems are running Ubuntu 18.04 with a kernel of the 4.15.0 lineage.
As time progressed, we began noticing curious performance issues with etcd, both 3.3.x and 3.4.x, that seemed to be correlated to the uptime of the control plane nodes. etcd would routinely complain about slow read-only range requests even on trivial fetches for single keys, and as the uptime of the machines increased, the slowness would get worse. At its worst, on machines with close to a year of uptime, trivial fetches could sometimes take 5+ seconds to complete. As you might expect, this made both etcd and Kubernetes really unhappy, leading to constant leader switches and Kubernetes control plane components crashing and restarting frequently.
I went through a whole lot of debugging steps while researching the problem: looked at disk performance on the nodes, any added load from our own controllers being scheduled onto the control plane nodes, etcd version, restarting etcd itself, restarting Kubernetes control plane components, network delays, buddy allocator memory fragmentation, poked at it with various eBPF scripts, etc., and ultimately came up with nothing other than the aforementioned correlation between uptime and the performance issues. We did determine that rebooting nodes would resolve the issue for a while, but it would always come back.
Finally, as a hail Mary pass, I ran a little script on one node:
...and the performance issues disappeared within 1-2 minutes after the last command returned. I reran the individual commands on other affected nodes and determined that the compact_memory change did not have an effect, but drop_caches definitely did. I tried the same script out on other Kubernetes clusters and confirmed that it resolved the performance issues there as well.
Right now we are applying a mitigation that will do a weekly drop_caches on our control plane nodes during low activity periods in order to keep etcd happy. This is fine as a tactical fix, but I'm opening this issue as a discussion point to see if anyone might have any insight into why this actually fixes the problem, because it's a pretty big hammer.
The text was updated successfully, but these errors were encountered: