Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd performance issues resolved via drop_caches? #12842

Closed
erhudy opened this issue Apr 8, 2021 · 4 comments
Closed

etcd performance issues resolved via drop_caches? #12842

erhudy opened this issue Apr 8, 2021 · 4 comments
Labels

Comments

@erhudy
Copy link

erhudy commented Apr 8, 2021

We run a large number (60+) of Kubernetes clusters across both bare metal and virtualized environments, each one with either 1 or 2 etcd quorums co-located with the rest of the Kubernetes control plane (in the ones with 2 etcd quorums, Kubernetes data is stored in one and Calico data is stored in the other). Nearly all systems are running Ubuntu 18.04 with a kernel of the 4.15.0 lineage.

As time progressed, we began noticing curious performance issues with etcd, both 3.3.x and 3.4.x, that seemed to be correlated to the uptime of the control plane nodes. etcd would routinely complain about slow read-only range requests even on trivial fetches for single keys, and as the uptime of the machines increased, the slowness would get worse. At its worst, on machines with close to a year of uptime, trivial fetches could sometimes take 5+ seconds to complete. As you might expect, this made both etcd and Kubernetes really unhappy, leading to constant leader switches and Kubernetes control plane components crashing and restarting frequently.

I went through a whole lot of debugging steps while researching the problem: looked at disk performance on the nodes, any added load from our own controllers being scheduled onto the control plane nodes, etcd version, restarting etcd itself, restarting Kubernetes control plane components, network delays, buddy allocator memory fragmentation, poked at it with various eBPF scripts, etc., and ultimately came up with nothing other than the aforementioned correlation between uptime and the performance issues. We did determine that rebooting nodes would resolve the issue for a while, but it would always come back.

Finally, as a hail Mary pass, I ran a little script on one node:

sync
echo 1 > /proc/sys/vm/compact_memory
echo 3 > /proc/sys/vm/drop_caches

...and the performance issues disappeared within 1-2 minutes after the last command returned. I reran the individual commands on other affected nodes and determined that the compact_memory change did not have an effect, but drop_caches definitely did. I tried the same script out on other Kubernetes clusters and confirmed that it resolved the performance issues there as well.

Right now we are applying a mitigation that will do a weekly drop_caches on our control plane nodes during low activity periods in order to keep etcd happy. This is fine as a tactical fix, but I'm opening this issue as a discussion point to see if anyone might have any insight into why this actually fixes the problem, because it's a pretty big hammer.

@ptabor
Copy link
Contributor

ptabor commented Apr 8, 2021

  1. Do you see difference in number of major page faults etcd experiences between pre & post drop-caches situation.

  2. Is it possible that there is another process running on the node that is creating memory pressure on etcd, causing etcd not being able to keep whole mmaped bbolt data in RAM. --experimental-memory-mlock support #12750 might be potential mitigation to this issue.

@erhudy
Copy link
Author

erhudy commented Apr 20, 2021

  1. I ran the script on a node that was showing moderate symptoms of the issue and it looks mostly the same before and after (the spike is when the script was actually running).

pagefault

2. It's possible, but I generally discounted it as a possibility because the symptoms are equally visible both on VMs and metal, and our standard metal SKUs have 512GB of memory.

@erhudy
Copy link
Author

erhudy commented Apr 20, 2021

For completeness' sake, here are the last 2 weeks of page fault stats (node_vmstat_pgfault from the Prometheus node exporter) for the same metal node as above. The spikes early on are probably when I ran the script there initially.
morepagefault

@stale
Copy link

stale bot commented Jul 19, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 19, 2021
@stale stale bot closed this as completed Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants