etcd performance issues resolved via drop_caches? #12842

erhudy · 2021-04-08T15:44:58Z

We run a large number (60+) of Kubernetes clusters across both bare metal and virtualized environments, each one with either 1 or 2 etcd quorums co-located with the rest of the Kubernetes control plane (in the ones with 2 etcd quorums, Kubernetes data is stored in one and Calico data is stored in the other). Nearly all systems are running Ubuntu 18.04 with a kernel of the 4.15.0 lineage.

As time progressed, we began noticing curious performance issues with etcd, both 3.3.x and 3.4.x, that seemed to be correlated to the uptime of the control plane nodes. etcd would routinely complain about slow read-only range requests even on trivial fetches for single keys, and as the uptime of the machines increased, the slowness would get worse. At its worst, on machines with close to a year of uptime, trivial fetches could sometimes take 5+ seconds to complete. As you might expect, this made both etcd and Kubernetes really unhappy, leading to constant leader switches and Kubernetes control plane components crashing and restarting frequently.

I went through a whole lot of debugging steps while researching the problem: looked at disk performance on the nodes, any added load from our own controllers being scheduled onto the control plane nodes, etcd version, restarting etcd itself, restarting Kubernetes control plane components, network delays, buddy allocator memory fragmentation, poked at it with various eBPF scripts, etc., and ultimately came up with nothing other than the aforementioned correlation between uptime and the performance issues. We did determine that rebooting nodes would resolve the issue for a while, but it would always come back.

Finally, as a hail Mary pass, I ran a little script on one node:

sync
echo 1 > /proc/sys/vm/compact_memory
echo 3 > /proc/sys/vm/drop_caches

...and the performance issues disappeared within 1-2 minutes after the last command returned. I reran the individual commands on other affected nodes and determined that the compact_memory change did not have an effect, but drop_caches definitely did. I tried the same script out on other Kubernetes clusters and confirmed that it resolved the performance issues there as well.

Right now we are applying a mitigation that will do a weekly drop_caches on our control plane nodes during low activity periods in order to keep etcd happy. This is fine as a tactical fix, but I'm opening this issue as a discussion point to see if anyone might have any insight into why this actually fixes the problem, because it's a pretty big hammer.

ptabor · 2021-04-08T18:28:38Z

Do you see difference in number of major page faults etcd experiences between pre & post drop-caches situation.
Is it possible that there is another process running on the node that is creating memory pressure on etcd, causing etcd not being able to keep whole mmaped bbolt data in RAM. --experimental-memory-mlock support #12750 might be potential mitigation to this issue.

erhudy · 2021-04-20T15:17:49Z

I ran the script on a node that was showing moderate symptoms of the issue and it looks mostly the same before and after (the spike is when the script was actually running).

2. It's possible, but I generally discounted it as a possibility because the symptoms are equally visible both on VMs and metal, and our standard metal SKUs have 512GB of memory.

erhudy · 2021-04-20T15:20:37Z

For completeness' sake, here are the last 2 weeks of page fault stats (node_vmstat_pgfault from the Prometheus node exporter) for the same metal node as above. The spikes early on are probably when I ran the script there initially.

stale · 2021-07-19T18:53:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

gyuho mentioned this issue Apr 8, 2021

proto.Size takes up to 30% memory usage in excesively large range requests #12835

Closed

stale bot added the stale label Jul 19, 2021

stale bot closed this as completed Aug 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd performance issues resolved via drop_caches? #12842

etcd performance issues resolved via drop_caches? #12842

erhudy commented Apr 8, 2021

ptabor commented Apr 8, 2021

erhudy commented Apr 20, 2021

erhudy commented Apr 20, 2021

stale bot commented Jul 19, 2021

etcd performance issues resolved via drop_caches? #12842

etcd performance issues resolved via drop_caches? #12842

Comments

erhudy commented Apr 8, 2021

ptabor commented Apr 8, 2021

erhudy commented Apr 20, 2021

erhudy commented Apr 20, 2021

stale bot commented Jul 19, 2021