Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd 3.2.1 Troubling CPU Usage Pattern #8491

Closed
iherbmatt opened this issue Sep 3, 2017 · 8 comments
Closed

etcd 3.2.1 Troubling CPU Usage Pattern #8491

iherbmatt opened this issue Sep 3, 2017 · 8 comments

Comments

@iherbmatt
Copy link

Hello,

We have recently upgraded to kube-aws 0.9.8 and are utilizing etcd 3.2.1 and have tested etcd 3.2.6, both versions have been installed with a 3-node etcd cluster with nodes having 2 cores and 8GB of RAM.

What's troubling is that we are only running a single application on the cluster and it's using more and more CPU as time goes by. Here is a sample showing the last week from the date of cluster start-up to now:

image

As you can see the CPU has not fluctuated - it has only increased steadily over the last few days. This is troubling because we have older clusters running etcd 3.1.3 and they are increasing faster. We figured we would test with a cluster using etcd 3.2.1 to see if that would fix the problem, but it doesn't - it just postponed the inevitable: an unstable cluster.

In order to fix the problem we need to terminate the nodes and let them rebuild and resync with the other members, or reboot them.

We created the K8s cluster with the following etcd configs:
3 etcd nodes
m4.large (2 Cores, 8GB RAM)
50GB root volume [general ssd (gp2)]
200GB data volume [general ssd (gp2)]
auto-recovery: true
auto-snapshot: true
encrypted volumes: true

Please somebody help us with this.

Thank you,

Matt

@redbaron
Copy link

redbaron commented Sep 4, 2017

@iherbmatt , there is a report about odd memory usage pattern #8472 do you see similar behavior on your setup?

@iherbmatt
Copy link
Author

iherbmatt commented Sep 4, 2017 via email

@heyitsanthony
Copy link
Contributor

@iherbmatt what is the RPC rate over time for 3.2? What is the memory utilization over time? Are there any errors or warnings in the etcd server logs?

Also, please use >=3.1.5 for 3.1; there's a memory leak on linearizable reads

@iherbmatt
Copy link
Author

iherbmatt commented Sep 5, 2017 via email

@heyitsanthony
Copy link
Contributor

It looks like kube-aws is taking snapshots every minute on every member according to https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cloud-config-etcd#L231

This is about 90x more frequent than the etcd-operator default policy and might account for the increased CPU load. It could be triggering #8009, where the etcd backend needs to be defragmented when there are frequent snapshots.

@iherbmatt
Copy link
Author

iherbmatt commented Sep 6, 2017 via email

@flah00
Copy link

flah00 commented Sep 15, 2017

Based on input from @heyitsanthony, I updated userdata/cloud-config-etcd, so backups run every 5m instead of 1m. CPU has stabilized.

@xiang90
Copy link
Contributor

xiang90 commented Sep 15, 2017

thanks for the update. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants