etcd 3.2.1 Troubling CPU Usage Pattern #8491

iherbmatt · 2017-09-03T22:01:19Z

Hello,

We have recently upgraded to kube-aws 0.9.8 and are utilizing etcd 3.2.1 and have tested etcd 3.2.6, both versions have been installed with a 3-node etcd cluster with nodes having 2 cores and 8GB of RAM.

What's troubling is that we are only running a single application on the cluster and it's using more and more CPU as time goes by. Here is a sample showing the last week from the date of cluster start-up to now:

As you can see the CPU has not fluctuated - it has only increased steadily over the last few days. This is troubling because we have older clusters running etcd 3.1.3 and they are increasing faster. We figured we would test with a cluster using etcd 3.2.1 to see if that would fix the problem, but it doesn't - it just postponed the inevitable: an unstable cluster.

In order to fix the problem we need to terminate the nodes and let them rebuild and resync with the other members, or reboot them.

We created the K8s cluster with the following etcd configs:
3 etcd nodes
m4.large (2 Cores, 8GB RAM)
50GB root volume [general ssd (gp2)]
200GB data volume [general ssd (gp2)]
auto-recovery: true
auto-snapshot: true
encrypted volumes: true

Please somebody help us with this.

Thank you,

Matt

redbaron · 2017-09-04T19:08:14Z

@iherbmatt , there is a report about odd memory usage pattern #8472 do you see similar behavior on your setup?

iherbmatt · 2017-09-04T21:06:54Z

From startup to a few days later most RAM is used. Some is moved to cached memory while others are taken by other processed - mostly Docker and etcd.

…

On Sep 4, 2017 12:08 PM, "Maxim Ivanov" ***@***.***> wrote: @iherbmatt <https://github.com/iherbmatt> , there is a report about odd memory usage pattern #8472 <#8472> do you see similar behavior on your setup? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8491 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWH4rpn-WHQHEkL-OgeI-XsI9_dsWuo9ks5sfEq2gaJpZM4PLUQb> .

-- *The information contained in this message is the sole and exclusive property of **iHerb Inc.** and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of **iHerb Inc.* *If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

heyitsanthony · 2017-09-05T07:21:47Z

@iherbmatt what is the RPC rate over time for 3.2? What is the memory utilization over time? Are there any errors or warnings in the etcd server logs?

Also, please use >=3.1.5 for 3.1; there's a memory leak on linearizable reads

iherbmatt · 2017-09-05T07:32:39Z

Hi Anthony, I'm not extremely familiar with etcd. How can I get this information for you? Also, I'm wondering if it's the automatic snapshots that are causing the issue. I'm testing another cluster with automatic snapshots and automatic recovery disabled. For about 2 hours I'm seeing the CPUs for each of 3 etcd nodes hovering around 1% - previously they were about 5% (v. 3.2.1) and ~20% (v. 3.1.3). Thanks, Matt *Matt Poland | Software Developer* *iHerb Inc - Natural Products & More* *www.iherb.com <http://www.iherb.com> | matt-p@iherb.com <matt-p@iherb.com>*

…

On Tue, Sep 5, 2017 at 12:22 AM, Anthony Romano ***@***.***> wrote: @iherbmatt <https://github.com/iherbmatt> what is the RPC rate over time for 3.2? What is the memory utilization over time? Are there any errors or warnings in the etcd server logs? Also, please use >=3.1.5 for 3.1; there's a memory leak on linearizable reads — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8491 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWH4rvc2RTjaVAO4MlVIz1Q0NWVSfexIks5sfPaegaJpZM4PLUQb> .

-- *The information contained in this message is the sole and exclusive property of **iHerb Inc.** and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of **iHerb Inc.* *If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

heyitsanthony · 2017-09-05T08:58:08Z

It looks like kube-aws is taking snapshots every minute on every member according to https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cloud-config-etcd#L231

This is about 90x more frequent than the etcd-operator default policy and might account for the increased CPU load. It could be triggering #8009, where the etcd backend needs to be defragmented when there are frequent snapshots.

iherbmatt · 2017-09-06T18:26:07Z

When I disable automatic snapshots and disaster recovery the cpu remains around 1-1.5%. It's obvious there's a bug in that logic somewhere. Thank you! *Matt Poland | Software Developer* *iHerb Inc - Natural Products & More* *www.iherb.com <http://www.iherb.com> | matt-p@iherb.com <matt-p@iherb.com>*

…

On Tue, Sep 5, 2017 at 1:58 AM, Anthony Romano ***@***.***> wrote: It looks like kube-aws is taking snapshots every minute on every member according to https://github.com/kubernetes-incubator/kube-aws/blob/ master/core/controlplane/config/templates/cloud-config-etcd#L231 This is about 90x more frequent than the etcd-operator default policy and might account for the increased CPU load. It could be triggering #8009 <#8009>, where the etcd backend needs to be defragmented when there are frequent snapshots. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#8491 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AWH4rgTpMKuWanQrHJZXBYC5N2giKAG6ks5sfQ1MgaJpZM4PLUQb> .

-- *The information contained in this message is the sole and exclusive property of **iHerb Inc.** and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of **iHerb Inc.* *If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

flah00 · 2017-09-15T20:10:01Z

Based on input from @heyitsanthony, I updated userdata/cloud-config-etcd, so backups run every 5m instead of 1m. CPU has stabilized.

xiang90 · 2017-09-15T20:22:25Z

thanks for the update. Closing this issue.

xiang90 closed this as completed Sep 15, 2017

flah00 mentioned this issue Sep 16, 2017

etcd degradation kubernetes-retired/kube-aws#795

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd 3.2.1 Troubling CPU Usage Pattern #8491

etcd 3.2.1 Troubling CPU Usage Pattern #8491

iherbmatt commented Sep 3, 2017

redbaron commented Sep 4, 2017

iherbmatt commented Sep 4, 2017 via email

heyitsanthony commented Sep 5, 2017

iherbmatt commented Sep 5, 2017 via email

heyitsanthony commented Sep 5, 2017

iherbmatt commented Sep 6, 2017 via email

flah00 commented Sep 15, 2017

xiang90 commented Sep 15, 2017

etcd 3.2.1 Troubling CPU Usage Pattern #8491

etcd 3.2.1 Troubling CPU Usage Pattern #8491

Comments

iherbmatt commented Sep 3, 2017

redbaron commented Sep 4, 2017

iherbmatt commented Sep 4, 2017 via email

heyitsanthony commented Sep 5, 2017

iherbmatt commented Sep 5, 2017 via email

heyitsanthony commented Sep 5, 2017

iherbmatt commented Sep 6, 2017 via email

flah00 commented Sep 15, 2017

xiang90 commented Sep 15, 2017