Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading etcd cluster version from v3.2.24 to v3.3.15 made the k8s cluster apparently frozen #12225

Closed
Leulz opened this issue Aug 15, 2020 · 3 comments

Comments

@Leulz
Copy link

Leulz commented Aug 15, 2020

etcd version: 3.3.15
k8s version: 1.16

I am using the etcd-wrapper to run etcd in dedicated machines.

When I upgraded the first machine, I noticed an absurd increase in CPU usage. The average CPU usage with v3.2.24 was at around 40% of an EC2 m3.medium. It jumped to more than 90% after the upgrade, and now, even using m3.large instances, it's still at around 50~60%.

Alas, I decided to keep upgrading the cluster instead of rolling back, and now the k8s cluster is seemingly immutable. Thankfully it's in a staging environment.

The cluster reports itself as healthy:

$ etcdctl endpoint health
https://dns-1:2379 is healthy: successfully committed proposal: took = 6.031176ms
https://dns-2:2379 is healthy: successfully committed proposal: took = 2.313741ms
https://dns-3:2379 is healthy: successfully committed proposal: took = 5.478692ms

But I noticed that the Raft Index is sometimes significantly different across all instances:

$ etcdctl endpoint status
dns-1:2379, b01d7560e848897, 3.3.15, 544 MB, false, 174, 245011277
dns-2:2379, 345a61dc8892f7ca, 3.3.15, 530 MB, true, 174, 245011291
dns-3:2379, e99321f6d32addcb, 3.3.15, 545 MB, false, 174, 245011346

Lots (as in, dozens a second) of logs like this: Aug 15 19:57:10 internal-dns etcd-wrapper[2981]: 2020-08-15 19:57:10.849851 I | auth: deleting token <token> for user root can be seen

Other logs that look weird are: auth: invalid user name etcd-2 for permission checking, pkg/fileutil: purged file /var/lib/etcd/member/snap/00000000000000ae-000000000e945a73.snap successfully, and lots of etcdserver: read-only range request "key:\"/registry/pods/\" range_end:\"/registry/pods0\" " with result "range_response_count:808 size:13316569" took too long (127.988459ms) to execute.

The k8s cluster using this etcd cluster is, as mentioned, apparently frozen. I tried editing a deployment we have in the cluster, and the result was:

Pods before the patch:

pod-1                3/3     Running            2          30h
pod-2                3/3     Running            0          31h

Pods after editing a deployment to force a cycle:

pod-3                0/3     Terminating         0          32h
pod-4                0/3     Terminating         0          31h
pod-1                0/3     ContainerCreating   0          30h
pod-2                0/3     Pending             0          31h

Pods after some time:

pod-1                3/3     Running            2          31h
pod-2                3/3     Running            0          31h

Is this a known issue? Any insight in what is happening here is much appreciated.

@tangcong
Copy link
Contributor

But I noticed that the Raft Index is sometimes significantly different across all instances:

$ etcdctl endpoint status
dns-1:2379, b01d7560e848897, 3.3.15, 544 MB, false, 174, 245011277
dns-2:2379, 345a61dc8892f7ca, 3.3.15, 530 MB, true, 174, 245011291
dns-3:2379, e99321f6d32addcb, 3.3.15, 545 MB, false, 174, 245011346

do you enable auth? please see #11689.

@Leulz
Copy link
Author

Leulz commented Aug 16, 2020

do you enable auth? please see #11689.

Thanks a lot for pointing me to that issue, @tangcong! I believe that is indeed what is happening in my cluster. I noticed that there are some logs like the following in my cluster's leader logs:

etcd-wrapper[2629]: 2020-08-16 15:18:57.148544 W | etcdserver: request "header:<ID:17855211145036495243 username:\"root\" auth_revision:10 > lease_revoke:<id:5dcb73f416b8f899>" with result "size:31" took too long (118.715626ms) to execute

Which I guess indicates that auth is indeed enabled and there are lease_revoke requests being issued.

I noticed that the only solution proposed by you is to first upgrade to the latest 3.2 version, and then upgrade to 3.3. Does that mean that this cluster I have is in an unrecoverable state and I should just obliterate it? Since the entire cluster is already at v3.3.

Also, I couldn't find the note you added related to this issue here. Shouldn't it be there?

@tangcong
Copy link
Contributor

If your clusters are already inconsistent, you can only remove the follower nodes one by one, and then add them to the cluster to make the cluster consistent. Note that there is no guarantee that your data is complete.

could you also release a new version for 3.2 when you release new version for 3.4/3.3, which includes a fix for a data inconsistency bug. thanks. @gyuho

The etcd website doc has not been updated for a long time, I will see how to update it, thank you. note is latest here.@Leulz

@Leulz Leulz closed this as completed Aug 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants