Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase robustness for kubeadm join / add etcd #2094

Closed
fabriziopandini opened this issue Mar 30, 2020 · 8 comments
Closed

Increase robustness for kubeadm join / add etcd #2094

fabriziopandini opened this issue Mar 30, 2020 · 8 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Milestone

Comments

@fabriziopandini
Copy link
Member

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: v1.17.*

What happened?

While executing Cluster API tests, in some cases it was observed kubeadm join failures when waiting for the new etcd member to report healthy state.

xref kubernetes-sigs/cluster-api#2769

What you expected to happen?

To add new etcd member more resilient by increasing the timeout/the number of retries for this operation

How to reproduce it (as minimally and precisely as possible)?

This error happens only sometimes, most probably due to slow network/slow I/O causing delays in etcd getting online or in some cases, also change fo the etcd leader.

Anything else we need to know?

Important: if possible the change should be kept as small and possible and backported

@neolit123 neolit123 added this to the v1.19 milestone Mar 30, 2020
@neolit123 neolit123 added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 30, 2020
@neolit123
Copy link
Member

@fabriziopandini note there is already a ticket for learner mode here:
#1793

@fabriziopandini
Copy link
Member Author

@neolit123 thanks! nevertheless, I will keep this also one for making current implement more robust

@neolit123
Copy link
Member

/assign

@neolit123
Copy link
Member

neolit123 commented Apr 14, 2020

To add new etcd member more resilient by increasing the timeout/the number of retries for this operation

@fabriziopandini what timeouts are we talking about?

current backoff for AddMember is:

steps: 11
duration: 50
factor: 2
jitter: 0.1
step: 0, value: 0
step: 1, value: 50.35
step: 2, value: 155.55
step: 3, value: 363.46
step: 4, value: 779.66
step: 5, value: 1611.29
step: 6, value: 3275.65
step: 7, value: 6602.45
step: 8, value: 13257.2
step: 9, value: 26565.96
step: 10, value: 53182.36

~53 sec

https://github.com/kubernetes/kubernetes/blob/8a4bf398840f1298edaca40b76bee2ace2733d12/cmd/kubeadm/app/util/etcd/etcd.go#L370-L383

also are you sure this is a AddMember issue and not an issue with the client dial?
https://github.com/kubernetes/kubernetes/blob/8a4bf398840f1298edaca40b76bee2ace2733d12/cmd/kubeadm/app/util/etcd/etcd.go#L356
https://github.com/kubernetes/kubernetes/blob/8a4bf398840f1298edaca40b76bee2ace2733d12/cmd/kubeadm/app/util/etcd/etcd.go#L215

@fabriziopandini
Copy link
Member Author

@neolit123 I was thinking that we can raise this timeout up to 2 minutes (or even more)
Also, as you are pointing out, for better robustness, we can move the client dial step into the retry loop

@neolit123
Copy link
Member

all cherry picks are about to merge, but we should keep this open to refactor the etcd client management in a similar way in mater. e.g. kubernetes/kubernetes#90645 made it so MemberAdd behaves differently than MemberRemove

@neolit123
Copy link
Member

cherry picks merged. lowering priority as the remaining refactor is not mandatory for 1.19.

@neolit123 neolit123 added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 5, 2020
@neolit123 neolit123 modified the milestones: v1.19, v1.20 Jun 11, 2020
@neolit123
Copy link
Member

actually let me log a new ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

2 participants