Increase robustness for kubeadm join / add etcd #2094

fabriziopandini · 2020-03-30T14:35:56Z

Is this a BUG REPORT or FEATURE REQUEST?

BUG REPORT

Versions

kubeadm version: v1.17.*

What happened?

While executing Cluster API tests, in some cases it was observed kubeadm join failures when waiting for the new etcd member to report healthy state.

xref kubernetes-sigs/cluster-api#2769

What you expected to happen?

To add new etcd member more resilient by increasing the timeout/the number of retries for this operation

How to reproduce it (as minimally and precisely as possible)?

This error happens only sometimes, most probably due to slow network/slow I/O causing delays in etcd getting online or in some cases, also change fo the etcd leader.

Anything else we need to know?

Important: if possible the change should be kept as small and possible and backported

neolit123 · 2020-04-01T14:15:58Z

@fabriziopandini note there is already a ticket for learner mode here:
#1793

fabriziopandini · 2020-04-02T15:13:34Z

@neolit123 thanks! nevertheless, I will keep this also one for making current implement more robust

neolit123 · 2020-04-14T12:28:51Z

/assign

neolit123 · 2020-04-14T14:48:23Z

To add new etcd member more resilient by increasing the timeout/the number of retries for this operation

@fabriziopandini what timeouts are we talking about?

current backoff for AddMember is:

steps: 11
duration: 50
factor: 2
jitter: 0.1
step: 0, value: 0
step: 1, value: 50.35
step: 2, value: 155.55
step: 3, value: 363.46
step: 4, value: 779.66
step: 5, value: 1611.29
step: 6, value: 3275.65
step: 7, value: 6602.45
step: 8, value: 13257.2
step: 9, value: 26565.96
step: 10, value: 53182.36

~53 sec

https://github.com/kubernetes/kubernetes/blob/8a4bf398840f1298edaca40b76bee2ace2733d12/cmd/kubeadm/app/util/etcd/etcd.go#L370-L383

also are you sure this is a AddMember issue and not an issue with the client dial?
https://github.com/kubernetes/kubernetes/blob/8a4bf398840f1298edaca40b76bee2ace2733d12/cmd/kubeadm/app/util/etcd/etcd.go#L356
https://github.com/kubernetes/kubernetes/blob/8a4bf398840f1298edaca40b76bee2ace2733d12/cmd/kubeadm/app/util/etcd/etcd.go#L215

fabriziopandini · 2020-04-15T12:06:39Z

@neolit123 I was thinking that we can raise this timeout up to 2 minutes (or even more)
Also, as you are pointing out, for better robustness, we can move the client dial step into the retry loop

neolit123 · 2020-05-19T15:22:07Z

all cherry picks are about to merge, but we should keep this open to refactor the etcd client management in a similar way in mater. e.g. kubernetes/kubernetes#90645 made it so MemberAdd behaves differently than MemberRemove

neolit123 · 2020-06-05T13:23:04Z

cherry picks merged. lowering priority as the remaining refactor is not mandatory for 1.19.

neolit123 · 2020-06-11T20:15:21Z

actually let me log a new ticket

neolit123 added this to the v1.19 milestone Mar 30, 2020

neolit123 added kind/bug Categorizes issue or PR as related to a bug. kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Mar 30, 2020

neolit123 mentioned this issue Mar 30, 2020

improve kubeadm's preflight and cluster health assurance #2096

Closed

neolit123 mentioned this issue Apr 6, 2020

kubeadm join does not explicitly wait for etcd to have grown when joining secondary control plane #1353

Closed

k8s-ci-robot assigned neolit123 Apr 14, 2020

neolit123 mentioned this issue Apr 30, 2020

kubeadm: fix flakes when performing etcd MemberAdd on slower setups kubernetes/kubernetes#90645

Merged

neolit123 added priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Jun 5, 2020

neolit123 modified the milestones: v1.19, v1.20 Jun 11, 2020

neolit123 closed this as completed Jun 11, 2020

neolit123 mentioned this issue Jun 16, 2020

Insulate users from kubeadm API version changes kubernetes-sigs/cluster-api#2769

Closed

killianmuldoon mentioned this issue Jul 15, 2022

Deprecate experimentalRetryJoin in CABPK kubernetes-sigs/cluster-api#5597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase robustness for kubeadm join / add etcd #2094

Increase robustness for kubeadm join / add etcd #2094

fabriziopandini commented Mar 30, 2020

neolit123 commented Apr 1, 2020

fabriziopandini commented Apr 2, 2020

neolit123 commented Apr 14, 2020

neolit123 commented Apr 14, 2020 •

edited

Loading

fabriziopandini commented Apr 15, 2020

neolit123 commented May 19, 2020

neolit123 commented Jun 5, 2020

neolit123 commented Jun 11, 2020

Increase robustness for kubeadm join / add etcd #2094

Increase robustness for kubeadm join / add etcd #2094

Comments

fabriziopandini commented Mar 30, 2020

Is this a BUG REPORT or FEATURE REQUEST?

Versions

What happened?

What you expected to happen?

How to reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

neolit123 commented Apr 1, 2020

fabriziopandini commented Apr 2, 2020

neolit123 commented Apr 14, 2020

neolit123 commented Apr 14, 2020 • edited Loading

fabriziopandini commented Apr 15, 2020

neolit123 commented May 19, 2020

neolit123 commented Jun 5, 2020

neolit123 commented Jun 11, 2020

neolit123 commented Apr 14, 2020 •

edited

Loading