Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.16 - HA master join failure - etcdserver: leader changed #1843

Closed
rrichardson opened this issue Oct 18, 2019 · 3 comments
Closed

v1.16 - HA master join failure - etcdserver: leader changed #1843

rrichardson opened this issue Oct 18, 2019 · 3 comments
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@rrichardson
Copy link

BUG REPORT

Versions

kubeadm version:

kubeadm version: &version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:15:39Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

Environment:

  • Kubernetes version:
Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:18:23Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.2", GitCommit:"c97fe5036ef3df2967d086711e6c0c405941e14b", GitTreeState:"clean", BuildDate:"2019-10-15T19:09:08Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
    In a VirtualBox VM network. 6 VMS. 3 masters, 3 workers.

  • OS (e.g. from /etc/os-release):
    Ubuntu 16.04

  • Kernel (e.g. uname -a):
    Linux 192-168-123-102 4.15.0-65-generic #74~16.04.1-Ubuntu SMP Wed Sep 18 09:51:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

  • Others:

We use an automated script which uses kubeadm to spin up the first master. It then captures the relevant details to spin up 2 additional masters "simultaneously".

What happened?

Upon attempting to bring up the 3rd of 3 HA masters using kubeadm, the kubeadm join command fails with the error below. It seems pretty explanatory. kubeadm doesn't deal well if the leader changes, and I'm guessing that the leader changes when the 2nd node joins the cluster.

We can consistently reproduce this, even if we wait a while between spinning up master #2 and master #3.

This has never occurred, to my knowledge, in version 1.14. We have spun up hundreds of clusters in 1.14.

Oct 18 16:03:17 192-168-123-102 kubeadm[6588]: [download-certs] Downloading the certificates in Secret "kubeadm-certs" in the "kube-system" Namespace
Oct 18 16:03:25 192-168-123-102 kubeadm[6588]: error execution phase control-plane-prepare/download-certs: error downloading certs: error downloading the secret: rpc error: code = Unavailable desc = etcdserver: leader changed
Oct 18 16:03:25 192-168-123-102 kubeadm[6588]: To see the stack trace of this error execute with --v=5 or higher

What you expected to happen?

I expected kubeadm join to succeed and the current node to join the HA master quorum.

How to reproduce it (as minimally and precisely as possible)?

Create a master node, collect the relevant details (token and certhash etc) then us it to start 2 additional masters, as close to simultaneously as possible.

Anything else we need to know?

You people rock. I love kubeadm.

@neolit123
Copy link
Member

We use an automated script which uses kubeadm to spin up the first master. It then captures the relevant details to spin up 2 additional masters "simultaneously".

hi, we are seeing flakes when trying to join parallel CP nodes to the cluster.
this is problematic and until etcd releases a new version we won't be able to solve it correctly.

This has never occurred, to my knowledge, in version 1.14. We have spun up hundreds of clusters in 1.14.

we did claim that kubeadm has this working properly in 1.15/16, but unfortunately it does not work as expected.

i don't see how it would have worked in 1.14, as the etcd member join logic had no retries.

Oct 18 16:03:25 192-168-123-102 kubeadm[6588]: error execution phase control-plane-prepare/download-certs: error downloading certs: error downloading the secret: rpc error: code = Unavailable desc = etcdserver: leader changed

i actually haven't seen this particular error.
i'm guessing it will not happen if you add the 2 CPs serially?

You people rock. I love kubeadm.

thanks! :)

i wanted to fold this issue into:
#1793

but let's keep it open for visibility.

/kind bug
/triage support

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. labels Oct 18, 2019
@neolit123 neolit123 added this to the v1.17 milestone Oct 18, 2019
@fabriziopandini fabriziopandini added the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 23, 2019
@neolit123
Copy link
Member

/close
folding into #1793
which should hopefully solve the concurrent join problems.

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

/close
folding into #1793
which should hopefully solve the concurrent join problems.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/etcd kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants