Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel control-plane joining HA cluster fails due to etcd leader conflict #17870

Closed
fg78nc opened this issue Nov 29, 2019 · 3 comments
Closed

Comments

@fg78nc
Copy link

fg78nc commented Nov 29, 2019

Statement from https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/

Note: Since kubeadm version 1.15 you can join multiple control-plane nodes in parallel. Prior to this version, you must join new control plane nodes sequentially, only after the first node has finished initializing.

I believe joining multiple control planes may file sometimes due to etcd leader change.

Please see stack trace below:

{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rpc error: code = Unavailable desc = etcdserver: leader changed","code":500}
rpc error: code = Unavailable desc = etcdserver: leader changed

I1129 03:27:54.198933   11926 round_trippers.go:443] GET https://34.74.158.255:6443/api/v1/namespaces/kube-system/secrets/kubeadm-certs 500 Internal Server Error in 2083 milliseconds
I1129 03:27:54.198978   11926 round_trippers.go:449] Response Headers:
I1129 03:27:54.198986   11926 round_trippers.go:452]     Date: Fri, 29 Nov 2019 03:27:54 GMT
I1129 03:27:54.198992   11926 round_trippers.go:452]     Cache-Control: no-cache, private
I1129 03:27:54.198999   11926 round_trippers.go:452]     Content-Type: application/json
I1129 03:27:54.199005   11926 round_trippers.go:452]     Content-Length: 156
I1129 03:27:54.199045   11926 request.go:968] Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"rpc error: code = Unavailable desc = etcdserver: leader changed","code":500}
rpc error: code = Unavailable desc = etcdserver: leader changed
error downloading the secret
k8s.io/kubernetes/cmd/kubeadm/app/phases/copycerts.DownloadCerts
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/phases/copycerts/copycerts.go:227
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runControlPlanePrepareDownloadCertsPhaseLocal
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplaneprepare.go:225
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:236
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:424
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:209
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdJoin.func1
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:169
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:830
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
k8s.io/kubernetes/cmd/kubeadm/app.Run
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
        _output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        /usr/local/go/src/runtime/proc.go:200
runtime.goexit
        /usr/local/go/src/runtime/asm_amd64.s:1337
error downloading certs
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runControlPlanePrepareDownloadCertsPhaseLocal
        /workspace/anago-v1.16.3-beta.0.56+b3cbbae08ec52a/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplaneprepare.go:226
@neolit123
Copy link
Member

/close
hi, this is a known problem and we need changes in etcd and kubeadm.
our existing workaround to make the members wait for each other when joining is not working very well.

please track this issue:
kubernetes/kubeadm#1793

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

/close
hi, this is a known problem and we need changes in etcd and kubeadm.
our existing workaround to make the members wait for each other when joining is not working very well.

please track this issue:
kubernetes/kubeadm#1793

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@manosnoam
Copy link

hi, this is a known problem and we need changes in etcd and kubeadm.
our existing workaround to make the members wait for each other when joining is not working very well.

@neolit123 , I'm seeing "Error from server: etcdserver: leader changed" on Kubernetes Version: v1.17.1,
when describing a new secret (docker-registry) that was just created.

Can you provide the workaround you used ?

That's the events during the LeaderElection change:

  openshift-operator-lifecycle-manager                    135m        Normal    ScalingReplicaSet                            deployment/packageserver                                                      (combined from similar events): Scaled down replica set packageserver-78db79dd96 to 0
  default                                                 135m        Normal    Reboot                                       node/default-cl1-hk655-master-2                                               Node will reboot into config rendered-master-7054e394074dd9e93e7af7b86ab89a6e
  default                                                 135m        Normal    PendingConfig                                node/default-cl1-hk655-master-2                                               Written pending config rendered-master-7054e394074dd9e93e7af7b86ab89a6e
  openshift-apiserver-operator                            135m        Normal    OperatorStatusChanged                        deployment/openshift-apiserver-operator                                       Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" to "OpenshiftAPIServerStaticResourcesDegraded: \"v3.11.0/openshift-apiserver/sa.yaml\" (string): etcdserver: leader changed\nOpenshiftAPIServerStaticResourcesDegraded: \nAPIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"
  openshift-operator-lifecycle-manager                    135m        Normal    Killing                                      pod/packageserver-78db79dd96-qwpcq                                            Stopping container packageserver
  openshift-cloud-credential-operator                     135m        Normal    LeaderElection                               configmap/cloud-credential-operator-leader                                    cloud-credential-operator-58d66f676f-ddcpc_860303fd-6c39-11eb-b840-0a580afc0016 became leader
  openshift-apiserver-operator                            135m        Normal    OperatorStatusChanged                        deployment/openshift-apiserver-operator                                       Status for clusteroperator/openshift-apiserver changed: Degraded message changed from "OpenshiftAPIServerStaticResourcesDegraded: \"v3.11.0/openshift-apiserver/sa.yaml\" (string): etcdserver: leader changed\nOpenshiftAPIServerStaticResourcesDegraded: \nAPIServerDeploymentDegraded: 1 of 3 requested instances are unavailable" to "APIServerDeploymentDegraded: 1 of 3 requested instances are unavailable"
  openshift-apiserver-operator                            135m        Normal    OperatorStatusChanged                        deployment/openshift-apiserver-operator                                       Status for clusteroperator/openshift-apiserver changed: Progressing changed from True to False ("")
  openshift-kube-controller-manager                       135m        Normal    LeaderElection                               configmap/cert-recovery-controller-lock                                       0ba2a6fa-4088-4457-8aac-6e2ff97c56d6 became leader

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants