Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition with etcd when joining control planes #2001

Closed
chuckha opened this issue Jan 10, 2020 · 2 comments
Closed

Race condition with etcd when joining control planes #2001

chuckha opened this issue Jan 10, 2020 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@chuckha
Copy link

chuckha commented Jan 10, 2020

Is this a BUG REPORT or FEATURE REQUEST?

I'm running cluster-api on docker for mac. I've not had this problem until recently when simultaneously running kubeadm join. However, this might be my own problem as I am running --ignore-preflight-errors=all. I really thought I could join nodes concurrently. Am I wrong in that assumption? Should I revert to joining one node at a time?

Choose one: BUG REPORT

/kind bug

Versions

1.15.3 and 1.17.0 ( I expect it also exists in the 1.16 branch).

What happened?

I init'd a control plane and then simultaneously joined two more control planes to the cluster. One of them of them errors out with

E0110 14:40:29.286492       1 machine.go:145] controllers/DockerMachine/DockerMachine-controller "msg"="capd@docker$ /bin/sh -c kubeadm join --config /tmp/kubeadm-controlplane-join-config.yaml --ignore-preflight-errors=all
W0110 14:39:58.992287     349 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
W0110 14:39:58.992827     349 common.go:77] your configuration file uses a deprecated API spec: \"kubeadm.k8s.io/v1beta1\". Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version.
[preflight] Running pre-flight checks
	[WARNING FileContent--proc-sys-net-bridge-bridge-nf-call-iptables]: /proc/sys/net/bridge/bridge-nf-call-iptables does not exist
	[WARNING Swap]: running with swap on is not supported. Please disable swap
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -oyaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[certs] Using certificateDir folder \"/etc/kubernetes/pki\"
[certs] Generating \"apiserver-etcd-client\" certificate and key
[certs] Generating \"etcd/peer\" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [test-1-controlplane-1 localhost] and IPs [172.17.0.6 127.0.0.1 ::1]
[certs] Generating \"etcd/healthcheck-client\" certificate and key
[certs] Generating \"etcd/server\" certificate and key
[certs] etcd/server serving cert is signed for DNS names [test-1-controlplane-1 localhost] and IPs [172.17.0.6 127.0.0.1 ::1]
[certs] Generating \"apiserver\" certificate and key
[certs] apiserver serving cert is signed for DNS names [test-1-controlplane-1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 172.17.0.6 172.17.0.3 127.0.0.1]
[certs] Generating \"apiserver-kubelet-client\" certificate and key
[certs] Generating \"front-proxy-client\" certificate and key
[certs] Valid certificates and keys now exist in \"/etc/kubernetes/pki\"
[certs] Using the existing \"sa\" key
[kubeconfig] Generating kubeconfig files
[kubeconfig] Using kubeconfig folder \"/etc/kubernetes\"
[kubeconfig] Writing \"admin.conf\" kubeconfig file
[kubeconfig] Writing \"controller-manager.conf\" kubeconfig file
[kubeconfig] Writing \"scheduler.conf\" kubeconfig file
[control-plane] Using manifest folder \"/etc/kubernetes/manifests\"
[control-plane] Creating static Pod manifest for \"kube-apiserver\"
W0110 14:40:01.215484     349 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"
W0110 14:40:01.224932     349 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"
[control-plane] Creating static Pod manifest for \"kube-controller-manager\"
W0110 14:40:01.225930     349 manifests.go:214] the default kube-apiserver authorization-mode is \"Node,RBAC\"; using \"Node,RBAC\"
[control-plane] Creating static Pod manifest for \"kube-scheduler\"
[check-etcd] Checking that the etcd cluster is healthy
[kubelet-start] Downloading configuration for the kubelet from the \"kubelet-config-1.17\" ConfigMap in the kube-system namespace
[kubelet-start] Writing kubelet configuration to file \"/var/lib/kubelet/config.yaml\"
[kubelet-start] Writing kubelet environment file with flags to file \"/var/lib/kubelet/kubeadm-flags.env\"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:15.403Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:15.470Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:15.784Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:16.011Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:16.434Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:17.274Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:18.964Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:22.187Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
{\"level\":\"warn\",\"ts\":\"2020-01-10T14:40:28.689Z\",\"caller\":\"clientv3/retry_interceptor.go:61\",\"msg\":\"retrying of unary invoker failed\",\"target\":\"endpoint://client-873d5fb6-fc5a-4a59-a435-dc11e61a6d43/172.17.0.4:2379\",\"attempt\":0,\"error\":\"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members\"}
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: re-configuration failed due to not enough started members
To see the stack trace of this error execute with --v=5 or higher
ERROR! exit status 1" "error"="error running {Cmd:/bin/sh Args:[-c kubeadm join --config /tmp/kubeadm-controlplane-join-config.yaml --ignore-preflight-errors=all]}: exit status 1" "cluster"="test-1" "docker-cluster"="test-1" "docker-machine"={"Namespace":"default","Name":"controlplane-1"} "machine"="controlplane-1" 

What you expected to happen?

I expected to be able to run the control plane join simultaneously and have a bootstrapped cluster.

How to reproduce it (as minimally and precisely as possible)?

This happens when running the docker end-to-end tests in cluster-api, I'm working off a fork so it's not super easy at the moment, but this might not be a real bug and it might be my fault for bad assumptions or ignoring preflight checks.

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 10, 2020
@neolit123
Copy link
Member

I'm running cluster-api on docker for mac. I've not had this problem until recently when simultaneously running kubeadm join. However, this might be my own problem as I am running --ignore-preflight-errors=all. I really thought I could join nodes concurrently. Am I wrong in that assumption? Should I revert to joining one node at a time?

it works for some users, not for others.
from my experience and last time i checked, it was flaking 1/5 times with etcd errors.

right now our etcd clients race to join, but if something goes wrong kubeadm does nothing.
this leads me to say the concurrent CP join is flaky for the time being.

our long term fix is tracked here:
#1793

/close

@k8s-ci-robot
Copy link
Contributor

@neolit123: Closing this issue.

In response to this:

I'm running cluster-api on docker for mac. I've not had this problem until recently when simultaneously running kubeadm join. However, this might be my own problem as I am running --ignore-preflight-errors=all. I really thought I could join nodes concurrently. Am I wrong in that assumption? Should I revert to joining one node at a time?

it works for some users, not for others.
from my experience and last time i checked, it was flaking 1/5 times with etcd errors.

right now our etcd clients race to join, but if something goes wrong kubeadm does nothing.
this leads me to say the concurrent CP join is flaky for the time being.

our long term fix is tracked here:
#1793

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cavcrosby added a commit to cavcrosby/homelab-cm that referenced this issue Jul 24, 2022
I noticed when doing my Kubernetes cluster setup, that it would be
common to see at least once that either controller 2 or 3 would fail to
join the cluster as an additional control plane. The error present would
say something like, "etcdserver: re-configuration failed due to not
enough started members".

It appears joining additional control planes concurrently is somewhat
flaky in whether or not it would work. For now, my solution to this
would be to join the additional control planes serially vs concurrently.
For reference on a related GitHub issue with the same error seen:
kubernetes/kubeadm#2001
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants