Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kind create cluster fails to remove control plane taint #2867

Closed
fasmat opened this issue Aug 8, 2022 · 20 comments
Closed

kind create cluster fails to remove control plane taint #2867

fasmat opened this issue Aug 8, 2022 · 20 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@fasmat
Copy link

fasmat commented Aug 8, 2022

What happened:

I tried to create a cluster with kind create cluster and received the error "failed to remove control plane taint"

What you expected to happen:

Successfully creating a cluster

How to reproduce it (as minimally and precisely as possible):

Install kind version v0.14.0-arm64 and call kind create cluster

Anything else we need to know?:

I'm running kind inside a container with the hosts (MacOS M1 Max) docker socket mounted and I'm able to run other containers with docker run.

Logs:

$ kind create cluster --loglevel=debug
WARNING: --loglevel is deprecated, please switch to -v and -q!
Creating cluster "kind" ...
DEBUG: docker/images.go:58] Image: kindest/node:v1.24.0@sha256:0866296e693efe1fed79d5e6c7af8df71fc73ae45e3679af05342239cdc5bc8e present locally
 ✓ Ensuring node image (kindest/node:v1.24.0) 🖼
 ✓ Preparing nodes 📦  
DEBUG: config/config.go:96] Using the following kubeadm config for node kind-control-plane:
apiServer:
  certSANs:
  - localhost
  - 127.0.0.1
  extraArgs:
    runtime-config: ""
apiVersion: kubeadm.k8s.io/v1beta3
clusterName: kind
controlPlaneEndpoint: kind-control-plane:6443
controllerManager:
  extraArgs:
    enable-hostpath-provisioner: "true"
kind: ClusterConfiguration
kubernetesVersion: v1.24.0
networking:
  podSubnet: 10.244.0.0/16
  serviceSubnet: 10.96.0.0/16
scheduler:
  extraArgs: null
---
apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- token: abcdef.0123456789abcdef
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 172.19.0.2
  bindPort: 6443
nodeRegistration:
  criSocket: unix:///run/containerd/containerd.sock
  kubeletExtraArgs:
    node-ip: 172.19.0.2
    node-labels: ""
    provider-id: kind://docker/kind/kind-control-plane
---
apiVersion: kubeadm.k8s.io/v1beta3
controlPlane:
  localAPIEndpoint:
    advertiseAddress: 172.19.0.2
    bindPort: 6443
discovery:
  bootstrapToken:
    apiServerEndpoint: kind-control-plane:6443
    token: abcdef.0123456789abcdef
    unsafeSkipCAVerification: true
kind: JoinConfiguration
nodeRegistration:
  criSocket: unix:///run/containerd/containerd.sock
  kubeletExtraArgs:
    node-ip: 172.19.0.2
    node-labels: ""
    provider-id: kind://docker/kind/kind-control-plane
---
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd
cgroupRoot: /kubelet
evictionHard:
  imagefs.available: 0%
  nodefs.available: 0%
  nodefs.inodesFree: 0%
failSwapOn: false
imageGCHighThresholdPercent: 100
kind: KubeletConfiguration
---
apiVersion: kubeproxy.config.k8s.io/v1alpha1
conntrack:
  maxPerCore: 0
iptables:
  minSyncPeriod: 1s
kind: KubeProxyConfiguration
mode: iptables
 ✓ Writing configuration 📜 
DEBUG: kubeadminit/init.go:82] I0808 18:27:28.895581     126 initconfiguration.go:255] loading configuration from "/kind/kubeadm.conf"
W0808 18:27:28.896451     126 initconfiguration.go:332] [config] WARNING: Ignored YAML document with GroupVersionKind kubeadm.k8s.io/v1beta3, Kind=JoinConfiguration
[init] Using Kubernetes version: v1.24.0
[certs] Using certificateDir folder "/etc/kubernetes/pki"
I0808 18:27:28.900057     126 certs.go:112] creating a new certificate authority for ca
[certs] Generating "ca" certificate and key
I0808 18:27:29.115670     126 certs.go:522] validating certificate period for ca certificate
[certs] Generating "apiserver" certificate and key
[certs] apiserver serving cert is signed for DNS names [kind-control-plane kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.96.0.1 172.19.0.2 127.0.0.1]
[certs] Generating "apiserver-kubelet-client" certificate and key
I0808 18:27:29.338086     126 certs.go:112] creating a new certificate authority for front-proxy-ca
[certs] Generating "front-proxy-ca" certificate and key
I0808 18:27:29.421219     126 certs.go:522] validating certificate period for front-proxy-ca certificate
[certs] Generating "front-proxy-client" certificate and key
I0808 18:27:29.554232     126 certs.go:112] creating a new certificate authority for etcd-ca
[certs] Generating "etcd/ca" certificate and key
I0808 18:27:29.615892     126 certs.go:522] validating certificate period for etcd/ca certificate
[certs] Generating "etcd/server" certificate and key
[certs] etcd/server serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/peer" certificate and key
[certs] etcd/peer serving cert is signed for DNS names [kind-control-plane localhost] and IPs [172.19.0.2 127.0.0.1 ::1]
[certs] Generating "etcd/healthcheck-client" certificate and key
[certs] Generating "apiserver-etcd-client" certificate and key
I0808 18:27:30.083897     126 certs.go:78] creating new public/private key files for signing service account users
[certs] Generating "sa" key and public key
I0808 18:27:30.124183     126 kubeconfig.go:103] creating kubeconfig file for admin.conf
[kubeconfig] Using kubeconfig folder "/etc/kubernetes"
[kubeconfig] Writing "admin.conf" kubeconfig file
I0808 18:27:30.254718     126 kubeconfig.go:103] creating kubeconfig file for kubelet.conf
[kubeconfig] Writing "kubelet.conf" kubeconfig file
I0808 18:27:30.362542     126 kubeconfig.go:103] creating kubeconfig file for controller-manager.conf
[kubeconfig] Writing "controller-manager.conf" kubeconfig file
I0808 18:27:30.463815     126 kubeconfig.go:103] creating kubeconfig file for scheduler.conf
[kubeconfig] Writing "scheduler.conf" kubeconfig file
I0808 18:27:30.698207     126 kubelet.go:65] Stopping the kubelet
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Starting the kubelet
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
I0808 18:27:30.784998     126 manifests.go:99] [control-plane] getting StaticPodSpecs
I0808 18:27:30.785329     126 certs.go:522] validating certificate period for CA certificate
I0808 18:27:30.785397     126 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-apiserver"
I0808 18:27:30.785414     126 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-apiserver"
I0808 18:27:30.785417     126 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-apiserver"
I0808 18:27:30.785420     126 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-apiserver"
I0808 18:27:30.785424     126 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-apiserver"
I0808 18:27:30.786696     126 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-apiserver" to "/etc/kubernetes/manifests/kube-apiserver.yaml"
I0808 18:27:30.786710     126 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-controller-manager"
I0808 18:27:30.786809     126 manifests.go:125] [control-plane] adding volume "ca-certs" for component "kube-controller-manager"
I0808 18:27:30.786818     126 manifests.go:125] [control-plane] adding volume "etc-ca-certificates" for component "kube-controller-manager"
I0808 18:27:30.786821     126 manifests.go:125] [control-plane] adding volume "flexvolume-dir" for component "kube-controller-manager"
I0808 18:27:30.786823     126 manifests.go:125] [control-plane] adding volume "k8s-certs" for component "kube-controller-manager"
I0808 18:27:30.786826     126 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-controller-manager"
I0808 18:27:30.786828     126 manifests.go:125] [control-plane] adding volume "usr-local-share-ca-certificates" for component "kube-controller-manager"
I0808 18:27:30.786830     126 manifests.go:125] [control-plane] adding volume "usr-share-ca-certificates" for component "kube-controller-manager"
I0808 18:27:30.787252     126 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-controller-manager" to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
I0808 18:27:30.787273     126 manifests.go:99] [control-plane] getting StaticPodSpecs
[control-plane] Creating static Pod manifest for "kube-scheduler"
I0808 18:27:30.787392     126 manifests.go:125] [control-plane] adding volume "kubeconfig" for component "kube-scheduler"
I0808 18:27:30.787617     126 manifests.go:154] [control-plane] wrote static Pod manifest for component "kube-scheduler" to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0808 18:27:30.787952     126 local.go:65] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0808 18:27:30.787989     126 waitcontrolplane.go:83] [wait-control-plane] Waiting for the API server to be healthy
I0808 18:27:30.788334     126 loader.go:372] Config loaded from file:  /etc/kubernetes/admin.conf
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
I0808 18:27:30.790085     126 round_trippers.go:553] GET https://kind-control-plane:6443/healthz?timeout=10s  in 0 milliseconds
 ✗ Starting control-plane 🕹️ 
ERROR: failed to create cluster: failed to remove control plane taint: command "docker exec --privileged kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf taint nodes --all node-role.kubernetes.io/control-plane- node-role.kubernetes.io/master-" failed with error: exit status 1
Command Output: The connection to the server kind-control-plane:6443 was refused - did you specify the right host or port?
Stack Trace: 
sigs.k8s.io/kind/pkg/errors.WithStack
        sigs.k8s.io/kind/pkg/errors/errors.go:59
sigs.k8s.io/kind/pkg/exec.(*LocalCmd).Run
        sigs.k8s.io/kind/pkg/exec/local.go:124
sigs.k8s.io/kind/pkg/cluster/internal/providers/docker.(*nodeCmd).Run
        sigs.k8s.io/kind/pkg/cluster/internal/providers/docker/node.go:146
sigs.k8s.io/kind/pkg/cluster/internal/create/actions/kubeadminit.(*action).Execute
        sigs.k8s.io/kind/pkg/cluster/internal/create/actions/kubeadminit/init.go:140
sigs.k8s.io/kind/pkg/cluster/internal/create.Cluster
        sigs.k8s.io/kind/pkg/cluster/internal/create/create.go:135
sigs.k8s.io/kind/pkg/cluster.(*Provider).Create
        sigs.k8s.io/kind/pkg/cluster/provider.go:182
sigs.k8s.io/kind/pkg/cmd/kind/create/cluster.runE
        sigs.k8s.io/kind/pkg/cmd/kind/create/cluster/createcluster.go:80
sigs.k8s.io/kind/pkg/cmd/kind/create/cluster.NewCommand.func1
        sigs.k8s.io/kind/pkg/cmd/kind/create/cluster/createcluster.go:55
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.4.0/command.go:856
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.4.0/command.go:974
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.4.0/command.go:902
sigs.k8s.io/kind/cmd/kind/app.Run
        sigs.k8s.io/kind/cmd/kind/app/main.go:53
sigs.k8s.io/kind/cmd/kind/app.Main
        sigs.k8s.io/kind/cmd/kind/app/main.go:35
main.main
        sigs.k8s.io/kind/main.go:25
runtime.main
        runtime/proc.go:250
runtime.goexit
        runtime/asm_arm64.s:1263

Environment:

  • kind version: (use kind version):
    kind v0.14.0 go1.18.2 linux/arm64
  • Kubernetes version: (use kubectl version):
    Client Version: v1.24.3
    Kustomize Version: v4.5.4
  • Docker version: (use docker info):
Client:
  Context:    default
  Debug Mode: false
  Plugins:
    buildx: Docker Buildx (Docker Inc., 0.8.2+azure-1)
    compose: Docker Compose (Docker Inc., 2.9.0+azure-1)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1
 runc version: v1.1.2-0-ga916309
 init version: de40ad0
 Security Options:
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.10.104-linuxkit
 Operating System: Docker Desktop
 OSType: linux
 Architecture: aarch64
 CPUs: 5
 Total Memory: 14.62GiB
 Name: docker-desktop
 ID: DWAP:AOR6:N5DU:HCAK:GC35:RRZ6:4YMP:4JVL:UJ66:GKCY:N6RR:VAAL
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 HTTP Proxy: http.docker.internal:3128
 HTTPS Proxy: http.docker.internal:3128
 No Proxy: hubproxy.docker.internal
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  hubproxy.docker.internal:5000
  127.0.0.0/8
 Live Restore Enabled: false
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
@fasmat fasmat added the kind/bug Categorizes issue or PR as related to a bug. label Aug 8, 2022
@BenTheElder
Copy link
Member

BenTheElder commented Aug 8, 2022

Can you share the logs folder from kind create cluster --retain; kind export logs; kind delete cluster?

Just to confirm: Everything is running in arm64 mode, no amd64 images / binaries? (See: #2718 which has been a somewhat common issue for M1 users)

@fasmat
Copy link
Author

fasmat commented Aug 8, 2022

Maybe to clarify: running kind create cluster on the Mac directly works without issues.

I'm trying to run the command from inside a container, that has access to the docker-socket of the host. This container is also built for arm64 and can run other containers without issues.

Alternatively if I create the cluster on the host, how can I access it from inside a container? Is there some way to access kind-control-plane with kubectl from inside my container, e.g. by having both of them inside the same docker network?

@fasmat
Copy link
Author

fasmat commented Aug 8, 2022

I switched from docker-from-docker to a docker-in-docker setup.

This is probably less performant, but it works for now. Closing the issue, since I probably tried something that is not officially supported.

@fasmat fasmat closed this as completed Aug 8, 2022
@BenTheElder
Copy link
Member

Alternatively if I create the cluster on the host, how can I access it from inside a container? Is there some way to access kind-control-plane with kubectl from inside my container, e.g. by having both of them inside the same docker network?

you can put the other container on the “kind” docker network either with the net flag or with “docker network connect”

then you can use “kind export kubeconfig —internal”

@fasmat
Copy link
Author

fasmat commented Aug 8, 2022

you can put the other container on the “kind” docker network either with the net flag or with “docker network connect”

then you can use “kind export kubeconfig —internal”

Thanks! I'll try that and compare performance.

@BenTheElder
Copy link
Member

It’s weird that this is working from the host but not via docker socket in a container. I’m not sure why we’d see that. Maybe proxy config differences?

@fasmat
Copy link
Author

fasmat commented Aug 8, 2022

I'm not sure. I need to test more to find out. I had issues with volumes/mounts before when using the hosts docker socket. I had to give the path how it is on the host instead of how it would be inside the container.

@luctowers
Copy link

I am also having this issue on my m1 macbook air when using a docker-from-docker setup

@luctowers
Copy link

luctowers commented Aug 25, 2022

Interestingly, docker network connect kind $HOSTNAME does allow me to curl the control plane via https://kind-control-plane:6443.

Despite this I still get this error when starting control plane via kind create cluster.

ERROR: failed to create cluster: failed to remove control plane taint: command "docker exec --privileged kind-control-plane kubectl --kubeconfig=/etc/kubernetes/admin.conf taint nodes --all node-role.kubernetes.io/control-plane- node-role.kubernetes.io/master-" failed with error: exit status 1
Command Output: The connection to the server kind-control-plane:6443 was refused - did you specify the right host or port?

Nevermind, it's a problem with the name resolution from within the control-plane itself, not the caller of kind create cluster.

@luctowers
Copy link

After some investigating. The problem is, the name resolution in the control-plane doesn't work immediately when the container is started. Adding a sleep before running the remove taint command works as a hacky fix.

diff --git a/pkg/cluster/internal/create/actions/kubeadminit/init.go b/pkg/cluster/internal/create/actions/kubeadminit/init.go
index cc587940..e9778ce9 100644
--- a/pkg/cluster/internal/create/actions/kubeadminit/init.go
+++ b/pkg/cluster/internal/create/actions/kubeadminit/init.go
@@ -19,6 +19,7 @@ package kubeadminit
 
 import (
        "strings"
+       "time"
 
        "sigs.k8s.io/kind/pkg/errors"
        "sigs.k8s.io/kind/pkg/exec"
@@ -135,6 +136,7 @@ func (a *action) Execute(ctx *actions.ActionContext) error {
                taintArgs := []string{"--kubeconfig=/etc/kubernetes/admin.conf", "taint", "nodes", "--all"}
                taintArgs = append(taintArgs, taints...)
 
+               time.Sleep(5 * time.Second)
                if err := node.Command(
                        "kubectl", taintArgs...,
                ).Run(); err != nil {

IDK why this is the case with our environments. We are both running docker-from-docker on Apple M1, not sure how much is coincidence.

Thoughts @BenTheElder ?

@luctowers
Copy link

I was also able to reproduce this issue with a docker-from-docker setup on x86_64 Ubuntu.

@BenTheElder
Copy link
Member

The problem is, the name resolution in the control-plane doesn't work immediately when the container is started.

Uh, that shouldn't be the case. Sounds like a docker bug?

Can you explain what you mean by "docker-from-docker", exactly? Is that like docker-in-docker (docker running inside of a docker container) or docker with the socket mounted to a container?

@luctowers
Copy link

Can you explain what you mean by "docker-from-docker", exactly? Is that like docker-in-docker (docker running inside of a docker container) or docker with the socket mounted to a container?

Just a minimal setup mounting /var/run/docker.sock from the host into the container.

@luctowers
Copy link

luctowers commented Aug 26, 2022

Uh, that shouldn't be the case. Sounds like a docker bug?

Maybe... I tested with older versions of kind, and they have the same issue. Seems odd that we are only just finding it now, unless something got broken in docker.

@BenTheElder
Copy link
Member

Sorry, too many things going on 😅

Does this happen if you use the docker client from the host, running on the host, instead?

It sounds like this environment is broken, and I'd rather not add a sleep to hack around it (I mean it could take longer elsewhere as well)

but we could consider, for example, using loopback instead. But I'd like to know more about why this is failing, something is off with the network in this setup.

@luctowers
Copy link

luctowers commented Sep 13, 2022

Does this happen if you use the docker client from the host, running on the host, instead?

No it does not. It may be something to do with the added latency or overhead when mounting the socket somehow? I agree, the sleep is a bad hacky solution. I think using the loopback addr is likely best.

@phroggyy
Copy link

Facing the same issue here - it seems highly error prone. I've observed that it seems to start working randomly after several retries.

I also don't see how the socket should be an issue here, as the error is DNS resolution from within the container 🤔

Note that in my case, I've been using the Kind go library, rather than the kind binary to perform these operations. I'm not sure if that is the case for others in this issue, and if so, that the command in the kind binary has some built-in retry mechanism? Seems like an obscure problem that's hard to debug for sure.

An additional note on my setup is that I've been calling docker over TCP rather than a unix socket (running socat on local to proxy to the socket).

@BenTheElder
Copy link
Member

I also don't see how the socket should be an issue here, as the error is DNS resolution from within the container 🤔

So the DNS resolution in the container comes from a (different) socket docker embeds.

See the note about custom networks in:

https://docs.docker.com/config/containers/container-networking/#dns-services

If you're using kind via a container that mounts the docker socket, it's possible docker behaves differently here (?)
Unfortunately I haven't had time to dig into this myself yet.

This is not really a use case we've been focused on, kind is a statically linked go binary meant to be run on the host.

Rather than the kind binary to perform these operations. I'm not sure if that is the case for others in this issue, and if so, that the command in the kind binary has some built-in retry mechanism?

No the CLI code is a very small CLI wrapper over the public APIs, there's no special retry logic.

An additional note on my setup is that I've been calling docker over TCP rather than a unix socket (running socat on local to proxy to the socket).

FYI this is unfortunately also known to have issues, off the top of my head there's no way for kind to reserve a random TCP port for the api server reliably. We do permit setting an explicit 0 port and let docker pick instead, but then on restart docker will assign another port. (don't have this handy but past discussion in the issue tracker)

phroggyy added a commit to phroggyy/cluster-api-provider-kind that referenced this issue Sep 24, 2022
By only triggering reconciliation for spec updates, we reduce the
risk of our cluster never starting due to network errors. We only
need it to work once rather than twice.

See kubernetes-sigs/kind#2867
@phroggyy
Copy link

phroggyy commented Sep 25, 2022

@BenTheElder to clarify, I run a docker socket on my local and use socat to specifically expose on :2375 (I wanted to minimise messing with the docker desktop configuration). So I don't think the TCP part should cause further issues.

Do you think it would make sense to add some retry mechanism to this taint call? What's happening now is that the whole creation gets rolled back after the cluster is already created, effectively due to networking. I'm wondering if it might make sense to add an exponential backoff retry (with a fairly low max) to make this more reliable. As you said, kind isn't really built for this usecase, so I'd also get if you don't want to add maintenance burden for an edge case. Thoughts?

Also, just to clarify what's happening: the error here isn't coming from a call to the docker daemon. Rather, the error we're getting is coming from the kubectl call running inside the new cluster (on the node). This should, at least from my limited understanding, work the same regardless of if the docker call is made locally or from a mounted socket (or over the network), since our error is happening inside the created container. Or put differently, it's not docker exec --privileged that's failing, but rather kubectl --kubeconfig=/etc/kubernetes/admin.conf taint nodes ... within the container (and it's executed within that container regardless of how you called docker, I'd assume).

@BenTheElder
Copy link
Member

Do you think it would make sense to add some retry mechanism to this taint call?

Tentatively, that's just patching over one particular symptom of the networking being broken.

What's happening now is that the whole creation gets rolled back after the cluster is already created, effectively due to networking.

Well yes, we can't very well run a functional cluster with broken networking.

I'm wondering if it might make sense to add an exponential backoff retry (with a fairly low max) to make this more reliable.

I don't think that's reasonable after the API server is up, this is a local API call executed itself on one of the control plane nodes itself. We already have an exponential retry waiting for the api server to be ready in kubeadm. Perhaps one retry, but again, this should not flake, it should be a very cheap local call, if it's failing, it's a symptom of the cluster being in a bad state of some sort.

I suggested a possible solution above, but I'd like to understand what / why this is actually broken before I jump on making any changes.

There's no reason resolving the container names should fail, docker is responsible for this and I've

Or put differently, it's not docker exec --privileged that's failing, but rather kubectl --kubeconfig=/etc/kubernetes/admin.conf taint nodes ... within the container (and it's executed within that container regardless of how you called docker, I'd assume).

Yes, but it only seems to be failing when the docker socket is mounted when using kind, and it seems to be related to DNS issues, which makes me think mounting the docker socket when creating the cluster is leading to somewhat broken DNS in the cluster, which doesn't make sense given my understanding of how docker implements DNS, but none of this makes sense ... The dns response for the node name should be local from docker and should be quick and reliable ™️

So far we've had no reports of this with standard local docker socket without containerizing kind itself of using docker over TCP, though I can't fathom why those are relevant.

Unfortunately, without a way to replicate this, I'm reliant on you all to identify why docker containers are not reliably able to resolve themselves or what else is making this call fail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants