cluster fails to start if mutiple control plane nodes are added. #3680

terryjix · 2024-07-11T09:01:03Z

What happened:
the cluster is failed to create if I add mutiple control plane nodes to the cluster

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

Error logs

{"level":"warn","ts":"2024-07-11T08:50:49.476205Z","logger":"etcd-client","caller":"v3@v3.5.10/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00062ee00/172.18.0.5:2379","attempt":0,"error":"rpc error: code = FailedPrecondition desc = etcdserver: can only promote a learner member which is in sync with leader"}
I0711 08:50:49.476269     249 etcd.go:550] [etcd] Promoting the learner 86e5aab36dbb6fb7 failed: etcdserver: can only promote a learner member which is in sync with leader
etcdserver: can only promote a learner member which is in sync with leader
error creating local etcd static pod manifest file
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join.runEtcdPhase
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/join/controlplanejoin.go:156
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:259
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:183
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.7.0/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.7.0/command.go:1068
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.7.0/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
        k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:52
main.main
        k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        runtime/proc.go:271
runtime.goexit
        runtime/asm_amd64.s:1695
error execution phase control-plane-join/etcd
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:260
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:446
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:232
k8s.io/kubernetes/cmd/kubeadm/app/cmd.newCmdJoin.func1
        k8s.io/kubernetes/cmd/kubeadm/app/cmd/join.go:183
github.com/spf13/cobra.(*Command).execute
        github.com/spf13/cobra@v1.7.0/command.go:940
github.com/spf13/cobra.(*Command).ExecuteC
        github.com/spf13/cobra@v1.7.0/command.go:1068
github.com/spf13/cobra.(*Command).Execute
        github.com/spf13/cobra@v1.7.0/command.go:992
k8s.io/kubernetes/cmd/kubeadm/app.Run
        k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:52
main.main
        k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
        runtime/proc.go:271
runtime.goexit
        runtime/asm_amd64.s:169

What you expected to happen:
kind supports to create a kubernetes cluster with mutiple control plane nodes.

How to reproduce it (as minimally and precisely as possible):
use following configuration to launch a cluster

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
- role: worker
- role: worker

Anything else we need to know?:

Environment:

kind version:
kind version 0.23.0
Runtime info: (use docker info, podman info or nerdctl info):
Client:
Version: 25.0.3
Context: default
Debug Mode: false
Plugins:
buildx: Docker Buildx (Docker Inc.)
Version: v0.0.0+unknown
Path: /usr/libexec/docker/cli-plugins/docker-buildx

Server:
Containers: 21
Running: 0
Paused: 0
Stopped: 21
Images: 78
Server Version: 25.0.3
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Using metacopy: false
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: systemd
Cgroup Version: 2
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 64b8a811b07ba6288238eefc14d898ee0b5b99ba
runc version: 4bccb38cc9cf198d52bebf2b3a90cd14e7af8c06
init version: de40ad0
Security Options:
seccomp
Profile: builtin
cgroupns
Kernel Version: 6.1.94-99.176.amzn2023.x86_64
Operating System: Amazon Linux 2023.5.20240701
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.629GiB
Name: ip-172-31-18-230.eu-west-1.compute.internal
ID: c3b0373c-7367-45d1-8e7b-12a0ff695616
Docker Root Dir: /var/lib/docker
Debug Mode: false
Experimental: false
Insecure Registries:
binglj.people.aws.dev:443
127.0.0.0/8
Live Restore Enabled: false

OS (e.g. from /etc/os-release):
Amazon Linux 2023
Kubernetes version: (use kubectl version):
1.30.0
Any proxies or other special environment settings?:

The text was updated successfully, but these errors were encountered:

neolit123 · 2024-07-11T10:56:16Z

I0711 08:50:49.476269 249 etcd.go:550] [etcd] Promoting the learner 86e5aab36dbb6fb7 failed: etcdserver: can only promote a learner member which is in sync with leader
etcdserver: can only promote a learner member which is in sync with leader
error creating local etcd static pod manifest file

@pacoxu didn't we wait for sync to happen before promote?

terryjix · 2024-07-11T10:59:01Z

I added following arguments to the configuration file

  kubeadmConfigPatches:
  - |
    kind: ClusterConfiguration
    featureGates:
      EtcdLearnerMode: false

the kubelet fails to start with another error message

Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.281034     379 factory.go:221] Registration of the systemd container factory successfully
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.281297     379 factory.go:219] Registration of the crio container factory failed: Get "http://%2Fvar%2Frun%2Fcrio%2Fcrio.sock/info": dial unix /var/run/crio/crio.sock: connect: no such file or directory
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: I0711 09:46:25.290080     379 factory.go:221] Registration of the containerd container factory successfully
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: E0711 09:46:25.290422     379 manager.go:294] Registration of the raw container factory failed: inotify_init: too many open files
Jul 11 09:46:25 k8s-playground-worker kubelet[379]: E0711 09:46:25.290542     379 kubelet.go:1530] "Failed to start cAdvisor" err="inotify_init: too many open files"

neolit123 · 2024-07-11T11:02:42Z

Failed to start cAdvisor" err="inotify_init: too many open files"

maybe an ulimit problem:
#2744 (comment)

terryjix · 2024-07-11T11:39:37Z

no ulimit issue if I only add one control-plane node to the cluster.
trying to find a way to update the sysctl configuration

BenTheElder · 2024-07-11T16:42:09Z

inotify: https://kind.sigs.k8s.io/docs/user/known-issues/#pod-errors-due-to-too-many-open-files

BenTheElder · 2024-07-11T16:43:01Z

This is a lot of nodes, do you need them? for what purpose?

most development should prefer single node clusters. each node consumes resources from the host and unlike a "real" cluster adding more nodes does not actually add more resources (only falsely), you are almost certainly hitting resource limits on the host (see the known-issues doc re: inotify above, though this may not be the only limit you're hitting)

terryjix added the kind/bug Categorizes issue or PR as related to a bug. label Jul 11, 2024

BenTheElder added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster fails to start if mutiple control plane nodes are added. #3680

cluster fails to start if mutiple control plane nodes are added. #3680

terryjix commented Jul 11, 2024

neolit123 commented Jul 11, 2024

terryjix commented Jul 11, 2024

neolit123 commented Jul 11, 2024

terryjix commented Jul 11, 2024 •

edited

Loading

BenTheElder commented Jul 11, 2024

BenTheElder commented Jul 11, 2024 •

edited

Loading

cluster fails to start if mutiple control plane nodes are added. #3680

cluster fails to start if mutiple control plane nodes are added. #3680

Comments

terryjix commented Jul 11, 2024

neolit123 commented Jul 11, 2024

terryjix commented Jul 11, 2024

neolit123 commented Jul 11, 2024

terryjix commented Jul 11, 2024 • edited Loading

BenTheElder commented Jul 11, 2024

BenTheElder commented Jul 11, 2024 • edited Loading

terryjix commented Jul 11, 2024 •

edited

Loading

BenTheElder commented Jul 11, 2024 •

edited

Loading