Provision 3 control plane nodes using kubeadm got errors etcd errors #8250

yongxiu · 2023-03-07T23:27:16Z

What steps did you take and what happened:

I use cluster-api with my own cluster api provider to provision the cluster, my own cluster api provider will use kubeadm to initialize the control plane nodes.

What did you expect to happen:

It should provision the cluster successfully

Anything else you would like to add:

The provision is flaky, sometimes it failed, when failed, the 2nd contorl plane node will have below errors:

	[WARNING ImagePull]: failed to pull image registry.io/pause:3.8: output: E0303 09:21:22.675900   27743 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \\"registry.io/pause:3.8\\": failed to resolve reference \\"registry.io/pause:3.8\\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized" image="registry.io/pause:3.8"
time="2023-03-03T09:21:22Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \\"registry.io/pause:3.8\\": failed to resolve reference \\"registry.io/pause:3.8\\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized"
, error: exit status 1
{"level":"warn","ts":"2023-03-03T09:21:24.488Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:24.616Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000391500/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:24.783Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0fc0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:25.044Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0c40/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:25.404Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0e00/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:25.941Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764540/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:26.758Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d1500/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:28.005Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764700/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:29.767Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0c40/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:32.445Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0e00/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:36.428Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001968c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:42.483Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00090c000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:51.399Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:22:04.771Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007641c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:22:25.573Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000390a80/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:22:55.437Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00090c540/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:23:40.135Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d16c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:24:48.211Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000390a80/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: re-configuration failed due to not enough started members
To see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0303 09:21:18.161170   27653 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!", "	[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists", "	[WARNING FileExisting-conntrack]: conntrack not found in system path", "	[WARNING FileExisting-socat]: socat not found in system path", "	[WARNING Port-10250]: Port 10250 is in use", "W0303 09:21:18.893885   27653 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" is forbidden: User "system:bootstrap:t124ac" cannot get resource "configmaps" in API group "" in the namespace "kube-system"", "	[WARNING Port-6444]: Port 6444 is in use", "	[WARNING Port-10259]: Port 10259 is in use", "	[WARNING Port-10257]: Port 10257 is in use", "	[WARNING FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists", "	[WARNING FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists", "	[WARNING FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists", "	[WARNING ImagePull]: failed to pull image registry.io/pause:3.8: output: E0303 09:21:22.675900   27743 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \\"registry.io/pause:3.8\\": failed to resolve reference \\"registry.io/pause:3.8\\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized" image="registry.io/pause:3.8"", "time="2023-03-03T09:21:22Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \\"registry.io/pause:3.8\\": failed to resolve reference \\"registry.io/pause:3.8\\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized"", ", error: exit status 1", "{"level":"warn","ts":"2023-03-03T09:21:24.488Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:24.616Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000391500/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:24.783Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0fc0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:25.044Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0c40/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:25.404Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0e00/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:25.941Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764540/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:26.758Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d1500/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:28.005Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764700/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:29.767Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0c40/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:32.445Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0e00/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:36.428Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001968c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:42.483Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00090c000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:51.399Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:22:04.771Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007641c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:22:25.573Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000390a80/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:22:55.437Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00090c540/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:23:40.135Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d16c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:24:48.211Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000390a80/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: re-configuration failed due to not enough started members", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[patches] Reading patches from path "/etc/kubernetes/patches"
[patches] Found the following patch files: [etcd2rootless+strategic.yaml kube-apiserver2rootless+strategic.yaml kube-controller-manager2rootless+strategic.yaml kube-scheduler2rootless+strategic.yaml]
[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'", "[preflight] Running pre-flight checks before initializing the new control plane instance", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[control-plane] Using manifest folder "/etc/kubernetes/manifests"", "[control-plane] Creating static Pod manifest for "kube-apiserver"", "[patches] Reading patches from path "/etc/kubernetes/patches"", "[patches] Found the following patch files: [etcd2rootless+strategic.yaml kube-apiserver2rootless+strategic.yaml kube-controller-manager2rootless+strategic.yaml kube-scheduler2rootless+strategic.yaml]", "[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-apiserver"", "[control-plane] Creating static Pod manifest for "kube-controller-manager"", "[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-controller-manager"", "[control-plane] Creating static Pod manifest for "kube-scheduler"", "[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-scheduler"", "[check-etcd] Checking that the etcd cluster is healthy", "[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"", "[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"", "[kubelet-start] Starting the kubelet", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[kubelet-check] Initial timeout of 40s passed."]}

According to kubernetes/kubeadm#1846 (comment), we need to provision the control plane nodes serially due to etcd issues, can anyone help confirm this is the right resolution? Then we need to change the lock logic.

Environment:

Cluster-api version: 0.3.6
minikube/kind version: v0.17.0
Kubernetes version: (use kubectl version): v1.26.0
OS (e.g. from /etc/os-release): Ubuntu 20.04

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2023-03-07T23:27:23Z

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sbueringer · 2023-03-08T04:15:34Z

I think you should look into this:

[WARNING ImagePull]: failed to pull image registry.io/pause:3.8: output: E0303 09:21:22.675900 27743 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.io/pause:3.8\": failed to resolve reference \"registry.io/pause:3.8\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized" image="registry.io/pause:3.8"
time="2023-03-03T09:21:22Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \"registry.io/pause:3.8\": failed to resolve reference \"registry.io/pause:3.8\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized"
, error: exit status 1

Just for my understanding. Your are using your own control plane and bootstrap provider?

As far as I'm aware you have to create the first control plane node and after that you can create further control plane nodes in parallel (KCP should do it ~ like that)

yongxiu · 2023-03-08T04:19:51Z

That's red herring, whenever kubeadm join failed, it will have this error, I do have the same config.toml and credential setup for the first control plane nodes.

Yes, I'm using own control plane and bootstrap provider.

I know cluster-api guarantees it will create the first control plane node, then create the other control plane nodes in parallel, but I wonder whether this is the root cause, or should we always provision the control plane nodes serially, based on kubernetes/kubeadm#1846 (comment), we need to provision the control plane nodes serially due to etcd issues, the error match exactly what I saw.

sbueringer · 2023-03-08T04:25:24Z

I'm not sure, I'm not aware that we had any reports about that issue for KCP in the last few years.

So I'm very hesitant to change the KCP implementation

Probably I was wrong and KCP actually joins nodes sequentially. Someone would have to take a closer look at the KCP code

sbueringer · 2023-03-08T04:30:10Z

I think it works by always running prechecks before adding a machine:

cluster-api/controlplane/kubeadm/internal/controllers/scale.go

Lines 68 to 87 in bbe7c2f

    
           func (r *KubeadmControlPlaneReconciler) scaleUpControlPlane(ctx context.Context, cluster *clusterv1.Cluster, kcp *controlplanev1.KubeadmControlPlane, controlPlane *internal.ControlPlane) (ctrl.Result, error) { 
        
           	logger := ctrl.LoggerFrom(ctx) 
        
           	// Run preflight checks to ensure that the control plane is stable before proceeding with a scale up/scale down operation; if not, wait. 
        
           	if result, err := r.preflightChecks(ctx, controlPlane); err != nil || !result.IsZero() { 
        
           		return result, err 
        
           	} 
        
           	// Create the bootstrap configuration 
        
           	bootstrapSpec := controlPlane.JoinControlPlaneConfig() 
        
           	fd := controlPlane.NextFailureDomainForScaleUp() 
        
           	if err := r.cloneConfigsAndGenerateMachine(ctx, cluster, kcp, bootstrapSpec, fd); err != nil { 
        
           		logger.Error(err, "Failed to create additional control plane Machine") 
        
           		r.recorder.Eventf(kcp, corev1.EventTypeWarning, "FailedScaleUp", "Failed to create additional control plane Machine for cluster %s/%s control plane: %v", cluster.Namespace, cluster.Name, err) 
        
           		return ctrl.Result{}, err 
        
           	} 
        
           	// Requeue the control plane, in case there are other operations to perform 
        
           	return ctrl.Result{Requeue: true}, nil 
        
           }

Essentially:

Create initial machine
Wait until cluster is stable
Create 2nd machine
Wait until cluster is stable
Create 3rd machine
...

yongxiu · 2023-03-08T06:02:17Z

Only the very first control plane has this lock:

cluster-api/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go

Line 401 in 838dcbb

if !r.KubeadmInitLock.Lock(ctx, scope.Cluster, machine) {

Once this machine is ready, then

cluster-api/internal/controllers/cluster/cluster_controller.go

Line 498 in 838dcbb

conditions.MarkTrue(cluster, clusterv1.ControlPlaneInitializedCondition)

will mark cluster as ControlPlaneInitializedCondition, then all the other machines will provision in the same time.

Preflight check will also run in the same time, they have no idea about each other, so essentially they all finish the preflightcheck successfully without knowing each other provision status.

sbueringer · 2023-03-08T06:10:41Z

@fabriziopandini WDYT?

fabriziopandini · 2023-03-20T20:32:02Z

/remove-kind bug
/kind support

In KCP we prefer stability over speed when dealing with the control plane machines, and thus we are joining control plane machines sequentially, one at time, and waiting for the entire control plane to be stable before creating the next machine.
Those safeguards are implemented in the KCP scale-up code, and this takes precedence on the code highlighted above.

If you are building your own provider using kubeadm internally you can follow KCP as a reference or engage the kubeadm community to discuss making kubeadm join to work consistently (according to https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#steps-for-the-rest-of-the-control-plane-nodes parallel join of control plane machines is supported, but if I look at kubernetes/kubeadm#1846 (comment) or #2050 (comment) it doesn't work consistently)

Note: I'm closing this issue because given my understanding it is due to a kubeadm issue, not to a CAPI/KCP/CABPK problem, but feel free to continue the discussion here if you need more information about how KCP or CAPBK works.
/close

k8s-ci-robot · 2023-03-20T20:32:07Z

@fabriziopandini: Closing this issue.

In response to this:

/remove-kind bug
/kind support

In KCP we prefer stability over speed when dealing with the control plane machines, and thus we are joining control plane machines sequentially, one at time, and waiting for the entire control plane to be stable before creating the next machine.
Those safeguards are implemented in the KCP scale-up code, and this takes precedence on the code highlighted above.

If you are building your own provider using kubeadm internally you can follow KCP as a reference or engage the kubeadm community to discuss making kubeadm join to work consistently (according to https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#steps-for-the-rest-of-the-control-plane-nodes parallel join of control plane machines is supported, but if I look at kubernetes/kubeadm#1846 (comment) or #2050 (comment) it doesn't work consistently)

Note: I'm closing this issue because given my understanding it is due to a kubeadm issue, not to a CAPI/KCP/CABPK problem, but feel free to continue the discussion here if you need more information about how KCP or CAPBK works.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

yongxiu · 2023-03-20T22:05:03Z

and thus we are joining control plane machines sequentially, one at time, and waiting for the entire control plane to be stable before creating the next machine.

Are you saying current kubeadm controller is creating kubeadm secrets sequentially? I'm using official kubeadm controller not my own. Based on my observation, it will use the lock to create the first kubeadm secret, but the remaining control plane nodes secrets are created in parallel, not sequentially:

Once conditions.IsTrue(cluster, clusterv1.ControlPlaneInitializedCondition) returns true, all the other remaining control plane nodes will execute

cluster-api/bootstrap/kubeadm/internal/controllers/kubeadmconfig_controller.go

Line 310 in 838dcbb

if config.Spec.JoinConfiguration == nil {

sbueringer · 2023-03-21T03:09:36Z

/reopen

I would really like to clarify if KCP behaves correct today. @fabriziopandini The question is if KCP actually creates Machines sequentially (and how this is enforced)

As far as I can tell the following could happen:

Create 1 Machine
Create Machine 2 & 3 in parallel

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provision 3 control plane nodes using kubeadm got errors etcd errors #8250

Provision 3 control plane nodes using kubeadm got errors etcd errors #8250

yongxiu commented Mar 7, 2023

k8s-ci-robot commented Mar 7, 2023

sbueringer commented Mar 8, 2023

yongxiu commented Mar 8, 2023

sbueringer commented Mar 8, 2023 •

edited

Loading

sbueringer commented Mar 8, 2023

yongxiu commented Mar 8, 2023

sbueringer commented Mar 8, 2023

fabriziopandini commented Mar 20, 2023

k8s-ci-robot commented Mar 20, 2023

yongxiu commented Mar 20, 2023

sbueringer commented Mar 21, 2023 •

edited

Loading

k8s-ci-robot commented Mar 21, 2023

sbueringer commented Mar 21, 2023 •

edited

Loading

yongxiu commented Mar 22, 2023

Provision 3 control plane nodes using kubeadm got errors etcd errors #8250

Provision 3 control plane nodes using kubeadm got errors etcd errors #8250

Comments

yongxiu commented Mar 7, 2023

k8s-ci-robot commented Mar 7, 2023

sbueringer commented Mar 8, 2023

yongxiu commented Mar 8, 2023

sbueringer commented Mar 8, 2023 • edited Loading

sbueringer commented Mar 8, 2023

yongxiu commented Mar 8, 2023

sbueringer commented Mar 8, 2023

fabriziopandini commented Mar 20, 2023

k8s-ci-robot commented Mar 20, 2023

yongxiu commented Mar 20, 2023

sbueringer commented Mar 21, 2023 • edited Loading

k8s-ci-robot commented Mar 21, 2023

sbueringer commented Mar 21, 2023 • edited Loading

yongxiu commented Mar 22, 2023

sbueringer commented Mar 8, 2023 •

edited

Loading

sbueringer commented Mar 21, 2023 •

edited

Loading

sbueringer commented Mar 21, 2023 •

edited

Loading