Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provision 3 control plane nodes using kubeadm got errors etcd errors #8250

Closed
yongxiu opened this issue Mar 7, 2023 · 14 comments
Closed

Provision 3 control plane nodes using kubeadm got errors etcd errors #8250

yongxiu opened this issue Mar 7, 2023 · 14 comments
Labels
kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@yongxiu
Copy link
Contributor

yongxiu commented Mar 7, 2023

What steps did you take and what happened:

I use cluster-api with my own cluster api provider to provision the cluster, my own cluster api provider will use kubeadm to initialize the control plane nodes.

What did you expect to happen:

It should provision the cluster successfully

Anything else you would like to add:

The provision is flaky, sometimes it failed, when failed, the 2nd contorl plane node will have below errors:

	[WARNING ImagePull]: failed to pull image registry.io/pause:3.8: output: E0303 09:21:22.675900   27743 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \\"registry.io/pause:3.8\\": failed to resolve reference \\"registry.io/pause:3.8\\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized" image="registry.io/pause:3.8"
time="2023-03-03T09:21:22Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \\"registry.io/pause:3.8\\": failed to resolve reference \\"registry.io/pause:3.8\\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized"
, error: exit status 1
{"level":"warn","ts":"2023-03-03T09:21:24.488Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:24.616Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000391500/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:24.783Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0fc0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:25.044Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0c40/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:25.404Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0e00/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:25.941Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764540/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:26.758Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d1500/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:28.005Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764700/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:29.767Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0c40/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:32.445Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0e00/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:36.428Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001968c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:42.483Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00090c000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:21:51.399Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:22:04.771Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007641c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:22:25.573Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000390a80/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:22:55.437Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00090c540/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:23:40.135Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d16c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
{"level":"warn","ts":"2023-03-03T09:24:48.211Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000390a80/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: re-configuration failed due to not enough started members
To see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["W0303 09:21:18.161170   27653 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!", "	[WARNING FileAvailable--etc-kubernetes-kubelet.conf]: /etc/kubernetes/kubelet.conf already exists", "	[WARNING FileExisting-conntrack]: conntrack not found in system path", "	[WARNING FileExisting-socat]: socat not found in system path", "	[WARNING Port-10250]: Port 10250 is in use", "W0303 09:21:18.893885   27653 configset.go:78] Warning: No kubeproxy.config.k8s.io/v1alpha1 config is loaded. Continuing without it: configmaps "kube-proxy" is forbidden: User "system:bootstrap:t124ac" cannot get resource "configmaps" in API group "" in the namespace "kube-system"", "	[WARNING Port-6444]: Port 6444 is in use", "	[WARNING Port-10259]: Port 10259 is in use", "	[WARNING Port-10257]: Port 10257 is in use", "	[WARNING FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml]: /etc/kubernetes/manifests/kube-apiserver.yaml already exists", "	[WARNING FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml]: /etc/kubernetes/manifests/kube-controller-manager.yaml already exists", "	[WARNING FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml]: /etc/kubernetes/manifests/kube-scheduler.yaml already exists", "	[WARNING ImagePull]: failed to pull image registry.io/pause:3.8: output: E0303 09:21:22.675900   27743 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \\"registry.io/pause:3.8\\": failed to resolve reference \\"registry.io/pause:3.8\\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized" image="registry.io/pause:3.8"", "time="2023-03-03T09:21:22Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \\"registry.io/pause:3.8\\": failed to resolve reference \\"registry.io/pause:3.8\\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized"", ", error: exit status 1", "{"level":"warn","ts":"2023-03-03T09:21:24.488Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:24.616Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000391500/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:24.783Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0fc0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:25.044Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0c40/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:25.404Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0e00/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:25.941Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764540/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:26.758Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d1500/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:28.005Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764700/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:29.767Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0c40/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:32.445Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d0e00/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:36.428Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0001968c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:42.483Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00090c000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:21:51.399Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000764000/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:22:04.771Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007641c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:22:25.573Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000390a80/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:22:55.437Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00090c540/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:23:40.135Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0005d16c0/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "{"level":"warn","ts":"2023-03-03T09:24:48.211Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000390a80/10.200.1.63:2379","attempt":0,"error":"rpc error: code = Unknown desc = etcdserver: re-configuration failed due to not enough started members"}", "error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: etcdserver: re-configuration failed due to not enough started members", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[preflight] Running pre-flight checks
[preflight] Reading configuration from the cluster...
[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'
[preflight] Running pre-flight checks before initializing the new control plane instance
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
[control-plane] Using manifest folder "/etc/kubernetes/manifests"
[control-plane] Creating static Pod manifest for "kube-apiserver"
[patches] Reading patches from path "/etc/kubernetes/patches"
[patches] Found the following patch files: [etcd2rootless+strategic.yaml kube-apiserver2rootless+strategic.yaml kube-controller-manager2rootless+strategic.yaml kube-scheduler2rootless+strategic.yaml]
[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-apiserver"
[control-plane] Creating static Pod manifest for "kube-controller-manager"
[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-controller-manager"
[control-plane] Creating static Pod manifest for "kube-scheduler"
[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-scheduler"
[check-etcd] Checking that the etcd cluster is healthy
[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
[kubelet-start] Starting the kubelet
[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...
[kubelet-check] Initial timeout of 40s passed.", "stdout_lines": ["[preflight] Running pre-flight checks", "[preflight] Reading configuration from the cluster...", "[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'", "[preflight] Running pre-flight checks before initializing the new control plane instance", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[control-plane] Using manifest folder "/etc/kubernetes/manifests"", "[control-plane] Creating static Pod manifest for "kube-apiserver"", "[patches] Reading patches from path "/etc/kubernetes/patches"", "[patches] Found the following patch files: [etcd2rootless+strategic.yaml kube-apiserver2rootless+strategic.yaml kube-controller-manager2rootless+strategic.yaml kube-scheduler2rootless+strategic.yaml]", "[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-apiserver"", "[control-plane] Creating static Pod manifest for "kube-controller-manager"", "[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-controller-manager"", "[control-plane] Creating static Pod manifest for "kube-scheduler"", "[patches] Applied patch of type "application/strategic-merge-patch+json" to target "kube-scheduler"", "[check-etcd] Checking that the etcd cluster is healthy", "[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"", "[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"", "[kubelet-start] Starting the kubelet", "[kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...", "[kubelet-check] Initial timeout of 40s passed."]}

According to kubernetes/kubeadm#1846 (comment), we need to provision the control plane nodes serially due to etcd issues, can anyone help confirm this is the right resolution? Then we need to change the lock logic.

Environment:

  • Cluster-api version: 0.3.6
  • minikube/kind version: v0.17.0
  • Kubernetes version: (use kubectl version): v1.26.0
  • OS (e.g. from /etc/os-release): Ubuntu 20.04

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 7, 2023
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Mar 7, 2023
@sbueringer
Copy link
Member

I think you should look into this:

[WARNING ImagePull]: failed to pull image registry.io/pause:3.8: output: E0303 09:21:22.675900 27743 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Unknown desc = failed to pull and unpack image \"registry.io/pause:3.8\": failed to resolve reference \"registry.io/pause:3.8\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized" image="registry.io/pause:3.8"
time="2023-03-03T09:21:22Z" level=fatal msg="pulling image: rpc error: code = Unknown desc = failed to pull and unpack image \"registry.io/pause:3.8\": failed to resolve reference \"registry.io/pause:3.8\": pulling from host 10.200.8.2:10443 failed with status code [manifests 3.8]: 401 Unauthorized"
, error: exit status 1

Just for my understanding. Your are using your own control plane and bootstrap provider?

As far as I'm aware you have to create the first control plane node and after that you can create further control plane nodes in parallel (KCP should do it ~ like that)

@yongxiu
Copy link
Contributor Author

yongxiu commented Mar 8, 2023

That's red herring, whenever kubeadm join failed, it will have this error, I do have the same config.toml and credential setup for the first control plane nodes.

Yes, I'm using own control plane and bootstrap provider.

I know cluster-api guarantees it will create the first control plane node, then create the other control plane nodes in parallel, but I wonder whether this is the root cause, or should we always provision the control plane nodes serially, based on kubernetes/kubeadm#1846 (comment), we need to provision the control plane nodes serially due to etcd issues, the error match exactly what I saw.

@sbueringer
Copy link
Member

sbueringer commented Mar 8, 2023

I'm not sure, I'm not aware that we had any reports about that issue for KCP in the last few years.

So I'm very hesitant to change the KCP implementation

Probably I was wrong and KCP actually joins nodes sequentially. Someone would have to take a closer look at the KCP code

@sbueringer
Copy link
Member

I think it works by always running prechecks before adding a machine:

func (r *KubeadmControlPlaneReconciler) scaleUpControlPlane(ctx context.Context, cluster *clusterv1.Cluster, kcp *controlplanev1.KubeadmControlPlane, controlPlane *internal.ControlPlane) (ctrl.Result, error) {
logger := ctrl.LoggerFrom(ctx)
// Run preflight checks to ensure that the control plane is stable before proceeding with a scale up/scale down operation; if not, wait.
if result, err := r.preflightChecks(ctx, controlPlane); err != nil || !result.IsZero() {
return result, err
}
// Create the bootstrap configuration
bootstrapSpec := controlPlane.JoinControlPlaneConfig()
fd := controlPlane.NextFailureDomainForScaleUp()
if err := r.cloneConfigsAndGenerateMachine(ctx, cluster, kcp, bootstrapSpec, fd); err != nil {
logger.Error(err, "Failed to create additional control plane Machine")
r.recorder.Eventf(kcp, corev1.EventTypeWarning, "FailedScaleUp", "Failed to create additional control plane Machine for cluster %s/%s control plane: %v", cluster.Namespace, cluster.Name, err)
return ctrl.Result{}, err
}
// Requeue the control plane, in case there are other operations to perform
return ctrl.Result{Requeue: true}, nil
}

Essentially:

  • Create initial machine
  • Wait until cluster is stable
  • Create 2nd machine
  • Wait until cluster is stable
  • Create 3rd machine
  • ...

@yongxiu
Copy link
Contributor Author

yongxiu commented Mar 8, 2023

Only the very first control plane has this lock:

if !r.KubeadmInitLock.Lock(ctx, scope.Cluster, machine) {

Once this machine is ready, then

conditions.MarkTrue(cluster, clusterv1.ControlPlaneInitializedCondition)
will mark cluster as ControlPlaneInitializedCondition, then all the other machines will provision in the same time.

Preflight check will also run in the same time, they have no idea about each other, so essentially they all finish the preflightcheck successfully without knowing each other provision status.

@sbueringer
Copy link
Member

@fabriziopandini WDYT?

@fabriziopandini
Copy link
Member

/remove-kind bug
/kind support

In KCP we prefer stability over speed when dealing with the control plane machines, and thus we are joining control plane machines sequentially, one at time, and waiting for the entire control plane to be stable before creating the next machine.
Those safeguards are implemented in the KCP scale-up code, and this takes precedence on the code highlighted above.

If you are building your own provider using kubeadm internally you can follow KCP as a reference or engage the kubeadm community to discuss making kubeadm join to work consistently (according to https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#steps-for-the-rest-of-the-control-plane-nodes parallel join of control plane machines is supported, but if I look at kubernetes/kubeadm#1846 (comment) or #2050 (comment) it doesn't work consistently)

Note: I'm closing this issue because given my understanding it is due to a kubeadm issue, not to a CAPI/KCP/CABPK problem, but feel free to continue the discussion here if you need more information about how KCP or CAPBK works.
/close

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed kind/bug Categorizes issue or PR as related to a bug. labels Mar 20, 2023
@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/remove-kind bug
/kind support

In KCP we prefer stability over speed when dealing with the control plane machines, and thus we are joining control plane machines sequentially, one at time, and waiting for the entire control plane to be stable before creating the next machine.
Those safeguards are implemented in the KCP scale-up code, and this takes precedence on the code highlighted above.

If you are building your own provider using kubeadm internally you can follow KCP as a reference or engage the kubeadm community to discuss making kubeadm join to work consistently (according to https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/high-availability/#steps-for-the-rest-of-the-control-plane-nodes parallel join of control plane machines is supported, but if I look at kubernetes/kubeadm#1846 (comment) or #2050 (comment) it doesn't work consistently)

Note: I'm closing this issue because given my understanding it is due to a kubeadm issue, not to a CAPI/KCP/CABPK problem, but feel free to continue the discussion here if you need more information about how KCP or CAPBK works.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yongxiu
Copy link
Contributor Author

yongxiu commented Mar 20, 2023

and thus we are joining control plane machines sequentially, one at time, and waiting for the entire control plane to be stable before creating the next machine.

Are you saying current kubeadm controller is creating kubeadm secrets sequentially? I'm using official kubeadm controller not my own. Based on my observation, it will use the lock to create the first kubeadm secret, but the remaining control plane nodes secrets are created in parallel, not sequentially:

Once conditions.IsTrue(cluster, clusterv1.ControlPlaneInitializedCondition) returns true, all the other remaining control plane nodes will execute

if config.Spec.JoinConfiguration == nil {

@sbueringer
Copy link
Member

sbueringer commented Mar 21, 2023

/reopen

I would really like to clarify if KCP behaves correct today. @fabriziopandini The question is if KCP actually creates Machines sequentially (and how this is enforced)

As far as I can tell the following could happen:

  • Create 1 Machine
  • Create Machine 2 & 3 in parallel

See also

will mark cluster as ControlPlaneInitializedCondition, then all the other machines will provision in the same time.

Preflight check will also run in the same time, they have no idea about each other, so essentially they all finish the preflightcheck successfully without knowing each other provision status.
#8250 (comment)

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Reopened this issue.

In response to this:

/reopen

I would really like to clarify if KCP behaves correct today. @fabriziopandini The question is if KCP actually creates Machines sequentially (and how this is enforced)

As far as I can tell the following could happen:

  • Create 1 Machine
  • Create Machine 2 & 3 in parallel

See also

will mark cluster as ControlPlaneInitializedCondition, then all the other machines will provision in the same time.
Preflight check will also run in the same time, they have no idea about each other, so essentially they all finish the preflightcheck successfully without knowing each other provision status.
#8250 (comment)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Mar 21, 2023
@sbueringer
Copy link
Member

sbueringer commented Mar 21, 2023

@yongxiu I talked to Fabrizio and KCP actually creates Machines sequentially.

The lock in CABPK is only for the very first control plane machine.

KCP is then creating additional Machines sequentially:

  1. Run preflight checks
  2. Create Machine 2
  3. Run preflight checks
  4. Create Machine 3

The preflight checks in 3. will fail until the Machine from 2 is sucessfully up.

There is no way that (1+2) and (3+4) can run concurrently as the KCP controller only has one worker go routine per KCP object. So basically controller-runtime ensures that we only ever reconcile a specific KCP object in one worker.

Can you please clarify with providers you are using. I'm a bit confused as the following statements seem to contradict each other:

Yes, I'm using own control plane and bootstrap provider.

I'm using official kubeadm controller not my own

@sbueringer sbueringer added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 21, 2023
@yongxiu
Copy link
Contributor Author

yongxiu commented Mar 22, 2023

I see, thanks. This is new logic added after 0.3.6, I was using an old version. Good to know cluster-api is already provisioning control plane nodes sequentially.

@yongxiu yongxiu closed this as completed Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

No branches or pull requests

4 participants