Clusters with v1.27.2 seem to have issues #8764

chrischdi · 2023-05-30T11:10:17Z

Which jobs are failing?

main: capi-e2e-main-1-26-1-27 is red since 26th (3 times)
v1.4: capi-e2e-release-1-4-1-26-1-27 is red since 26th (3 times)

Which tests are failing?

capi-e2e: [It] When upgrading a workload cluster using ClusterClass and testing K8S conformance [Conformance] [K8s-Upgrade] [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest expand_more

Since when has it been failing?

26th May

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-26-1-27) is red since 26th

Reason for failure (if possible)

OCI runtime exec failed: exec failed: unable to start container process: error adding pid 11410 to cgroups: failed to write 11410: openat2 /sys/fs/cgroup/unified/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf50282df_4a85_455c_be12_52dcf5261287.slice/cri-containerd-6d64260e1fe52c46cceb265b6a5367e25039bfa6e39dd908fa612e993a2da01f.scope/cgroup.procs: no such file or directory: unknown

Which seems to be related to this change in v1.27.2 kind image:

add workaround for misc controller on cgroups v1 kind#3255
- This green Prowjob is using the old SHA (kindest/node:v1.27.2@sha256:f8c4cbfb438b777ad84c2707efee91571294593a871071486369229102411e20) for kind (see the above MR)
- Beginning with the first red Prowjob, we are using the new SHA (kindest/node:v1.27.2@sha256:ee77a85d1146ba4f1df9f836c828845a7dbe1f1a094ee670879d7c14f41e31f2) for kind (see the above MR)
upgrade base/node image kind#3256
- actually changed/swapped the images
upgrading to runc 1.1.6 / 1.1.7 breaks kind#3223
- is also related

Anything else we need to know?

No response

Label(s) to be applied

/kind failing-test

The text was updated successfully, but these errors were encountered:

chrischdi · 2023-05-30T11:10:40Z

Note: @kubernetes-sigs/cluster-api-release-team I think this should be release blocking!

killianmuldoon · 2023-05-30T11:12:18Z

/triage accepted

sbueringer · 2023-05-30T11:17:04Z

Not sure if this should be release blocking, assuming that the already released CAPI versions have the same issue with 1.27.2.

Or did we do something in Cluster API which introduced this issue?

But to assess that we have to find out more details I guess.

chrischdi · 2023-05-30T15:30:10Z

~~We can reproduce this locally via. What I did:~~

~~Create a ubuntu 22.04 VM on aws~~
~~install docker, go, kubectl, tilt, ...~~
~~clone capi repo~~
~~run tilt up~~
~~Create cluster using quickstart templates~~
- Note: requires a workaround for Clusters with v1.27.2 seem to have issues #8764 , e.g.:
  - ~~adding fail-swap-on: false at the kcp + md kubeadmConfig templates at the kubeletExtraArgs~~
~~Try a kubectl exec to a control-plane pod.~~

ubuntu@ip-172-31-10-126:~/go/src/sigs.k8s.io/cluster-api$ kubectl get po -n kube-system
NAME                                                   READY   STATUS    RESTARTS   AGE
coredns-5d78c9869d-pdhwl                               0/1     Pending   0          20m
coredns-5d78c9869d-tgft5                               0/1     Pending   0          20m
etcd-development-3774-rbk6v-4hltm                      1/1     Running   0          20m
kube-apiserver-development-3774-rbk6v-4hltm            1/1     Running   0          20m
kube-controller-manager-development-3774-rbk6v-4hltm   1/1     Running   0          20m
kube-proxy-c9zzl                                       1/1     Running   0          19m
kube-proxy-mkqw4                                       1/1     Running   0          20m
kube-scheduler-development-3774-rbk6v-4hltm            1/1     Running   0          20m

ubuntu@ip-172-31-10-126:~/go/src/sigs.k8s.io/cluster-api$ kubectl exec -ti -n kube-system etcd-development-3774-rbk6v-4hltm -- ls
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "6912386191fc9dc2ecb591286d2ff1af65c96b48af34697f640e585641d0d54c": OCI runtime exec failed: exec failed: unable to start container process: exec: "ls": executable file not found in $PATH: unknown

Edit: the above error is ok

chrischdi · 2023-05-30T21:22:44Z

Note: creating the cluster works, but things like kubectl exec don't.

Note to myself: check if kubectl exec works with kind + v1.27.2.

killianmuldoon · 2023-05-31T09:45:32Z

Is the exec not just an issue with the binary not being available in that pod?

killianmuldoon · 2023-05-31T09:47:06Z

e.g. something like this works for me (on a kind cluster)

kubectl exec -n kube-system kube-apiserver-capi-test-control-plane kube-apiserver

chrischdi · 2023-05-31T10:06:15Z

Is the exec not just an issue with the binary not being available in that pod?

Uh oh yes 🤦 I thought it is the other line we got from the tests...

killianmuldoon · 2023-05-31T13:52:42Z

kubernetes/test-infra#29654 is designed to help debug this. It makes the following changes which will have to be rolled back once we have a solution.

1.26-1.27 upgrade job for branch release-1.4 is now pinned to upgrade to 1.27.1. This should unblock the release on that branch
Added two experimental jobs, one running the test-infra kind image and a second running the kubekins image. This is to help debug the underlying issues with the test to help get to a root cause of the conformance failures.

killianmuldoon · 2023-06-01T09:17:49Z

/assign

killianmuldoon · 2023-06-02T17:10:44Z

I've managed to find a solution to this on #8774

Basically CAPD needs to copy the config used by kind in both kubernetes-sigs/kind#3241 and kubernetes-sigs/kind#3255. I've also included changes from kubernetes-sigs/kind#3240 in the current version of this PR, but they may not be necessary for the fix.

The issue we have now and will increasingly have in future is that the new images and old images will require different configs - I've got a clumsy check for v1.27.2 in a string in this PR to do the different config of private cgroupNS mode, but that's obviously not a long term solution.

The options we have are to either

Find a way to introspect a given kind image to decide what configuration to pass to it.
or
Move to the new config once kind v0.20.0 is installed, ignore images like the current v1.27.2 for now e.g. by enforcing a build in the upgrade tests. Once we move to kind v0.20.0 and its new build process CAPD will no longer function with older images.

I'll try to see if there's a way to check if an image should have private cgroupns set or not to make both older and new images compatibly with CAPD.

sbueringer · 2023-06-02T18:23:41Z

I think the second option is fine. It's a dev provider

killianmuldoon · 2023-06-05T13:36:19Z

I think we can close this issue with kubernetes/test-infra#29682 and follow up with the rest of the work on #8788

Feel free to reopen if there's still something to be done specific to this issue.

killianmuldoon · 2023-06-05T13:36:23Z

/close

k8s-ci-robot · 2023-06-05T13:36:29Z

@killianmuldoon: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 30, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 30, 2023

killianmuldoon mentioned this issue May 31, 2023

🐛 [WIP] Make CAPD compatible with new kindest/node images #8774

Closed

nrb mentioned this issue Jun 1, 2023

Upgrade to Kubernetes 1.27 and Go 1.20 kubernetes/cloud-provider-vsphere#727

Merged

k8s-ci-robot assigned killianmuldoon Jun 1, 2023

This was referenced Jun 5, 2023

Cluster API: Pin upgrade job to v1.27.1 kubernetes/test-infra#29682

Merged

Update kind to v0.20.0 #8788

Closed

k8s-ci-robot closed this as completed Jun 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clusters with v1.27.2 seem to have issues #8764

Clusters with v1.27.2 seem to have issues #8764

chrischdi commented May 30, 2023

chrischdi commented May 30, 2023

killianmuldoon commented May 30, 2023

sbueringer commented May 30, 2023 •

edited

Loading

chrischdi commented May 30, 2023 •

edited

Loading

chrischdi commented May 30, 2023

killianmuldoon commented May 31, 2023

killianmuldoon commented May 31, 2023

chrischdi commented May 31, 2023

killianmuldoon commented May 31, 2023

killianmuldoon commented Jun 1, 2023

killianmuldoon commented Jun 2, 2023 •

edited

Loading

sbueringer commented Jun 2, 2023

killianmuldoon commented Jun 5, 2023

killianmuldoon commented Jun 5, 2023

k8s-ci-robot commented Jun 5, 2023

Clusters with v1.27.2 seem to have issues #8764

Clusters with v1.27.2 seem to have issues #8764

Comments

chrischdi commented May 30, 2023

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

chrischdi commented May 30, 2023

killianmuldoon commented May 30, 2023

sbueringer commented May 30, 2023 • edited Loading

chrischdi commented May 30, 2023 • edited Loading

chrischdi commented May 30, 2023

killianmuldoon commented May 31, 2023

killianmuldoon commented May 31, 2023

chrischdi commented May 31, 2023

killianmuldoon commented May 31, 2023

killianmuldoon commented Jun 1, 2023

killianmuldoon commented Jun 2, 2023 • edited Loading

sbueringer commented Jun 2, 2023

killianmuldoon commented Jun 5, 2023

killianmuldoon commented Jun 5, 2023

k8s-ci-robot commented Jun 5, 2023

sbueringer commented May 30, 2023 •

edited

Loading

chrischdi commented May 30, 2023 •

edited

Loading

killianmuldoon commented Jun 2, 2023 •

edited

Loading