Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusters with v1.27.2 seem to have issues #8764

Closed
chrischdi opened this issue May 30, 2023 · 15 comments
Closed

Clusters with v1.27.2 seem to have issues #8764

chrischdi opened this issue May 30, 2023 · 15 comments
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@chrischdi
Copy link
Member

Which jobs are failing?

Which tests are failing?

capi-e2e: [It] When upgrading a workload cluster using ClusterClass and testing K8S conformance [Conformance] [K8s-Upgrade] [ClusterClass] Should create and upgrade a workload cluster and eventually run kubetest expand_more

Since when has it been failing?

26th May

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-26-1-27) is red since 26th

Reason for failure (if possible)

OCI runtime exec failed: exec failed: unable to start container process: error adding pid 11410 to cgroups: failed to write 11410: openat2 /sys/fs/cgroup/unified/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podf50282df_4a85_455c_be12_52dcf5261287.slice/cri-containerd-6d64260e1fe52c46cceb265b6a5367e25039bfa6e39dd908fa612e993a2da01f.scope/cgroup.procs: no such file or directory: unknown

Which seems to be related to this change in v1.27.2 kind image:

Anything else we need to know?

No response

Label(s) to be applied

/kind failing-test

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 30, 2023
@chrischdi
Copy link
Member Author

Note: @kubernetes-sigs/cluster-api-release-team I think this should be release blocking!

@killianmuldoon
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 30, 2023
@sbueringer
Copy link
Member

sbueringer commented May 30, 2023

Not sure if this should be release blocking, assuming that the already released CAPI versions have the same issue with 1.27.2.

Or did we do something in Cluster API which introduced this issue?

But to assess that we have to find out more details I guess.

@chrischdi
Copy link
Member Author

chrischdi commented May 30, 2023

We can reproduce this locally via. What I did:

  • Create a ubuntu 22.04 VM on aws
  • install docker, go, kubectl, tilt, ...
  • clone capi repo
  • run tilt up
  • Create cluster using quickstart templates
  • Try a kubectl exec to a control-plane pod.
ubuntu@ip-172-31-10-126:~/go/src/sigs.k8s.io/cluster-api$ kubectl get po -n kube-system
NAME                                                   READY   STATUS    RESTARTS   AGE
coredns-5d78c9869d-pdhwl                               0/1     Pending   0          20m
coredns-5d78c9869d-tgft5                               0/1     Pending   0          20m
etcd-development-3774-rbk6v-4hltm                      1/1     Running   0          20m
kube-apiserver-development-3774-rbk6v-4hltm            1/1     Running   0          20m
kube-controller-manager-development-3774-rbk6v-4hltm   1/1     Running   0          20m
kube-proxy-c9zzl                                       1/1     Running   0          19m
kube-proxy-mkqw4                                       1/1     Running   0          20m
kube-scheduler-development-3774-rbk6v-4hltm            1/1     Running   0          20m

ubuntu@ip-172-31-10-126:~/go/src/sigs.k8s.io/cluster-api$ kubectl exec -ti -n kube-system etcd-development-3774-rbk6v-4hltm -- ls
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "6912386191fc9dc2ecb591286d2ff1af65c96b48af34697f640e585641d0d54c": OCI runtime exec failed: exec failed: unable to start container process: exec: "ls": executable file not found in $PATH: unknown

Edit: the above error is ok

@chrischdi
Copy link
Member Author

Note: creating the cluster works, but things like kubectl exec don't.

Note to myself: check if kubectl exec works with kind + v1.27.2.

@killianmuldoon
Copy link
Contributor

Is the exec not just an issue with the binary not being available in that pod?

@killianmuldoon
Copy link
Contributor

e.g. something like this works for me (on a kind cluster)

kubectl exec -n kube-system kube-apiserver-capi-test-control-plane kube-apiserver

@chrischdi
Copy link
Member Author

Is the exec not just an issue with the binary not being available in that pod?

Uh oh yes 🤦 I thought it is the other line we got from the tests...

@killianmuldoon
Copy link
Contributor

kubernetes/test-infra#29654 is designed to help debug this. It makes the following changes which will have to be rolled back once we have a solution.

  1. 1.26-1.27 upgrade job for branch release-1.4 is now pinned to upgrade to 1.27.1. This should unblock the release on that branch
  2. Added two experimental jobs, one running the test-infra kind image and a second running the kubekins image. This is to help debug the underlying issues with the test to help get to a root cause of the conformance failures.

@killianmuldoon
Copy link
Contributor

/assign

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Jun 2, 2023

I've managed to find a solution to this on #8774

Basically CAPD needs to copy the config used by kind in both kubernetes-sigs/kind#3241 and kubernetes-sigs/kind#3255. I've also included changes from kubernetes-sigs/kind#3240 in the current version of this PR, but they may not be necessary for the fix.

The issue we have now and will increasingly have in future is that the new images and old images will require different configs - I've got a clumsy check for v1.27.2 in a string in this PR to do the different config of private cgroupNS mode, but that's obviously not a long term solution.

The options we have are to either

  1. Find a way to introspect a given kind image to decide what configuration to pass to it.
    or
  2. Move to the new config once kind v0.20.0 is installed, ignore images like the current v1.27.2 for now e.g. by enforcing a build in the upgrade tests. Once we move to kind v0.20.0 and its new build process CAPD will no longer function with older images.

I'll try to see if there's a way to check if an image should have private cgroupns set or not to make both older and new images compatibly with CAPD.

@sbueringer
Copy link
Member

I think the second option is fine. It's a dev provider

@killianmuldoon
Copy link
Contributor

I think we can close this issue with kubernetes/test-infra#29682 and follow up with the rest of the work on #8788

Feel free to reopen if there's still something to be done specific to this issue.

@killianmuldoon
Copy link
Contributor

/close

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants