Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test capi-e2e-main-1-28-latest is failing #9633

Closed
killianmuldoon opened this issue Oct 30, 2023 · 7 comments · Fixed by #9682
Closed

Test capi-e2e-main-1-28-latest is failing #9633

killianmuldoon opened this issue Oct 30, 2023 · 7 comments · Fixed by #9682
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@killianmuldoon
Copy link
Contributor

killianmuldoon commented Oct 30, 2023

Which jobs are failing?

capi-e2e-main-1-28-latest: FAILING

Which tests are failing?

capi-e2e-main-1-28-latest: FAILING

Since when has it been failing?

Since October 27th

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-28-latest

Reason for failure (if possible)

No response

Anything else we need to know?

Upstream tests are also failing so this seems directly related to those failures.

Ref: https://testgrid.k8s.io/sig-release-master-informing#periodic-conformance-main-k8s-main

This is the issue tracking this test failure upstream: kubernetes/kubernetes#121617

Label(s) to be applied

/kind failing-test

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 30, 2023
@killianmuldoon
Copy link
Contributor Author

Note: This failing test shouldn't necessarily be release blocking as the release v1.6.0 probably won't support Kubernetes v1.29.

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 30, 2023
@neolit123
Copy link
Member

neolit123 commented Oct 30, 2023

@killianmuldoon what are the kubeadm failure logs exactly?
tried finding them in the artifacts.

EDIT:

https://testgrid.k8s.io/sig-release-master-informing#periodic-conformance-main-k8s-main

here is the issue that tracks the mentioned job failures:
kubernetes/kubernetes#121617

This is the issue tracking this test failure upstream: kubernetes/kubernetes#121587

is it the same problem?

@killianmuldoon
Copy link
Contributor Author

The kubeadm failure logs are unfortunately buried in CAPD. This is the actual output of a kubeadm join failure:

Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'\n[preflight] Running pre-flight checks before initializing the new control plane instance\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[certs] Using certificateDir folder "/etc/kubernetes/pki"\n[certs] Generating "etcd/healthcheck-client" certificate and key\n[certs] Generating "apiserver-etcd-client" certificate and key\n[certs] Generating "etcd/peer" certificate and key\n[certs] etcd/peer serving cert is signed for DNS names [k8s-upgrade-and-conformance-f4fr2s-f66bt-c7qp2 localhost] and IPs [172.18.0.9 127.0.0.1 ::1]\n[certs] Generating "etcd/server" certificate and key\n[certs] etcd/server serving cert is signed for DNS names [k8s-upgrade-and-conformance-f4fr2s-f66bt-c7qp2 localhost] and IPs [172.18.0.9 127.0.0.1 ::1]\n[certs] Generating "apiserver-kubelet-client" certificate and key\n[certs] Generating "apiserver" certificate and key\n[certs] apiserver serving cert is signed for DNS names [host.docker.internal k8s-upgrade-and-conformance-f4fr2s-f66bt-c7qp2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.128.0.1 172.18.0.9 172.18.0.3 :: ::1 127.0.0.1 0.0.0.0]\n[certs] Generating "front-proxy-client" certificate and key\n[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"\n[certs] Using the existing "sa" key\n[kubeconfig] Generating kubeconfig files\n[kubeconfig] Using kubeconfig folder "/etc/kubernetes"\n[kubeconfig] Writing "admin.conf" kubeconfig file\n[kubeconfig] Writing "controller-manager.conf" kubeconfig file\n[kubeconfig] Writing "scheduler.conf" kubeconfig file\n[control-plane] Using manifest folder "/etc/kubernetes/manifests"\n[control-plane] Creating static Pod manifest for "kube-apiserver"\n[control-plane] Creating static Pod manifest for "kube-controller-manager"\n[control-plane] Creating static Pod manifest for "kube-scheduler"\n[check-etcd] Checking that the etcd cluster is healthy\n","stderr":"\t[WARNING FileExisting-socat]: socat not found in system path\n\t[WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: FATAL: Module configs not found in directory /lib/modules/5.15.0-1036-gke\n", err: exit status 1\nW1030 09:14:02.128714 436 checks.go:835] detected that the sandbox image "registry.k8s.io/pause:3.7" of the container runtime is inconsistent with that used by kubeadm. It is recommended that using "registry.k8s.io/pause:3.9" as the CRI sandbox image.\nerror execution phase check-etcd: could not retrieve the list of etcd endpoints: pods is forbidden: User "kubernetes-admin" cannot list resource "pods" in API group "" in the namespace "kube-system"\nTo see the stack trace of this error execute with --v=5 or higher\n",

I was mainly looking at the upstream issue because of the coincidence in time - the 1.28 -> main confromance job has not been flaky or failing recently. It's definitely possible that there's a different cause for this failure.

@killianmuldoon
Copy link
Contributor Author

The most recent test ran on v1.29.0-alpha.2.787+f5a5d83d7c027a which includes the fix for the Kubeadm issue I linked above. The test failed so it looks like there may be a different underlying issue, possibly in Cluster API itself.

Will need to take a closer look into this.

@neolit123
Copy link
Member

neolit123 commented Oct 31, 2023

error execution phase check-etcd: could not retrieve the list of etcd endpoints: pods is forbidden: User "kubernetes-admin" cannot list resource "pods" in API group "" in the namespace "kube-system"\nTo see the stack trace of this error execute with --v=5 or higher\n

this definitely seems related to the change in kubernetes/kubernetes#121305
but this was a different failure (strictly inside kubeadm init) and it was resolved:
kubernetes/kubernetes#121587

indirectly, this CAPI test job log says that the ClusterRoleBinding called "kubeadm:cluster-admins" was not created during kubeadm init.

  • is there anything specific about this CAPI job capi-e2e-main-1-28-latest - e.g. does it skip phases of kubeadm init?
  • does it by any chance try to join a kubeadm 1.29.pre managed control plane node to a cluster created with kubeadm init 1.28.x? if yes, the CRB will be indeed, missing.

EDIT: the kubeadm e2e test for this new feature is WIP, but note that we don't have tests that join kubeadm X to kubeadm Y, where X != Y, because that is not a supported skew:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#kubeadm-s-skew-against-kubeadm
we do have tests that skew the control plane and kubelet versions, but the kubeadm version remains the same.

@sbueringer
Copy link
Member

sbueringer commented Oct 31, 2023

Yeah this test creates a cluster with Kubernetes 1.28 and then upgrades to Kubernetes 1.29 by joining new 1.29 nodes.

Sounds like we have to implement this sub task to fix the failure:

image

#9578

@neolit123
Copy link
Member

Sounds like we have to implement this sub task to fix the failure:

yep, looks like CAPI has the missing upgrade step.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants