Test capi-e2e-main-1-28-latest is failing #9633

killianmuldoon · 2023-10-30T10:13:18Z

Which jobs are failing?

capi-e2e-main-1-28-latest: FAILING

Which tests are failing?

capi-e2e-main-1-28-latest: FAILING

Since when has it been failing?

Since October 27th

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-28-latest

Reason for failure (if possible)

No response

Anything else we need to know?

Upstream tests are also failing so this seems directly related to those failures.

Ref: https://testgrid.k8s.io/sig-release-master-informing#periodic-conformance-main-k8s-main

This is the issue tracking this test failure upstream: kubernetes/kubernetes#121617

Label(s) to be applied

/kind failing-test

killianmuldoon · 2023-10-30T10:14:08Z

Note: This failing test shouldn't necessarily be release blocking as the release v1.6.0 probably won't support Kubernetes v1.29.

/triage accepted

neolit123 · 2023-10-30T17:45:32Z

@killianmuldoon what are the kubeadm failure logs exactly?
tried finding them in the artifacts.

EDIT:

https://testgrid.k8s.io/sig-release-master-informing#periodic-conformance-main-k8s-main

here is the issue that tracks the mentioned job failures:
kubernetes/kubernetes#121617

This is the issue tracking this test failure upstream: kubernetes/kubernetes#121587

is it the same problem?

killianmuldoon · 2023-10-31T09:17:58Z

The kubeadm failure logs are unfortunately buried in CAPD. This is the actual output of a kubeadm join failure:

Reading configuration from the cluster...\n[preflight] FYI: You can look at this config file with 'kubectl -n kube-system get cm kubeadm-config -o yaml'\n[preflight] Running pre-flight checks before initializing the new control plane instance\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[certs] Using certificateDir folder "/etc/kubernetes/pki"\n[certs] Generating "etcd/healthcheck-client" certificate and key\n[certs] Generating "apiserver-etcd-client" certificate and key\n[certs] Generating "etcd/peer" certificate and key\n[certs] etcd/peer serving cert is signed for DNS names [k8s-upgrade-and-conformance-f4fr2s-f66bt-c7qp2 localhost] and IPs [172.18.0.9 127.0.0.1 ::1]\n[certs] Generating "etcd/server" certificate and key\n[certs] etcd/server serving cert is signed for DNS names [k8s-upgrade-and-conformance-f4fr2s-f66bt-c7qp2 localhost] and IPs [172.18.0.9 127.0.0.1 ::1]\n[certs] Generating "apiserver-kubelet-client" certificate and key\n[certs] Generating "apiserver" certificate and key\n[certs] apiserver serving cert is signed for DNS names [host.docker.internal k8s-upgrade-and-conformance-f4fr2s-f66bt-c7qp2 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local localhost] and IPs [10.128.0.1 172.18.0.9 172.18.0.3 :: ::1 127.0.0.1 0.0.0.0]\n[certs] Generating "front-proxy-client" certificate and key\n[certs] Valid certificates and keys now exist in "/etc/kubernetes/pki"\n[certs] Using the existing "sa" key\n[kubeconfig] Generating kubeconfig files\n[kubeconfig] Using kubeconfig folder "/etc/kubernetes"\n[kubeconfig] Writing "admin.conf" kubeconfig file\n[kubeconfig] Writing "controller-manager.conf" kubeconfig file\n[kubeconfig] Writing "scheduler.conf" kubeconfig file\n[control-plane] Using manifest folder "/etc/kubernetes/manifests"\n[control-plane] Creating static Pod manifest for "kube-apiserver"\n[control-plane] Creating static Pod manifest for "kube-controller-manager"\n[control-plane] Creating static Pod manifest for "kube-scheduler"\n[check-etcd] Checking that the etcd cluster is healthy\n","stderr":"\t[WARNING FileExisting-socat]: socat not found in system path\n\t[WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: FATAL: Module configs not found in directory /lib/modules/5.15.0-1036-gke\n", err: exit status 1\nW1030 09:14:02.128714 436 checks.go:835] detected that the sandbox image "registry.k8s.io/pause:3.7" of the container runtime is inconsistent with that used by kubeadm. It is recommended that using "registry.k8s.io/pause:3.9" as the CRI sandbox image.\nerror execution phase check-etcd: could not retrieve the list of etcd endpoints: pods is forbidden: User "kubernetes-admin" cannot list resource "pods" in API group "" in the namespace "kube-system"\nTo see the stack trace of this error execute with --v=5 or higher\n",

I was mainly looking at the upstream issue because of the coincidence in time - the 1.28 -> main confromance job has not been flaky or failing recently. It's definitely possible that there's a different cause for this failure.

killianmuldoon · 2023-10-31T09:30:06Z

The most recent test ran on v1.29.0-alpha.2.787+f5a5d83d7c027a which includes the fix for the Kubeadm issue I linked above. The test failed so it looks like there may be a different underlying issue, possibly in Cluster API itself.

Will need to take a closer look into this.

neolit123 · 2023-10-31T10:01:34Z

error execution phase check-etcd: could not retrieve the list of etcd endpoints: pods is forbidden: User "kubernetes-admin" cannot list resource "pods" in API group "" in the namespace "kube-system"\nTo see the stack trace of this error execute with --v=5 or higher\n

this definitely seems related to the change in kubernetes/kubernetes#121305
but this was a different failure (strictly inside kubeadm init) and it was resolved:
kubernetes/kubernetes#121587

indirectly, this CAPI test job log says that the ClusterRoleBinding called "kubeadm:cluster-admins" was not created during kubeadm init.

is there anything specific about this CAPI job capi-e2e-main-1-28-latest - e.g. does it skip phases of kubeadm init?
does it by any chance try to join a kubeadm 1.29.pre managed control plane node to a cluster created with kubeadm init 1.28.x? if yes, the CRB will be indeed, missing.

EDIT: the kubeadm e2e test for this new feature is WIP, but note that we don't have tests that join kubeadm X to kubeadm Y, where X != Y, because that is not a supported skew:
https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/#kubeadm-s-skew-against-kubeadm
we do have tests that skew the control plane and kubelet versions, but the kubeadm version remains the same.

sbueringer · 2023-10-31T10:03:54Z

Yeah this test creates a cluster with Kubernetes 1.28 and then upgrades to Kubernetes 1.29 by joining new 1.29 nodes.

Sounds like we have to implement this sub task to fix the failure:

#9578

neolit123 · 2023-10-31T10:04:33Z

Sounds like we have to implement this sub task to fix the failure:

yep, looks like CAPI has the missing upgrade step.

k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 30, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 30, 2023

killianmuldoon mentioned this issue Nov 7, 2023

🌱 Support admin config for Kubeadm v1.29 #9682

Merged

k8s-ci-robot closed this as completed in #9682 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test capi-e2e-main-1-28-latest is failing #9633

Test capi-e2e-main-1-28-latest is failing #9633

killianmuldoon commented Oct 30, 2023 •

edited

Loading

killianmuldoon commented Oct 30, 2023

neolit123 commented Oct 30, 2023 •

edited

Loading

killianmuldoon commented Oct 31, 2023

killianmuldoon commented Oct 31, 2023

neolit123 commented Oct 31, 2023 •

edited

Loading

sbueringer commented Oct 31, 2023 •

edited

Loading

neolit123 commented Oct 31, 2023

Test capi-e2e-main-1-28-latest is failing #9633

Test capi-e2e-main-1-28-latest is failing #9633

Comments

killianmuldoon commented Oct 30, 2023 • edited Loading

Which jobs are failing?

Which tests are failing?

Since when has it been failing?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

killianmuldoon commented Oct 30, 2023

neolit123 commented Oct 30, 2023 • edited Loading

killianmuldoon commented Oct 31, 2023

killianmuldoon commented Oct 31, 2023

neolit123 commented Oct 31, 2023 • edited Loading

sbueringer commented Oct 31, 2023 • edited Loading

neolit123 commented Oct 31, 2023

killianmuldoon commented Oct 30, 2023 •

edited

Loading

neolit123 commented Oct 30, 2023 •

edited

Loading

neolit123 commented Oct 31, 2023 •

edited

Loading

sbueringer commented Oct 31, 2023 •

edited

Loading