CAPI capi-e2e-main-1-24-latest failing consistently since June 1st #6596

fabriziopandini · 2022-06-06T09:57:09Z

see https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-24-latest
this should be related to one of the recent changes in the test framework

fabriziopandini · 2022-06-06T09:59:41Z

it seems there is a similar problem in capi-e2e-main-1-22-1-23

sbueringer · 2022-06-06T10:28:21Z

I think it's something with the new registry, which is used for >= v1.25
1.22=>1.23 is probably a flake plus the general Prow issue ("unauthorized")

Reposting from Slack:

I don't think this has been caused by our recent changes. This job succeeded the first time after those changes afaik.
The timing perfectly aligns with when CAPO started failing because of the registry change discussed here: #6590 (comment) and mentioned here: https://kubernetes.slack.com/archives/C13J86Z63/p1654179481039939?thread_ts=1654113710.985029&cid=C13J86Z63

Testgrid CAPO/CAPI:

Note: I think the root cause is the same for the CAPI/CAPO failures but the fixes are probably different.

sbueringer · 2022-06-08T05:14:32Z

I tested it locally with the following results:

tl;dr the cause is that since 1st June the new (kubeadm) default registry is registry.k8s.io.

Usually this is not a problem as:

new v1.24 cluster => uses k8s.gcr.io
new v1.25-x => uses registry.k8s.io
v1.24 cluster (k8s.gcr.io) upgraded to v1.25-x => should be able to still work with the old registry

In our case CAPD CI fails because we are relying on the correct images being baked into the kind image. When the v1.25-x machine comes up we get errors like the following:

Jun 07 11:25:02.835407 k8s-upgrade-and-conformance-1vnw8y-vnm46-9xxcf containerd[220]: time="2022-06-07T11:25:02.835205540Z" level=error msg="PullImage "k8s.gcr.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222" failed" error="rpc error: code = NotFound desc = failed to pull and unpack image "k8s.gcr.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222": failed to resolve reference "k8s.gcr.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222": k8s.gcr.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222: not found"
(while registry.k8s.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222 is included in the kind image)

In theory we can work around this by choosing the KCP imageRepository based on Kubernetes version in the ClusterClass. I tried manually to use registry.k8s.io for the v1.25-x machines (by upgrading the field in KCP) and it worked.

But I think we should consider a more general solution. As @neolit123 mentioned in Slack:

note that CAPI would have to mutate the clusterconfiguration.imageRepository too, since CAPI doesn't use "kubeadm upgrade apply".
https://kubernetes.slack.com/archives/C13J86Z63/p1654179481039939?thread_ts=1654113710.985029&cid=C13J86Z63

When the imageRepository is not set in KCP the clusterConfiguration.imageRepository in the kubeadm-config ConfigMap will get the kubeadm default on kubeadm init.

I would suggest the following change to the upgrade behavior of KCP:

if the imageRepository is set in KCP => do nothing
if the imageRepository is not set in KCP => upgrade the imageRepository in the ConfigMap from k8s.gcr.io to registry.k8s.io (when the current KCP version is >= v1.25)

If I got it right this would align our KCP upgrade implementation to kubeadm upgrade (kubernetes/kubernetes#110343)

WDYT @fabriziopandini ?

neolit123 · 2022-06-08T07:56:51Z

I think the CAPD node images should bake registry.k8s.io component images if the kubeadm version is at least 1.25.0-0. The issue here is that kubeadm 1.25.0-0 would need registry.k8s.io images by default and if another repo is passed via imageRepository it will be considered custom (e.g. coredns path is different).

We had a similar problem in kinder / kubeadm e2e. So what we do is check the kubeadm version and bake the component images from tars differently.

sbueringer · 2022-06-08T07:59:09Z

@neolit123 We already have the registry.k8s.io images baked in. Our problem is that we don't have the kubeadm upgrade logic. So an upgrade test which init's v1.24 and then upgrades to v1.25 is still trying to use the old registry.

I think we have to implement the "kubeadm upgrade" change in KCP anyway and this will then also fix our test.

neolit123 · 2022-06-08T08:16:34Z

ok, understood.

I think we have to implement the "kubeadm upgrade" change in KCP anyway and this will then also fix our test.

yes, KCP has to do it.

sbueringer · 2022-06-17T08:10:14Z

For completeness. Upstream kubeadm issues/PRs:

I added it as release specific task to our v1.25 tracking issue #6661

sbueringer · 2022-07-12T12:47:14Z

Talked with Fabrizio about this issue, we would do the following:

if the imageRepository is set in KCP => do nothing
if the imageRepository is not set in KCP => upgrade the imageRepository in the ConfigMap from k8s.gcr.io to registry.k8s.io (when the current KCP version is >= v1.25)

chrischdi · 2022-07-13T17:04:30Z

I'd like to tackle this

/assign

chrischdi · 2022-07-14T08:51:33Z

Talked with Fabrizio about this issue, we would do the following:

if the imageRepository is set in KCP => do nothing

if the imageRepository is not set in KCP => upgrade the imageRepository in the ConfigMap from k8s.gcr.io to registry.k8s.io (when the current KCP version is >= v1.25)

To note: this won't actually fix the issue our tests have yet, because we always set a imageRepository via ClusterClass.

sbueringer · 2022-07-14T09:23:43Z

Yup. We should drop that patch

chrischdi · 2022-07-14T10:09:07Z

I think we should just not do the patch if the value is empty instead :-) keeping it customizable?! (And to do so, change the default value to empty sting)

sbueringer · 2022-07-14T11:47:50Z

I don't think there is any need for that patch to be honest. It was just a random patch we added for testing.

To be clear. I'm only referring to the ClusterClass we are using in e2e tests. I don't think we have to keep it customizable. (and keep a patch that is not covered by our tests)

But probably the same applies to our CAPD ClusterClass as well. It's just a test/dev provider, explicitly not recommended for production. So I don't think we have to provide a variable to allow customization of the imageRepository

chrischdi · 2022-07-14T13:56:28Z

I don't think there is any need for that patch to be honest. It was just a random patch we added for testing.

To be clear. I'm only referring to the ClusterClass we are using in e2e tests. I don't think we have to keep it customizable. (and keep a patch that is not covered by our tests)

But probably the same applies to our CAPD ClusterClass as well. It's just a test/dev provider, explicitly not recommended for production. So I don't think we have to provide a variable to allow customization of the imageRepository

@sbueringer : but doesn't the "When following the Cluster API quick-start [PR-Blocking]" test require us to keep the clusterclass in sync?

xref:

cluster-api/test/e2e/quick_start_test.go

Line 27 in 37a4b2b

    
           var _ = Describe("When following the Cluster API quick-start [PR-Blocking]", func() {

sbueringer · 2022-07-14T15:28:16Z

Given that they are not in sync today, it's not a hard requirement :).

But I think we can drop the variable+patch from both. I think it would be a good practice to keep them in sync and it would be nice to ~ use the patches we have in our e2e test so we have some confidence that they work / didn't just break at some point.

sbueringer · 2022-07-19T15:51:30Z

Would be great if someone can keep an eye on testgrid the next few days

chrischdi · 2022-07-19T16:21:21Z

I will do so 😀 I'd like to see green test grids

chrischdi · 2022-07-22T07:12:43Z

For the record: was green for two times in series since the fix was merged.

fabriziopandini added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jun 6, 2022

sbueringer mentioned this issue Jun 17, 2022

Tasks for Kubernetes v1.25 #6661

Closed

29 tasks

sbueringer mentioned this issue Jun 20, 2022

Align to CoreDNS bump to 1.9.3 #6641

Closed

k8s-ci-robot assigned chrischdi Jul 13, 2022

chrischdi mentioned this issue Jul 13, 2022

🐛 kcp: patch ImageRepository in ClusterConfiguration if it is not set for v1.25 #6917

Merged

k8s-ci-robot closed this as completed in #6917 Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPI capi-e2e-main-1-24-latest failing consistently since June 1st #6596

CAPI capi-e2e-main-1-24-latest failing consistently since June 1st #6596

fabriziopandini commented Jun 6, 2022 •

edited

Loading

fabriziopandini commented Jun 6, 2022

sbueringer commented Jun 6, 2022 •

edited

Loading

sbueringer commented Jun 8, 2022 •

edited

Loading

neolit123 commented Jun 8, 2022

sbueringer commented Jun 8, 2022 •

edited

Loading

neolit123 commented Jun 8, 2022

sbueringer commented Jun 17, 2022

sbueringer commented Jul 12, 2022

chrischdi commented Jul 13, 2022

chrischdi commented Jul 14, 2022

sbueringer commented Jul 14, 2022

chrischdi commented Jul 14, 2022 •

edited

Loading

sbueringer commented Jul 14, 2022 •

edited

Loading

chrischdi commented Jul 14, 2022

sbueringer commented Jul 14, 2022 •

edited

Loading

sbueringer commented Jul 19, 2022

chrischdi commented Jul 19, 2022

chrischdi commented Jul 22, 2022

CAPI capi-e2e-main-1-24-latest failing consistently since June 1st #6596

CAPI capi-e2e-main-1-24-latest failing consistently since June 1st #6596

Comments

fabriziopandini commented Jun 6, 2022 • edited Loading

fabriziopandini commented Jun 6, 2022

sbueringer commented Jun 6, 2022 • edited Loading

sbueringer commented Jun 8, 2022 • edited Loading

neolit123 commented Jun 8, 2022

sbueringer commented Jun 8, 2022 • edited Loading

neolit123 commented Jun 8, 2022

sbueringer commented Jun 17, 2022

sbueringer commented Jul 12, 2022

chrischdi commented Jul 13, 2022

chrischdi commented Jul 14, 2022

sbueringer commented Jul 14, 2022

chrischdi commented Jul 14, 2022 • edited Loading

sbueringer commented Jul 14, 2022 • edited Loading

chrischdi commented Jul 14, 2022

sbueringer commented Jul 14, 2022 • edited Loading

sbueringer commented Jul 19, 2022

chrischdi commented Jul 19, 2022

chrischdi commented Jul 22, 2022

fabriziopandini commented Jun 6, 2022 •

edited

Loading

sbueringer commented Jun 6, 2022 •

edited

Loading

sbueringer commented Jun 8, 2022 •

edited

Loading

sbueringer commented Jun 8, 2022 •

edited

Loading

chrischdi commented Jul 14, 2022 •

edited

Loading

sbueringer commented Jul 14, 2022 •

edited

Loading

sbueringer commented Jul 14, 2022 •

edited

Loading