Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPI capi-e2e-main-1-24-latest failing consistently since June 1st #6596

Closed
fabriziopandini opened this issue Jun 6, 2022 · 18 comments · Fixed by #6917
Closed

CAPI capi-e2e-main-1-24-latest failing consistently since June 1st #6596

fabriziopandini opened this issue Jun 6, 2022 · 18 comments · Fixed by #6917
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.

Comments

@fabriziopandini
Copy link
Member

fabriziopandini commented Jun 6, 2022

see https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main-1-24-latest
this should be related to one of the recent changes in the test framework

@fabriziopandini fabriziopandini added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jun 6, 2022
@fabriziopandini
Copy link
Member Author

it seems there is a similar problem in capi-e2e-main-1-22-1-23

@sbueringer
Copy link
Member

sbueringer commented Jun 6, 2022

I think it's something with the new registry, which is used for >= v1.25
1.22=>1.23 is probably a flake plus the general Prow issue ("unauthorized")

Reposting from Slack:

I don't think this has been caused by our recent changes. This job succeeded the first time after those changes afaik.
The timing perfectly aligns with when CAPO started failing because of the registry change discussed here: #6590 (comment) and mentioned here: https://kubernetes.slack.com/archives/C13J86Z63/p1654179481039939?thread_ts=1654113710.985029&cid=C13J86Z63

Testgrid CAPO/CAPI:

Note: I think the root cause is the same for the CAPI/CAPO failures but the fixes are probably different.

@sbueringer
Copy link
Member

sbueringer commented Jun 8, 2022

I tested it locally with the following results:

tl;dr the cause is that since 1st June the new (kubeadm) default registry is registry.k8s.io.

Usually this is not a problem as:

  • new v1.24 cluster => uses k8s.gcr.io
  • new v1.25-x => uses registry.k8s.io
  • v1.24 cluster (k8s.gcr.io) upgraded to v1.25-x => should be able to still work with the old registry

In our case CAPD CI fails because we are relying on the correct images being baked into the kind image. When the v1.25-x machine comes up we get errors like the following:

Jun 07 11:25:02.835407 k8s-upgrade-and-conformance-1vnw8y-vnm46-9xxcf containerd[220]: time="2022-06-07T11:25:02.835205540Z" level=error msg="PullImage "k8s.gcr.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222" failed" error="rpc error: code = NotFound desc = failed to pull and unpack image "k8s.gcr.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222": failed to resolve reference "k8s.gcr.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222": k8s.gcr.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222: not found"
(while registry.k8s.io/kube-apiserver:v1.25.0-alpha.0.839_79cef122765222 is included in the kind image)

In theory we can work around this by choosing the KCP imageRepository based on Kubernetes version in the ClusterClass. I tried manually to use registry.k8s.io for the v1.25-x machines (by upgrading the field in KCP) and it worked.

But I think we should consider a more general solution. As @neolit123 mentioned in Slack:

note that CAPI would have to mutate the clusterconfiguration.imageRepository too, since CAPI doesn't use "kubeadm upgrade apply".
https://kubernetes.slack.com/archives/C13J86Z63/p1654179481039939?thread_ts=1654113710.985029&cid=C13J86Z63

When the imageRepository is not set in KCP the clusterConfiguration.imageRepository in the kubeadm-config ConfigMap will get the kubeadm default on kubeadm init.

I would suggest the following change to the upgrade behavior of KCP:

  • if the imageRepository is set in KCP => do nothing
  • if the imageRepository is not set in KCP => upgrade the imageRepository in the ConfigMap from k8s.gcr.io to registry.k8s.io (when the current KCP version is >= v1.25)

If I got it right this would align our KCP upgrade implementation to kubeadm upgrade (kubernetes/kubernetes#110343)

WDYT @fabriziopandini ?

@neolit123
Copy link
Member

I think the CAPD node images should bake registry.k8s.io component images if the kubeadm version is at least 1.25.0-0. The issue here is that kubeadm 1.25.0-0 would need registry.k8s.io images by default and if another repo is passed via imageRepository it will be considered custom (e.g. coredns path is different).

We had a similar problem in kinder / kubeadm e2e. So what we do is check the kubeadm version and bake the component images from tars differently.

@sbueringer
Copy link
Member

sbueringer commented Jun 8, 2022

@neolit123 We already have the registry.k8s.io images baked in. Our problem is that we don't have the kubeadm upgrade logic. So an upgrade test which init's v1.24 and then upgrades to v1.25 is still trying to use the old registry.

I think we have to implement the "kubeadm upgrade" change in KCP anyway and this will then also fix our test.

@neolit123
Copy link
Member

ok, understood.

I think we have to implement the "kubeadm upgrade" change in KCP anyway and this will then also fix our test.

yes, KCP has to do it.

@sbueringer
Copy link
Member

For completeness. Upstream kubeadm issues/PRs:

I added it as release specific task to our v1.25 tracking issue #6661

@sbueringer
Copy link
Member

Talked with Fabrizio about this issue, we would do the following:

  • if the imageRepository is set in KCP => do nothing
  • if the imageRepository is not set in KCP => upgrade the imageRepository in the ConfigMap from k8s.gcr.io to registry.k8s.io (when the current KCP version is >= v1.25)

@chrischdi
Copy link
Member

I'd like to tackle this

/assign

@chrischdi
Copy link
Member

Talked with Fabrizio about this issue, we would do the following:

  • if the imageRepository is set in KCP => do nothing
  • if the imageRepository is not set in KCP => upgrade the imageRepository in the ConfigMap from k8s.gcr.io to registry.k8s.io (when the current KCP version is >= v1.25)

To note: this won't actually fix the issue our tests have yet, because we always set a imageRepository via ClusterClass.

@sbueringer
Copy link
Member

Yup. We should drop that patch

@chrischdi
Copy link
Member

chrischdi commented Jul 14, 2022

I think we should just not do the patch if the value is empty instead :-) keeping it customizable?! (And to do so, change the default value to empty sting)

@sbueringer
Copy link
Member

sbueringer commented Jul 14, 2022

I don't think there is any need for that patch to be honest. It was just a random patch we added for testing.

To be clear. I'm only referring to the ClusterClass we are using in e2e tests. I don't think we have to keep it customizable. (and keep a patch that is not covered by our tests)

But probably the same applies to our CAPD ClusterClass as well. It's just a test/dev provider, explicitly not recommended for production. So I don't think we have to provide a variable to allow customization of the imageRepository

@chrischdi
Copy link
Member

I don't think there is any need for that patch to be honest. It was just a random patch we added for testing.

To be clear. I'm only referring to the ClusterClass we are using in e2e tests. I don't think we have to keep it customizable. (and keep a patch that is not covered by our tests)

But probably the same applies to our CAPD ClusterClass as well. It's just a test/dev provider, explicitly not recommended for production. So I don't think we have to provide a variable to allow customization of the imageRepository

@sbueringer : but doesn't the "When following the Cluster API quick-start [PR-Blocking]" test require us to keep the clusterclass in sync?

xref:

var _ = Describe("When following the Cluster API quick-start [PR-Blocking]", func() {

@sbueringer
Copy link
Member

sbueringer commented Jul 14, 2022

Given that they are not in sync today, it's not a hard requirement :).

But I think we can drop the variable+patch from both. I think it would be a good practice to keep them in sync and it would be nice to ~ use the patches we have in our e2e test so we have some confidence that they work / didn't just break at some point.

@sbueringer
Copy link
Member

Would be great if someone can keep an eye on testgrid the next few days

@chrischdi
Copy link
Member

I will do so 😀 I'd like to see green test grids

@chrischdi
Copy link
Member

For the record: was green for two times in series since the fix was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test.
Projects
None yet
4 participants