Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI failure: capi-e2e-release-1-1-1-23-1-24 is failing consistently #7768

Closed
furkatgofurov7 opened this issue Dec 15, 2022 · 40 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Dec 15, 2022

release-1.1 branch https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-1.1#capi-e2e-release-1-1-1-23-1-24 job is failing consistently since December 9th.

Logs from the corresponding prow job:

capi-e2e:  When upgrading a workload cluster using ClusterClass and testing K8S  conformance [Conformance] [K8s-Upgrade] Should create and upgrade a  workload cluster and run kubetest expand_less | 20m54s
-- | --
{ Failure /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade.go:115 Failed to run Kubernetes conformance Unexpected error:     <*errors.withStack \| 0xc001dd0a80>: {         error: <*errors.withMessage \| 0xc0010a6240>{             cause: <*errors.errorString \| 0xc00086ce40>{                 s: "error container run failed with exit code 1",             },             msg: "Unable to run conformance tests",         },         stack: [0x1ad532a, 0x1b1a468, 0x73c37a, 0x73bd45, 0x73b43b, 0x7411c9, 0x740ba7, 0x7621c5, 0x761ee5, 0x761725, 0x7639d2, 0x76fb65, 0x76f97e, 0x1b34e11, 0x515662, 0x46b321],     }     Unable to run conformance tests: error container run failed with exit code 1 occurred /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade.go:232}

Dec 15 12:41:10.938: INFO: At 2022-12-15 12:31:23 +0000 UTC - event for coredns-74f7f66b6f-s6m5s: {kubelet k8s-upgrade-and-conformance-0v458z-md-0-lcjxn-69fd749f7f-5dlnz} Failed: Failed to pull image "k8s.gcr.io/coredns:v1.8.6": rpc error: code = NotFound desc = failed to pull and unpack image "k8s.gcr.io/coredns:v1.8.6": failed to resolve reference "k8s.gcr.io/coredns:v1.8.6": k8s.gcr.io/coredns:v1.8.6: not found

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

/kind bug
[One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels]

To sum up:
This is fixed in #7787 but I will keep this open for few days to see the CI signal and close it after. In CAPI release-1.1 branch e2e tests the registry was pinned to k8s.gcr.io as default and removing the registry pinning and leaving it empty solved the CI failure reported in this issue.
Also, please check the comments/suggestions from @sbueringer here and here to further understand the root cause of the issue and the possible ways to fix it forward in case you are seeing it.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 15, 2022
@furkatgofurov7
Copy link
Member Author

Similar issue reported for image pullback in kubeadm: kubernetes/kubeadm#2761

Looks like the solution suggested was reverting imageRegistry from registry.k8s.io/coredns:v1.8.6 to k8s.gcr.io. Maybe we need to cherry pick #7410 to release-1.1 branch to fix it?

Any thoughts @fabriziopandini @killianmuldoon @ykakarap @CecileRobertMichon @oscr

@furkatgofurov7
Copy link
Member Author

Looks like the solution suggested was reverting imageRegistry from registry.k8s.io/coredns:v1.8.6 to k8s.gcr.io. Maybe we need to cherry pick #7410 to release-1.1 branch to fix it?

Actually, #7410 was changing the registry from k8s.gcr.io to registry.k8s.io/ which was the opposite of what has been suggested.

@oscr
Copy link
Contributor

oscr commented Dec 15, 2022

Is that image available in the k8s.gcr.io registry?

Running locally:

$ docker pull k8s.gcr.io/coredns:v1.8.6
Error response from daemon: manifest for k8s.gcr.io/coredns:v1.8.6 not found: manifest unknown: Failed to fetch "v1.8.6" from request "/v2/coredns/manifests/v1.8.6".

$ docker pull registry.k8s.io/coredns:v1.8.6
Error response from daemon: manifest for registry.k8s.io/coredns:v1.8.6 not found: manifest unknown: Failed to fetch "v1.8.6"

However it is available here:

$ docker pull registry.k8s.io/coredns/coredns:v1.8.6
v1.8.6: Pulling from coredns/coredns
Digest: sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e
Status: Image is up to date for registry.k8s.io/coredns/coredns:v1.8.6
registry.k8s.io/coredns/coredns:v1.8.6

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Dec 15, 2022

Is that image available in the k8s.gcr.io registry?

It is not but https://github.com/kubernetes-sigs/cluster-api/blob/release-1.1/controlplane/kubeadm/internal/workload_cluster_coredns.go#L220-L224 should handle the transformation automatically AFAIU

and that image is available:

docker pull  k8s.gcr.io/coredns/coredns:v1.8.6
v1.8.6: Pulling from coredns/coredns
d92bdee79785: Pull complete 
6e1b7c06e42d: Pull complete 
Digest: sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e
Status: Downloaded newer image for k8s.gcr.io/coredns/coredns:v1.8.6
k8s.gcr.io/coredns/coredns:v1.8.6

@CecileRobertMichon
Copy link
Contributor

Interesting, we are seeing the same error in CAPZ but with CAPI 1.3... https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capi-periodic-upgrade-main-1-23-1-24

@killianmuldoon
Copy link
Contributor

This is confusing - there shouldn't be a coreDNS upgrade from 1.23 - 1.24. If I've read it right 1.23 uses coreDNS v1.86 by default, and the directive is to upgrade to v1.8.6 so it should be a no-op. I'm not able to replicate this locally, but I'm trying to figure out what's going on.

@Ankitasw
Copy link
Member

In CAPA CSIMigration tests, we are upgrading from v1.22 to v1.23.

@killianmuldoon
Copy link
Contributor

In CAPA CSIMigration tests, we are upgrading from v1.22 to v1.23.

Can you point me to the test spec for this? The issue seems maybe different from the one this issue covers.

@Ankitasw
Copy link
Member

Ankitasw commented Dec 16, 2022

If the issue seems different, I can raise it separately.
Only diff I saw is we are getting error in pulling image from registry.k8s.io, and this issue is about image pull error from k8s.gcr.io, hence reported here.

@killianmuldoon
Copy link
Contributor

My best guess after trying to replicate this is that there is a clash between the CAPI code that works to fix the coreDNS image name for older versions and the update to kubeadm which uses the new registry for coredns.

The root cause is that the version of Kubernetes in use in the e2e tests has been updated to use Kubeadm versions with the registry fix. There aren't yet kindest/node images available.

I think this should be fixable by specifying the imageRegistry in the clusterConfiguration field of the KubeadmControlPlane. I've got a version which does this for the failing 1.1 branch, but I'd like to find out if it can fix the issues on CAPZ or CAPA. Starting next week I'll build a kindest/node image with the new versions of 1.23 and 1.24 so I can properly test the fix with CAPD.

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Dec 19, 2022

My best guess after trying to replicate this is that there is a clash between the CAPI code that works to fix the coreDNS image name for older versions and the update to kubeadm which uses the new registry for coredns.

I am not sure why #6917 was not backported to the release-1.1 branch but it was to release-1.2 branch and how CI was fine up until now? Also, on the same patch, we do only overwrite imageRepository to the new one only if a value was explicitly set in /spec/template/spec/kubeadmConfigSpec/clusterConfiguration/imageRepository which is not the case IMO

@furkatgofurov7
Copy link
Member Author

/cc @chrischdi

@chrischdi
Copy link
Member

Hi @furkatgofurov7 ,

it was not backported to release-1.1 because it was already out of support: #6917 (comment) .

@chrischdi
Copy link
Member

Also: if no imageRepository is set, there is no need to update/fix the imageRepository, because then we defer to kubeadm's defaults.

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Dec 19, 2022

Also: if no imageRepository is set, there is no need to update/fix the imageRepository, because then we defer to kubeadm's defaults.

@chrischdi hi! The defaults in kubeadm points to "registry.k8s.io" is not it?
Also, this has been reported by CAPI users in the registry change k8s mailing list thread (https://groups.google.com/a/kubernetes.io/g/dev/c/DYZYNQ_A6_c) recently

@chrischdi
Copy link
Member

The default of kubeadm highly depends on the used version of the kubeadm binary.

In this case, the test seems to fail since ~ the release of kubernetes v1.24.9. The changelog also states an interesting kubeadm feature:

Kubeadm: use the image registry registry.k8s.io instead of k8s.gcr.io for new clusters. During upgrade, migrate users to registry.k8s.io if they were using the default of k8s.gcr.io. (kubernetes/kubernetes#113395, @neolit123) [SIG Cloud Provider and Cluster Lifecycle]
[src]

The above test uses ClusterClass and makes use of the default value for the imageRepository of the ClusterClass, which is:

    - name: imageRepository
      value: k8s.gcr.io

[src rendered] / [src clusterclass default]

It looks like this setting together with kubeadm v1.24.9 breaks this case.

@neolit123
Copy link
Member

outside of capi (kubeadm only) i saw a number of failures where users pinned to registry.k8s.io before it became the default in kubeadm version foo. that is not right and users should leave any registry fields blank, unless they are pinning to a non default, custom registry (not the old or new default).

also, again i greatly regret agreeing to making the coredns paths inconsistent depending or default / non default registry. ironically it was capi maintainers that pushed for this.

@furkatgofurov7
Copy link
Member Author

/kind failing-test

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Dec 19, 2022
@sbueringer
Copy link
Member

sbueringer commented Dec 19, 2022

I think the difference why the test started failing now is that previously kubeadm auto-migrated from k8s.gcr.io/coredns => k8s.gcr.io/coredns/coredns and now it doesn't do that for the old registry anymore.

I think the solution for our release-1.1 tests might just be to stop pinning the registry in the e2e test. If I see correctly that's also what we are doing on newer branches.

I think for now it would be fine to make this change on release-1.1 even though it's out of support (it's only a e2e test change). I will bring up the point about dropping tests of out-of-support CAPI versions independently.

I think the main point for users here is: (as written by Lubomir)

that is not right and users should leave any registry fields blank, unless they are pinning to a non default, custom registry (not the old or new default)

Essentially this is what kubeadm supports and as KCP is tightly coupled to kubeadm we should support the same. KCP is slightly more flexible as it auto-migrate coredns => coredns/coredns for both registries (in v1.3 and v1.2.8+), but apart from that I think we are just aligned to the kubeadm behavior.

If I understood it correctly there's nothing to do for CAPI main / 1.3.x / 1.2.x

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Dec 19, 2022

I think the difference why the test started failing now is that previously kubeadm auto-migrated from k8s.gcr.io/coredns => k8s.gcr.io/coredns/coredns and now it doesn't do that for the old registry anymore.

interesting I was not aware of this.

I think the solution for our release-1.1 tests might just be to stop pinning the registry in the e2e test. If I see correctly that's also what we are doing on newer branches.

correct

Essentially this is what kubeadm supports and as KCP is tightly coupled to kubeadm we should support the same. KCP is slightly more flexible as it auto-migrate coredns => coredns/coredns for both registries (in v1.3 and v1.2.8+), but apart from that I think we are just aligned to the kubeadm behavior.

If I understood it correctly there's nothing to do for CAPI main / 1.3.x / 1.2.x

I thought it is wider then that, have you checked #7768 (comment) because CAPZ is failing using CAPI v1.3 and CAPA #7768 (comment) (this I have not checked myself and not sure if it is the same issue).
It is not reported here, but in CAPM3 we started seeing the same issue from December 14th in our e2e KCP upgrade tests and we are using release-1.2.X release series in that test.

Some logs from the CAPM3 provider, basically cloud init is failing in the upgraded node:

[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR ImagePull]: failed to pull image registry.k8s.io/coredns:v1.8.6: output: E1219 04:06:02.045321    4093 remote_image.go:238] "PullImage from image service failed" err="rpc error: code = Unknown desc = reading manifest v1.8.6 in registry.k8s.io/coredns: manifest unknown: Failed to fetch \"v1.8.6\"" image="registry.k8s.io/coredns:v1.8.6"
time="2022-12-19T04:06:02-05:00" level=fatal msg="pulling image: rpc error: code = Unknown desc = reading manifest v1.8.6 in registry.k8s.io/coredns: manifest unknown: Failed to fetch \"v1.8.6\""
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
cp: cannot stat '/etc/kubernetes/admin.conf': No such file or directory
chown: cannot access '/home/metal3/.kube/config': No such file or directory

@sbueringer
Copy link
Member

@Ankitasw @CecileRobertMichon Can you please confirm with which versions of CAPI you are seeing this issue?

As far as I can see CAPA is on v1.2.7 and for CAPZ there seem to be contradictory statements

Interesting, we are seeing the same error in CAPZ but with CAPI 1.3... https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api-provider-azure#capi-periodic-upgrade-main-1-23-1-24
#7768 (comment)

vs.

for some reason it's no longer happening since updating to 1.3
https://kubernetes.slack.com/archives/CEX9HENG7/p1670976928873619?thread_ts=1670895435.002349&cid=CEX9HENG7

@sbueringer
Copy link
Member

sbueringer commented Dec 19, 2022

It is not reported here, but in CAPM3 we started seeing the same issue from December 14th in our e2e KCP upgrade tests and we are using release-1.2.X release series in that test.

Which Cluster API and Kubernetes version is exactly used in this test?

I would suspect it's a Kubernetes version which doesn't have the registry default change (not the latest patch version).

Are you pinning the registry to registry.k8s.io?

@Ankitasw
Copy link
Member

Ankitasw commented Dec 20, 2022

@Ankitasw Can you please confirm with which versions of CAPI you are seeing this issue?
As far as I can see CAPA is on v1.2.7

We have a CAPI bump PR open in CAPA and we are facing this issue while running tests in that PR. We first bumped it to 1.3.0, from there itself we observed CCM migration test failures with this image not found error.

Also, we are using registry as registry.k8s.io in our E2E tests, and coredns is found in registry.k8s.io/coredns/coredns but not in registry.k8s.io/coredns. Ref slack thread

@lentzi90
Copy link
Contributor

lentzi90 commented Dec 20, 2022

It is not reported here, but in CAPM3 we started seeing the same issue from December 14th in our e2e KCP upgrade tests and we are using release-1.2.X release series in that test.

Which Cluster API and Kubernetes version is exactly used in this test?

I would suspect it's a Kubernetes version which doesn't have the registry default change (not the latest patch version).

Are you pinning the registry to registry.k8s.io?

Here are the details:

  • Cluster API: v1.2.8
  • Starting Kubernetes version: v1.23.8 (works)
  • Upgraded Kubernetes version: v1.24.1 (broken)
  • imageRepository is not specified.

Edit: clusterctl is v1.2.8 but go modules is at v1.2.6

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Dec 20, 2022

Which Cluster API and Kubernetes version is exactly used in this test?

I would suspect it's a Kubernetes version which doesn't have the registry default change (not the latest patch version).

Are you pinning the registry to registry.k8s.io?

Here are the details:

* Cluster API: v1.2.8

* Starting Kubernetes version: v1.23.8 (works)

* Upgraded Kubernetes version: v1.24.1 (broken)

* `imageRepository` is not specified.

Edit: clusterctl is v1.2.8 but go modules is at v1.2.6

@sbueringer so the upgraded Kubernetes version should be the latest patched release? that should be v1.24.9 in that case (there is v1.24.10-rc.0 as well)

@sbueringer
Copy link
Member

sbueringer commented Dec 20, 2022

The following patch versions should have the change:

  • >= v1.22.17
  • >= v1.23.15
  • >= v1.24.9
  • >= v1.25.0

kubeadm init in those versions should handle coredns=>coredns/coredns for registry.k8s.io correctly, while previous versions of kubeadm will handle it correctly for k8s.gcr.io.
(That's why pinning the registry to either k8s.gcr.io or registry.k8s.io is not recommended)

I can look into the different cases, but it takes a bit until I have time for that.

EDIT: The patch versions I listed for v1.22, v1.23 and v1.24 where one too low, updated now

@Ankitasw
Copy link
Member

Ankitasw commented Dec 20, 2022

We don't pin the imageRepository in CAPA, and even after using above patch versions, we are getting this error in CAPA CCM migration tests.

@CecileRobertMichon
Copy link
Contributor

The following patch versions should have the change

@sbueringer that confirms what I'm seeing and why the CAPZ test mysteriously fixed itself (we automatically pick up new patches when the versions become available):

Dec 15 19:17:30.383: INFO: Resolved KUBERNETES_VERSION_UPGRADE_FROM (set to stable-1.23) to v1.23.13
Dec 15 19:17:30.383: INFO: Resolved KUBERNETES_VERSION_UPGRADE_TO (set to stable-1.24) to v1.24.7

Dec 19 19:19:09.341: INFO: Resolved KUBERNETES_VERSION_UPGRADE_FROM (set to stable-1.23) to v1.23.15
Dec 19 19:19:09.341: INFO: Resolved KUBERNETES_VERSION_UPGRADE_TO (set to stable-1.24) to v1.24.9

@sbueringer
Copy link
Member

sbueringer commented Dec 22, 2022

@CecileRobertMichon This would make sense if the test sets the imageRepository to registry.k8s.io, if it is not set I would have assumed both test runs should succeed.

Is the imageRepo (in KCP) set in this test?

EDIT: Checked the resources in https://gcsweb.k8s.io/gcs/kubernetes-jenkins/logs/periodic-cluster-api-provider-azure-e2e-workload-upgrade-1-23-1-24-main/1603466731035037696/artifacts/clusters/bootstrap/resources/ imageRepository is not set in KCP. I wonder where this test run got registry.k8s.io from as kubeadm should have still used the old registry

EDIT 2: Okay I have a working theory what happened:

  • Cluster API >= v1.2.8 and >= v1.3.0 automatically changes the registry tracked in the kubeadm config ConfigMap to registry.k8s.io for Kubernetes >= 1.22.0
    • It doesn't distinguish between Kubernetes version which have kubeadm with the old or new registry
  • If a Kubernetes/kubeadm version with the old registry is used kubeadm join fails as kubeadm join doesn't handle the new registry correctly
    • RROR ImagePull]: failed to pull image registry.k8s.io/coredns:v1.8.6: output: time="2022-12-15T19:31:16Z" level=fatal msg="pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \"registry.k8s.io/coredns:v1.8.6\": failed to resolve reference \"registry.k8s.io/coredns:v1.8.6\": registry.k8s.io/coredns:v1.8.6: not found"

If I'm correct this essentially means that Cluster API v1.2.8 and v1.3.0 are only compatible with Kubernetes >= v1.22.17, >= v1.23.15, >= v1.24.9, >= v1.25.0 (if the kubeadm providers are used)

P.S. The patch verions in #7768 (comment) were initially one too low, updated now.

@sbueringer
Copy link
Member

sbueringer commented Dec 22, 2022

To confirm my theory:

@lentzi90 Would it be possible to check if the upgrade test works when upgrading to v1.24.9?

@Ankitasw Would it be possible to check if the upgrade test works when upgrading to v1.23.15?

@furkatgofurov7 I don't know which minor versions you are using in the test, but can you please also retry with the latest Kubernetes patch releases?

@Ankitasw
Copy link
Member

The tests are passing with updated k8s patch version, thanks a lot @sbueringer for the analysis 🙇‍♀️. Great work 👍

@furkatgofurov7
Copy link
Member Author

@furkatgofurov7 I don't know which minor versions you are using in the test, but can you please also retry with the latest Kubernetes patch releases?

I will come back with the results after testing with the new k8s patch version

@lentzi90
Copy link
Contributor

@lentzi90 Would it be possible to check if the upgrade test works when upgrading to v1.24.9?

It works! 🎉 Thanks for the help!

@furkatgofurov7
Copy link
Member Author

I will come back with the results after testing with the new k8s patch version

@lentzi90 Would it be possible to check if the upgrade test works when upgrading to v1.24.9?

It works! tada Thanks for the help!

@sbueringer you were right, tests are passing after uplifting k8s version 👍🏼

@furkatgofurov7
Copy link
Member Author

Tests have passed 2 times in a row after the fix (#7787) landed in the release-1.1 branch

@fabriziopandini
Copy link
Member

/triage accepted
/close

it seems that we nailed down this problem across the board. amazing team work!

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 28, 2022
@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: Closing this issue.

In response to this:

/triage accepted
/close

it seems that we nailed down this problem across the board. amazing team work!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sbueringer
Copy link
Member

sbueringer commented Jan 3, 2023

We still have an issue with kubeadm v1.22.x v1.23.x v1.24.x binaries that use the old registry. I've opened an issue to follow-up for those cases: #7833

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests