Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient Failures in CI for K8S_VERSION 1.25 #803

Closed
jonathan-innis opened this issue Nov 17, 2023 · 5 comments · Fixed by #833
Closed

Transient Failures in CI for K8S_VERSION 1.25 #803

jonathan-innis opened this issue Nov 17, 2023 · 5 comments · Fixed by #833
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/testing Issues that involve adding test coverage operational-excellence

Comments

@jonathan-innis
Copy link
Member

jonathan-innis commented Nov 17, 2023

Problem

We are transient CI failures in our system due to a bug in the kube-apiserver that is causing a panic on the apiserver due to a race in the evaluation of CEL expression. This race was first reported here: kubernetes/kubernetes#114661 and was later fixed by a Kubernetes PR here: kubernetes/kubernetes#114857 that bumped the package version of the CEL evaluation package where this bug was originating: google/cel-go#620.

Screenshot 2023-11-17 at 10 55 54 AM

Kubernetes did back-port patch versions that contained the fix for this bug in the apiserver. These can be seen at the bottom of the originally reported issue.

Screenshot 2023-11-17 at 10 58 32 AM

You can also see these cherry-picks by viewing the Kubernetes changelog for 1.24, 1.25, and 1.26.

setup-envtest does not currently surface every kubernetes version through its mirror where it pulls the etcd and apiserver binaries. These mirrors can be seen here: https://storage.googleapis.com/kubebuilder-tools. As a result, we will continue to see these transient failures on 1.25 until they publish a newer patch version of these binaries that contains the fix.

We are only seeing this version on Kubernetes 1.25 because we have disabled CEL on versions less than 1.25 (because the feature was not yet in beta and wasn’t enabled on EKS clusters) and the bug is fixed on versions greater than 1.25 because setup-envtest has a new enough binary that contains the patch fix.

Solutions

  1. Upstream controller-runtime can support these newer patch versions in their mirror directly. I've currently opened an issue to ask for this here: Support newer patch versions of Kubernetes in setup-envtest controller-runtime#2583.
  2. We can download and install the binaries into the correct directories ourselves for 1.25. envtest would then just use these binaries directly.

For now, we can continue to retry when we see failures on 1.25 since these errors are transient and resolve after some re-runs; however, we should consider fixing this so our CI doesn't become flaky for 1.25 due to this interaction with CEL.

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@jonathan-innis jonathan-innis added kind/bug Categorizes issue or PR as related to a bug. kind/testing Issues that involve adding test coverage operational-excellence labels Nov 17, 2023
@jonathan-innis
Copy link
Member Author

/assign @sadath-12

@k8s-ci-robot
Copy link
Contributor

@jonathan-innis: GitHub didn't allow me to assign the following users: sadath-12.

Note that only kubernetes-sigs members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

/assign @sadath-12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sadath-12
Copy link
Contributor

/assign

@jonathan-innis
Copy link
Member Author

Looks like the binaries are made available here: https://www.downloadkubernetes.com/

@jonathan-innis
Copy link
Member Author

/assign @jmdeal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/testing Issues that involve adding test coverage operational-excellence
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants