Transient Failures in CI for K8S_VERSION
1.25
#803
Labels
kind/bug
Categorizes issue or PR as related to a bug.
kind/testing
Issues that involve adding test coverage
operational-excellence
Problem
We are transient CI failures in our system due to a bug in the kube-apiserver that is causing a panic on the apiserver due to a race in the evaluation of CEL expression. This race was first reported here: kubernetes/kubernetes#114661 and was later fixed by a Kubernetes PR here: kubernetes/kubernetes#114857 that bumped the package version of the CEL evaluation package where this bug was originating: google/cel-go#620.
Kubernetes did back-port patch versions that contained the fix for this bug in the apiserver. These can be seen at the bottom of the originally reported issue.
You can also see these cherry-picks by viewing the Kubernetes changelog for 1.24, 1.25, and 1.26.
setup-envtest does not currently surface every kubernetes version through its mirror where it pulls the etcd and apiserver binaries. These mirrors can be seen here: https://storage.googleapis.com/kubebuilder-tools. As a result, we will continue to see these transient failures on 1.25 until they publish a newer patch version of these binaries that contains the fix.
We are only seeing this version on Kubernetes 1.25 because we have disabled CEL on versions less than 1.25 (because the feature was not yet in beta and wasn’t enabled on EKS clusters) and the bug is fixed on versions greater than 1.25 because setup-envtest has a new enough binary that contains the patch fix.
Solutions
controller-runtime
can support these newer patch versions in their mirror directly. I've currently opened an issue to ask for this here: Support newer patch versions of Kubernetes insetup-envtest
controller-runtime#2583.envtest
would then just use these binaries directly.For now, we can continue to retry when we see failures on 1.25 since these errors are transient and resolve after some re-runs; however, we should consider fixing this so our CI doesn't become flaky for 1.25 due to this interaction with CEL.
The text was updated successfully, but these errors were encountered: