-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Fluxd can get stuck if started "too early" after a GKE cluster creation #1855
Comments
Absolutely. We need to take a look at the code to see why flux doesn't recover from an unavailable API server. |
@primeroz was this systematic? I think the problem lays in the discovery cache, introduced in 1.11.0, which isn't invalidated unless there is a change in the CRDs. My guess is that, somehow, early in the history of the cluster, a resource kind exists which is removed after bootstrapping. @primeroz if you are able to reproduce, could you see if creating/deleting a CRD solves the problem? Doing the following should do:
or
if the crd already existed in your cluster. @squaremo Regardless of whether this is the actual reason, I think we should be invalidating the cache with certain regularity. I am not sure how kubernetes bootstraps, but I don't think there are any guarantees of all the resources being made available atomically. If we don't refresh periodically, transient errors can become permanent. |
@2opremio The refresh of the CRD did indeed fix the issue , on the next scheduled sync the repo manifests were applied. |
Thanks, that confirms it's a problem with the cached resource kinds. I will work on a PR to refresh them periodically |
The (client-go) controller already does this (https://github.com/kubernetes/client-go/blob/master/tools/cache/controller.go#L273). I gave it a conservative refresh period of 5 minutes, in https://github.com/weaveworks/flux/blob/master/cluster/kubernetes/cached_disco.go#L99 |
That's only for CRDs. AFAIU, if no CRDs exist there is nothing to resync and the discovery cache won't be invalidated even if resync is called every 5 minutes on the store (which has no events). |
I see what you mean now. Yes it's only invalidated by a CRD coming or going -- everything else is assumed to be static. |
I am testing the bootstrap of a cluster using flux on GKE , the whole configuration is managed with terraform.
in this case fluxd seems to fail to fetch some information from the apiserver and get stuck in a
collating resources in cluster for sync: not found
loop even though after a while , if it retried to fetch the apiserver info , it would succedthe error i can see is
ERROR: logging before flag.Parse: E0322 09:47:31.690724 6 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
Deleting the pod let the new instance of flux complete its startup process properly
Adding a sleep of 1 minute between the GKE cluster creation and the flux deployment creation
fixed
the problem.This feel more like a workaround rather than a fix though and makes me wonder if other apiserver errors might get flux stuck as well
The text was updated successfully, but these errors were encountered: