Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Fluxd can get stuck if started "too early" after a GKE cluster creation #1855

Closed
primeroz opened this issue Mar 22, 2019 · 7 comments
Closed
Assignees
Labels

Comments

@primeroz
Copy link

I am testing the bootstrap of a cluster using flux on GKE , the whole configuration is managed with terraform.

  1. terraform create gke cluster
  2. as soon as the creation of gke returns success flux is deployed

in this case fluxd seems to fail to fetch some information from the apiserver and get stuck in a collating resources in cluster for sync: not found loop even though after a while , if it retried to fetch the apiserver info , it would succed

the error i can see is ERROR: logging before flag.Parse: E0322 09:47:31.690724 6 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

ts=2019-03-22T09:47:23.281235971Z caller=main.go:165 version=1.11.0
ts=2019-03-22T09:47:23.317696094Z caller=main.go:295 component=cluster identity=/etc/fluxd/ssh/identity
ts=2019-03-22T09:47:23.317758233Z caller=main.go:296 component=cluster identity.pub="XXXX"
ts=2019-03-22T09:47:23.317785061Z caller=main.go:297 component=cluster host=https://10.0.88.1:443 version=kubernetes-v1.12.5-gke.10
ts=2019-03-22T09:47:23.317855927Z caller=main.go:309 component=cluster kubectl=/usr/local/bin/kubectl
ts=2019-03-22T09:47:23.319794947Z caller=main.go:319 component=cluster ping=true
ts=2019-03-22T09:47:23.375286258Z caller=aws.go:69 component=aws warn="no AWS region configured, or detected as cluster region" err="EC2MetadataError: failed to make EC2Metadata request\ncaused by: This metadata API is not allowed by the metadata proxy.\n"
ts=2019-03-22T09:47:23.375356788Z caller=main.go:344 warning="AWS authorization not used; pre-flight check failed"
ts=2019-03-22T09:47:23.381741151Z caller=main.go:449 url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git user="Weave Flux" email=support@weave.works signing-key= sync-tag=flux-sync-dev notes-ref=flux-sync-dev set-author=false
ts=2019-03-22T09:47:23.381838588Z caller=main.go:506 upstream="no upstream URL given"
ts=2019-03-22T09:47:23.383853937Z caller=main.go:535 metrics-addr=:3031
ts=2019-03-22T09:47:23.385475596Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-22T09:47:23.385516124Z caller=images.go:27 component=sync-loop msg="no automated services"
ts=2019-03-22T09:47:23.385837708Z caller=loop.go:90 component=sync-loop err="git repo not ready: git repo has not been cloned yet"
ts=2019-03-22T09:47:23.387328006Z caller=main.go:527 addr=:3030
ts=2019-03-22T09:47:24.035019823Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.11.0
ts=2019-03-22T09:47:31.577869464Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git branch=dev HEAD=4446db6c8225fd18b566219c2a41ce9a23a9d088
ERROR: logging before flag.Parse: E0322 09:47:31.690724       6 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
ts=2019-03-22T09:47:31.701396542Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:47:31.710863513Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:48:33.199756991Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:48:33.208266708Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:48:33.213800046Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git branch=dev HEAD=4446db6c8225fd18b566219c2a41ce9a23a9d088
ts=2019-03-22T09:49:34.810247471Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:49:34.815635198Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:49:34.819879233Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git branch=dev HEAD=4446db6c8225fd18b566219c2a41ce9a23a9d088

Deleting the pod let the new instance of flux complete its startup process properly

ts=2019-03-22T09:49:53.04497357Z caller=main.go:165 version=1.11.0
ts=2019-03-22T09:49:53.085897609Z caller=main.go:295 component=cluster identity=/etc/fluxd/ssh/identity
ts=2019-03-22T09:49:53.085976658Z caller=main.go:296 component=cluster identity.pub="XXXX
ts=2019-03-22T09:49:53.086178422Z caller=main.go:297 component=cluster host=https://10.0.88.1:443 version=kubernetes-v1.12.5-gke.10
ts=2019-03-22T09:49:53.086253208Z caller=main.go:309 component=cluster kubectl=/usr/local/bin/kubectl
ts=2019-03-22T09:49:53.087752566Z caller=main.go:319 component=cluster ping=true
ts=2019-03-22T09:49:53.090451181Z caller=aws.go:69 component=aws warn="no AWS region configured, or detected as cluster region" err="EC2MetadataError: failed to make EC2Metadata request\ncaused by: This metadata API is not allowed by the metadata proxy.\n"
ts=2019-03-22T09:49:53.090593122Z caller=main.go:344 warning="AWS authorization not used; pre-flight check failed"
ts=2019-03-22T09:49:53.096061273Z caller=main.go:449 url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git user="Weave Flux" email=support@weave.works signing-key= sync-tag=flux-sync-dev notes-ref=flux-sync-dev set-author=false
ts=2019-03-22T09:49:53.096125961Z caller=main.go:506 upstream="no upstream URL given"
ts=2019-03-22T09:49:53.096645904Z caller=main.go:535 metrics-addr=:3031
ts=2019-03-22T09:49:53.097625224Z caller=loop.go:90 component=sync-loop err="git repo not ready: git repo has not been cloned yet"
ts=2019-03-22T09:49:53.097683867Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-22T09:49:53.098122543Z caller=images.go:27 component=sync-loop msg="no automated services"
ts=2019-03-22T09:49:53.100502392Z caller=main.go:527 addr=:3030
ts=2019-03-22T09:49:53.829431175Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.11.0
ts=2019-03-22T09:50:00.97629931Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git branch=dev HEAD=4446db6c8225fd18b566219c2a41ce9a23a9d088
ts=2019-03-22T09:50:01.832938149Z caller=sync.go:417 component=cluster method=Sync cmd=apply args= count=9
ts=2019-03-22T09:50:02.745412087Z caller=sync.go:483 component=cluster method=Sync cmd="kubectl apply -f -" took=912.159534ms err=null output="namespace/cert-manager created\nnamespace/ingress created\nnamespace/vault created\ncustomresourcedefinition.apiextensions.k8s.io/certificates.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/challenges.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/clusterissuers.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/issuers.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/orders.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/vaults.vault.banzaicloud.com created"

Adding a sleep of 1 minute between the GKE cluster creation and the flux deployment creation fixed the problem.

This feel more like a workaround rather than a fix though and makes me wonder if other apiserver errors might get flux stuck as well

@2opremio
Copy link
Contributor

This feel more like a workaround rather than a fix though and makes me wonder if other apiserver errors might get flux stuck as well

Absolutely. We need to take a look at the code to see why flux doesn't recover from an unavailable API server.

@2opremio 2opremio added the bug label Mar 22, 2019
@2opremio
Copy link
Contributor

2opremio commented Mar 22, 2019

@primeroz was this systematic?

I think the problem lays in the discovery cache, introduced in 1.11.0, which isn't invalidated unless there is a change in the CRDs. My guess is that, somehow, early in the history of the cluster, a resource kind exists which is removed after bootstrapping.

@primeroz if you are able to reproduce, could you see if creating/deleting a CRD solves the problem?

Doing the following should do:

kubectl create -f https://raw.githubusercontent.com/weaveworks/flux/master/deploy-helm/flux-helm-release-crd.yaml

or

kubectl delete crd helmreleases.flux.weave.works

if the crd already existed in your cluster.

@squaremo Regardless of whether this is the actual reason, I think we should be invalidating the cache with certain regularity. I am not sure how kubernetes bootstraps, but I don't think there are any guarantees of all the resources being made available atomically. If we don't refresh periodically, transient errors can become permanent.

@2opremio 2opremio self-assigned this Mar 22, 2019
@primeroz
Copy link
Author

primeroz commented Mar 25, 2019

@2opremio
Yeah it is systematic. got it 100% of the time if i don't put any sleep ( even a shorter one ) in my deployment after the gke cluster is returned up and running from terraform.
It might be that terraform gke just return too quick and kube-proxy - apiserver iptables load balancer is not setup yet for flux to connect ? Even a 10 seconds sleep seems to make it work

The refresh of the CRD did indeed fix the issue , on the next scheduled sync the repo manifests were applied.

@2opremio
Copy link
Contributor

Thanks, that confirms it's a problem with the cached resource kinds. I will work on a PR to refresh them periodically

@squaremo
Copy link
Member

I think we should be invalidating the cache with certain regularity.

The (client-go) controller already does this (https://github.com/kubernetes/client-go/blob/master/tools/cache/controller.go#L273). I gave it a conservative refresh period of 5 minutes, in https://github.com/weaveworks/flux/blob/master/cluster/kubernetes/cached_disco.go#L99

@2opremio
Copy link
Contributor

2opremio commented Mar 25, 2019

The (client-go) controller already does this (https://github.com/kubernetes/client-go/blob/master/tools/cache/controller.go#L273). I gave it a conservative refresh period of 5 minutes, in https://github.com/weaveworks/flux/blob/master/cluster/kubernetes/cached_disco.go#L99

That's only for CRDs.

AFAIU, if no CRDs exist there is nothing to resync and the discovery cache won't be invalidated even if resync is called every 5 minutes on the store (which has no events).

@squaremo
Copy link
Member

AFAIU, if no CRDs exist there is nothing to resync and the discovery cache won't be invalidated

I see what you mean now. Yes it's only invalidated by a CRD coming or going -- everything else is assumed to be static.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants