Fluxd can get stuck if started "too early" after a GKE cluster creation #1855

primeroz · 2019-03-22T10:55:14Z

I am testing the bootstrap of a cluster using flux on GKE , the whole configuration is managed with terraform.

terraform create gke cluster
as soon as the creation of gke returns success flux is deployed

in this case fluxd seems to fail to fetch some information from the apiserver and get stuck in a collating resources in cluster for sync: not found loop even though after a while , if it retried to fetch the apiserver info , it would succed

the error i can see is ERROR: logging before flag.Parse: E0322 09:47:31.690724 6 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request

ts=2019-03-22T09:47:23.281235971Z caller=main.go:165 version=1.11.0
ts=2019-03-22T09:47:23.317696094Z caller=main.go:295 component=cluster identity=/etc/fluxd/ssh/identity
ts=2019-03-22T09:47:23.317758233Z caller=main.go:296 component=cluster identity.pub="XXXX"
ts=2019-03-22T09:47:23.317785061Z caller=main.go:297 component=cluster host=https://10.0.88.1:443 version=kubernetes-v1.12.5-gke.10
ts=2019-03-22T09:47:23.317855927Z caller=main.go:309 component=cluster kubectl=/usr/local/bin/kubectl
ts=2019-03-22T09:47:23.319794947Z caller=main.go:319 component=cluster ping=true
ts=2019-03-22T09:47:23.375286258Z caller=aws.go:69 component=aws warn="no AWS region configured, or detected as cluster region" err="EC2MetadataError: failed to make EC2Metadata request\ncaused by: This metadata API is not allowed by the metadata proxy.\n"
ts=2019-03-22T09:47:23.375356788Z caller=main.go:344 warning="AWS authorization not used; pre-flight check failed"
ts=2019-03-22T09:47:23.381741151Z caller=main.go:449 url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git user="Weave Flux" email=support@weave.works signing-key= sync-tag=flux-sync-dev notes-ref=flux-sync-dev set-author=false
ts=2019-03-22T09:47:23.381838588Z caller=main.go:506 upstream="no upstream URL given"
ts=2019-03-22T09:47:23.383853937Z caller=main.go:535 metrics-addr=:3031
ts=2019-03-22T09:47:23.385475596Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-22T09:47:23.385516124Z caller=images.go:27 component=sync-loop msg="no automated services"
ts=2019-03-22T09:47:23.385837708Z caller=loop.go:90 component=sync-loop err="git repo not ready: git repo has not been cloned yet"
ts=2019-03-22T09:47:23.387328006Z caller=main.go:527 addr=:3030
ts=2019-03-22T09:47:24.035019823Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.11.0
ts=2019-03-22T09:47:31.577869464Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git branch=dev HEAD=4446db6c8225fd18b566219c2a41ce9a23a9d088
ERROR: logging before flag.Parse: E0322 09:47:31.690724       6 memcache.go:147] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
ts=2019-03-22T09:47:31.701396542Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:47:31.710863513Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:48:33.199756991Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:48:33.208266708Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:48:33.213800046Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git branch=dev HEAD=4446db6c8225fd18b566219c2a41ce9a23a9d088
ts=2019-03-22T09:49:34.810247471Z caller=loop.go:210 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:49:34.815635198Z caller=loop.go:90 component=sync-loop err="collating resources in cluster for sync: not found"
ts=2019-03-22T09:49:34.819879233Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git branch=dev HEAD=4446db6c8225fd18b566219c2a41ce9a23a9d088

Deleting the pod let the new instance of flux complete its startup process properly

ts=2019-03-22T09:49:53.04497357Z caller=main.go:165 version=1.11.0
ts=2019-03-22T09:49:53.085897609Z caller=main.go:295 component=cluster identity=/etc/fluxd/ssh/identity
ts=2019-03-22T09:49:53.085976658Z caller=main.go:296 component=cluster identity.pub="XXXX
ts=2019-03-22T09:49:53.086178422Z caller=main.go:297 component=cluster host=https://10.0.88.1:443 version=kubernetes-v1.12.5-gke.10
ts=2019-03-22T09:49:53.086253208Z caller=main.go:309 component=cluster kubectl=/usr/local/bin/kubectl
ts=2019-03-22T09:49:53.087752566Z caller=main.go:319 component=cluster ping=true
ts=2019-03-22T09:49:53.090451181Z caller=aws.go:69 component=aws warn="no AWS region configured, or detected as cluster region" err="EC2MetadataError: failed to make EC2Metadata request\ncaused by: This metadata API is not allowed by the metadata proxy.\n"
ts=2019-03-22T09:49:53.090593122Z caller=main.go:344 warning="AWS authorization not used; pre-flight check failed"
ts=2019-03-22T09:49:53.096061273Z caller=main.go:449 url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git user="Weave Flux" email=support@weave.works signing-key= sync-tag=flux-sync-dev notes-ref=flux-sync-dev set-author=false
ts=2019-03-22T09:49:53.096125961Z caller=main.go:506 upstream="no upstream URL given"
ts=2019-03-22T09:49:53.096645904Z caller=main.go:535 metrics-addr=:3031
ts=2019-03-22T09:49:53.097625224Z caller=loop.go:90 component=sync-loop err="git repo not ready: git repo has not been cloned yet"
ts=2019-03-22T09:49:53.097683867Z caller=images.go:17 component=sync-loop msg="polling images"
ts=2019-03-22T09:49:53.098122543Z caller=images.go:27 component=sync-loop msg="no automated services"
ts=2019-03-22T09:49:53.100502392Z caller=main.go:527 addr=:3030
ts=2019-03-22T09:49:53.829431175Z caller=checkpoint.go:24 component=checkpoint msg="up to date" latest=1.11.0
ts=2019-03-22T09:50:00.97629931Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:mintel/satoshi/experimental/gitops-rendered-manifests.git branch=dev HEAD=4446db6c8225fd18b566219c2a41ce9a23a9d088
ts=2019-03-22T09:50:01.832938149Z caller=sync.go:417 component=cluster method=Sync cmd=apply args= count=9
ts=2019-03-22T09:50:02.745412087Z caller=sync.go:483 component=cluster method=Sync cmd="kubectl apply -f -" took=912.159534ms err=null output="namespace/cert-manager created\nnamespace/ingress created\nnamespace/vault created\ncustomresourcedefinition.apiextensions.k8s.io/certificates.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/challenges.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/clusterissuers.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/issuers.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/orders.certmanager.k8s.io created\ncustomresourcedefinition.apiextensions.k8s.io/vaults.vault.banzaicloud.com created"

Adding a sleep of 1 minute between the GKE cluster creation and the flux deployment creation fixed the problem.

This feel more like a workaround rather than a fix though and makes me wonder if other apiserver errors might get flux stuck as well

The text was updated successfully, but these errors were encountered:

2opremio · 2019-03-22T15:40:02Z

This feel more like a workaround rather than a fix though and makes me wonder if other apiserver errors might get flux stuck as well

Absolutely. We need to take a look at the code to see why flux doesn't recover from an unavailable API server.

2opremio · 2019-03-22T16:43:27Z

@primeroz was this systematic?

I think the problem lays in the discovery cache, introduced in 1.11.0, which isn't invalidated unless there is a change in the CRDs. My guess is that, somehow, early in the history of the cluster, a resource kind exists which is removed after bootstrapping.

@primeroz if you are able to reproduce, could you see if creating/deleting a CRD solves the problem?

Doing the following should do:

kubectl create -f https://raw.githubusercontent.com/weaveworks/flux/master/deploy-helm/flux-helm-release-crd.yaml

or

kubectl delete crd helmreleases.flux.weave.works

if the crd already existed in your cluster.

@squaremo Regardless of whether this is the actual reason, I think we should be invalidating the cache with certain regularity. I am not sure how kubernetes bootstraps, but I don't think there are any guarantees of all the resources being made available atomically. If we don't refresh periodically, transient errors can become permanent.

primeroz · 2019-03-25T08:47:10Z

@2opremio
Yeah it is systematic. got it 100% of the time if i don't put any sleep ( even a shorter one ) in my deployment after the gke cluster is returned up and running from terraform.
It might be that terraform gke just return too quick and kube-proxy - apiserver iptables load balancer is not setup yet for flux to connect ? Even a 10 seconds sleep seems to make it work

The refresh of the CRD did indeed fix the issue , on the next scheduled sync the repo manifests were applied.

2opremio · 2019-03-25T12:47:21Z

Thanks, that confirms it's a problem with the cached resource kinds. I will work on a PR to refresh them periodically

squaremo · 2019-03-25T12:59:36Z

I think we should be invalidating the cache with certain regularity.

The (client-go) controller already does this (https://github.com/kubernetes/client-go/blob/master/tools/cache/controller.go#L273). I gave it a conservative refresh period of 5 minutes, in https://github.com/weaveworks/flux/blob/master/cluster/kubernetes/cached_disco.go#L99

2opremio · 2019-03-25T15:54:14Z

The (client-go) controller already does this (https://github.com/kubernetes/client-go/blob/master/tools/cache/controller.go#L273). I gave it a conservative refresh period of 5 minutes, in https://github.com/weaveworks/flux/blob/master/cluster/kubernetes/cached_disco.go#L99

That's only for CRDs.

AFAIU, if no CRDs exist there is nothing to resync and the discovery cache won't be invalidated even if resync is called every 5 minutes on the store (which has no events).

squaremo · 2019-03-25T16:07:56Z

AFAIU, if no CRDs exist there is nothing to resync and the discovery cache won't be invalidated

I see what you mean now. Yes it's only invalidated by a CRD coming or going -- everything else is assumed to be static.

2opremio added the bug label Mar 22, 2019

2opremio self-assigned this Mar 22, 2019

2opremio mentioned this issue Mar 25, 2019

Invalidate discovery cache periodically #1859

Merged

2opremio closed this as completed in #1859 Mar 27, 2019

ttarczynski mentioned this issue Apr 30, 2019

Sync fails due to metrics.k8s.io discovery error #1991

Closed

2opremio mentioned this issue May 2, 2019

Ignore discovery errors for metrics resources #2009

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluxd can get stuck if started "too early" after a GKE cluster creation #1855

Fluxd can get stuck if started "too early" after a GKE cluster creation #1855

primeroz commented Mar 22, 2019

2opremio commented Mar 22, 2019

2opremio commented Mar 22, 2019 •

edited

Loading

primeroz commented Mar 25, 2019 •

edited

Loading

2opremio commented Mar 25, 2019

squaremo commented Mar 25, 2019

2opremio commented Mar 25, 2019 •

edited

Loading

squaremo commented Mar 25, 2019

Fluxd can get stuck if started "too early" after a GKE cluster creation #1855

Fluxd can get stuck if started "too early" after a GKE cluster creation #1855

Comments

primeroz commented Mar 22, 2019

2opremio commented Mar 22, 2019

2opremio commented Mar 22, 2019 • edited Loading

primeroz commented Mar 25, 2019 • edited Loading

2opremio commented Mar 25, 2019

squaremo commented Mar 25, 2019

2opremio commented Mar 25, 2019 • edited Loading

squaremo commented Mar 25, 2019

2opremio commented Mar 22, 2019 •

edited

Loading

primeroz commented Mar 25, 2019 •

edited

Loading

2opremio commented Mar 25, 2019 •

edited

Loading