Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Registry cache warmer stops logging #1940

Closed
brantb opened this issue Apr 15, 2019 · 17 comments · Fixed by #1970
Closed

Registry cache warmer stops logging #1940

brantb opened this issue Apr 15, 2019 · 17 comments · Fixed by #1970
Labels

Comments

@brantb
Copy link
Contributor

brantb commented Apr 15, 2019

We observed a scenario where the registry cache warmer stopped functioning and Flux would no longer automatically detect or apply updated tags.

We seem to be hitting Azure Container Registry's rate limit fairly hard, with a lot of rate limit errors being logged. However, at some point, the warmer component stopped logging entirely.

The relevant part of the log output is below:

ts=2019-04-15T11:36:24.486593909Z caller=warming.go:268 component=warmer info="refreshing image" image=foobar.azurecr.io/image tag_count=516 to_update=152 of_which_refresh=152 of_which_missing=0 
ts=2019-04-15T11:36:25.113396872Z caller=warming.go:309 component=warmer canonical_name=foobar.azurecr.io/image auth= { map[] } warn="aborting image tag fetching due to rate limiting, will try again later" 
ts=2019-04-15T11:36:35.105444106Z caller=warming.go:313 component=warmer canonical_name=foobar.azurecr.io/image auth= { map[] } err="toomanyrequests: too many requests" ref=foobar.azurecr.io/image:tag 
[~50 similar lines snipped]
ts=2019-04-15T11:36:35.201179999Z caller=warming.go:364 component=warmer updated=foobar.azurecr.io/image successful=55 attempted=152 
ts=2019-04-15T11:36:35.867880443Z caller=warming.go:268 component=warmer info="refreshing image" image=foobar.azurecr.io/image tag_count=545 to_update=208 of_which_refresh=208 of_which_missing=0 
ts=2019-04-15T11:36:36.487968292Z caller=warming.go:309 component=warmer canonical_name=foobar.azurecr.io/image auth= { map[] } warn="aborting image tag fetching due to rate limiting, will try again later" 
[~30 similar lines snipped]
ts=2019-04-15T11:41:22.120128959Z caller=images.go:17 component=sync-loop msg="polling images" 
ts=2019-04-15T11:46:23.291988006Z caller=images.go:17 component=sync-loop msg="polling images" 
ts=2019-04-15T11:51:24.495635836Z caller=images.go:17 component=sync-loop msg="polling images" 
[Previous line repeated every --registry-poll-interval (5m) until flux was manually restarted]
ts=2019-04-15T16:38:32.956210183Z caller=loop.go:90 component=sync-loop err="git repo not ready: git repo has not been cloned yet" 
ts=2019-04-15T16:38:35.026483907Z caller=warming.go:268 component=warmer info="refreshing image" image=foobar.azurecr.io/image tag_count=41 to_update=41 of_which_refresh=0 of_which_missing=41 
ts=2019-04-15T16:38:45.376898488Z caller=warming.go:313 component=warmer canonical_name=foobar.azurecr.io/image auth= { map[] } err="toomanyrequests: too many requests" ref=foobar.azurecr.io/image:tag 
[... more rate limit spam ...]

A discussion in #flux revealed that the memcached cache is full due to the large number of tags being scanned, which might be contributing to the issue. (Possibly related: #1939)

@hiddeco
Copy link
Member

hiddeco commented Apr 15, 2019

I have a theory that ACR might (automatically) null route you after repeatedly hitting their HTTP 429 rate limit wall. If this is indeed the case, my suspicion is that a timeout occurs on the connection Flux attempts to make, in which case it falls into a black hole of not logging anything.

https://github.com/weaveworks/flux/blob/4feee5f4671a91151ddc0089c4c95769bd652426/registry/cache/warming.go#L301-L304

You restarting the pod probably meant Flux received a new outbound IP address with no null route attached to it.

@2opremio 2opremio added the bug label Apr 16, 2019
@2opremio
Copy link
Contributor

at some point, the warmer component stopped logging entirely

Nothing, not even "refreshing image" logs?

If that's the case I doubt it's the lack of timeout logs, because we only omit the "refreshing image" logs when we don't find images to update.

My hunch is that, somehow, imagesToFetchFunc func() registry.ImageCreds ends up silently returning an empty map.

@2opremio
Copy link
Contributor

My hunch is that, somehow, imagesToFetchFunc func() registry.ImageCreds ends up silently returning an empty map.

Upon further reading, this seems highly unlikely

@2opremio
Copy link
Contributor

Maybe related? #1808

@2opremio
Copy link
Contributor

@brantb have you seen this again recently?

@2opremio
Copy link
Contributor

2opremio commented Apr 19, 2019

I've just seen that the image repository client we use:

  1. Completely ignores the contexts being passed. For instance, for the tag request, the parameters are not used at all.
  2. Sets an implicit timeout of zero in the client it uses

So, I think that as soon as a request hangs, the warmer will get stuck. I believe that's what's happening both here and #1808 . I bet it only happens with ACR because other registry providers don't leave the connection open. It may very well be a combination with rate limiting.

There are no new library updates (masterstill has this problem).

@2opremio
Copy link
Contributor

2opremio commented Apr 19, 2019

Not that we do anything with the context, other than set it to Background()

@alexanderbuhler
Copy link

alexanderbuhler commented Apr 23, 2019

@2opremio I maybe have a similar issue but not due to rate limiting. At least I can't see anything related to that in the logs.

At some point warmer just disappears from logs. After that new images aren't refreshed/pulled.
Restarting the flux pod helps with this problem, until warmer will go dark at some point again.

Last lines of logs before warmer went dark. It was in the middle of the night...


  | Time | log
-- | -- | --

April 19th 2019, 01:02:34.532 | ts=2019-04-18T23:02:34.528103926Z caller=release.go:149 component=release info="processing release nginx-ingress (as nginx-ingress)" action=CREATE options="{DryRun:false ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.964 | ts=2019-04-18T23:02:33.964707525Z caller=release.go:149 component=release info="processing release metrics-prometheus-operator (as 4e241c51-4a43-11e9-8899-0a580af40bd9)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.707 | ts=2019-04-18T23:02:33.70595772Z caller=release.go:149 component=release info="processing release external-dns-controller (as b9a751b3-509b-11e9-b3ea-0a580af420b3)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.551 | ts=2019-04-18T23:02:33.550568905Z caller=release.go:149 component=release info="processing release es-kibana (as cc3760f1-4998-11e9-8899-0a580af40bd9)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.404 | ts=2019-04-18T23:02:33.40450279Z caller=release.go:149 component=release info="processing release es-fluentd (as cbf7704c-4998-11e9-8899-0a580af40bd9)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.246 | ts=2019-04-18T23:02:33.241983667Z caller=release.go:149 component=release info="processing release es-elasticsearch (as cbbdd4e8-4998-11e9-b46a-0a580af41b0e)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:31.922 | ts=2019-04-18T23:02:31.922166361Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:02:31.576 | ts=2019-04-18T23:02:31.575803749Z caller=release.go:149 component=release info="processing release cert-manager (as fd70c5fc-4c99-11e9-8899-0a580af40bd9)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:28.272 | ts=2019-04-18T23:02:28.271622601Z caller=chartsync.go:250 component=chartsync info="Start of releasesync"

  | April 19th 2019, 01:02:16.660 | ts=2019-04-18T23:02:16.651435655Z caller=warming.go:192 component=warmer canonical_name=quay.io/weaveworks/helm-operator auth={map[]} err="requesting tags: Get https://quay.io/v2/weaveworks/helm-operator/tags/list: unknown: Namespace weaveworks has been disabled. Please contact a system administrator."

  | April 19th 2019, 01:01:54.226 | ts=2019-04-18T23:01:54.226342097Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:one.exp/infrastructure/kubernetes-configuration.git branch=master HEAD=0f391b41919bd77fbdacf4980bd9eb51d3bd20a9

  | April 19th 2019, 01:01:46.181 | ts=2019-04-18T23:01:46.181189434Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:01:46.172 | ts=2019-04-18T23:01:46.171809346Z caller=warming.go:364 component=warmer updated=quay.io/calico/node successful=1 attempted=1

  | April 19th 2019, 01:01:45.664 | ts=2019-04-18T23:01:45.663611449Z caller=warming.go:268 component=warmer info="refreshing image" image=quay.io/calico/node tag_count=2214 to_update=1 of_which_refresh=0 of_which_missing=1

  | April 19th 2019, 01:01:11.329 | ts=2019-04-18T23:01:11.329118355Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:01:10.287 | ts=2019-04-18T23:01:10.286713882Z caller=warming.go:192 component=warmer canonical_name=quay.io/weaveworks/helm-operator auth={map[]} err="requesting tags: Get https://quay.io/v2/weaveworks/helm-operator/tags/list: unknown: Namespace weaveworks has been disabled. Please contact a system administrator."

  | April 19th 2019, 01:00:35.831 | ts=2019-04-18T23:00:35.830877071Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:00:24.521 | ts=2019-04-18T23:00:24.519729915Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:00:24.521 | ts=2019-04-18T23:00:24.518359858Z caller=warming.go:364 component=warmer updated=quay.io/calico/node successful=3 attempted=3

  | April 19th 2019, 01:00:24.010 | ts=2019-04-18T23:00:24.009899617Z caller=warming.go:268 component=warmer info="refreshing image" image=quay.io/calico/node tag_count=2213 to_update=3 of_which_refresh=0 of_which_missing=3



@alexanderbuhler
Copy link

alexanderbuhler commented Apr 24, 2019

Another update.
Yesterday he stopped logging after trying to refresh an image with rather many tags:

ts=2019-04-23T08:22:38.023090068Z caller=warming.go:268 component=warmer info="refreshing image" image=datawire/telepresence-k8s tag_count=575 to_update=575 of_which_refresh=0 of_which_missing=575

@2opremio
Copy link
Contributor

I think I may have fixed the problem in #1970

@alexanderbuhler / @brantb Would you be able to confirm by running docker.io/2opremio/flux:set-warmer-timeouts-496b1c91 ?

@alexanderbuhler
Copy link

@2opremio great work, thanks. I applied your image and will check back after some monitoring.

@2opremio
Copy link
Contributor

Perfect, thanks a lot!

@2opremio
Copy link
Contributor

Reopening until we get confirmation of #1970 fixing the problem

@2opremio 2opremio reopened this Apr 24, 2019
@alexanderbuhler
Copy link

@2opremio I already saw a new Flux release containing this fix but I'm confirming the warmer still works with our setup after >24h hours. Think we can close this. Thanks again!

@2opremio
Copy link
Contributor

GreatI! Just to be sure, it always took less than 24 hours for the problem to happen?

@alexanderbuhler
Copy link

@2opremio Not always, but mostly under 24 hours. Now after almost two days warmer's still working.

@2opremio
Copy link
Contributor

Great, closing. Let me know if it happens again.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants