Registry cache warmer stops logging #1940

brantb · 2019-04-15T18:08:23Z

We observed a scenario where the registry cache warmer stopped functioning and Flux would no longer automatically detect or apply updated tags.

We seem to be hitting Azure Container Registry's rate limit fairly hard, with a lot of rate limit errors being logged. However, at some point, the warmer component stopped logging entirely.

The relevant part of the log output is below:

ts=2019-04-15T11:36:24.486593909Z caller=warming.go:268 component=warmer info="refreshing image" image=foobar.azurecr.io/image tag_count=516 to_update=152 of_which_refresh=152 of_which_missing=0 
ts=2019-04-15T11:36:25.113396872Z caller=warming.go:309 component=warmer canonical_name=foobar.azurecr.io/image auth= { map[] } warn="aborting image tag fetching due to rate limiting, will try again later" 
ts=2019-04-15T11:36:35.105444106Z caller=warming.go:313 component=warmer canonical_name=foobar.azurecr.io/image auth= { map[] } err="toomanyrequests: too many requests" ref=foobar.azurecr.io/image:tag 
[~50 similar lines snipped]
ts=2019-04-15T11:36:35.201179999Z caller=warming.go:364 component=warmer updated=foobar.azurecr.io/image successful=55 attempted=152 
ts=2019-04-15T11:36:35.867880443Z caller=warming.go:268 component=warmer info="refreshing image" image=foobar.azurecr.io/image tag_count=545 to_update=208 of_which_refresh=208 of_which_missing=0 
ts=2019-04-15T11:36:36.487968292Z caller=warming.go:309 component=warmer canonical_name=foobar.azurecr.io/image auth= { map[] } warn="aborting image tag fetching due to rate limiting, will try again later" 
[~30 similar lines snipped]
ts=2019-04-15T11:41:22.120128959Z caller=images.go:17 component=sync-loop msg="polling images" 
ts=2019-04-15T11:46:23.291988006Z caller=images.go:17 component=sync-loop msg="polling images" 
ts=2019-04-15T11:51:24.495635836Z caller=images.go:17 component=sync-loop msg="polling images" 
[Previous line repeated every --registry-poll-interval (5m) until flux was manually restarted]
ts=2019-04-15T16:38:32.956210183Z caller=loop.go:90 component=sync-loop err="git repo not ready: git repo has not been cloned yet" 
ts=2019-04-15T16:38:35.026483907Z caller=warming.go:268 component=warmer info="refreshing image" image=foobar.azurecr.io/image tag_count=41 to_update=41 of_which_refresh=0 of_which_missing=41 
ts=2019-04-15T16:38:45.376898488Z caller=warming.go:313 component=warmer canonical_name=foobar.azurecr.io/image auth= { map[] } err="toomanyrequests: too many requests" ref=foobar.azurecr.io/image:tag 
[... more rate limit spam ...]

A discussion in #flux revealed that the memcached cache is full due to the large number of tags being scanned, which might be contributing to the issue. (Possibly related: #1939)

The text was updated successfully, but these errors were encountered:

hiddeco · 2019-04-15T19:43:58Z

I have a theory that ACR might (automatically) null route you after repeatedly hitting their HTTP 429 rate limit wall. If this is indeed the case, my suspicion is that a timeout occurs on the connection Flux attempts to make, in which case it falls into a black hole of not logging anything.

https://github.com/weaveworks/flux/blob/4feee5f4671a91151ddc0089c4c95769bd652426/registry/cache/warming.go#L301-L304

You restarting the pod probably meant Flux received a new outbound IP address with no null route attached to it.

2opremio · 2019-04-18T23:03:13Z

at some point, the warmer component stopped logging entirely

Nothing, not even "refreshing image" logs?

If that's the case I doubt it's the lack of timeout logs, because we only omit the "refreshing image" logs when we don't find images to update.

My hunch is that, somehow, imagesToFetchFunc func() registry.ImageCreds ends up silently returning an empty map.

2opremio · 2019-04-18T23:19:33Z

My hunch is that, somehow, imagesToFetchFunc func() registry.ImageCreds ends up silently returning an empty map.

Upon further reading, this seems highly unlikely

2opremio · 2019-04-19T00:50:26Z

Maybe related? #1808

2opremio · 2019-04-19T00:53:36Z

@brantb have you seen this again recently?

2opremio · 2019-04-19T01:23:12Z

I've just seen that the image repository client we use:

Completely ignores the contexts being passed. For instance, for the tag request, the parameters are not used at all.
Sets an implicit timeout of zero in the client it uses

So, I think that as soon as a request hangs, the warmer will get stuck. I believe that's what's happening both here and #1808 . I bet it only happens with ACR because other registry providers don't leave the connection open. It may very well be a combination with rate limiting.

There are no new library updates (masterstill has this problem).

2opremio · 2019-04-19T01:42:21Z

Not that we do anything with the context, other than set it to Background()

alexanderbuhler · 2019-04-23T09:03:26Z

@2opremio I maybe have a similar issue but not due to rate limiting. At least I can't see anything related to that in the logs.

At some point warmer just disappears from logs. After that new images aren't refreshed/pulled.
Restarting the flux pod helps with this problem, until warmer will go dark at some point again.

Last lines of logs before warmer went dark. It was in the middle of the night...


  | Time | log
-- | -- | --

April 19th 2019, 01:02:34.532 | ts=2019-04-18T23:02:34.528103926Z caller=release.go:149 component=release info="processing release nginx-ingress (as nginx-ingress)" action=CREATE options="{DryRun:false ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.964 | ts=2019-04-18T23:02:33.964707525Z caller=release.go:149 component=release info="processing release metrics-prometheus-operator (as 4e241c51-4a43-11e9-8899-0a580af40bd9)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.707 | ts=2019-04-18T23:02:33.70595772Z caller=release.go:149 component=release info="processing release external-dns-controller (as b9a751b3-509b-11e9-b3ea-0a580af420b3)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.551 | ts=2019-04-18T23:02:33.550568905Z caller=release.go:149 component=release info="processing release es-kibana (as cc3760f1-4998-11e9-8899-0a580af40bd9)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.404 | ts=2019-04-18T23:02:33.40450279Z caller=release.go:149 component=release info="processing release es-fluentd (as cbf7704c-4998-11e9-8899-0a580af40bd9)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:33.246 | ts=2019-04-18T23:02:33.241983667Z caller=release.go:149 component=release info="processing release es-elasticsearch (as cbbdd4e8-4998-11e9-b46a-0a580af41b0e)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:31.922 | ts=2019-04-18T23:02:31.922166361Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:02:31.576 | ts=2019-04-18T23:02:31.575803749Z caller=release.go:149 component=release info="processing release cert-manager (as fd70c5fc-4c99-11e9-8899-0a580af40bd9)" action=CREATE options="{DryRun:true ReuseName:false}" timeout=300s

  | April 19th 2019, 01:02:28.272 | ts=2019-04-18T23:02:28.271622601Z caller=chartsync.go:250 component=chartsync info="Start of releasesync"

  | April 19th 2019, 01:02:16.660 | ts=2019-04-18T23:02:16.651435655Z caller=warming.go:192 component=warmer canonical_name=quay.io/weaveworks/helm-operator auth={map[]} err="requesting tags: Get https://quay.io/v2/weaveworks/helm-operator/tags/list: unknown: Namespace weaveworks has been disabled. Please contact a system administrator."

  | April 19th 2019, 01:01:54.226 | ts=2019-04-18T23:01:54.226342097Z caller=loop.go:103 component=sync-loop event=refreshed url=git@gitlab.com:one.exp/infrastructure/kubernetes-configuration.git branch=master HEAD=0f391b41919bd77fbdacf4980bd9eb51d3bd20a9

  | April 19th 2019, 01:01:46.181 | ts=2019-04-18T23:01:46.181189434Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:01:46.172 | ts=2019-04-18T23:01:46.171809346Z caller=warming.go:364 component=warmer updated=quay.io/calico/node successful=1 attempted=1

  | April 19th 2019, 01:01:45.664 | ts=2019-04-18T23:01:45.663611449Z caller=warming.go:268 component=warmer info="refreshing image" image=quay.io/calico/node tag_count=2214 to_update=1 of_which_refresh=0 of_which_missing=1

  | April 19th 2019, 01:01:11.329 | ts=2019-04-18T23:01:11.329118355Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:01:10.287 | ts=2019-04-18T23:01:10.286713882Z caller=warming.go:192 component=warmer canonical_name=quay.io/weaveworks/helm-operator auth={map[]} err="requesting tags: Get https://quay.io/v2/weaveworks/helm-operator/tags/list: unknown: Namespace weaveworks has been disabled. Please contact a system administrator."

  | April 19th 2019, 01:00:35.831 | ts=2019-04-18T23:00:35.830877071Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:00:24.521 | ts=2019-04-18T23:00:24.519729915Z caller=images.go:18 component=sync-loop msg="polling images"

  | April 19th 2019, 01:00:24.521 | ts=2019-04-18T23:00:24.518359858Z caller=warming.go:364 component=warmer updated=quay.io/calico/node successful=3 attempted=3

  | April 19th 2019, 01:00:24.010 | ts=2019-04-18T23:00:24.009899617Z caller=warming.go:268 component=warmer info="refreshing image" image=quay.io/calico/node tag_count=2213 to_update=3 of_which_refresh=0 of_which_missing=3

alexanderbuhler · 2019-04-24T10:28:34Z

Another update.
Yesterday he stopped logging after trying to refresh an image with rather many tags:

ts=2019-04-23T08:22:38.023090068Z caller=warming.go:268 component=warmer info="refreshing image" image=datawire/telepresence-k8s tag_count=575 to_update=575 of_which_refresh=0 of_which_missing=575

2opremio · 2019-04-24T10:40:53Z

I think I may have fixed the problem in #1970

@alexanderbuhler / @brantb Would you be able to confirm by running docker.io/2opremio/flux:set-warmer-timeouts-496b1c91 ?

alexanderbuhler · 2019-04-24T10:54:52Z

@2opremio great work, thanks. I applied your image and will check back after some monitoring.

2opremio · 2019-04-24T10:56:19Z

Perfect, thanks a lot!

2opremio · 2019-04-24T16:21:50Z

Reopening until we get confirmation of #1970 fixing the problem

alexanderbuhler · 2019-04-25T19:23:17Z

@2opremio I already saw a new Flux release containing this fix but I'm confirming the warmer still works with our setup after >24h hours. Think we can close this. Thanks again!

2opremio · 2019-04-25T19:57:09Z

GreatI! Just to be sure, it always took less than 24 hours for the problem to happen?

alexanderbuhler · 2019-04-26T09:07:49Z

@2opremio Not always, but mostly under 24 hours. Now after almost two days warmer's still working.

2opremio · 2019-04-26T10:03:47Z

Great, closing. Let me know if it happens again.

2opremio added the bug label Apr 16, 2019

2opremio mentioned this issue Apr 19, 2019

Honor contexts passed to registry client methods distribution/distribution#2905

Merged

2opremio mentioned this issue Apr 24, 2019

Add a timeout to all registry requests #1970

Merged

2opremio closed this as completed in #1970 Apr 24, 2019

2opremio reopened this Apr 24, 2019

2opremio closed this as completed Apr 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Registry cache warmer stops logging #1940

Registry cache warmer stops logging #1940

brantb commented Apr 15, 2019 •

edited

Loading

hiddeco commented Apr 15, 2019

2opremio commented Apr 18, 2019

2opremio commented Apr 18, 2019

2opremio commented Apr 19, 2019

2opremio commented Apr 19, 2019

2opremio commented Apr 19, 2019 •

edited

Loading

2opremio commented Apr 19, 2019 •

edited

Loading

alexanderbuhler commented Apr 23, 2019 •

edited

Loading

alexanderbuhler commented Apr 24, 2019 •

edited

Loading

2opremio commented Apr 24, 2019

alexanderbuhler commented Apr 24, 2019

2opremio commented Apr 24, 2019

2opremio commented Apr 24, 2019

alexanderbuhler commented Apr 25, 2019

2opremio commented Apr 25, 2019

alexanderbuhler commented Apr 26, 2019

2opremio commented Apr 26, 2019

Registry cache warmer stops logging #1940

Registry cache warmer stops logging #1940

Comments

brantb commented Apr 15, 2019 • edited Loading

hiddeco commented Apr 15, 2019

2opremio commented Apr 18, 2019

2opremio commented Apr 18, 2019

2opremio commented Apr 19, 2019

2opremio commented Apr 19, 2019

2opremio commented Apr 19, 2019 • edited Loading

2opremio commented Apr 19, 2019 • edited Loading

alexanderbuhler commented Apr 23, 2019 • edited Loading

alexanderbuhler commented Apr 24, 2019 • edited Loading

2opremio commented Apr 24, 2019

alexanderbuhler commented Apr 24, 2019

2opremio commented Apr 24, 2019

2opremio commented Apr 24, 2019

alexanderbuhler commented Apr 25, 2019

2opremio commented Apr 25, 2019

alexanderbuhler commented Apr 26, 2019

2opremio commented Apr 26, 2019

brantb commented Apr 15, 2019 •

edited

Loading

2opremio commented Apr 19, 2019 •

edited

Loading

2opremio commented Apr 19, 2019 •

edited

Loading

alexanderbuhler commented Apr 23, 2019 •

edited

Loading

alexanderbuhler commented Apr 24, 2019 •

edited

Loading