Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry on TOOMANYREQUESTS error from ECR #750

Merged
merged 2 commits into from
Aug 4, 2020

Conversation

jchorl
Copy link
Contributor

@jchorl jchorl commented Aug 3, 2020

What it says on the tin

I made a test go program here to force 429s. The output was:

...
I0803 03:21:58.695399    1477 main.go:44] body: {"errors":[{"code":"TOOMANYREQUESTS","message":"Rate exceeded"}]}
I0803 03:21:58.695412    1477 main.go:45] temporary: false
...

Quite obviously, go-containerregistry is not treating TOOMANYREQUESTS as temporary even though it is, in fact, a temporary error. This is preventing retries here because the error does implement the temporary interface (here) but this error is deemed non-temporary.

The error manifested itself when kaniko was trying to push a bunch of images in parallel:

error pushing image: failed to push to destination <redacted>.dkr.ecr.<redacted>.amazonaws.com/<redacted>: POST https://<redacted>.dkr.ecr.<redacted>.amazonaws.com/v2/<redacted>/blobs/uploads/: TOOMANYREQUESTS: Rate exceeded

The error is explained in the registry spec here: https://github.com/docker/distribution/blob/master/docs/spec/api.md#errors-2

Anyway, this change causes go-containerregistry to treat a 429 as a retryable error.

@codecov-commenter
Copy link

codecov-commenter commented Aug 3, 2020

Codecov Report

Merging #750 into master will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #750   +/-   ##
=======================================
  Coverage   79.06%   79.06%           
=======================================
  Files         102      102           
  Lines        4725     4725           
=======================================
  Hits         3736     3736           
  Misses        548      548           
  Partials      441      441           
Impacted Files Coverage Δ
pkg/v1/remote/transport/error.go 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 38ad4ec...3dbc489. Read the comment docs.

Copy link
Collaborator

@jonjohnsonjr jonjohnsonjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to treat 429 as just temporary errors for now, but it might make sense to have different back-off/throttling behavior for 429, since it's an explicit signal that we're sending too much traffic, not just a temporary failure. Just something to keep in mind if you're ever feeling ambitious :)

pkg/v1/remote/transport/error.go Outdated Show resolved Hide resolved
@jchorl
Copy link
Contributor Author

jchorl commented Aug 4, 2020

Agreed, backoff behaviour would be nice. The current retry behaviour waits 1/3/9(?) seconds, which might deal with overloading a registry. I'd also be in favour of bumping those timeouts. Having different exponential backoff behaviour depending on the error would... be ambitious ;)

@jonjohnsonjr
Copy link
Collaborator

Let's just stick with this and revisit something more sophisticated if you're still seeing issues.

@jonjohnsonjr jonjohnsonjr merged commit b0d31a1 into google:master Aug 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants