Improve retry behavior for push operation #1578
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #584
Fixes #1290
Description
We are facing intermittent issues to push the image to the destination. Cause is as far as I can tell network flakeness. There is a long-standing issue asking for retries for the push operation, so I investigated this.
I am making two improvements in two commits.
Update go-containerregistry to 0.4
I am updating the go-containerregistry library to 0.4, mainly to pickup Retry registry access on some server errors. #901. This improves the logic in the library to retry on some 5xx HTTP status codes from the registry. With this change, I was making my firsts tests. I had a registry:2 instance running behind an Apache. By terminating the registry, I made the registry to return a 503 until it comes back after around 15 seconds. I was able to see the retries happening in the access log of the Apache.
Anyway, this was not yet good enough for three reasons:
The third item also caused my above test to not succeed because my registry was needing 15 seconds to restart and the library only retried 1+3=4 seconds.
Therefore the second extension:
Implement --push-retry argument
This introduces the
--push-retry
argument which is handled with a simple retry logic inside Kaniko. I decided against filtering theerror
and basically retry everything. My thinking behind this is that Kaniko validates the registry credentials before the build (a great feature btw). If this succeeds, then the registry is in general functional. It does not make sense to later have special handling for (maybe non-retryable) things like DNS failures, or authentication problems.The default is
0
which maintains the existing logic.Retries are happening with exponential delay (1s, 2s, 4s, 8s, 16s, ...).
I was repeating my test by specifying
--retry-count 5
and the test was successful.I also improved the logging in push.go to make this transparent.
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
Reviewer Notes
Release Notes