Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve retry behavior for push operation #1578

Merged
merged 2 commits into from
Feb 23, 2021
Merged

Improve retry behavior for push operation #1578

merged 2 commits into from
Feb 23, 2021

Conversation

SaschaSchwarze0
Copy link
Contributor

Fixes #584
Fixes #1290

Description

We are facing intermittent issues to push the image to the destination. Cause is as far as I can tell network flakeness. There is a long-standing issue asking for retries for the push operation, so I investigated this.

I am making two improvements in two commits.

Update go-containerregistry to 0.4

I am updating the go-containerregistry library to 0.4, mainly to pickup Retry registry access on some server errors. #901. This improves the logic in the library to retry on some 5xx HTTP status codes from the registry. With this change, I was making my firsts tests. I had a registry:2 instance running behind an Apache. By terminating the registry, I made the registry to return a 503 until it comes back after around 15 seconds. I was able to see the retries happening in the access log of the Apache.

Anyway, this was not yet good enough for three reasons:

  1. go-containerregistry only retries on some status codes from the remote endpoint and now added handling for some HTTP status codes. I was not finding any code that would retry issues on the network layer where one for example fails to connect to the registry or during the upload of the layer or does not receive a (complete) response.
  2. go-containerregistry is not retrying all HTTP calls but only those that push the layer. For example, the code that checks if a layer already exists is not retried, see Add retry for checkExistingBlob in uploadOne #618.
  3. go-containerregistry is "only" performing three tries with delay times of one and three seconds without a way to configure this.

The third item also caused my above test to not succeed because my registry was needing 15 seconds to restart and the library only retried 1+3=4 seconds.

Therefore the second extension:

Implement --push-retry argument

This introduces the --push-retry argument which is handled with a simple retry logic inside Kaniko. I decided against filtering the error and basically retry everything. My thinking behind this is that Kaniko validates the registry credentials before the build (a great feature btw). If this succeeds, then the registry is in general functional. It does not make sense to later have special handling for (maybe non-retryable) things like DNS failures, or authentication problems.

The default is 0 which maintains the existing logic.

Retries are happening with exponential delay (1s, 2s, 4s, 8s, 16s, ...).

I was repeating my test by specifying --retry-count 5 and the test was successful.

I also improved the logging in push.go to make this transparent.

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

  • Includes unit tests
  • Adds integration tests if needed.

Reviewer Notes

  • The code flow looks good.
  • Unit tests and or integration tests added.

Release Notes

- Update go-containerregistry to pick up retry improvements
- Add new flag `--push-retry` to enable retries of the push operation

@google-cla google-cla bot added the cla: yes CLA signed by all commit authors label Feb 15, 2021
@tejal29 tejal29 merged commit 69f942f into GoogleContainerTools:master Feb 23, 2021
no-reply pushed a commit to surfliner/surfliner-mirror that referenced this pull request Mar 16, 2022
Setting the value to `3` initially, which we can adjust as needed.

This feature was introduced relatively recently, the following PR

GoogleContainerTools/kaniko#1578

Reason for feature:

> We are facing intermittent issues to push the image to the destination. Cause is as far as I can tell network flakeness. There is a long-standing issue asking for retries for the push operation, so I investigated this.

I recently noticed this error that happened to coincide with 5
dependency MRs from Renovate all coming in at once.

This user references similar experiences with Kaniko/Gitlab:

GoogleContainerTools/kaniko#584 (comment)

The hope here using the `--push-retry` option is that it will try to
push a few more times and avoid the need for us to manually retry the
`build` job to get it to pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes CLA signed by all commit authors
Projects
None yet
2 participants