Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a retry for the image pull during Preflight execution #785

Closed
tkrishtop opened this issue Sep 20, 2022 · 5 comments · Fixed by #792
Closed

Add a retry for the image pull during Preflight execution #785

tkrishtop opened this issue Sep 20, 2022 · 5 comments · Fixed by #792
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@tkrishtop
Copy link
Contributor

Is your feature request related to a problem? Please describe.

Currently, Preflight does not retry image pull

// pull the image and save to fs
log.Debug("pulling image from target registry")
img, err := crane.Pull(c.Image, options...)
if err != nil {
return fmt.Errorf("failed to pull remote container: %v", err)
}

and that produces the errors like this

/tmp/preflight_tmp_dir.7w73yzeb/preflight/1.4.0/preflight check container gcr.io/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c 

time=\"2022-09-19T21:53:21-05:00\" level=info msg=\"certification library version 1.4.0 <commit: ee9cf5bff7418620e7bc511ec99257d3865678da>\"
time=\"2022-09-19T21:53:21-05:00\" level=debug msg=\"target image: gcr.io/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c\"
time=\"2022-09-19T21:53:21-05:00\" level=debug msg=\"pulling image from target registry\"
time=\"2022-09-19T21:53:21-05:00\" level=trace msg=\"entering preflight keychain Resolve\"
Error: failed to pull remote container: Get \"https://gcr.io/v2/\": dial tcp: i/o timeout
2022/09/19 21:53:50 failed to pull remote container: Get \"https://gcr.io/v2/\": dial tcp: i/o timeout

Describe the solution you'd like.

Usually, these registry errors are temporary, I'd like to wait for 5 seconds and retry again.
Not sure if it's better to do it in the Preflight code or implement a retry feature at Crane.

Additional context.

This error is pretty often to bother.

@tkrishtop tkrishtop added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 20, 2022
@acornett21
Copy link
Contributor

It looks like Crane does have exponential backoff, but I'm not sure if this is exposed in a usable way, it looks like it is. We would need to test more and see.

See Crane code here

@tkrishtop
Copy link
Contributor Author

tkrishtop commented Sep 21, 2022

Hi, @acornett21 thanks for checking.

I checked a Crane code in more detail, and it seems like they have already a reasonable policy by retrying 3 times: at 0s, at 1s, and 4s.

https://github.com/google/go-containerregistry/blob/e3b94c7e7a5720c5e2985012f617f36006abff62/pkg/v1/remote/options.go#L69-L75

// Try this three times, waiting for 1s after the first failure, and 3s after the second.
var defaultRetryBackoff = Backoff{
	Duration: 1.0 * time.Second,
	Factor:   3.0,
	Jitter:   0.1,
	Steps:    3,
}

Out of my experience with quay.io, more than 3 short-term retries would activate throttling, so probably the only thing we could do here is to increase the waiting time in-between attempts by setting Factor: 5.0.

Normally, this PR allows configuring the retry attempts.
PS. removed non-working code.

@acornett21
Copy link
Contributor

acornett21 commented Sep 21, 2022

@tkrishtop Thanks for looking further, I'm not sure if these defaults get picked up since we are passing in our own options. Would need to look at the stacktrace to see if they get picked-up.

Go Playground only allows for go core library to be used.

@tkrishtop
Copy link
Contributor Author

@acornett21 thanks, I got help and managed to fix the invocation.

main.go

package main

import (
	"log"
	"time"

	"github.com/google/go-containerregistry/pkg/crane"
	"github.com/google/go-containerregistry/pkg/v1/remote"
)

func withRemoteOption(ro remote.Option) crane.Option {
	return func(o *crane.Options) {
		o.Remote = append(o.Remote, ro)
	}
}

func main() {
	customRetryBackoff := &remote.Backoff{
		Duration: 1.0 * time.Second,
		Factor:   5.0,
		Jitter:   0.1,
		Steps:    3,
	}

	options := []crane.Option{
		withRemoteOption(remote.WithRetryBackoff(*customRetryBackoff)),
	}

	image := "gcr.io/kubebuilder/kube-rbac-proxy@sha256:db06cc4c084dd0253134f156dddaaf53ef1c3fb3cc809e5d81711baa4029ea4c"
	_, err := crane.Pull(image, options...)

	if err != nil {
		log.Println("[error] failed to pull remote container", err)
	} else {
		log.Println("[ok] pull finished fine")
	}
}

execution:

$ go run main.go 
2022/09/21 17:42:02 [ok] pull finished fine

Don't know how to test though, I hesitate to trigger throttling and get blocked :)

@komish
Copy link
Contributor

komish commented Oct 11, 2022

@tkrishtop My initial thought is that you should rerun preflight in this case, but I would imagine the trick is detecting this case has occurred within your automation? With your concern around throttling in mind, I'm hesitate to do a full exponential backoff. Maybe a single retry after 5-10 seconds would help? Seems like we should be able to do that using the remote.Backoff implementation. WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants