Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ImagePullPolicy for private ECR repositories #1583

Closed
jonathanbeber opened this issue Feb 22, 2024 · 3 comments · Fixed by #1584
Closed

ImagePullPolicy for private ECR repositories #1583

jonathanbeber opened this issue Feb 22, 2024 · 3 comments · Fixed by #1584

Comments

@jonathanbeber
Copy link

I was debugging an issue with using the project with ECR repositories. The nodes would start to work normally and new pods would start as expected. After a while, new pods wouldn't start on these nodes with the following errors:

{"error":"failed(layer:\"sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c\", ref:\"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5\"): cannot resolve layer: failed to redirect (host \"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com\", ref:\"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5\", digest:\"sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c\"): failed to request: GET https://AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/v2/python3/blobs/sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c giving up after 1 attempt(s): context canceled: failed to resolve: failed to refresh connection","key":"sha256:44d47c020a1d02901ff1868ed70282c3d209aa7682e4a362d64221182c6d38b0","level":"warning","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/56/fs","msg":"check failed","time":"2024-02-22T17:17:49.337153296Z"}
{"error":"cannot resolve layer: failed to redirect (host \"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com\", ref:\"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5\", digest:\"sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c\"): failed to request: GET https://AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/v2/python3/blobs/sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c giving up after 1 attempt(s): context canceled: failed to resolve","key":"sha256:44d47c020a1d02901ff1868ed70282c3d209aa7682e4a362d64221182c6d38b0","level":"warning","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/56/fs","msg":"failed to refresh the layer \"sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c\" from \"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5\"","time":"2024-02-22T17:17:49.335916853Z"}

These images are from a private ECR registry. I'm using the CRI-based authentication.

I was using imagePullPolicy: IfNotPresent in the pods.

New nodes would correctly try to download new images and share the credentials with the snapshotter. Older nodes with pods already running started failing.

My hypothesis, since the node already has pods using that image and I see events in the failed pods as the following, is that the kubelet is no longer sharing the credentials with the snapshotter. The ECR credentials expire from time to time and have to be renewed.

  Normal   Pulled     4s (x3 over 18s)  kubelet            Container image "AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5" already present on machine

For now I changed the ImagePullPolicy and will keep monitoring.

I created this issue to understand a bit better why stargz snapshotter keeps trying to reach for the image even after successfull running pods even with ImagePullPolicy equals to IfNotPresent, and most importantly to discuss adding some alert to the documentation about CRI-based authentication.

@ktock
Copy link
Member

ktock commented Feb 23, 2024

Thanks for reporting this issue.

My hypothesis, since the node already has pods using that image and I see events in the failed pods as the following, is that the kubelet is no longer sharing the credentials with the snapshotter. The ECR credentials expire from time to time and have to be renewed.

I think you're right.

For now I changed the ImagePullPolicy and will keep monitoring.

Thanks for trying the workaround.

I created this issue to understand a bit better why stargz snapshotter keeps trying to reach for the image even after successfull running pods even with ImagePullPolicy equals to IfNotPresent, and most importantly to discuss adding some alert to the documentation about CRI-based authentication.

Thanks for the suggestion.I'm trying to fix this at #1584 .

@jonathanbeber
Copy link
Author

jonathanbeber commented Feb 23, 2024

that's great, thank you for the quick triage and even quicker solution proposal. I just wonder if the proposed solution takes in consideration pods that have the ImagePullPolicy set to Always as I believe that's used as an strategy to use the tag Always.

Being less verbose: doesn't this PR makes stargz to never check for image updates?

@ktock
Copy link
Member

ktock commented Feb 24, 2024

Being less verbose: doesn't this PR makes stargz to never check for image updates?

Checking for image updates is handled by containerd's CRI plugin. Stargz-snapshotter checks registry connection when the same layer mount is reused by contained. #1584 changes stargz-snapshotter to avoid connection check when if the layer contents is fully cached on node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants