ImagePullPolicy for private ECR repositories #1583

jonathanbeber · 2024-02-22T22:29:48Z

I was debugging an issue with using the project with ECR repositories. The nodes would start to work normally and new pods would start as expected. After a while, new pods wouldn't start on these nodes with the following errors:

{"error":"failed(layer:\"sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c\", ref:\"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5\"): cannot resolve layer: failed to redirect (host \"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com\", ref:\"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5\", digest:\"sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c\"): failed to request: GET https://AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/v2/python3/blobs/sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c giving up after 1 attempt(s): context canceled: failed to resolve: failed to refresh connection","key":"sha256:44d47c020a1d02901ff1868ed70282c3d209aa7682e4a362d64221182c6d38b0","level":"warning","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/56/fs","msg":"check failed","time":"2024-02-22T17:17:49.337153296Z"}
{"error":"cannot resolve layer: failed to redirect (host \"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com\", ref:\"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5\", digest:\"sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c\"): failed to request: GET https://AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/v2/python3/blobs/sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c giving up after 1 attempt(s): context canceled: failed to resolve","key":"sha256:44d47c020a1d02901ff1868ed70282c3d209aa7682e4a362d64221182c6d38b0","level":"warning","mountpoint":"/var/lib/containerd-stargz-grpc/snapshotter/snapshots/56/fs","msg":"failed to refresh the layer \"sha256:3686ce6bf62dba60463db0c77220e9be3ee1f42911288a076f84e06a48c9b50c\" from \"AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5\"","time":"2024-02-22T17:17:49.335916853Z"}

These images are from a private ECR registry. I'm using the CRI-based authentication.

I was using imagePullPolicy: IfNotPresent in the pods.

New nodes would correctly try to download new images and share the credentials with the snapshotter. Older nodes with pods already running started failing.

My hypothesis, since the node already has pods using that image and I see events in the failed pods as the following, is that the kubelet is no longer sharing the credentials with the snapshotter. The ECR credentials expire from time to time and have to be renewed.

  Normal   Pulled     4s (x3 over 18s)  kubelet            Container image "AWS_ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/python3:v0.0.5" already present on machine

For now I changed the ImagePullPolicy and will keep monitoring.

I created this issue to understand a bit better why stargz snapshotter keeps trying to reach for the image even after successfull running pods even with ImagePullPolicy equals to IfNotPresent, and most importantly to discuss adding some alert to the documentation about CRI-based authentication.

The text was updated successfully, but these errors were encountered:

ktock · 2024-02-23T13:26:37Z

Thanks for reporting this issue.

My hypothesis, since the node already has pods using that image and I see events in the failed pods as the following, is that the kubelet is no longer sharing the credentials with the snapshotter. The ECR credentials expire from time to time and have to be renewed.

I think you're right.

For now I changed the ImagePullPolicy and will keep monitoring.

Thanks for trying the workaround.

I created this issue to understand a bit better why stargz snapshotter keeps trying to reach for the image even after successfull running pods even with ImagePullPolicy equals to IfNotPresent, and most importantly to discuss adding some alert to the documentation about CRI-based authentication.

Thanks for the suggestion.I'm trying to fix this at #1584 .

jonathanbeber · 2024-02-23T17:15:43Z

that's great, thank you for the quick triage and even quicker solution proposal. I just wonder if the proposed solution takes in consideration pods that have the ImagePullPolicy set to Always as I believe that's used as an strategy to use the tag Always.

Being less verbose: doesn't this PR makes stargz to never check for image updates?

ktock · 2024-02-24T08:00:54Z

Being less verbose: doesn't this PR makes stargz to never check for image updates?

Checking for image updates is handled by containerd's CRI plugin. Stargz-snapshotter checks registry connection when the same layer mount is reused by contained. #1584 changes stargz-snapshotter to avoid connection check when if the layer contents is fully cached on node.

ktock mentioned this issue Feb 23, 2024

fs: Check connection only when image isn't fully cached #1584

Merged

AkihiroSuda closed this as completed in #1584 Feb 28, 2024

debajyoti-truefoundry mentioned this issue Mar 1, 2024

[Bug] ImagePullBackoff when pulling an image from private ECR with SOCI Index being present. awslabs/soci-snapshotter#1084

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ImagePullPolicy for private ECR repositories #1583

ImagePullPolicy for private ECR repositories #1583

jonathanbeber commented Feb 22, 2024

ktock commented Feb 23, 2024

jonathanbeber commented Feb 23, 2024 •

edited

Loading

ktock commented Feb 24, 2024

ImagePullPolicy for private ECR repositories #1583

ImagePullPolicy for private ECR repositories #1583

Comments

jonathanbeber commented Feb 22, 2024

ktock commented Feb 23, 2024

jonathanbeber commented Feb 23, 2024 • edited Loading

ktock commented Feb 24, 2024

jonathanbeber commented Feb 23, 2024 •

edited

Loading