-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] ImagePullBackoff when pulling an image from private ECR with SOCI Index being present. #1084
Comments
I noticed similar issues when using Docker Hub as a registry. In this case, I did not have the SOCI Index for the image present in the repository.
|
This is also happening intermittently. I cannot reproduce this consistently. |
I am also noticing a lot of logs like the ones below,
Is this normal even after setting up CRI-based authentication? |
This is something I started noticing on 0.5.0 with the CRI-based authentication. I had the soci snapshotter 0.4.0 in the cluster for some time and had not faced this issue before. When a new pod with a newly built image of the same Docker Hub registry got scheduled into this node again, I faced the same problem. I had no soci index in the repository for the same image. Once I drained the problematic node and the pod got scheduled to another new node, with soci-snapshotter running there, the issue was not there anymore. Sometimes, the snapshotter can get into this bad state after running for some time. I should have restarted the snapshotter, rather than draining the node, to see if the issue was resolved, but it slipped my mind. |
@debajyoti-truefoundry Hey, thanks for letting us know. We are looking into this. Could you provide a full dump of the snapshotter and containerd logs for the pull with an image with a SOCI index and without? |
Please find the snapshotter log attached. @turan18 I did not have containerd logs with me. I will attach it as soon as this issue starts happening again. |
This is the full dump of all the logs. I have not filtered it for a particular image. But, the pod was stuck in this node due to the same issue. |
@turan18 Adding logs from containerd and soci-snapshotter. |
Hey @debajyoti-truefoundry, so a couple of things:
The error you are seeing: That is to say, unfortunately, CRI credentials are not a stable mechanism for getting credentials to the snapshotter and we do not recommend using it in a production environment. For the time being, we recommend disabling CRI credentials and setting up a docker credential helper on the nodes (see: ecr-credential-helper for ECR), if you haven't done so already. Apologies, for the troubles. |
Thanks, @turan18, for your detailed response.
We do not have any credential refresher in our cluster and keep the credentials as a I assume the CRI credential is basically what we have in the Secret; it is not a different token derived from the given credentials with a different lifetime. Please correct me if this assumption is wrong. So, if the credential expired, even on the new node, it should have used the expired credential and given an ErrImagePull. But I am not observing this. Somehow, it works on the new node. We faced the same issue with a private image for Docker Hub for which we were using a long-living credential that could not have expired. Again, we noticed the same behaviour once I drained the node and the pod got scheduled on a new node; the issue did not happen again. |
Hey @debajyoti-truefoundry sorry for the delay.
All EKS worker nodes configure the
The reason it works on the new node is because the images are re-pulled using the credentials obtained by the image credential helper binary automatically configured on the node.
Yeah, I initially said: "then attempt to run the same pod with the same set of images on the same node, after some time (enough for the credentials to expire; which should be 12 hours in the case of ECR), the snapshotter would still attempt to use the in-memory expired CRI credentials from the first pod run.", it is actually a little more nuanced than that. Containerd solely refers to layers/"snapshots" by the digest of their uncompressed content (it's actually a bit more complicated than that). What this means is that shared layers, starting from the base layer, between two different images will still have the same snapshot key/"digest". When containerd attempts to unpack a layer to disk, it first checks to see if the layer has already been unpacked, using the digest of the uncompressed layer as the key. If it sees that the key exists, then it will move on to the next layer. The problem with this is that it doesn't exactly fit the model of SOCI. To the SOCI snapshotter, layers are not just content but a set of configurations that are needed to fetch content. This includes things like the ECR repository URL and, more importantly, the credentials to the ECR repository/registry. What this then effectively means is that shared layers, regardless of what image they are a apart of, will use the same underlying SOCI configuration, meaning the same set of credentials. This is exactly what's happening in your case. The image that resides in docker hub still shares layers, starting from the base layer, with the images previously pulled with ECR with SOCI. Because of this, containerd assumes that there is no point of downloading/unpacking the layer, since, in it's eyes, the content already exists. So the layer in the docker image ends up sharing the same SOCI mount point and thus the same set of expired credentials, as the layer in the ECR image that was pulled with SOCI. Now, although there isn't an easy way to address the shared layers problem between images, at least without making some changes in upstream containerd, we can still make some changes to the SOCI snapshotter, like avoiding making any requests to the registry after all the content for the layer has been downloaded. This still isn't a universal solution, however, as there may be cases where, for some reason, the SOCI snapshotter couldn't fetch all the content in the background and will need to do so on demand at some point after the credentials have expired. Currently, the best mechanism and the one that we recommend for getting credentials through to the snapshotter, is setting up a credential provider on the node. This way, on credential expiration, the snapshotter can retrieve a fresh set of credentials. Apologies, if any of this was confusing, I'm happy to help clarify any questions that may arise. |
No worries. I appreciate the detailed response.
Right. I tested this. Seems like that is what is happening. We use imagePullSecrets anyway, and I have no way to verify whether this binary is used or the imagePullSecret is used. 😅
Right, this seems to be a big issue and should be documented. I went through https://github.com/containerd/stargz-snapshotter/blob/main/docs/overview.md#cri-based-authentication, and this was not documented there. CRI-based Auth can only work reliably if the cluster has private images only from a single registry, ignoring the credential expiry issue.
Do you have any documentation for this? How do I integrate different credential providers for registries from different providers (Docker Hub, ECR)? |
@turan18 Is this limitation present in other snapshotters too? https://github.com/containerd/nydus-snapshotter |
containerd/stargz-snapshotter#1583 Is this related? I also want to understand whether #1035 helps resolve this issue. |
CRI creds work fine, even with shared layers, as long as the registry credentials are not expired. The issue is with credential rotation.
Unfortunately, we do not right now. You can follow the documentation on the ecr-credential-helper repository on how to get the helper installed and set up on the node. I'm not too sure how DockerHub handles credentials. If they are long lived/do not expire you can run
Yes. Both stargz and nydus also rely on the same mechanisms to get credentials (credential helper or CRI creds). There are some on-going discussions upstream about making the process of obtaining credentials in snapshotter's more streamlined, but there isn't anything concrete just yet (see: containerd/containerd#6251, containerd/containerd#5251).
Yes, it looks like that's the same issue and the proposed fix was what I mentioned earlier, which is to ensure we do not make requests once the content for the image is fully downloaded on disk. This is a viable work-around for the time being and something we will work on as-well.
Not really. All that will do is ensure that images that cannot be lazy loaded at-all are managed/unpacked by containerd. If the image has already been pulled with the SOCI snapshotter, like in your case, and you attempt to to run another image with shared layers after the credentials have expired, containerd will still see the content as pulled and it will use the same underlying SOCI configuration. |
Right. So there are two issues.
The reason image pulls from ECR were failing was due to 1. Image pulls from Docker Hub were failing due to 1 and 2.
In this case, a layer from the DockerHub image shares the SOCI mount point of a layer from ECR with expired credentials. In the container's eye, these are logically the same layer. In this case, any extra API call to the registry will fail.
Will you be able to give a timeline for this? This will open up adoption for us.
Understood. I brought this up because I was thinking with the mental model that if there is any error from the soci-snapshotter, it can defer the process to containerd. |
Is this even solving the issue at the soci-snapshotter layer? I can see that the Authn details are cached here.
The cache is populated here.
So, the cached credentials are only updated if an image is pulled. Looking at containerd/stargz-snapshotter#1583, it seems credentials will only get updated if we Always pull the Image.
Kubelet is already using a credential helper.
The only fix here seems to be, "Do not connect the registry again for layer A if soci-snapshotter has already set it up". Can soci-snapshotter work with that assumption? |
It should, I've opened up an issue and will try and get a PR out soon to test this. |
Thanks @sondavidb. I am worried about another scenario.
|
@sondavidb I'm having trouble reproducing since the expiry time is 12h for an ECR token, but even with the amazon-ecr-credential-helper we're seeing that after the authorization expires it's never refreshed. These are the logs in question. It's different than what's in this issue, so happy to open a new one if needed.
(same instance, different occurrence)
Restarting
|
Sorry for the delay in response @debajyoti-truefoundry
As of now, no, I don't have a reason to believe it would work, for the same reason that we always check the registry. With the fix from stargz I think this might work, but I wonder if it poses any security threats. Have made a PR for this in the meantime #1147 |
We don't do any form of credential rotation. Whatever credentials are pulled with the image are the ones we use for the lifecycle of the container. |
Setting aside K8s, it does seem like #945 made an attempt to. I split out this conversation to a new issue (#1148) so I don't muddle it more, apologies! |
Hey @debajyoti-truefoundry, we just released v0.6.0, wondering if you could confirm that this mitigates this particular issue? |
Hey @sondavidb, thanks for the ping. I will try it again on our cluster sometime in the next two weeks. For me, #1148 and #1093 will also be crucial for adoption. But let me try to carve out some time to get |
Thanks! I got a little sidetracked on investigation for 1148 but I'm taking a look. I will confirm I can repro the solution, though. |
Description
I have set up CRI-based authentication to make Soci Snapshotter work with our private ECR Registry. We use EKS.
I noticed that one of the containers was stuck in the ImagePullBackoff state. This was the event.
These are the logs from the Containerd, Soci snapshotter and Kubelet.
I restarted the Pod and got the same error again.
I drained the Node where the Pod was scheduled to move it to another Node.
Surprisingly it worked this time. The new Node, where the Pod got scheduled, also had Soci Snapshotter.
Steps to reproduce the bug
No response
Describe the results you expected
No ImagePullBackoff
Host information
Any additional context or information about the bug
No response
The text was updated successfully, but these errors were encountered: