Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image in SpinApp CR is being incorrectly cached #40

Closed
calebschoepp opened this issue Feb 7, 2024 · 10 comments
Closed

Image in SpinApp CR is being incorrectly cached #40

calebschoepp opened this issue Feb 7, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@calebschoepp
Copy link
Contributor

Changing the image of a SpinApp CR is not always working. For an unknown reason sometimes when you change the image it does not update. Even if you delete and re-apply the SpinApp it still does not use the new image.

@vdice and @ThorstenHans have observed this issue. I'll leave it to them to share more details in a comment below. My understanding is that @ThorstenHans has a reproduction of this issue. Also please feel free to edit this issue to be more accurate.

@vdice
Copy link
Contributor

vdice commented Feb 27, 2024

Ran into this today in testing again. This should probably be high in priority, though signs currently point to upstream culprit(s) (eg runwasi).

Some potential leads/related issues (mentioning because the app pod being stuck in 'Terminating' is reliably seen when encountering caching behavior):
containerd/runwasi#418
deislabs/containerd-wasm-shims#207 (links to 418 above)

@calebschoepp
Copy link
Contributor Author

calebschoepp commented Feb 27, 2024

Put it as a must have in the Initial Release project. CC @macolso

@endocrimes
Copy link
Contributor

Filled spinkube/containerd-shim-spin#22 because this is a shim issue rather than an operator one.

@macolso macolso added the bug Something isn't working label Feb 28, 2024
@KaiWalter
Copy link

same here, building with

 spin registry login -u $AZURE_CONTAINER_REGISTRY_NAME -p $AZURE_CONTAINER_REGISTRY_PASSWORD $AZURE_CONTAINER_REGISTRY_ENDPOINT
  spin registry push --build $AZURE_CONTAINER_REGISTRY_ENDPOINT/spin-dapr-ts:$REVISION

deploying with new image tag, checking

│ Name:         distributor                                                                                                                       │
│ Namespace:    default                                                                                                                           │
│ Labels:       <none>                                                                                                                            │
│ Annotations:  <none>                                                                                                                            │
│ API Version:  core.spinoperator.dev/v1                                                                                                          │
│ Kind:         SpinApp                                                                                                                           │
│ Metadata:                                                                                                                                       │
│   Creation Timestamp:  2024-03-03T10:56:34Z                                                                                                     │
│   Generation:          2                                                                                                                        │
│   Resource Version:    19132                                                                                                                    │
│   UID:                 7f8a69ec-fceb-4527-a5e1-bf4b1fa9f533                                                                                     │
│ Spec:                                                                                                                                           │
│   Checks:                                                                                                                                       │
│   Deployment Annotations:                                                                                                                       │
│     scheduler.alpha.kubernetes.io/node-selector:  agentpool=wasm                                                                                │
│   Enable Autoscaling:                             true                                                                                          │
│   Executor:                                       containerd-shim-spin                                                                          │
│   Image:                                          kw971987ecacr.azurecr.io/spin-dapr-ts:1709464104 

shows that latest / desired image is referenced; but logging indicates that changes are not active

@ThorstenHans
Copy link
Collaborator

I was also able to reproduce this behavior using latest bits. IMO the key to reproduce the issue is the following

  1. Deploy a Spin App (hello-spin here running app:0.0.1)
  2. Make a change to source code
  3. Push the new tag to your registry (app:0.0.2)
  4. Delete the hello-spin app from the cluster
  5. Update deployment manifest to use app:0.0.2
  6. Deploy the app

New instances spawned will response with whatever defined in app:0.0.1

@rajatjindal
Copy link
Member

rajatjindal commented Mar 4, 2024

my 2 cents here:

There may be two different issues in the play here:

  1. The spin app pods are stuck in terminating state. Its been tracked in Shim continues to serve/run outdated content on updates containerd-shim-spin#22
  2. The terminating pods still continues to serve traffic. This behavior can be changed using Service spec.PublishNotReadyAddresses. This will ensure traffic (via k8s service) is not sent to the terminating pods.

@endocrimes, what do you think about the second point above? thanks

@endocrimes
Copy link
Contributor

PublishNotReadyAddresses=true would be a bug - we'd send traffic to pods that aren't ready to receive it (i.e still potentially compiling the app).

@rajatjindal
Copy link
Member

I was looking more into caching issue, and it seems like it is not due to publishNotReadyAddresses (and we correctly don't configure it in spin-operator).

What I did noticed is that, the image in DeploymentSpec and PodSpec is referring to the new image, but in the containerStatutes field in the pod spec, it is still referring to the old image.

I verified it was a new pod.

e.g.
look at the pod spec from my test cluster: https://gist.github.com/rajatjindal/52ee4e796dc37d33f9b41f27dbc7baaf

I am still trying to understand why that might be happening.

radu-matei added a commit to radu-matei/spin that referenced this issue Mar 5, 2024
This commit ensures that applications pushed to OCI have unique image
config fields for unique Spin application content and metadata by adding
a label in the OCI image config to the content digest (SHA256) of the
Spin locked application file.

This is to address the issue of the Containerd Spin shim serving
outdated content, because all images of Spin apps on a node would have the
same image ID (the content digest of the OCI config object, which was
identical for all Spin apps).

ref spinkube/spin-operator#40

Signed-off-by: Radu Matei <radu@fermyon.com>

Co-authored-by: Rajat Jindal <rajatjindal83@gmail.com>
Co-authored-by: Danielle Lancashire <dani@builds.terrible.systems>
Co-authored-by: Michelle Dhanani <michelle@fermyon.com>
@radu-matei
Copy link
Member

radu-matei commented Mar 5, 2024

We have managed to reproduce this setup consistently — once fetched by containerd, all Spin applications would have the same image ID, even if the SHA256 of the image was different (and the applications had different content):

# these images have very different Spin applications, yet they have the same image ID:
$ crictl images list
IMAGE                                        TAG                    IMAGE ID            SIZE
docker.io/rajatjindal/http-rust-cooler       v1                     aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-again        willitwork             aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-example      f                      aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-example      w                      aabedf0b09494       2.34MB

@rajatjindal tracked down this behavior in containerd, specifically where containerd resolves the local image based on the image ID — https://github.com/containerd/containerd/blob/7c3aca7a610df76212171d200ca3811ff6096eb8/pkg/cri/server/helpers.go#L163-L191, which would always return all Spin applications present on the node (since they all had the same image ID) — and the first of those would get picked up (https://github.com/containerd/containerd/blob/7c3aca7a610df76212171d200ca3811ff6096eb8/pkg/cri/server/helpers.go#L199), which resulted in the behavior we are seeing here — with pods with a mismatch between the container pod spec and the actual image reported in the status field (and run by containerd).

The issue boils down to how spin registry push constructs the OCI reference for an application: https://github.com/fermyon/spin/blob/5b3aa0a0426d3c626bccdb3a1bfdd3c657952955/crates/oci/src/client.rs#L179-L187

Every Spin application ever pushed would generate the same OCI image config, because Spin would only set the OS and architecture — which resulted in all Spin applications having the same image ID reported by containerd — sha256:aabedf0b094944b344cecbffcfa0a9172e905a5d156bdf67c4c6cbb6d27d1d92.

This was never an issue when running spin up -f <image-reference> because the Spin internals do not rely on the OCI image config — just on the locked application file; however, containerd has no knowledge of it, and since the OCI image ID is the content digest of the config object, containerd reported that all Spin applications present on a node would have the same image ID.

The fix requires us to ensure the uniqueness of the OCI image config object for unique content, by adding a reference to the digest of the locked application file in the OCI config object — fermyon/spin#2322.

With this fix in place, new Spin applications have different image IDs reported by containerd:

/ # crictl images list
IMAGE                                        TAG                    IMAGE ID            SIZE
docker.io/rajatjindal/http-rust-cooler       v1                     aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-again        willitwork             aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-example      f                      aabedf0b09494       2.34MB
docker.io/rajatjindal/spin-rust-example      w                      aabedf0b09494       2.34MB
# new images have new image IDs
ttl.sh/we-really-fixed-it                    12h                    0e91dc3974ef9       2.21MB
ttl.sh/we-really-fixed-it                    24h                    19740bab7ca70       2.21MB

The full fix requires a patch release of Spin (based on the Spin PR linked above), and no changes in the operator or the shim.

Huge thanks to @rajatjindal, @michelleN, @endocrimes, and @jpflueger for helping track this down.

@endocrimes
Copy link
Contributor

Closing this because it's fixed in fermyon/spin#2322 and has landed on Spin main.

adamreese pushed a commit to fermyon/spin that referenced this issue Mar 7, 2024
This commit ensures that applications pushed to OCI have unique image
config fields for unique Spin application content and metadata by adding
a label in the OCI image config to the content digest (SHA256) of the
Spin locked application file.

This is to address the issue of the Containerd Spin shim serving
outdated content, because all images of Spin apps on a node would have the
same image ID (the content digest of the OCI config object, which was
identical for all Spin apps).

ref spinkube/spin-operator#40

Signed-off-by: Radu Matei <radu@fermyon.com>

Co-authored-by: Rajat Jindal <rajatjindal83@gmail.com>
Co-authored-by: Danielle Lancashire <dani@builds.terrible.systems>
Co-authored-by: Michelle Dhanani <michelle@fermyon.com>
(cherry picked from commit 14cdc42)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants