CC: newly pulled pause image by snapshotter stored in an unexpected location #5781

BbolroC · 2023-10-12T13:15:50Z

Description of problem

With a config IMAGE_OFFLOAD_TO_GUEST=yes and FORKED_CONTAINERD=no, a pod creation under IBM Z SE is sometimes stuck in a CreateContainerError state with the following error:

Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown

It is a known issue with an upstream containerd v1.6.8 (#5775 (comment)). A quick remedy would be to remove a pause image and get the snapshotter to pull the image. But the newly pulled image is stored in an unexpected location (originally /run/kata-containers/shared/sandboxes/${sandbox_id}/shared is expected) as follows:

# ls -lah /run/kata-containers/shared/sandboxes/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1/shared
total 16K
drwxr-x--- 3 root root 160 Oct 12 11:04 .
drwx------ 5 root root 100 Oct 12 11:04 ..
-rw-r--r-- 1 root root 103 Oct 12 11:04 a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1-e9967091f9448d8a-resolv.conf
-rw-r--r-- 1 root root  11 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-44e4e6f3b60b2926-hostname
-rw-r--r-- 1 root root 103 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-4c6bb0d5b7fc98ff-resolv.conf
-rw-rw-rw- 1 root root   0 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-83476f850307d009-termination-log
-rw-r--r-- 1 root root 205 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-844b44105b991bcd-hosts
drwxrwxrwt 3 root root 140 Oct 12 11:04 efde0bf9b12e2e127bdb007f58e4dfb893d990fc64b8063f9594c1c1753c06ce-ab6d937a4d086125-serviceaccount
# ls -lah /run/containerd/io.containerd.runtime.v2.task/k8s.io/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1/
total 28K
drwx------  3 root root  200 Oct 12 11:04 .
drwx--x--x 20 root root  400 Oct 12 11:04 ..
-rw-r--r--  1 root root   89 Oct 12 11:04 address
-rw-r--r--  1 root root 8.4K Oct 12 11:04 config.json
prwx------  1 root root    0 Oct 12 11:07 log
-rw-r--r--  1 root root  101 Oct 12 11:04 monitor_address
drwx--x--x  2 root root   40 Oct 12 11:04 rootfs
-rw-------  1 root root   32 Oct 12 11:04 shim-binary-path
-rw-r--r--  1 root root    7 Oct 12 11:04 shim.pid
lrwxrwxrwx  1 root root  121 Oct 12 11:04 work -> /var/lib/containerd/io.containerd.runtime.v2.task/k8s.io/a322d916b5dc547d1dce178d31b13091418793a9675a8aa006fcfecd49f8bbc1

This leads to a test failure for Test can pull an unencrypted image inside the guest.

tests/integration/kubernetes/confidential/agent_image.bats

Line 71 in 61806ee

[ ${#rootfs[@]} -eq 1 ]

This could be resolved by bumping the containerd to v1.7, but is not an option at the moment.

The error looks only happening at http://jenkins.katacontainers.io/job/kata-containers-CCv0-ubuntu-20.04-s390x-SE-daily/. We could skip the test until the update is finished.

The text was updated successfully, but these errors were encountered:

This PR is to skip a test `Test can pull an unencrypted image inside the guest` for IBM Z secure execution until the containerd is updated to v1.7. Fixes: kata-containers#5781 Signed-off-by: Hyounggyu Choi <Hyounggyu.Choi@ibm.com>

fitzthum · 2023-10-16T15:20:57Z

Btw, this issue also shows up on other platforms and has surfaced across multiple PRs. It seems likely that this would also affect users deploying our upcoming release.

BbolroC · 2023-10-17T04:37:54Z

Btw, this issue also shows up on other platforms and has surfaced across multiple PRs. It seems likely that this would also affect users deploying our upcoming release.

If this issue is also the case for other platforms, this would affect users using a cluster (containerd 1.6.x) created without the snapshotter. What do you think? @stevenhorsman @fidencio

stevenhorsman · 2023-10-17T08:29:26Z

So I think there are potentially two separate things going on, that may, or may not be related:

Error: failed to create containerd container: create instance 697: object with key "697" already exists: unknown

issues which we've seen a few times on different platforms and

[ ${#rootfs[@]} -eq 1 ]

which we've only seen on the s390x system. So either it is not related, or the fact that most of the key already exists errors have happened on the AMD nodes that don't run the same tests, so we wouldn't know, so I think we should potentially separate these issues?

BbolroC · 2023-10-17T09:00:11Z

Yeah, I was thinking that while writing the comment. I would say the latter doesn't seem @fitzthum wanted to bring on the table. We have to discuss whether the object with key "xxx" already exists issue will affect users or not in the next release.

In the kubernetes agent_image test we currently have a check: ``` echo "Check the image was not pulled in the host" local pod_id=$(kubectl get pods -o jsonpath='{.items..metadata.name}') retrieve_sandbox_id rootfs=($(find /run/kata-containers/shared/sandboxes/${sandbox_id}/shared \ -name rootfs)) [ ${#rootfs[@]} -eq 1 ] ``` to ensure that the image hasn't been pulled onto the host. The reason that the check is for a single rootfs is that we found that the pause image was always pulled on the host, presumably due to it being needed to create the pod sandbox. With the introduction of the nydus-snapshotter code we've found that on some systems (SE and TDX) it appears to be in a different location with nydus-snapshotter, so check for 1, or 0. See an issue at kata-containers#5781 to track this. We don't have time to understand this fully now, so we just want the tests to pass and check that we don't have both the pause and test pod container image pulled, so set the check to pass if there are 1, or 0 rootfs' found in /run/kata-containers/shared/sandboxes/ Fixes: kata-containers#5790 Signed-off-by: stevenhorsman <steven@uk.ibm.com>

ChengyuZhu6 · 2023-11-07T09:27:22Z

I found that test 4 failed due to a stale kata process on the TDX CI machine while running the operator tests.:

/ ps -ef|grep kata
root      717683  716131  0 17:05 ?        00:00:00 sudo -E ./run-local.sh -r kata-qemu-tdx
root      717684  717683  0 17:05 ?        00:00:00 /bin/bash ./run-local.sh -r kata-qemu-tdx
root      721166  672128  0 17:07 pts/29   00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox kata
root     3051702       1  0 Nov01 ?        00:01:50 /opt/kata/bin/containerd-shim-kata-v2 -namespace k8s.io -address /run/containerd/containerd.sock -publish-binary /opt/confidential-containers/bin/containerd -id 70c83b7d3bf5ebb5bef7208bf816e2bccfb49962964d4559b50ab80d0112cf26

after I killing the stale kata process, all the tests(including test 4) passed.
http://10.112.240.228:8080/job/confidential-containers-operator-main-centos8stream-x86_64-containerd_kata-qemu-tdx-PR/639/console

ChengyuZhu6 · 2023-11-07T09:29:55Z

@BbolroC This could potentially be the reason for the failure of test 4 on the SE machine as well.

BbolroC · 2023-11-07T09:55:22Z

Thanks @ChengyuZhu6. I will check that out today if that is the cause for SE after the kata AC meeting (I have a schedule before it)

BbolroC · 2023-11-07T22:05:46Z

@ChengyuZhu6 @stevenhorsman @fidencio I've confirmed that the 4th test Test can pull an unencrypted image inside the guest passed on the SE machine (with the latest commit in a CCv0 branch) when I reverted the acceptance criteria back to [ ${#rootfs[@]} -eq 1 ].

stevenhorsman · 2023-11-08T09:21:55Z

@ChengyuZhu6 @stevenhorsman @fidencio I've confirmed that the 4th test Test can pull an unencrypted image inside the guest passed on the SE machine (with the latest commit in a CCv0 branch) when I reverted the acceptance criteria back to [ ${#rootfs[@]} -eq 1 ].

Thanks, this means when we move this into main we can go back to the -eq 1 rather than -le 1. Thanks a lot to Chengyu for discovery the root cause of this mystery!

BbolroC added bug Incorrect behaviour needs-review Needs to be assessed by the team. labels Oct 12, 2023

BbolroC mentioned this issue Oct 12, 2023

CC: Skip test pulling image inside guest temporarily for IBM SE #5782

Merged

stevenhorsman mentioned this issue Nov 6, 2023

ci: test: k8s: agent_image rootfs check #5790

Closed

stevenhorsman mentioned this issue Nov 6, 2023

ci: test: k8s: agent_image rootfs check #5791

Merged

stevenhorsman mentioned this issue Feb 26, 2024

Merge basic guest pull image code to main kata-containers/kata-containers#8484

Merged

stevenhorsman mentioned this issue May 16, 2024

tests: pull-image: Only skip tests for TEEs kata-containers/kata-containers#9613

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CC: newly pulled pause image by snapshotter stored in an unexpected location #5781

CC: newly pulled pause image by snapshotter stored in an unexpected location #5781

BbolroC commented Oct 12, 2023 •

edited

Loading

fitzthum commented Oct 16, 2023

BbolroC commented Oct 17, 2023 •

edited

Loading

stevenhorsman commented Oct 17, 2023

BbolroC commented Oct 17, 2023

ChengyuZhu6 commented Nov 7, 2023

ChengyuZhu6 commented Nov 7, 2023

BbolroC commented Nov 7, 2023

BbolroC commented Nov 7, 2023

stevenhorsman commented Nov 8, 2023

CC: newly pulled pause image by snapshotter stored in an unexpected location #5781

CC: newly pulled pause image by snapshotter stored in an unexpected location #5781

Comments

BbolroC commented Oct 12, 2023 • edited Loading

Description of problem

fitzthum commented Oct 16, 2023

BbolroC commented Oct 17, 2023 • edited Loading

stevenhorsman commented Oct 17, 2023

BbolroC commented Oct 17, 2023

ChengyuZhu6 commented Nov 7, 2023

ChengyuZhu6 commented Nov 7, 2023

BbolroC commented Nov 7, 2023

BbolroC commented Nov 7, 2023

stevenhorsman commented Nov 8, 2023

BbolroC commented Oct 12, 2023 •

edited

Loading

BbolroC commented Oct 17, 2023 •

edited

Loading