Pods stuck in RunContainerError status with error: no IP addresses available #2244

tnqn · 2021-06-07T13:29:54Z

Describe the bug
With containerd 1.4.4 as container runtime, it may happen that some nodes failed to create new Pods with RunContainerError. Describing the pod got the below message:

Events:
  Type     Reason                  Age                   From     Message
  ----     ------                  ----                  ----     -------
  Warning  FailedCreatePodSandBox  71s (x622 over 136m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8c64f8839249ab0f85e1b44994335d3b3062ac77e48f295b7a5a5db21ce4034d": failed to allocate for range 0: no IP addresses available in range set: 100.96.9.1-100.96.9.254

But there should be available IPs on the Node.
This seems to be caused by some issue in containerd or runc because containerd had been failing to create sandbox container tasks for a while before the IPs were exhausted:

Jun 07 06:14:21 containerd[683]: time="2021-06-07T06:14:21.274380342Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:cafe-80-5b5b959fcd-mxqd8,Uid:3b511311-d3b0-4f38-bcdd-1c7a6d56d3d8,Namespace:workload-ns-20,Attempt:1,} failed, error" error="failed to start sandbox container task \"a719a9447d058e721c81155213d7b94c4050bb88c53f869f7a0b996a1ae48ec8\": context canceled: unknown"

The above error was from https://github.com/containerd/containerd/blob/963625d7bcee468ced2f868a9de6dbb2c7506514/vendor/github.com/containerd/cri/pkg/server/sandbox_run.go#L285, which indicated it failed in task.Start(ctx).

The messages said it failed to allocate IP was because of another containerd issue containerd/containerd#5438 that it didn't invoke CNI for cleanup when sandbox container creation times out. So it kept allocating IPs and being stuck in creating sandbox container tasks. After all IPs were exhausted, it started to alarm the above IP allocation error instead of creating sandbox container error because CNI was invoked before starting sandbox container task.

The reason why containerd failed to start sandbox container task was still not clear. I suspected it was caused by opencontainers/runc#2865 as currently the issue was only hit with containerd 1.4.4 which has the runc bug, but @dims clarified using containerd directly won't hit it.

The IP leak issue was not specific to Antrea as it was because containerd didn't invoke CNI for cleanup, containerd/containerd#5438 was reported with Weave as the CNI plugin. I created containerd/containerd#5569 to fix it on containerd side.

Both issues seem to be in containerd/runc, currently nothing can be done on Antrea side. Using a different containerd version should avoid it.

To Reproduce

Use container 1.4.4 as the container runtime.
Keep creating Pods until it failed.

Versions:
Please provide the following information:

Antrea version (Docker image tag). N/A
Kubernetes version (use kubectl version). If your Kubernetes components have different versions, please provide the version for all of them. N/A
Container runtime: which runtime are you using (e.g. containerd, cri-o, docker) and which version are you using? containerd 1.4.4

The text was updated successfully, but these errors were encountered:

tnqn · 2021-06-07T13:35:37Z

cc @luwang-vmware @dims @andrewsykim

tnqn · 2021-06-11T10:18:25Z

After applying containerd/containerd#5569 to containerd 1.4.4, Pod IPs were no longer leaked and all Pods failed with the error failed to start sandbox container task
we saw exactly same error as opencontainers/runc#2865.

# strace -p 1820363 -s256 -e write
strace: Process 1820363 attached
write(2, "standard_init_linux.go:207: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: invalid argument\n", 132) = 132
+++ exited with 1 +++

So this was caused by opencontainers/runc#2865 and has been fixed by opencontainers/runc#2871. Bumping up containerd to >1.4.4 (runc > 1.0.0-rc93) can avoid it.

The IP leak issue has been fixed by containerd/containerd#5569 and is being backported to containerd 1.4 and 1.5.

tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jun 7, 2021

tnqn closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods stuck in RunContainerError status with error: no IP addresses available #2244

Pods stuck in RunContainerError status with error: no IP addresses available #2244

tnqn commented Jun 7, 2021

tnqn commented Jun 7, 2021

tnqn commented Jun 11, 2021 •

edited

Loading

Pods stuck in RunContainerError status with error: no IP addresses available #2244

Pods stuck in RunContainerError status with error: no IP addresses available #2244

Comments

tnqn commented Jun 7, 2021

tnqn commented Jun 7, 2021

tnqn commented Jun 11, 2021 • edited Loading

tnqn commented Jun 11, 2021 •

edited

Loading