Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods stuck in RunContainerError status with error: no IP addresses available #2244

Closed
tnqn opened this issue Jun 7, 2021 · 2 comments
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tnqn
Copy link
Member

tnqn commented Jun 7, 2021

Describe the bug
With containerd 1.4.4 as container runtime, it may happen that some nodes failed to create new Pods with RunContainerError. Describing the pod got the below message:

Events:
  Type     Reason                  Age                   From     Message
  ----     ------                  ----                  ----     -------
  Warning  FailedCreatePodSandBox  71s (x622 over 136m)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8c64f8839249ab0f85e1b44994335d3b3062ac77e48f295b7a5a5db21ce4034d": failed to allocate for range 0: no IP addresses available in range set: 100.96.9.1-100.96.9.254

But there should be available IPs on the Node.
This seems to be caused by some issue in containerd or runc because containerd had been failing to create sandbox container tasks for a while before the IPs were exhausted:

Jun 07 06:14:21 containerd[683]: time="2021-06-07T06:14:21.274380342Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:cafe-80-5b5b959fcd-mxqd8,Uid:3b511311-d3b0-4f38-bcdd-1c7a6d56d3d8,Namespace:workload-ns-20,Attempt:1,} failed, error" error="failed to start sandbox container task \"a719a9447d058e721c81155213d7b94c4050bb88c53f869f7a0b996a1ae48ec8\": context canceled: unknown"

The above error was from https://github.com/containerd/containerd/blob/963625d7bcee468ced2f868a9de6dbb2c7506514/vendor/github.com/containerd/cri/pkg/server/sandbox_run.go#L285, which indicated it failed in task.Start(ctx).

The messages said it failed to allocate IP was because of another containerd issue containerd/containerd#5438 that it didn't invoke CNI for cleanup when sandbox container creation times out. So it kept allocating IPs and being stuck in creating sandbox container tasks. After all IPs were exhausted, it started to alarm the above IP allocation error instead of creating sandbox container error because CNI was invoked before starting sandbox container task.

The reason why containerd failed to start sandbox container task was still not clear. I suspected it was caused by opencontainers/runc#2865 as currently the issue was only hit with containerd 1.4.4 which has the runc bug, but @dims clarified using containerd directly won't hit it.

The IP leak issue was not specific to Antrea as it was because containerd didn't invoke CNI for cleanup, containerd/containerd#5438 was reported with Weave as the CNI plugin. I created containerd/containerd#5569 to fix it on containerd side.

Both issues seem to be in containerd/runc, currently nothing can be done on Antrea side. Using a different containerd version should avoid it.

To Reproduce

  1. Use container 1.4.4 as the container runtime.
  2. Keep creating Pods until it failed.

Versions:
Please provide the following information:

  • Antrea version (Docker image tag). N/A
  • Kubernetes version (use kubectl version). If your Kubernetes components have different versions, please provide the version for all of them. N/A
  • Container runtime: which runtime are you using (e.g. containerd, cri-o, docker) and which version are you using? containerd 1.4.4
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Jun 7, 2021
@tnqn
Copy link
Member Author

tnqn commented Jun 7, 2021

cc @luwang-vmware @dims @andrewsykim

@tnqn
Copy link
Member Author

tnqn commented Jun 11, 2021

After applying containerd/containerd#5569 to containerd 1.4.4, Pod IPs were no longer leaked and all Pods failed with the error failed to start sandbox container task
we saw exactly same error as opencontainers/runc#2865.

# strace -p 1820363 -s256 -e write
strace: Process 1820363 attached
write(2, "standard_init_linux.go:207: init seccomp caused: error loading seccomp filter into kernel: loading seccomp filter: invalid argument\n", 132) = 132
+++ exited with 1 +++

So this was caused by opencontainers/runc#2865 and has been fixed by opencontainers/runc#2871. Bumping up containerd to >1.4.4 (runc > 1.0.0-rc93) can avoid it.

The IP leak issue has been fixed by containerd/containerd#5569 and is being backported to containerd 1.4 and 1.5.

@tnqn tnqn closed this as completed Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

1 participant