You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
With containerd 1.4.4 as container runtime, it may happen that some nodes failed to create new Pods with RunContainerError. Describing the pod got the below message:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 71s (x622 over 136m) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "8c64f8839249ab0f85e1b44994335d3b3062ac77e48f295b7a5a5db21ce4034d": failed to allocate for range 0: no IP addresses available in range set: 100.96.9.1-100.96.9.254
But there should be available IPs on the Node.
This seems to be caused by some issue in containerd or runc because containerd had been failing to create sandbox container tasks for a while before the IPs were exhausted:
Jun 07 06:14:21 containerd[683]: time="2021-06-07T06:14:21.274380342Z" level=error msg="RunPodSandbox for &PodSandboxMetadata{Name:cafe-80-5b5b959fcd-mxqd8,Uid:3b511311-d3b0-4f38-bcdd-1c7a6d56d3d8,Namespace:workload-ns-20,Attempt:1,} failed, error" error="failed to start sandbox container task \"a719a9447d058e721c81155213d7b94c4050bb88c53f869f7a0b996a1ae48ec8\": context canceled: unknown"
The messages said it failed to allocate IP was because of another containerd issue containerd/containerd#5438 that it didn't invoke CNI for cleanup when sandbox container creation times out. So it kept allocating IPs and being stuck in creating sandbox container tasks. After all IPs were exhausted, it started to alarm the above IP allocation error instead of creating sandbox container error because CNI was invoked before starting sandbox container task.
The reason why containerd failed to start sandbox container task was still not clear. I suspected it was caused by opencontainers/runc#2865 as currently the issue was only hit with containerd 1.4.4 which has the runc bug, but @dims clarified using containerd directly won't hit it.
The IP leak issue was not specific to Antrea as it was because containerd didn't invoke CNI for cleanup, containerd/containerd#5438 was reported with Weave as the CNI plugin. I created containerd/containerd#5569 to fix it on containerd side.
Both issues seem to be in containerd/runc, currently nothing can be done on Antrea side. Using a different containerd version should avoid it.
To Reproduce
Use container 1.4.4 as the container runtime.
Keep creating Pods until it failed.
Versions:
Please provide the following information:
Antrea version (Docker image tag). N/A
Kubernetes version (use kubectl version). If your Kubernetes components have different versions, please provide the version for all of them. N/A
Container runtime: which runtime are you using (e.g. containerd, cri-o, docker) and which version are you using? containerd 1.4.4
The text was updated successfully, but these errors were encountered:
After applying containerd/containerd#5569 to containerd 1.4.4, Pod IPs were no longer leaked and all Pods failed with the error failed to start sandbox container task
we saw exactly same error as opencontainers/runc#2865.
Describe the bug
With containerd 1.4.4 as container runtime, it may happen that some nodes failed to create new Pods with
RunContainerError
. Describing the pod got the below message:But there should be available IPs on the Node.
This seems to be caused by some issue in containerd or runc because containerd had been failing to create sandbox container tasks for a while before the IPs were exhausted:
The above error was from https://github.com/containerd/containerd/blob/963625d7bcee468ced2f868a9de6dbb2c7506514/vendor/github.com/containerd/cri/pkg/server/sandbox_run.go#L285, which indicated it failed in
task.Start(ctx)
.The messages said it failed to allocate IP was because of another containerd issue containerd/containerd#5438 that it didn't invoke CNI for cleanup when sandbox container creation times out. So it kept allocating IPs and being stuck in creating sandbox container tasks. After all IPs were exhausted, it started to alarm the above IP allocation error instead of creating sandbox container error because CNI was invoked before starting sandbox container task.
The reason why containerd failed to start sandbox container task was still not clear. I suspected it was caused by opencontainers/runc#2865 as currently the issue was only hit with containerd 1.4.4 which has the runc bug, but @dims clarified using containerd directly won't hit it.
The IP leak issue was not specific to Antrea as it was because containerd didn't invoke CNI for cleanup, containerd/containerd#5438 was reported with Weave as the CNI plugin. I created containerd/containerd#5569 to fix it on containerd side.
Both issues seem to be in containerd/runc, currently nothing can be done on Antrea side. Using a different containerd version should avoid it.
To Reproduce
Versions:
Please provide the following information:
kubectl version
). If your Kubernetes components have different versions, please provide the version for all of them. N/AThe text was updated successfully, but these errors were encountered: