-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HostPID Pod Container Cgroup path was residual after container restarts #4040
Comments
The code you referred to was recently changed in #3825. In fact, I think what happens is kubelet calls The workaround is to call |
@Burning1020 If you can try with runc from main branch and check if it fixes your issue, that would be great! |
If we create a container with a shared or host PID namespace, after the init process has dead, the container's state becomes Stopped, so it will lead to ‘3. Container will be restart again by kubelet.’ Maybe we should trans this type container’s state to stopped by not only checking the init process has dead or not, but also checking whether there is no pid in the cgroup or not. Or just only the last condition. |
But I think Kubelet should delete the stopped container first and then start a new one. Did you see some error logs in k8s? Such as: ERRO[0000] Failed to remove paths: map[:/sys/fs/cgroup/unified/test blkio:/sys/fs/cgroup/blkio/user.slice/test cpu:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuacct:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuset:/sys/fs/cgroup/cpuset/test devices:/sys/fs/cgroup/devices/user.slice/test freezer:/sys/fs/cgroup/freezer/test hugetlb:/sys/fs/cgroup/hugetlb/test memory:/sys/fs/cgroup/memory/user.slice/user-1000.slice/session-8.scope/test misc:/sys/fs/cgroup/misc/test name=systemd:/sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-8.scope/test net_cls:/sys/fs/cgroup/net_cls,net_prio/test net_prio:/sys/fs/cgroup/net_cls,net_prio/test perf_event:/sys/fs/cgroup/perf_event/test pids:/sys/fs/cgroup/pids/user.slice/user-1000.slice/session-8.scope/test rdma:/sys/fs/cgroup/rdma/test] |
I think it has detected: |
Maybe the source of the problem here is we treat the host pidns container with dead init as stopped, which is not quite true. So, a way to fix this is to modify runc to say the container is in running state once there are some processes left it its cgroup (and use |
@kolyshkin I have tried the main branch, # runc --version
runc version 1.1.0+dev
commit: v1.1.0-791-g90cbd11
spec: 1.1.0+dev
go: go1.18.5
libseccomp: 2.5.0 The bug is not resolved! |
Sorry, there is a bug for |
For example, what is the container's image the pod uses? The content of the pod's yaml description file. |
Reproduction is very simple.
|
I can't reproduce it in my test env. You mean the OOM killed container was deleted by kubelet, but the cgroup path still existed? |
@lifubang Yes, the purpose of creating a OOM killed container is to make the container main process died immediately(killed by SIGKILL), so that its children processes is still alive and then killed by I think one of the key point in reproduction is to make the container main process continuously fork child processes. |
One more thing, I change the cgroup remove retries from 5 to 7, the bug is gone. |
I have reproduced this issue only with runc.
But it's very difficulty to fix, because with shared pid namespace, we have no efficient way to know all container processes has exited or not,except reading pids from cgroup path. Do you have any ideas to fix this issue?
|
Yes, it is difficult to fix. That's why I open this issue for disscusion.
Besides this, how about add the removel of cgroup to |
Yes, I also think so. Now in the main branch, because the code of this area has been refactored, there are still some bugs for kill and delete a container with |
Description
We created a HostPID pod that has shared pid namespace with host. The container process was killed and then restarted again and agian. We found that the container cgroup path under
/sys/fs/cgroup/<subsystem>/kubepods/podxxx-xxx/<contaienrID>/
was left.The reason is that
runc kill
orrunc delete
did not really wait for the exit of container children process,p.wait()
will receive ECHILD immediately, see https://github.com/opencontainers/runc/blob/v1.1.9/libcontainer/init_linux.go#L585C18-L585C18. If any child process is still running, the cgroup path couldn't be removed.Steps to reproduce the issue
Describe the results you received and expected
Expected: The container cgroup path is deleted.
Received: Still exist.
What version of runc are you using?
runc version 1.1.9
commit: v1.1.9-0-gccaecfcb
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4
Host OS information
No response
Host kernel information
No response
The text was updated successfully, but these errors were encountered: