Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HostPID Pod Container Cgroup path was residual after container restarts #4040

Closed
Burning1020 opened this issue Sep 28, 2023 · 16 comments · Fixed by #3825 or #4102
Closed

HostPID Pod Container Cgroup path was residual after container restarts #4040

Burning1020 opened this issue Sep 28, 2023 · 16 comments · Fixed by #3825 or #4102

Comments

@Burning1020
Copy link
Contributor

Burning1020 commented Sep 28, 2023

Description

We created a HostPID pod that has shared pid namespace with host. The container process was killed and then restarted again and agian. We found that the container cgroup path under /sys/fs/cgroup/<subsystem>/kubepods/podxxx-xxx/<contaienrID>/ was left.

The reason is that runc kill or runc delete did not really wait for the exit of container children process, p.wait() will receive ECHILD immediately, see https://github.com/opencontainers/runc/blob/v1.1.9/libcontainer/init_linux.go#L585C18-L585C18. If any child process is still running, the cgroup path couldn't be removed.

Steps to reproduce the issue

  1. Create a HostPID pod, the container has many children process died and new born.
  2. Kill the main container process for many times.
  3. Container will be restart again by kubelet.

Describe the results you received and expected

Expected: The container cgroup path is deleted.
Received: Still exist.

What version of runc are you using?

runc version 1.1.9
commit: v1.1.9-0-gccaecfcb
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.4

Host OS information

No response

Host kernel information

No response

@Burning1020 Burning1020 changed the title Container Cgroup path was residual when HostPID Pod Container Cgroup path was residual after container restarts Sep 28, 2023
@kolyshkin
Copy link
Contributor

The code you referred to was recently changed in #3825.

In fact, runc kill or runc delete can't use wait(2) because those processes are not children of (this) runc invocation.

I think what happens is kubelet calls runc delete -f, and (in runc 1.1.x) it does not autodetect that this is a host pidns process and thus all the processes should be killed, not just init (this is solved in #3825 and will be available in runc 1.2).

The workaround is to call runc kill -a (for host pidns container) followed by runc delete, or wait till runc 1.2. One other option would be to backport #3825 to runc 1.1.x, but it's pretty large so I'd rather not do that.

@kolyshkin
Copy link
Contributor

@Burning1020 If you can try with runc from main branch and check if it fixes your issue, that would be great!

@lifubang
Copy link
Member

lifubang commented Oct 2, 2023

2. Kill the main container process for many times.

If we create a container with a shared or host PID namespace, after the init process has dead, the container's state becomes Stopped, so it will lead to ‘3. Container will be restart again by kubelet.’

Maybe we should trans this type container’s state to stopped by not only checking the init process has dead or not, but also checking whether there is no pid in the cgroup or not. Or just only the last condition.

@lifubang
Copy link
Member

lifubang commented Oct 2, 2023

But I think Kubelet should delete the stopped container first and then start a new one. Did you see some error logs in k8s? Such as: ERRO[0000] Failed to remove paths: map[:/sys/fs/cgroup/unified/test blkio:/sys/fs/cgroup/blkio/user.slice/test cpu:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuacct:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuset:/sys/fs/cgroup/cpuset/test devices:/sys/fs/cgroup/devices/user.slice/test freezer:/sys/fs/cgroup/freezer/test hugetlb:/sys/fs/cgroup/hugetlb/test memory:/sys/fs/cgroup/memory/user.slice/user-1000.slice/session-8.scope/test misc:/sys/fs/cgroup/misc/test name=systemd:/sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-8.scope/test net_cls:/sys/fs/cgroup/net_cls,net_prio/test net_prio:/sys/fs/cgroup/net_cls,net_prio/test perf_event:/sys/fs/cgroup/perf_event/test pids:/sys/fs/cgroup/pids/user.slice/user-1000.slice/session-8.scope/test rdma:/sys/fs/cgroup/rdma/test]

@lifubang
Copy link
Member

lifubang commented Oct 2, 2023

and (in runc 1.1.x) it does not autodetect that this is a host pidns process and thus all the processes should be killed, not just init

I think it has detected:
https://github.com/opencontainers/runc/blob/v1.1.9/libcontainer/state_linux.go#L38-L44

@kolyshkin
Copy link
Contributor

Maybe the source of the problem here is we treat the host pidns container with dead init as stopped, which is not quite true. So, a way to fix this is to modify runc to say the container is in running state once there are some processes left it its cgroup (and use runc kill -a (-a is not needed since runc 1.2) to actually kill those leftovers.

@Burning1020
Copy link
Contributor Author

Burning1020 commented Oct 8, 2023

@Burning1020 If you can try with runc from main branch and check if it fixes your issue, that would be great!

@kolyshkin I have tried the main branch,

# runc --version
runc version 1.1.0+dev
commit: v1.1.0-791-g90cbd11
spec: 1.1.0+dev
go: go1.18.5
libseccomp: 2.5.0

The bug is not resolved!

@lifubang
Copy link
Member

lifubang commented Oct 8, 2023

The bug is not resolved!

Sorry, there is a bug for shared pid namespace in the main branch. Please see #4047 .
Do you have a detailed step to reproduce your issue? If you can reproduce it with Docker, it will be more better.

@lifubang
Copy link
Member

lifubang commented Oct 8, 2023

Do you have a detailed step to reproduce your issue?

For example, what is the container's image the pod uses? The content of the pod's yaml description file.

@Burning1020
Copy link
Contributor Author

Do you have a detailed step to reproduce your issue?

For example, what is the container's image the pod uses? The content of the pod's yaml description file.

Reproduction is very simple.

  1. Run a Pod under a Kubernetes cluster with HostPID=true, the container process should continuously fork child processes, you can use Kubernetes node-problem-detector for example.
  2. Set the memory limit to a very small value, like 50Mb
  3. The npd container was killed by OOM Killer for a few seconds and restarted by Kubelet but no pod was rebuilt.
  4. Check the cgroup path

@lifubang
Copy link
Member

lifubang commented Oct 8, 2023

I can't reproduce it in my test env. You mean the OOM killed container was deleted by kubelet, but the cgroup path still existed?

@Burning1020
Copy link
Contributor Author

Burning1020 commented Oct 9, 2023

@lifubang Yes, the purpose of creating a OOM killed container is to make the container main process died immediately(killed by SIGKILL), so that its children processes is still alive and then killed by signalAllProcesses. Because they couldn't be truely waited through the p.Wait() so maybe they are still alive even after the c.cgroupManager.Destroy(). That makes the cgroup path residual.

I think one of the key point in reproduction is to make the container main process continuously fork child processes.

@Burning1020
Copy link
Contributor Author

One more thing, I change the cgroup remove retries from 5 to 7, the bug is gone.

@lifubang
Copy link
Member

lifubang commented Oct 9, 2023

I have reproduced this issue only with runc.

WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/cpuset/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/hugetlb/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/rdma/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/misc/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/cpu,cpuacct/user.slice/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/cpu,cpuacct/user.slice/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/pids/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/blkio/user.slice/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/unified/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/devices/user.slice/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/net_cls,net_prio/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/perf_event/test: device or resource busy"
WARN[0079] Failed to remove cgroup (will retry)          error="rmdir /sys/fs/cgroup/freezer/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/cpuset/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/hugetlb/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/rdma/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/misc/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/unified/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/cpu,cpuacct/user.slice/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/cpu,cpuacct/user.slice/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/pids/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/blkio/user.slice/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/freezer/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/devices/user.slice/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/memory/user.slice/user-1000.slice/session-39.scope/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/perf_event/test: device or resource busy"
ERRO[0079] Failed to remove cgroup                       error="rmdir /sys/fs/cgroup/net_cls,net_prio/test: device or resource busy"
ERRO[0079] Failed to remove paths: map[:/sys/fs/cgroup/unified/test blkio:/sys/fs/cgroup/blkio/user.slice/test cpu:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuacct:/sys/fs/cgroup/cpu,cpuacct/user.slice/test cpuset:/sys/fs/cgroup/cpuset/test devices:/sys/fs/cgroup/devices/user.slice/test freezer:/sys/fs/cgroup/freezer/test hugetlb:/sys/fs/cgroup/hugetlb/test memory:/sys/fs/cgroup/memory/user.slice/user-1000.slice/session-39.scope/test misc:/sys/fs/cgroup/misc/test name=systemd:/sys/fs/cgroup/systemd/user.slice/user-1000.slice/session-39.scope/test net_cls:/sys/fs/cgroup/net_cls,net_prio/test net_prio:/sys/fs/cgroup/net_cls,net_prio/test perf_event:/sys/fs/cgroup/perf_event/test pids:/sys/fs/cgroup/pids/user.slice/user-1000.slice/session-39.scope/test rdma:/sys/fs/cgroup/rdma/test]

But it's very difficulty to fix, because with shared pid namespace, we have no efficient way to know all container processes has exited or not,except reading pids from cgroup path. Do you have any ideas to fix this issue?
I think just only increasing retry times isn't the best way to fix this issue. We may need many steps:

  1. Before removing cgroup path, we should check that there is no pid in cgroup;
  2. Increase retry times;
  3. If there is an error when removing cgroup dirs, never destroy the container resources and return the error immediately.

@Burning1020
Copy link
Contributor Author

Yes, it is difficult to fix. That's why I open this issue for disscusion.

  1. Before removing cgroup path, we should check that there is no pid in cgroup;
  2. Increase retry times;
  3. If there is an error when removing cgroup dirs, never destroy the container resources and return the error immediately.
  1. We should not keep waitting if any process still alive, as it is uncontrollable.
  2. As you say, increasing retry times isn't an efficient way.
  3. Maybe returning an error is a solution, let the upper caller decided how to handle this error, we should figure out why ignore
    this error from the very begining: https://github.com/opencontainers/runc/blob/main/utils_linux.go#L86

Besides this, how about add the removel of cgroup to shim.

@lifubang
Copy link
Member

3. Maybe returning an error is a solution

Yes, I also think so.

Now in the main branch, because the code of this area has been refactored, there are still some bugs for kill and delete a container with shared pid ns, so maybe this issue can't be fixed in release-1.1.
If you know how to fix it in release-1.1, please tell us or feel free to open a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment