Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] *: introduce pidfd-socket flag #4045

Merged
merged 1 commit into from
Nov 22, 2023

Conversation

fuweid
Copy link
Member

@fuweid fuweid commented Oct 1, 2023

The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process. It's unsafe to call kill(2) since the pid can be recycled. It's good to provide the pidfd of init process through the pidfd-socket. It's similar to the console-socket. With the pidfd, the container manager like containerd-shim can send the signal to target process safely.

And for the standard init process, we can have polling support to get exit event instead of blocking on wait4.


Let me explain why the containerd-shim needs this feature for containerd init process.

Without pidfd, containerd-shim can't tell which process exits. It has to use reap all the zombies.
However, it requires all the fork/exec operations needs to use the reap-event-framework in containerd.
For example, the mount go-package needs to fork child process which unshares to get brand-new userns. If the child process has been killed and reaped by containerd-shim, the child process's pid can be reused. In order to know the exit event, the mount go-package needs to use reap-event-framework, which doesn't make senses.

With pidfd support, we can use polling support to know which process exits instead of calling wait4 syscall.

And one more detail is that the containerd-shim only cares the container init process and exec init processes.

Currently, containerd-shim uses PR_SET_CHILD_SUBREAPER, and watch the signal SIGCHLD to reap all the zombie processes, including container init process and exec init processes. Before v4.11 kernel-exit: fix the setns() && PR_SET_CHILD_SUBREAPER interaction, the process X double-forked by the exec init process will be reparented to containerd-shim so that containerd-shim can cleanup the zombie.

# docker exec
root@50c69c07310e:/go# uname -r
4.4.0-210-generic
root@50c69c07310e:/go# go install github.com/fuweid/cmdrun-go@v0.1.0
root@50c69c07310e:/go#
root@50c69c07310e:/go# cmdrun-go sleep 30
root@50c69c07310e:/go#
root@50c69c07310e:/go# ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 04:57 ?        00:00:00 sleep 1d
root         7     0  0 04:58 pts/0    00:00:00 bash
root       262     0  0 05:11 pts/0    00:00:00 sleep 30
root       264     7  0 05:11 pts/0    00:00:00 ps -ef
root@50c69c07310e:/go#

Right now (>= v4.11 kernel), the containerd-shim only receives the SIGCHLD from init processes because the double-forked processes will be reparent to pid-1 in the pid namespace. So, containerd-shim doesn't need to care any double-forked processes. The pidfd can help containerd-shim to focus on the correct processes.

# docker exec
root@d46705a1963c:/go# uname -r
6.4.0-4-amd64
root@d46705a1963c:/go#
root@d46705a1963c:/go# go install github.com/fuweid/cmdrun-go@v0.1.0
root@d46705a1963c:/go#
root@d46705a1963c:/go# cmdrun-go sleep 5
root@d46705a1963c:/go#
root@d46705a1963c:/go# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 05:15 ?        00:00:00 /bin/sleep 1d
root           7       0  0 05:15 pts/0    00:00:00 /bin/bash
root         103       1  0 05:16 pts/0    00:00:00 sleep 5
root         104       7  0 05:16 pts/0    00:00:00 ps -ef
root@d46705a1963c:/go#
root@d46705a1963c:/go# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 05:15 ?        00:00:00 /bin/sleep 1d
root           7       0  0 05:15 pts/0    00:00:00 /bin/bash
root         103       1  0 05:16 pts/0    00:00:00 [sleep] <defunct>
root         105       7  0 05:16 pts/0    00:00:00 ps -ef
root@d46705a1963c:/go#

REF: containerd/containerd#9175

@lifubang
Copy link
Member

lifubang commented Oct 1, 2023

The king of involution, is writing code during holidays also a form of rest? 😄

The container manager like containerd-shim can't use cgroup.kill feature or freeze all the processes in cgroup to terminate the exec init process.

Could you please explain why? And why we can’t use ‘runc kill’ or runc libcontainer API directly?

It's unsafe to call kill(2) since the pid can be recycled.

I seem to remember that ‘contained’ has used kill(2) for many many years.

@fuweid
Copy link
Member Author

fuweid commented Oct 2, 2023

Could you please explain why? And why we can’t use ‘runc kill’ or runc libcontainer API directly?

runc-kill is used to send the signal to the container init process. As far as I know, there is no runc-commandline to send signal to the exec init process.

I seem to remember that ‘contained’ has used kill(2) for many many years.

Yes because pidfd is available since v5.3 kernel. pidfd can ensure that we can send the signal to the correct process, especially the exec-probe has timeout.

@fuweid
Copy link
Member Author

fuweid commented Oct 2, 2023

ping @AkihiroSuda @thaJeztah
cc @dmcgowan

@fuweid
Copy link
Member Author

fuweid commented Oct 2, 2023

also ping @lifubang @cyphar @kolyshkin

@lifubang
Copy link
Member

lifubang commented Oct 3, 2023

I think this is a nice feature, because I have hated the big for loop in containerd to find out whether the exit signal is from the init process or not in many years ago.

Just only one question, I think maybe we can simplify the implementation, I don’t know whether my solution could work or not:
Like ‘createPidFile’, How about create a function named ‘sendPidFd’ in utils_linux.go, use process.Pid() to get PidFd, and call it in function ’run’.

runc/utils_linux.go

Lines 270 to 278 in ee45b9b

tty.ClosePostStart()
if r.pidFile != "" {
if err = createPidFile(r.pidFile, process); err != nil {
r.terminate(process)
return -1, err
}
}
status, err := handler.forward(process, tty, detach)
if err != nil {

@fuweid
Copy link
Member Author

fuweid commented Oct 3, 2023

Just only one question, I think maybe we can simplify the implementation, I don’t know whether my solution could work > or not:
Like ‘createPidFile’, How about create a function named ‘sendPidFd’ in utils_linux.go, use process.Pid() to get PidFd, and > call it in function ’run’.

Basically, yes. The runc-{create/exec/run} process is still parent of the init process before exit. We should check the status of process by WNOHANG after pidfd_open. Since the pidfd_open allows to open the own process's pid, it's safe and not race-condition. So I think it's straightforward to handle it in the init process. Sounds good to you?

@cyphar
Copy link
Member

cyphar commented Oct 4, 2023

It seems like it would've been a good idea to make --console-socket a proper communication channel so we don't need to add a new one for each file descriptor (unfortunately I don't think we can fix this -- old runc callers expect to only get the console socket)...

FWIW, I don't like adding features to runc's command-line if we can avoid it -- it makes life harder for other OCI runtimes because we are creating non-standard behaviour that everyone has to copy from us in order to work with runtimes that depend on it. I made this mistake with --console-socket (in retrospect, we probably should've implemented something like LXC's monitor process) and it'd be a shame to repeat it if it's unnecessary.

But then again, I don't see another way of solving it, other than re-architecting runc... Hmmm...

@lifubang
Copy link
Member

lifubang commented Oct 4, 2023

I don't like adding features to runc's command-line if we can avoid it

I think it isn’t a problem to add features to runc's command-line, because if there is a way that we accept pidFd solution without cmd flag, other OCI runtimes should also have to support it with the way like runc uses. What I mean is that, Can we independently solve this problem on containerd side? @fuweid For example, when containerd have fetched the init process’ pid, how about get the pidFd from containerd side?

@fuweid
Copy link
Member Author

fuweid commented Oct 4, 2023

Thanks for the comment! @cyphar

because we are creating non-standard behaviour that everyone has to copy from us in order to work with runtimes that depend on it.

Understand. Currently, no spec is to describe what the command line looks like. For example, the standard init process has two steps to setup: runc-create and runc-start. But the setns init process only has one step runc-exec. It's more about best practice.

But then again, I don't see another way of solving it, other than re-architecting runc... Hmmm...

Just wondering about what re-architecting runc looks like. If it's not related to spec or standard, I think we still have problem to align with all the runtime implementations. Any new features could introduce new flag.

It seems like it would've been a good idea to make --console-socket a proper communication channel so we don't need to add a new one for each file descriptor

Totally agrees. I was thinking about introduce --experimental-xxx to support new command-line features. The feature needs baking time before GA.


Hi @lifubang

What I mean is that, Can we independently solve this problem on containerd side? @fuweid For example, when containerd have fetched the init process’ pid, how about get the pidFd from containerd side?

It requires the sub-reaper setting. The idea comes from refactoring the containerd-shim process manager.
Without sub-reaper setting, there is race-condition to open pidfd after receiving the pid file, because of pid recycled.
With the dup fd by socket, the receiver can ensure that pidfd is correct and there is no race-condition.

I think it's useful to non-sub-reaper use case as well.

libcontainer/init_linux.go Outdated Show resolved Hide resolved
@fuweid fuweid force-pushed the support-pidfd-socket branch 2 times, most recently from 16c6989 to 1105572 Compare October 6, 2023 14:21
@fuweid
Copy link
Member Author

fuweid commented Oct 9, 2023

ping @opencontainers/runc-maintainers ~

@fuweid
Copy link
Member Author

fuweid commented Oct 19, 2023

Sorry to ping @cyphar @kolyshkin @AkihiroSuda @thaJeztah @lifubang again. Any thoughts on this pull request? Thanks

cli.StringFlag{
Name: "pidfd-socket",
Usage: "path to an AF_UNIX socket which will receive a file descriptor referencing the init process",
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. cc @lifubang

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is green now. @lifubang @AkihiroSuda PTAL thanks

@fuweid fuweid force-pushed the support-pidfd-socket branch 3 times, most recently from 661a689 to 5fe6606 Compare October 20, 2023 07:27
@fuweid fuweid force-pushed the support-pidfd-socket branch 2 times, most recently from d77625a to 911366a Compare October 20, 2023 07:41
Makefile Show resolved Hide resolved
Copy link
Member

@lifubang lifubang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Please add a changelog entry in your PR description.

@lifubang
Copy link
Member

@cyphar @kolyshkin @thaJeztah PTAL, if there is no objection, I will merge it in next week.
There are just two things needs to discuss:

  1. Is it need to add a description: This is an experimental flag at this time.
  2. There is another way to implement this feature, using env, just like NOTIFY_SOCKET in systemd area. But I think this way is not better than command line.

exec.go Outdated Show resolved Hide resolved
The container manager like containerd-shim can't use cgroup.kill feature or
freeze all the processes in cgroup to terminate the exec init process.
It's unsafe to call kill(2) since the pid can be recycled. It's good to
provide the pidfd of init process through the pidfd-socket. It's similar to
the console-socket. With the pidfd, the container manager like containerd-shim
can send the signal to target process safely.

And for the standard init process, we can have polling support to get
exit event instead of blocking on wait4.

Signed-off-by: Wei Fu <fuweid89@gmail.com>
@lifubang lifubang merged commit 95a93c1 into opencontainers:main Nov 22, 2023
45 checks passed
@fuweid fuweid deleted the support-pidfd-socket branch December 1, 2023 07:27
@cyphar cyphar mentioned this pull request Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants