Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runc kill: add support for cgroup.kill #3825

Merged
merged 7 commits into from
Jun 10, 2023

Conversation

kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented Apr 11, 2023

The cgroup.kill API was added to Linux kernel 5.14 (see [1], [2]). This PR adds its support to runc, and changes quite a few things around killing containers.

[1] https://lwn.net/Articles/855049/
[2] https://lwn.net/Articles/855924/

Fixes: #3135
Fixes: #3866
Fixes: #3864
Fixes: #4040
Closes: #3199

This also removes child reaper from libcontainer. It is useless for runc itself, and any libcontainer users who start a container need to take care of reaping their own children. See "libct: signalAllProcesses: remove child reaping" commit for more details.

Release note

### Deprecated

- `runc kill` option `-a` is now deprecated. Previously, it had to be specified
  to kill a container (with SIGKILL) which does not have its own private PID
  namespace (so that runc would send SIGKILL to all processes). Now, this is
  done automatically. (#3864, #3825)

### Fixed

- libcontainer: fix private PID namespace detection when killing the container
  (#3866, #3825)

### Changed

- libcontainer users that create and kill containers from a daemon process
  (so that the container init is a child of that process) must now implement
  a proper child reaper in case a container does not have its own private PID
  namespace, as documented in `container.Signal`. (#3825)
- libcontainer: `container.Signal` no longer have the second `all bool`
  argument; a need to kill all processes is now determined automatically.
  (#3885)

### Added

- Support for `cgroup.kill` to kill all processes inside a container. (#3135,
  #3825)

@kolyshkin kolyshkin force-pushed the cgroup.kill branch 3 times, most recently from e9cdc73 to 53a2b6a Compare April 12, 2023 03:00
@kolyshkin

This comment was marked as outdated.

@kolyshkin kolyshkin force-pushed the cgroup.kill branch 2 times, most recently from 81c8c07 to 144339b Compare April 27, 2023 02:11
@kolyshkin kolyshkin marked this pull request as ready for review April 27, 2023 02:12
@kolyshkin kolyshkin force-pushed the cgroup.kill branch 2 times, most recently from e861f49 to d3fd53c Compare April 27, 2023 02:26
@kolyshkin kolyshkin marked this pull request as draft April 27, 2023 02:27
@kolyshkin kolyshkin force-pushed the cgroup.kill branch 6 times, most recently from 133885a to f1a50f1 Compare May 12, 2023 19:44
@kolyshkin kolyshkin force-pushed the cgroup.kill branch 4 times, most recently from bd44b32 to 9cc045c Compare May 13, 2023 01:36
@kolyshkin kolyshkin marked this pull request as ready for review May 15, 2023 21:12
@kolyshkin kolyshkin requested a review from thaJeztah May 15, 2023 21:12
@kolyshkin
Copy link
Contributor Author

@cyphar @AkihiroSuda PTAL

@kolyshkin kolyshkin added this to the 1.2.0 milestone May 15, 2023
@kolyshkin kolyshkin force-pushed the cgroup.kill branch 2 times, most recently from 122b9cc to e2254ea Compare May 18, 2023 00:59
@kolyshkin
Copy link
Contributor Author

CI failure is unrelated (#3868)

libcontainer/container_linux.go Outdated Show resolved Hide resolved
delete.go Show resolved Hide resolved
libcontainer/state_linux.go Show resolved Hide resolved
libcontainer/init_linux.go Show resolved Hide resolved
kolyshkin added 7 commits June 8, 2023 09:23
It seems that set -x was temporarily added as a debug measure, but
slipped into the final commit.

Remove it, for the sake of test logs brevity.

Fixes: 9f656db
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
This is roughly the same as TestPIDHostInitProcessWait in libct/int,
except that here we use separate processes to create and to kill a
container, so the processes inside a container are not children of "runc kill", and
also we hit different codepaths (nonChildProcess.signal rather than
initProcess.signal).

One other thing is, rootless is also tested.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
There are two very distinct usage scenarios for signalAllProcesses:

* when used from the runc binary ("runc kill" command), the processes
  that it kills are not the children of "runc kill", and so calling
  wait(2) on each process is totally useless, as it will return ECHLD;

* when used from a program that have created the container (such as
  libcontainer/integration test suite), that program can and should call
  wait(2), not the signalling code.

So, the child reaping code is totally useless in the first case, and
should be implemented by the program using libcontainer in the second
case. I was not able to track down how this code was added, my best
guess is it happened when this code was part of dockerd, which did not
have a proper child reaper implemented at that time.

Remove it, and add a proper documentation piece.

Change the integration test accordingly.

PS the first attempt to disable the child reaping code in
signalAllProcesses was made in commit bb912eb, which used a
questionable heuristic to figure out whether wait(2) should be called.
This heuristic worked for a particular use case, but is not correct in
general.

While at it:
 - simplify signalAllProcesses to use unix.Kill;
 - document (container).Signal.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
When someone is using libcontainer to start and kill containers from a
long lived process (i.e. the same process creates and removes the
container), initProcess.wait method is used, which has a kludge to work
around killing containers that do not have their own PID namespace.

The code that checks for own PID namespace is not entirely correct.
To be exact, it does not set sharePidns flag when the host/caller PID
namespace is implicitly used. As a result, the above mentioned kludge
does not work.

Fix the issue, add a test case (which fails without the fix).

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
By default, the container has its own PID namespace, and killing (with
SIGKILL) its init process from the parent PID namespace also kills all
the other processes.

Obviously, it does not work that way when the container is sharing its
PID namespace with the host or another container, since init is no
longer special (it's not PID 1). In this case, killing container's init
will result in a bunch of other processes left running (and thus the
inability to remove the cgroup).

The solution to the above problem is killing all the container
processes, not just init.

The problem with the current implementation is, the killing logic is
implemented in libcontainer's initProcess.wait, and thus only available
to libcontainer users, but not the runc kill command (which uses
nonChildProcess.kill and does not use wait at all). So, some workarounds
exist:
 - func destroy(c *Container) calls signalAllProcesses;
 - runc kill implements -a flag.

This code became very tangled over time. Let's simplify things by moving
the killing all processes from initProcess.wait to container.Signal,
and documents the new behavior.

In essence, this also makes `runc kill` to automatically kill all container
processes when the container does not have its own PID namespace.
Document that as well.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
As of previous commit, this is implied in a particular scenario. In
fact, this is the one and only scenario that justifies the use of -a.

Drop the option from the documentation. For backward compatibility, do
recognize it, and retain the feature of ignoring the "container is
stopped" error when set.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@kolyshkin
Copy link
Contributor Author

Apparently this also fixes #4040 (added to description).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants