libct/capabilities: create capabilities map based on current environment #3240

thaJeztah · 2021-10-08T11:11:27Z

Relates to:

docker container is privileged tries to assume more capabilities than available moby/moby#42906 docker container is privileged tries to assume more capabilities than available
fix: provide only available capabilities to insecure environment moby/buildkit#2394 fix: provide only available capabilities to insecure environment
fix: don't attempt to set unavailable capabilities in privileged mode moby/moby#42911 fix: don't attempt to set unavailable capabilities in privileged mode
Recent test failures in CI (apply caps: operation not permitted) moby/moby#42892 Recent test failures in CI (apply caps: operation not permitted)
capabilities: WARN, not ERROR, for unknown / unavailable capabilities #2854 capabilities: WARN, not ERROR, for unknown / unavailable capabilities
runtime should WARN / ignore capabilities that cannot be granted runtime-spec#1094 runtime should WARN / ignore capabilities that cannot be granted

Commit 5fb831a (#2854) changed the behavior of runc
to match the OCI runtime spec, which now describes that unknown or unavailable
capabilities should be ignored.

While this change addressed situations where a capability was requested that's
not supported by the current kernel ("unknown capabilities"), it did not take
into account situations where the kernel supports a capability, but that
capability is not available in the current environment.

This causes issues if, for example, runc is running in a docker-in-docker setup,
and the outer container does not have all known capabilities enabled, either
on purpose (for example, Talos version 0.13 drops two capabilities (kexec + module
loading) from all processes but PID 1), or because the outer container was created
by an older version of docker or runc, which did not yet support newer capabilities.

This patch attempts to address this problem by limiting the list of "known" capa-
bilities on the set of effective capabilties for the current process. This code
is based on the code in containerd's "caps" package, with some modifications:

the full list of capabilities uses github.com/syndtr/gocapability, instead of
a self-defined list. Containerd removed the use of github.com/syndtr/gocapability,
but this dependency is still in use in runc, so this change makes it a closer
match to the current code.
functions where un-exported, as we don't intend them to be used externally.
a sync.Once was added to the .current() function, so that /proc/self/status
is only parsed once. This assumes effective capabilities do not change during
runc's lifecycle.

There are some things left to be looked at:

current() may return an error when failing to parse /proc/self/status, but this
error is currently ignored. If an error occurs in this code, it will mean that
no capabilities are known. While this will be logged as warning when attempting
to apply capabilities, it's not a very desirable situation. We'll have to decide
what to do in that situation, which could be "panic" (runc unable to run success-
fully), or "fall back to a safe/default list".
the current code applies the same list (effective caps) to every "type" (ambient,
inheritable, bounding, ...). When applying capabilities, should each of those
types in the container's spec be limited to the corresponding type in the
current processes' capabilities?
integration test: we may want an integration test for this.
do we want to upstream this functionality to github.com/syndtr/gocapability ?

Signed-off-by: Sebastiaan van Stijn github@gone.nl

Commit 5fb831a changed the behavior of runc to match the OCI runtime spec, which now describes that unknown or unavailable capabilities should be ignored. While this change addressed situations where a capability was requested that's not supported by the current kernel ("unknown capabilities"), it did not take into account situations where the kernel *supports* a capability, but that capability is not *available* in the current environment. This causes issues if, for example, runc is running in a docker-in-docker setup, and the outer container does not have all known capabilities enabled, either on purpose (for example, Talos version 0.13 drops two capabilities (kexec + module loading) from all processes but PID 1), or because the outer container was created by an older version of docker or runc, which did not yet support newer capabilities. This patch attempts to address this problem by limiting the list of "known" capa- bilities on the set of effective capabilties for the current process. This code is based on the code in containerd's "caps" package, with some modifications: - the full list of capabilities uses github.com/syndtr/gocapability, instead of a self-defined list. Containerd removed the use of github.com/syndtr/gocapability, but this dependency is still in use in runc, so this change makes it a closer match to the current code. - functions where un-exported, as we don't intend them to be used externally. - a sync.Once was added to the .current() function, so that /proc/self/status is only parsed once. This assumes effective capabilities do not change during runc's lifecycle. There are some things left to be looked at: 1. current() may return an error when failing to parse /proc/self/status, but this error is currently ignored. If an error occurs in this code, it will mean that *no* capabilities are known. While this will be logged as warning when attempting to apply capabilities, it's not a very desirable situation. We'll have to decide what to do in that situation, which could be "panic" (runc unable to run success- fully), or "fall back to a safe/default list". 2. the current code applies the same list (effective caps) to every "type" (ambient, inheritable, bounding, ...). When applying capabilities, should each of those types in the container's spec be limited to the _corresponding_ type in the current processes' capabilities? 3. integration test: we may want an integration test for this. 4. do we want to upstream this functionality to github.com/syndtr/gocapability ? Signed-off-by: Sebastiaan van Stijn <github@gone.nl>

thaJeztah · 2021-10-08T12:47:47Z

@kolyshkin @AkihiroSuda @tianon @tonistiigi ptal; consider this a first iteration (happy to update where needed); let me know what you think. Also look at the list I put at the end of the PR description (for integration test, I could use some pointers 😅)

There are some things left to be looked at:

current() may return an error when failing to parse /proc/self/status, but this
error is currently ignored. If an error occurs in this code, it will mean that
no capabilities are known. While this will be logged as warning when attempting
to apply capabilities, it's not a very desirable situation. We'll have to decide
what to do in that situation, which could be "panic" (runc unable to run success-
fully), or "fall back to a safe/default list".

the current code applies the same list (effective caps) to every "type" (ambient,
inheritable, bounding, ...). When applying capabilities, should each of those
types in the container's spec be limited to the corresponding type in the
current processes' capabilities?

integration test: we may want an integration test for this.

do we want to upstream this functionality to github.com/syndtr/gocapability ?

smira

This is going to solve a lot of problems for our case, when base OS drops some of the capabilities

tonistiigi · 2021-10-11T18:18:05Z

I don't think this is a correct change in runc level. Eg. in @smira system where he has sys_boot disabled, when a user runs docker run --cap-add SYS_BOOT it should error, not run a container with the wrong configuration. One can make a case that for docker run --privileged the same system should not error but warn instead. But to make this logic, docker needs to put together the correct spec(by querying caps.current() if needed). By Runc ignoring values that were requested in the spec it makes modifications to the spec that may not fit the user expectation in higher-level container manager and make it impossible to have the correct behavior there.

smira · 2021-10-11T18:20:14Z

I don't think this is a correct change in runc level. Eg. in @smira system where he has sys_boot disabled, when a user runs docker run --cap-add SYS_BOOT it should error, not run a container with the wrong configuration. One can make a case that for docker run --privileged the same system should not error but warn instead. But to make this logic, docker needs to put together the correct spec(by querying caps.current() if needed). By Runc ignoring values that were requested in the spec it makes modifications to the spec that may not fit the user expectation in higher-level container manager and make it impossible to have the correct behavior there.

Does it mean that some caps should be explicit, and some granted only if available?

E.g. --privileged can request all caps, but whatever is available gets granted, while --cap-add SYS_BOOT should fail if not available (either not supported or not available)?

tonistiigi · 2021-10-11T18:26:40Z

@smira While --privileged has historically meant to add a fixed list of caps hardcoded in docker, one could make an argument that expected behavior is to add all caps that docker is able to add. (There is a counterargument that we should prioritize reproducibility between systems instead but that is a discussion for a different repo.) But with user writing --cap-add SYS_BOOT there isn't really two different ways to think about it other than SYS_BOOT is required for this container.

smira · 2021-10-11T18:29:08Z

I mean that on other hand how can docker (or any other client in general) know what are the capabilities available to dockerd ?

tonistiigi · 2021-10-11T18:34:38Z

@smira Sorry, I meant dockerd in here, not client. The Docker API already only deals with added caps, dropped caps and special modes like privileged. Not full list of caps like runc. Dockerd runs in same environment as runc so can use caps.current() to figure out what is the correct list for the runc spec for specific case.

smira · 2021-10-11T18:44:21Z

thanks @tonistiigi for the clarification, yep, it makes perfect sense

thaJeztah · 2021-10-12T11:26:44Z

Dockerd runs in same environment as runc so can use caps.current() to figure out what is the correct list for the runc spec for specific case.

I think this is that assumption may differ; runc (or <insert OCI runtime>)

may have a different set of capabilities that are supported (this was one situation we ran into where dockerd/containerd/<higher runtime> knew about recent capabilities added to the kernel, but runc did not yet)
runc may not always run in the same environment. For example, for LCOW, containerd and runc / CRI run in different environments (runc will be inside a VM)
other cases may be in the same situation (Kata containers come to mind here)

E.g. --privileged can request all caps, but whatever is available gets granted, while --cap-add SYS_BOOT should fail if not available (either not supported or not available)?

Yes, the --cap-add case may be a bit more complicated. My concern is that each-and-every higher-level runtime will have to reimplement this logic, and (hopefully) do it in a consistent way.

For the --privileged case, I suggested opencontainers/runtime-spec#1071, which would (IMO) better reflect the intent (perhaps a different name); basically; run "unconfined" / no restrictions applied to the environment in which the container is started.

tonistiigi · 2021-10-12T21:52:56Z

this was one situation we ran into where dockerd/containerd/ knew about recent capabilities added to the kernel, but runc did not yet

Iiuc this is a different case that was already covered in #2854 . While that solution isn't ideal as well (I think I would have preferred "all-caps" solution instead) it is a much more limited case and I don't request it to be reverted. A container manager can always make sure it depends on runc that is new enough for the features that it needs so it is only an issue when someone manually overwrites runc with an unsupported older version.

higher-level runtime will have to reimplement this logic, and (hopefully) do it in a consistent way.

We shouldn't force high-level tools to all have identical logic. If someone wants to compete with docker they should be free to innovate on their own and don't need to copy-paste decisions that made sense for docker(in this case we are actually breaking Docker features as described earlier but ignoring all caps might make sense for some other tool). Runc API should be versatile so it allows different tools and use-cases to be built on top of it, it should be very explicit and not have magical inputs that mean different things depending on what host configuration is.

kolyshkin · 2021-11-09T02:25:23Z

So, are we implementing this in higher levels instead?

kolyshkin · 2021-11-19T23:08:07Z

So, are we implementing this in higher levels instead?

@thaJeztah ^^^

thaJeztah · 2021-12-20T16:17:28Z

So, we implemented this in Moby through moby/moby#42933, but I (personally) still think it would be nice to have a way to indicate "give me whatever you have"; perhaps that would be more something like opencontainers/runtime-spec#1071 (adding ALL_CAPS to the spec)? Although perhaps opencontainers/runtime-spec#1130 could be considered for that instead.

thaJeztah force-pushed the check_for_current_caps branch from f30ff03 to 1523467 Compare October 8, 2021 11:26

thaJeztah force-pushed the check_for_current_caps branch from 1523467 to b21568f Compare October 8, 2021 11:30

thaJeztah marked this pull request as ready for review October 8, 2021 12:17

This was referenced Oct 8, 2021

fix: provide only available capabilities to insecure environment moby/buildkit#2394

Merged

fix: don't attempt to set unavailable capabilities in privileged mode moby/moby#42911

Closed

smira approved these changes Oct 11, 2021

View reviewed changes

thaJeztah mentioned this pull request Oct 13, 2021

oci/caps: limit available capabilities to current environment moby/moby#42933

Merged

thaJeztah mentioned this pull request Dec 20, 2021

[master] kube-proxy doesn't start up due to "apply caps: operation not permitted" error kubernetes-sigs/kind#2058

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libct/capabilities: create capabilities map based on current environment #3240

libct/capabilities: create capabilities map based on current environment #3240

thaJeztah commented Oct 8, 2021

thaJeztah commented Oct 8, 2021

smira left a comment

tonistiigi commented Oct 11, 2021

smira commented Oct 11, 2021

tonistiigi commented Oct 11, 2021

smira commented Oct 11, 2021

tonistiigi commented Oct 11, 2021

smira commented Oct 11, 2021

thaJeztah commented Oct 12, 2021

tonistiigi commented Oct 12, 2021

kolyshkin commented Nov 9, 2021

kolyshkin commented Nov 19, 2021

thaJeztah commented Dec 20, 2021

libct/capabilities: create capabilities map based on current environment #3240

Are you sure you want to change the base?

libct/capabilities: create capabilities map based on current environment #3240

Conversation

thaJeztah commented Oct 8, 2021

thaJeztah commented Oct 8, 2021

smira left a comment

Choose a reason for hiding this comment

tonistiigi commented Oct 11, 2021

smira commented Oct 11, 2021

tonistiigi commented Oct 11, 2021

smira commented Oct 11, 2021

tonistiigi commented Oct 11, 2021

smira commented Oct 11, 2021

thaJeztah commented Oct 12, 2021

tonistiigi commented Oct 12, 2021

kolyshkin commented Nov 9, 2021

kolyshkin commented Nov 19, 2021

thaJeztah commented Dec 20, 2021