Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libct/capabilities: create capabilities map based on current environment #3240

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

thaJeztah
Copy link
Member

Relates to:

Commit 5fb831a (#2854) changed the behavior of runc
to match the OCI runtime spec, which now describes that unknown or unavailable
capabilities should be ignored.

While this change addressed situations where a capability was requested that's
not supported by the current kernel ("unknown capabilities"), it did not take
into account situations where the kernel supports a capability, but that
capability is not available in the current environment.

This causes issues if, for example, runc is running in a docker-in-docker setup,
and the outer container does not have all known capabilities enabled, either
on purpose (for example, Talos version 0.13 drops two capabilities (kexec + module
loading) from all processes but PID 1), or because the outer container was created
by an older version of docker or runc, which did not yet support newer capabilities.

This patch attempts to address this problem by limiting the list of "known" capa-
bilities on the set of effective capabilties for the current process. This code
is based on the code in containerd's "caps" package, with some modifications:

  • the full list of capabilities uses github.com/syndtr/gocapability, instead of
    a self-defined list. Containerd removed the use of github.com/syndtr/gocapability,
    but this dependency is still in use in runc, so this change makes it a closer
    match to the current code.
  • functions where un-exported, as we don't intend them to be used externally.
  • a sync.Once was added to the .current() function, so that /proc/self/status
    is only parsed once. This assumes effective capabilities do not change during
    runc's lifecycle.

There are some things left to be looked at:

  1. current() may return an error when failing to parse /proc/self/status, but this
    error is currently ignored. If an error occurs in this code, it will mean that
    no capabilities are known. While this will be logged as warning when attempting
    to apply capabilities, it's not a very desirable situation. We'll have to decide
    what to do in that situation, which could be "panic" (runc unable to run success-
    fully), or "fall back to a safe/default list".
  2. the current code applies the same list (effective caps) to every "type" (ambient,
    inheritable, bounding, ...). When applying capabilities, should each of those
    types in the container's spec be limited to the corresponding type in the
    current processes' capabilities?
  3. integration test: we may want an integration test for this.
  4. do we want to upstream this functionality to github.com/syndtr/gocapability ?

Signed-off-by: Sebastiaan van Stijn github@gone.nl

@thaJeztah thaJeztah force-pushed the check_for_current_caps branch from f30ff03 to 1523467 Compare October 8, 2021 11:26
Commit 5fb831a changed the behavior of runc
to match the OCI runtime spec, which now describes that unknown or unavailable
capabilities should be ignored.

While this change addressed situations where a capability was requested that's
not supported by the current kernel ("unknown capabilities"), it did not take
into account situations where the kernel *supports* a capability, but that
capability is not *available* in the current environment.

This causes issues if, for example, runc is running in a docker-in-docker setup,
and the outer container does not have all known capabilities enabled, either
on purpose (for example, Talos version 0.13 drops two capabilities (kexec + module
loading) from all processes but PID 1), or because the outer container was created
by an older version of docker or runc, which did not yet support newer capabilities.

This patch attempts to address this problem by limiting the list of "known" capa-
bilities on the set of effective capabilties for the current process. This code
is based on the code in containerd's "caps" package, with some modifications:

- the full list of capabilities uses github.com/syndtr/gocapability, instead of
  a self-defined list. Containerd removed the use of github.com/syndtr/gocapability,
  but this dependency is still in use in runc, so this change makes it a closer
  match to the current code.
- functions where un-exported, as we don't intend them to be used externally.
- a sync.Once was added to the .current() function, so that /proc/self/status
  is only parsed once. This assumes effective capabilities do not change during
  runc's lifecycle.

There are some things left to be looked at:

1. current() may return an error when failing to parse /proc/self/status, but this
   error is currently ignored. If an error occurs in this code, it will mean that
   *no* capabilities are known. While this will be logged as warning when attempting
   to apply capabilities, it's not a very desirable situation. We'll have to decide
   what to do in that situation, which could be "panic" (runc unable to run success-
   fully), or "fall back to a safe/default list".
2. the current code applies the same list (effective caps) to every "type" (ambient,
   inheritable, bounding, ...). When applying capabilities, should each of those
   types in the container's spec be limited to the _corresponding_ type in the
   current processes' capabilities?
3. integration test: we may want an integration test for this.
4. do we want to upstream this functionality to github.com/syndtr/gocapability ?

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
@thaJeztah
Copy link
Member Author

@kolyshkin @AkihiroSuda @tianon @tonistiigi ptal; consider this a first iteration (happy to update where needed); let me know what you think. Also look at the list I put at the end of the PR description (for integration test, I could use some pointers 😅)

There are some things left to be looked at:

  1. current() may return an error when failing to parse /proc/self/status, but this
    error is currently ignored. If an error occurs in this code, it will mean that
    no capabilities are known. While this will be logged as warning when attempting
    to apply capabilities, it's not a very desirable situation. We'll have to decide
    what to do in that situation, which could be "panic" (runc unable to run success-
    fully), or "fall back to a safe/default list".
  2. the current code applies the same list (effective caps) to every "type" (ambient,
    inheritable, bounding, ...). When applying capabilities, should each of those
    types in the container's spec be limited to the corresponding type in the
    current processes' capabilities?
  3. integration test: we may want an integration test for this.
  4. do we want to upstream this functionality to github.com/syndtr/gocapability ?

Copy link

@smira smira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to solve a lot of problems for our case, when base OS drops some of the capabilities

@tonistiigi
Copy link
Contributor

I don't think this is a correct change in runc level. Eg. in @smira system where he has sys_boot disabled, when a user runs docker run --cap-add SYS_BOOT it should error, not run a container with the wrong configuration. One can make a case that for docker run --privileged the same system should not error but warn instead. But to make this logic, docker needs to put together the correct spec(by querying caps.current() if needed). By Runc ignoring values that were requested in the spec it makes modifications to the spec that may not fit the user expectation in higher-level container manager and make it impossible to have the correct behavior there.

@smira
Copy link

smira commented Oct 11, 2021

I don't think this is a correct change in runc level. Eg. in @smira system where he has sys_boot disabled, when a user runs docker run --cap-add SYS_BOOT it should error, not run a container with the wrong configuration. One can make a case that for docker run --privileged the same system should not error but warn instead. But to make this logic, docker needs to put together the correct spec(by querying caps.current() if needed). By Runc ignoring values that were requested in the spec it makes modifications to the spec that may not fit the user expectation in higher-level container manager and make it impossible to have the correct behavior there.

Does it mean that some caps should be explicit, and some granted only if available?

E.g. --privileged can request all caps, but whatever is available gets granted, while --cap-add SYS_BOOT should fail if not available (either not supported or not available)?

@tonistiigi
Copy link
Contributor

@smira While --privileged has historically meant to add a fixed list of caps hardcoded in docker, one could make an argument that expected behavior is to add all caps that docker is able to add. (There is a counterargument that we should prioritize reproducibility between systems instead but that is a discussion for a different repo.) But with user writing --cap-add SYS_BOOT there isn't really two different ways to think about it other than SYS_BOOT is required for this container.

@smira
Copy link

smira commented Oct 11, 2021

I mean that on other hand how can docker (or any other client in general) know what are the capabilities available to dockerd ?

@tonistiigi
Copy link
Contributor

@smira Sorry, I meant dockerd in here, not client. The Docker API already only deals with added caps, dropped caps and special modes like privileged. Not full list of caps like runc. Dockerd runs in same environment as runc so can use caps.current() to figure out what is the correct list for the runc spec for specific case.

@smira
Copy link

smira commented Oct 11, 2021

thanks @tonistiigi for the clarification, yep, it makes perfect sense

@thaJeztah
Copy link
Member Author

Dockerd runs in same environment as runc so can use caps.current() to figure out what is the correct list for the runc spec for specific case.

I think this is that assumption may differ; runc (or <insert OCI runtime>)

  • may have a different set of capabilities that are supported (this was one situation we ran into where dockerd/containerd/<higher runtime> knew about recent capabilities added to the kernel, but runc did not yet)
  • runc may not always run in the same environment. For example, for LCOW, containerd and runc / CRI run in different environments (runc will be inside a VM)
  • other cases may be in the same situation (Kata containers come to mind here)

E.g. --privileged can request all caps, but whatever is available gets granted, while --cap-add SYS_BOOT should fail if not available (either not supported or not available)?

Yes, the --cap-add case may be a bit more complicated. My concern is that each-and-every higher-level runtime will have to reimplement this logic, and (hopefully) do it in a consistent way.

For the --privileged case, I suggested opencontainers/runtime-spec#1071, which would (IMO) better reflect the intent (perhaps a different name); basically; run "unconfined" / no restrictions applied to the environment in which the container is started.

@tonistiigi
Copy link
Contributor

this was one situation we ran into where dockerd/containerd/ knew about recent capabilities added to the kernel, but runc did not yet

Iiuc this is a different case that was already covered in #2854 . While that solution isn't ideal as well (I think I would have preferred "all-caps" solution instead) it is a much more limited case and I don't request it to be reverted. A container manager can always make sure it depends on runc that is new enough for the features that it needs so it is only an issue when someone manually overwrites runc with an unsupported older version.

higher-level runtime will have to reimplement this logic, and (hopefully) do it in a consistent way.

We shouldn't force high-level tools to all have identical logic. If someone wants to compete with docker they should be free to innovate on their own and don't need to copy-paste decisions that made sense for docker(in this case we are actually breaking Docker features as described earlier but ignoring all caps might make sense for some other tool). Runc API should be versatile so it allows different tools and use-cases to be built on top of it, it should be very explicit and not have magical inputs that mean different things depending on what host configuration is.

@kolyshkin
Copy link
Contributor

So, are we implementing this in higher levels instead?

@kolyshkin
Copy link
Contributor

So, are we implementing this in higher levels instead?

@thaJeztah ^^^

@thaJeztah
Copy link
Member Author

So, we implemented this in Moby through moby/moby#42933, but I (personally) still think it would be nice to have a way to indicate "give me whatever you have"; perhaps that would be more something like opencontainers/runtime-spec#1071 (adding ALL_CAPS to the spec)? Although perhaps opencontainers/runtime-spec#1130 could be considered for that instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants