Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: define a "ALL_CAPS" pseudo-capability to grant all capabilities #1071

Open
thaJeztah opened this issue Oct 20, 2020 · 6 comments
Open

Comments

@thaJeztah
Copy link
Member

While the list of capabilities in the Kernel has been relatively stable, recently,
new capabilities were added (CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE).

This proved to be a challenge, as (for example), docker was updated to be aware
of these new capabilities (and detects if the kernel on which it's running supports them),
however, the current runc release (and possibly other runtimes) not yet recognize them.

The specification currently defines that, in order to grant capabilities to a container process,
the container configuration has to specify those capabilities:

capabilities (object, OPTIONAL) is an object containing arrays that
specifies the sets of capabilities for the process.
Valid values are defined in the [capabilities(7)][capabilities.7] man page,
such as CAP_CHOWN. Any value which cannot be mapped to a relevant kernel
interface MUST cause an error.

In most situations, this is not a problem. For example, if I'm running on a 5.8+ kernel
and want to grant my container CAP_BPF capabilities, I start the container with --cap-add CAP_BPF.
Attempting to do the same on an older kernel version will produce an error (either generated
by dockerd, or by runc).

However, when granting a container all capabilities (for example, when using
--cap-add=ALL, or when running a container with --privileged), things become
problematic.

In this situation, dockerd generates a list of all capabilities supported by the
host's kernel, and sets those capabilities in the container configuration. On a
5.8+ kernel, this will include the (CAP_PERFMON, CAP_BPF, and CAP_CHECKPOINT_RESTORE).
Docker has no option to detect what capabilities are supported by the runtime, and
runc (or other runtime) on their hand, process the list of capabilities, and
produce an error for any "unknown" capability.

While docker could account for the runtime not supporting certain capabilities
(which is what's currently done as a temporary solution moby/moby#41563),
doing so is undesirable, as it would tightly couple the runtime (and would complicate
using alternative runtimes, such as crun, gVisor (runsc) or others).

Proposal

My proposal is to delegate generation of the "all capabilities" list to the runtime,
and to include a special ALL_CAPS (just a suggestion, I'm not attached to the name)
value in the specification.

  • runtimes that do not support the ALL_CAPS special value, consider it an
    "unknown capability", and will produce an error (as defined by the specification).
  • runtimes that do support the ALL_CAPS special value will materialize the list
    of capabilities, and add all capabilities that the runtime (and active kernel)
    supports.
  • when combining ALL_CAPS with other capabilities (e.g. ALL_CAPS and CAP_CHMOD),
    ALL_CAPS must take precedence. Alternatively, this situation could be considered
    ambiguous, and an error can be produced (we should consider what's more future-proof
    in case additional "special" values are to be added in future).

Compatibility and downsides

Ideally, docker would be able to detect what version of the runtime-spec is supported
by a runtime, but this is likely a separate discussion to have.

As described above, runtimes that do not support the ALL_CAPS special value
will produce an error. This could be considered a breaking change, on the other
hand, the current situation already does not handle new capabilities to be added
to the list.

Having an ALL_CAPS capability makes the container configuration "non-declarative";
the meaning of "all" capabilities will depend on the runtime, and the kernel on
which it's running. I don't think that's worse than the current situation, in
which the same applies, only at a higher level (dockerd or containerd supporting
the new capabilities).

@thaJeztah
Copy link
Member Author

@justincormack
Copy link
Contributor

I think a standard way to ask the runtime about what it supports might be better. The runtime could return a JSON doc with everything it supports in, and the runtime should always use a subset.

@giuseppe
Copy link
Member

alternative idea: what do you think about supporting the capability value in addition to its name?

e.g.

        "capabilities": {
            "bounding": [
                "CAP_CHOWN",
                "1",
               "CAP_DAC_READ",
                ...

the higher level runtimes could read the maximum value from /proc/sys/kernel/cap_last_cap and use it to fill the OCI configuration. An advantage is that it could be used on newer kernels without requiring changes in the OCI runtime.

cap_from_name() seems to already support it

@thaJeztah
Copy link
Member Author

Yes, I think numeric values would work (at a cost of not being very human-readable, but perhaps that's not the biggest concern 🤔)

@cpuguy83
Copy link

Having runc report it isn't bad, but I think in practice it is not very usable for this case.

runc's update cycle is very different from higher level runtimes, so we can:

  • cache (forever... until restart) the caps and get out of sync on update
  • have an expiring cache and query periodically and still have some potential for being out of sync
  • query at each run and incur significant container startup overhead.

It would be nice to not depend on a library to have an update to date listing of names if nothing else than because of this discrepancy in update cycles.

@thaJeztah
Copy link
Member Author

I think a standard way to ask the runtime about what it supports might be better. The runtime could return a JSON doc with everything it supports in, and the runtime should always use a subset.

Related: opencontainers/runc#3296, which implemented a runc features subcommand to get that information from runc, and related proposal in this repo; #1130

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants