-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
libct/capabilities: create capabilities map based on current environment #3240
base: main
Are you sure you want to change the base?
Conversation
f30ff03
to
1523467
Compare
Commit 5fb831a changed the behavior of runc to match the OCI runtime spec, which now describes that unknown or unavailable capabilities should be ignored. While this change addressed situations where a capability was requested that's not supported by the current kernel ("unknown capabilities"), it did not take into account situations where the kernel *supports* a capability, but that capability is not *available* in the current environment. This causes issues if, for example, runc is running in a docker-in-docker setup, and the outer container does not have all known capabilities enabled, either on purpose (for example, Talos version 0.13 drops two capabilities (kexec + module loading) from all processes but PID 1), or because the outer container was created by an older version of docker or runc, which did not yet support newer capabilities. This patch attempts to address this problem by limiting the list of "known" capa- bilities on the set of effective capabilties for the current process. This code is based on the code in containerd's "caps" package, with some modifications: - the full list of capabilities uses github.com/syndtr/gocapability, instead of a self-defined list. Containerd removed the use of github.com/syndtr/gocapability, but this dependency is still in use in runc, so this change makes it a closer match to the current code. - functions where un-exported, as we don't intend them to be used externally. - a sync.Once was added to the .current() function, so that /proc/self/status is only parsed once. This assumes effective capabilities do not change during runc's lifecycle. There are some things left to be looked at: 1. current() may return an error when failing to parse /proc/self/status, but this error is currently ignored. If an error occurs in this code, it will mean that *no* capabilities are known. While this will be logged as warning when attempting to apply capabilities, it's not a very desirable situation. We'll have to decide what to do in that situation, which could be "panic" (runc unable to run success- fully), or "fall back to a safe/default list". 2. the current code applies the same list (effective caps) to every "type" (ambient, inheritable, bounding, ...). When applying capabilities, should each of those types in the container's spec be limited to the _corresponding_ type in the current processes' capabilities? 3. integration test: we may want an integration test for this. 4. do we want to upstream this functionality to github.com/syndtr/gocapability ? Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
1523467
to
b21568f
Compare
@kolyshkin @AkihiroSuda @tianon @tonistiigi ptal; consider this a first iteration (happy to update where needed); let me know what you think. Also look at the list I put at the end of the PR description (for integration test, I could use some pointers 😅)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to solve a lot of problems for our case, when base OS drops some of the capabilities
I don't think this is a correct change in runc level. Eg. in @smira system where he has |
Does it mean that some caps should be explicit, and some granted only if available? E.g. |
@smira While |
I mean that on other hand how can |
@smira Sorry, I meant |
thanks @tonistiigi for the clarification, yep, it makes perfect sense |
I think this is that assumption may differ; runc (or
Yes, the For the |
Iiuc this is a different case that was already covered in #2854 . While that solution isn't ideal as well (I think I would have preferred "all-caps" solution instead) it is a much more limited case and I don't request it to be reverted. A container manager can always make sure it depends on runc that is new enough for the features that it needs so it is only an issue when someone manually overwrites runc with an unsupported older version.
We shouldn't force high-level tools to all have identical logic. If someone wants to compete with docker they should be free to innovate on their own and don't need to copy-paste decisions that made sense for docker(in this case we are actually breaking Docker features as described earlier but ignoring all caps might make sense for some other tool). Runc API should be versatile so it allows different tools and use-cases to be built on top of it, it should be very explicit and not have magical inputs that mean different things depending on what host configuration is. |
So, are we implementing this in higher levels instead? |
@thaJeztah ^^^ |
So, we implemented this in Moby through moby/moby#42933, but I (personally) still think it would be nice to have a way to indicate "give me whatever you have"; perhaps that would be more something like opencontainers/runtime-spec#1071 (adding |
Relates to:
Commit 5fb831a (#2854) changed the behavior of runc
to match the OCI runtime spec, which now describes that unknown or unavailable
capabilities should be ignored.
While this change addressed situations where a capability was requested that's
not supported by the current kernel ("unknown capabilities"), it did not take
into account situations where the kernel supports a capability, but that
capability is not available in the current environment.
This causes issues if, for example, runc is running in a docker-in-docker setup,
and the outer container does not have all known capabilities enabled, either
on purpose (for example, Talos version 0.13 drops two capabilities (kexec + module
loading) from all processes but PID 1), or because the outer container was created
by an older version of docker or runc, which did not yet support newer capabilities.
This patch attempts to address this problem by limiting the list of "known" capa-
bilities on the set of effective capabilties for the current process. This code
is based on the code in containerd's "caps" package, with some modifications:
a self-defined list. Containerd removed the use of github.com/syndtr/gocapability,
but this dependency is still in use in runc, so this change makes it a closer
match to the current code.
is only parsed once. This assumes effective capabilities do not change during
runc's lifecycle.
There are some things left to be looked at:
error is currently ignored. If an error occurs in this code, it will mean that
no capabilities are known. While this will be logged as warning when attempting
to apply capabilities, it's not a very desirable situation. We'll have to decide
what to do in that situation, which could be "panic" (runc unable to run success-
fully), or "fall back to a safe/default list".
inheritable, bounding, ...). When applying capabilities, should each of those
types in the container's spec be limited to the corresponding type in the
current processes' capabilities?
Signed-off-by: Sebastiaan van Stijn github@gone.nl