Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add features.md to formalize the runc features JSON #1130

Merged
merged 1 commit into from
Mar 27, 2023

Conversation

AkihiroSuda
Copy link
Member

@AkihiroSuda AkihiroSuda commented Dec 9, 2021

Add features.md and features-linux.md, to formalize the runc features JSON that was introduced in runc v1.1.0.

A runtime caller MAY use this JSON to detect the features implemented by the runtime.

The spec corresponds to https://github.com/opencontainers/runc/blob/v1.1.0/types/features/features.go
(opencontainers/runc#3296, opencontainers/runc#3310)

Differences from runc v1.1.0

@@ -0,0 +1,183 @@
# <a name="linuxFeatures" />Linux Features Document
Copy link
Contributor

@flouthoc flouthoc Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a missing feature flag for IntelRDT , some runtimes supports restricting bandwidth using that. Also runtime-spec has a definition for it.

Not sure maybe something like

"intelrdt": {
  "supported": true
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added .linux.intelRdt.enabled field


## <a name="linuxFeaturesCgroup" />Cgroup

**`cgroup`** (object, OPTIONAL) represents the runtime's implementation status of cgroup managers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be any field to list down which controllers are implemented by runtime for instance its not necessary that a given runtime must or must not implement support for a specific controller. Example RDMA is not implemented by all runtimes so entry for RDMA becomes no-op.

Again not sure about just adding it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's hard to define the set of the controller names.

e.g., "freezer" is a real controller in v1 but not a real controller in v2. And the name of the block I/O controller differs across V1 ("blkio") and V2 ("io").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AkihiroSuda hmm I agree and fair point but just a small doubt if there is a feature info, container manager should be able to delegate task to the right runtime.

A practical use case would be.

  • Container Manger supports delegating container generation to multiple runtimes but if spec requests setting a particular controller example RDMA it queries all supported runtime and delegate spec to the runtime which implements RDMA for every other spec it keeps using the default runtime.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using an array to specify which controllers are supported by a cgroup manager instead? This would solve that the controller names are different. Also not all controllers might be supported by all cgroup manager implementations.

"cgroup": {
  "v1": ["cpu", "cpuset", "blkio"],
  "v2": ["io", "memory", "devices"]
  "systemd": ["cpu", "memory", "cpuset"],
  "systemdUser":  ["cpu", "memory", "cpuset"]
}

In youki we already have the concept of a pseudo controller for devices, unified and freezer, so in my opinion a name in that array must not always map to a real controller in all implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated PR to add rdma property.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using an array to specify which controllers are supported by a cgroup manager instead?

  • v1 is dying and probably won't have new controllers, so I don't think we need to track v1 controller list (except rdma)

  • v2 may have new controllers, but probably we will support such new controllers via the unified struct without defining a new dedicated struct, so we don't need to track v2 controller list, either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we also add a "disabled": "true|false" to indicate whether the runtime supports disabling cgroups? crun does it with crun --cgroup-manager=disabled .... That is useful when running as a systemd service to reuse the existing cgroup.


```json
"apparmor": {
"enabled": true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure but should this be supported instead of enabled.

Suggested change
"enabled": true
"supported": true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON is already implemented in runc (experimentally, though), so I'm reluctant to change the field names.

IIRC, I avoided "supported" to avoid confusion with enterprise commercial "support".
An enterprise distributor may provide commercial "support" to subset of the "enabled" features.

Copy link
Contributor

@flouthoc flouthoc Dec 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i see. This is a really small nit and i don't have a strong opinion on this. 😃

To me it just feels that apparmor is only enabled for a particular spec if spec defines apparmorProfile but by default its only supported by a runtime and not enabled for every spec.

Again i am thinking this too much and its not a big thing. 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think enabled is okay, though I think that we shouldn't care what runc has developed at the moment -- runc features was only merged a day or two ago, so literally nobody is using at the moment (and we can change it until runc 1.2).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, I avoided "supported" to avoid confusion with enterprise commercial "support". An enterprise distributor may provide commercial "support" to subset of the "enabled" features.

Perhaps "available" ? (IMO "enabled" feels like a configuration toggle where most/many of these will really be capabilities that are either usable or they aren't)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess https://github.com/opencontainers/runtime-spec/pull/1130/files#diff-445e6d0dc97b93a0f04d914ad7816114ddc175dc462c7eec7961921af17ff0d0R6-R7 is the context I was missing:

The features document is irrelevant to the actual availability of the features in the host operating system. Hence, the content of the feature document SHOULD be determined on the compilation time of the runtime, not on the execution time.

I redact my suggestion ("enabled" seems fine for what this describes), but then also wonder which specific problem this functionality is solving. 🙈

Copy link
Member Author

@AkihiroSuda AkihiroSuda Dec 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder which specific problem this functionality is solving.

Examples of specific problems solved in this PR:

  • A high-level engine can't be aware whether the low-level runtime supports seccomp, so a high-level engine can't avoid "seccomp not supported" error unless the user supplies a custom configuration.
  • A high-level engine can't be aware whether the low-level runtime supports rro mounts, so a high-level engine may accidentally have non-RRO mounts against the user's will. This may result in a severe vulnerability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we can have availability property that corresponds to actual availability of the feature on the host OS/hardware, in addition to the enabled property proposed in this PR, but that will be another PR.

As a maintainer of containerd/Moby and its relevant projects, I don't see necessity of having availability detection in the low-level runtime, though, because high-level engine implementations are already capable of detecting such feature availability on the host OS/hardware.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we can have availability property that corresponds to actual availability of the feature on the host OS/hardware, in addition to the enabled property proposed in this PR, but that will be another PR.

As a maintainer of containerd/Moby and its relevant projects, I don't see necessity of having availability detection in the low-level runtime, though, because high-level engine implementations are already capable of detecting such feature availability on the host OS/hardware.

My original doubt was to replace enabled with supported or availability I don't think adding extra key just for availability would be good idea. So I agree with @AkihiroSuda we should not need extra key for this.

+1 for keeping enabled: true

@AkihiroSuda
Copy link
Member Author

Updated PR.

Differences from runc v1.1.0-rc.1

  • Add .linux.intelRdt.enabled field
  • Add .linux.cgroup.rdma field

A [runtime](glossary.md#runtime) MAY provide a JSON document about its implemented features to [runtime callers](glossary.md#runtime-caller).
This JSON document is called ["features document"](glossary.md#features-document).

The features document is irrelevant to the actual availability of the features in the host operating system.
Copy link
Member

@utam0k utam0k Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the role of the caller of the oci container runtime to ensure that it is actually available on the actual host? I felt it would be nice to check if the feature is even available on the host. What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the role of the caller of the oci container runtime to ensure that it is actually available on the actual host?

Yes

I felt it would be nice to check if the feature is even available on the host. What do you think?

Out of the scope of this PR (and I don't see necessity)
#1130 (comment)

@AkihiroSuda
Copy link
Member Author

@giuseppe Could you take a look from crun's perspective?

@giuseppe
Copy link
Member

@giuseppe Could you take a look from crun's perspective?

from crun's perspective, it looks good.

What could also be useful for us, would have a way to know what log-driver is supported by a runtime (the --log-format option used by runc and crun), but that is not part of the OCI specs, so I am not sure if it could be added here?

@AkihiroSuda
Copy link
Member Author

What could also be useful for us, would have a way to know what log-driver is supported by a runtime (the --log-format option used by runc and crun), but that is not part of the OCI specs, so I am not sure if it could be added here?

That should be probably a custom Annotations field like "org.opencontainers.runc.log-formats":"[\"text\",\"json\"]".

@giuseppe
Copy link
Member

That should be probably a custom Annotations field like "org.opencontainers.runc.log-formats":"[\"text\",\"json\"]".

the problem with custom annotations is that it won't really help to understand if a generic OCI runtime $FOO has support for --log-driver and what options are available because you need to know the field in advance.

@giuseppe
Copy link
Member

the problem with custom annotations is that it won't really help to understand if a generic OCI runtime $FOO has support for --log-driver and what options are available because you need to know the field in advance.

or are you suggesting that other OCI runtimes would use the same "org.opencontainers.runc.log-formats" key?

@AkihiroSuda
Copy link
Member Author

@giuseppe

The runc CLI flags like --log-format are currently out of the scope of the OCI Specs, so they have to be defined as annotations under org.opencontainers.runc.* namespace.

Only runtimes with runc-compatible CLI should use these org.opencontainers.runc.* annotations.

We can add proper fields for log formats when the CLI behaviors get defined in the OCI Specs.

@kolyshkin
Copy link
Contributor

This can potentially eliminate podman's containers.conf(5), parameters runtime_supports_*. Currently they have

# List of the OCI runtimes that support --format=json.  When json is supported
# engine will use it for reporting nicer errors.
#
#runtime_supports_json = ["crun", "runc", "kata", "runsc", "krun"]

# List of the OCI runtimes that supports running containers with KVM Separation.
#
#runtime_supports_kvm = ["kata", "krun"]

# List of the OCI runtimes that supports running containers without cgroups.
#
#runtime_supports_nocgroups = ["crun", "krun"]

As pointed out by @giuseppe above, runc (and crun) can add an annotation telling that --log-format is supported.

Something similar could be done with other properties from the above config.

@AkihiroSuda
Copy link
Member Author

@opencontainers/runtime-spec-maintainers

Could you take a look?
Is there any remaining task?

@AkihiroSuda
Copy link
Member Author

Rebased.

@AkihiroSuda AkihiroSuda added this to the v1.1.0 milestone Feb 1, 2023
@tianon
Copy link
Member

tianon commented Feb 1, 2023

Do you have an example of how you'd expect to see this used by a higher-level runtime like containerd or moby? Should they invoke this on every container creation? Once at their daemon startup and cache it for the lifetime? Periodically refresh their cache?

@AkihiroSuda
Copy link
Member Author

Do you have an example of how you'd expect to see this used by a higher-level runtime like containerd or moby? Should they invoke this on every container creation? Once at their daemon startup and cache it for the lifetime? Periodically refresh their cache?

I'd suggest invoking this on every container creation, but the implementation can opt-in to cache it too.

@tianon
Copy link
Member

tianon commented Feb 1, 2023

Isn't that a bit heavy just for a slightly better/earlier error message? (#1130 (comment))

Wouldn't it be better to update the requirements of something like Moby to a higher version of runc that does support rro and then just implement them, especially since that's a potential security issue? I'm sorry, but I still feel like I'm missing something here. 🙈

@AkihiroSuda
Copy link
Member Author

AkihiroSuda commented Feb 1, 2023

Isn't that a bit heavy just for a slightly better/earlier error message? (#1130 (comment))

"Check the error message and disable the unsupported feature" would work fine for just testing a single feature, but doesn't scale for checking multiple features.

Also, how do we propagate the machine-readable error message to the caller? Not sure we can just jsonify the stderr.

Wouldn't it be better to update the requirements of something like Moby to a higher version of runc that does support rro and then just implement them, especially since that's a potential security issue? I'm sorry, but I still feel like I'm missing something here. 🙈

Checking the version number is fine if the runtime caller only supports runc, but doesn't work well for supporting crun, youki, etc.

@tianon
Copy link
Member

tianon commented Feb 1, 2023

Wouldn't most of these things be things that the user asked for explicitly, so erroring all the way up is the appropriate response? Do you have an example of a workflow where something like Moby or containerd would want to enable a feature, check this API, and then dynamically no longer want to enable that feature? Doesn't that create a bit of an unexpected/inconsistent user experience?

@AkihiroSuda
Copy link
Member Author

Wouldn't most of these things be things that the user asked for explicitly, so erroring all the way up is the appropriate response? Do you have an example of a workflow where something like Moby or containerd would want to enable a feature, check this API, and then dynamically no longer want to enable that feature? Doesn't that create a bit of an unexpected/inconsistent user experience?

Moby, containerd, etc. already have this sort of inconsistency; e.g., they conditionally enable Apparmor depending on its availability on the host.
Same is highly likely to happen for other LSMs in future, such as Landlocks, when they are supported by the kernel and the OCI runtime.

Also, these high-level container engines may potentially have a feature to automatically translate ro mounts to rro to harden the "readonly" mounts, when it is supported by the kernel and the OCI runtime.

@AkihiroSuda
Copy link
Member Author

The features should be also useful for calling VM-based OCI runtimes. https://github.com/opencontainers/runtime-spec/blob/v1.1.0-rc.1/config-vm.md
The high-level container engine may conditionally choose the VM disk image file, depending on the disk formats (raw, qcow2, etc.) and the firmwares (BIOS, UEFI) supported by the OCI runtime.

(The current PR does not cover the VM-based runtimes though. Can be covered in a separate PR).

@AkihiroSuda
Copy link
Member Author

This KEP may also depend on the runc features command:

@AkihiroSuda
Copy link
Member Author

@opencontainers/runtime-spec-maintainers

We have 3 LGTMs. Can we merge this?
Also, let's release the next RC right after deciding to merge this in v1.1 or not.

@tianon
Copy link
Member

tianon commented Mar 7, 2023

In case this is blocked on my comments (:heart:): I still don't feel like I actually understand fully what the use case here is (and why they're worth potentially invoking runc features for every container startup), but clearly several other maintainers do and I don't feel strongly enough to block it. 🙇

@AkihiroSuda
Copy link
Member Author

what the use case here

The runtime features are expected to be eventually propagated to Kubernetes node annotations
(in Kube's own format) so that kube-scheduler can avoid using nodes that lack the features required by the Pods.

e.g., when a Pod is launched with recursivelyReadOnly: Required (proposed in kubernetes/enhancements#3858), the scheduler should avoid using nodes that lacks support for rro mount types.

why they're worth potentially invoking runc features for every container startup

No need to invoke runc features for every container startup.
High-level engines can just cache it, but they may face an inconsistency issue when the cache is not invalidated on upgrading the runc binary.

Add `features.md` and `features-linux.md`, to formalize the `runc features` JSON that was introduced in runc v1.1.0.

A runtime caller MAY use this JSON to detect the features implemented by the runtime.

The spec corresponds to https://github.com/opencontainers/runc/blob/v1.1.0/types/features/features.go
(opencontainers/runc PR 3296, opencontainers/runc PR 3310)

Differences since runc v1.1.0:
- Add `.linux.intelRdt.enabled` field
- Add `.linux.cgroup.rdma` field
- Add `.linux.seccomp.knownFlags` and `.linux.seccomp.supportedFlags` fields (Implemented in runc PR 3588)

Signed-off-by: Akihiro Suda <akihiro.suda.cz@hco.ntt.co.jp>
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a nit, otherwise LGTM

glossary.md Outdated Show resolved Hide resolved
Copy link
Member

@giuseppe giuseppe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@giuseppe giuseppe merged commit dbab843 into opencontainers:main Mar 27, 2023
@AkihiroSuda AkihiroSuda mentioned this pull request Mar 28, 2023
Comment on lines +141 to +158
"archs": [
"SCMP_ARCH_AARCH64",
"SCMP_ARCH_ARM",
"SCMP_ARCH_MIPS",
"SCMP_ARCH_MIPS64",
"SCMP_ARCH_MIPS64N32",
"SCMP_ARCH_MIPSEL",
"SCMP_ARCH_MIPSEL64",
"SCMP_ARCH_MIPSEL64N32",
"SCMP_ARCH_PPC",
"SCMP_ARCH_PPC64",
"SCMP_ARCH_PPC64LE",
"SCMP_ARCH_S390",
"SCMP_ARCH_S390X",
"SCMP_ARCH_X32",
"SCMP_ARCH_X86",
"SCMP_ARCH_X86_64"
],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is the runtime supposed to query this list from libseccomp? Is it expected to hardcode the list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hardcode

Yes: https://github.com/opencontainers/runc/blob/v1.1.7/libcontainer/seccomp/config.go#L54-L71
But I guess it is not difficult to add a new function to libseccomp.

You can also keep the list nil.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants