Skip to content

Commit

Permalink
Add enhancement for supporting user namespaces
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Hunt <pehunt@redhat.com>
  • Loading branch information
haircommander committed Jul 15, 2024
1 parent 0977611 commit edb1513
Showing 1 changed file with 294 additions and 0 deletions.
294 changes: 294 additions & 0 deletions enhancements/kubelet/user-namespaces-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
---
title: user-namespaces-support
authors:
- haircommander
reviewers:
- rphilips
- giuseppe
approvers: # A single approver is preferred, the role of the approver is to raise important questions, help ensure the enhancement receives reviews from all applicable areas/SMEs, and determine when consensus is achieved such that the EP can move forward to implementation. Having multiple approvers makes it difficult to determine who is responsible for the actual approval.
- mrunalp
api-approvers: # In case of new or modified APIs or API extensions (CRDs, aggregated apiservers, webhooks, finalizers). If there is no API change, use "None"
- deads2k
creation-date: 2024-06-17
last-updated: 2024-06-17
tracking-link: # link to the tracking ticket (for example: Jira Feature or Epic ticket) that corresponds to this enhancement
- https://issues.redhat.com/browse/OCPNODE-2000
see-also:
- N/A
replaces:
- N/A
superseded-by:
- N/A
---

# User Namespaces Support

## Summary

In Kubernetes 1.30, support for User Namespaces went to Beta. Adding support in Openshift will allow users to gain access to additional users
in a container in a safe way, as well as open up avenues for running podman inside of an unprivileged Openshift pod.
To integrate into Openshift, we must enable the feature UserNamespacesSupport and UserNamespacesPodSecurityStandards,
integrate support into SecurityContextContraints (SCC), and add a feature that controls for Kubelet version skew.

This feature relies on work already done in the kernel to support idmapped mounts, or a mechanism to allow filesystems to be user namespaces aware.
This work was merged in RHEL 9.4.

## Motivation

Originally implemented in the linux kernel 3.8, user namespaces have long been a goal of the Kubernetes community.
[KEP-127](www.github.com/kubernetes/enhancements/issues/127) is one of the oldest still-open KEPs today.
Part of the push for user namespaces is it gets containers closer to the aspirational goal of a virtualized host: putting a process
in a user namespace means it can have "privileges" inside the container, while being unprivileged on the host. Further, this
divide between the container's namespace and the host's means an admin can allow users to gain access to privileges within the container,
while being able to trust that the kernel doesn't grant them on the host. A consequence of this is users can, for instance, run podman
within an Openshift pod without being in a privileged namespace.

### User Stories

* As an Openshift user, I would like to be able to run my container as root without needing to be trusted by the platform.
* As a user of Openshift Devspaces, I would like to run podman within the Devspace.
* As an Openshift admin, I would like to run untrusted users on a tighter security profile than the SCC restricted-v2.
* As an Openshift admin, I would like to ensure pods that request user namespaces are confined to one.

### Goals

- Enable Openshift users to request a pod be put in a user namespace
- Update SCC to take user namespaces into account when choosing the security profile of a container
- Add support for users to run podman within an Openshift pod without being privileged.

### Non-Goals

-

## Proposal

There are three pieces to this proposal:
- Extend SCC to be aware of the hostUsers field:
- Add a new field AllowHostUser to SCC, which will be Allowed by default
- Add a new SCC to the default list: restricted-v3
- This SCC will be identical to restricted-v2, but have AllowHostUsers set to Disallowed
- Add a new SCC to the default list: container-in-pod. It will have:
- SELinux context set to MustRunAs.Type: container_engine_t
- RunAsUserStrategy: RunAsAny
- RequiredDropCapabilities: None
- AllowHostUser: Disallowed
- And otherwise mirror the restricted-v2 profile
- Add the features UserNamespacesSupport and UserNamespacesPodSecurityStandards to the list that qualify a cluster as TechPreviewNoUpgrade

### Workflow Description

N/A

### API Extensions

- Add AllowHostUser to SCC. This relies on approval from the apiserver team.

### Topology Considerations

#### Hypershift / Hosted Control Planes

- Support for a user to change the node config object will have to be investigated.

#### Standalone Clusters

I don't think there are any special topology considerations for standalone.

#### Single-node Deployments or MicroShift

From my understanding, there should not be any large resource consumption changes for this feature.

### Implementation Details/Notes/Constraints

#### SecurityContextConstraint Updates

##### `AllowHostUsers` field

Add the following to the [Openshift API](https://github.com/openshift/api/blob/76a71dac36a08eab1b240c6c8d4e39c813b1b12b/security/v1/types.go):

```
// allowHostUsers determines if the policy allows host users in containers.
// Valid values are "Allowed", "Disallowed" and omitted.
// When omitted, this means no opinion and the platform is left to choose a reasonable default.
// The current default is "Allowed".
// +openshift:enable:FeatureGate=UserNamespacesSupport
// +kubebuilder:validation:Enum="Allowed";"Disallowed";""
// +kubebuilder:validation:Optional
AllowHostUsers HostUsersStrategyType `json:"allowHostUsers" protobuf:"bytes,26,opt,name=allowHostUsers"`
...
// AllowHostUsersStrategyType shows the allowable values for AllowHostUsers
// While it's an enum, it's practically a boolean, as it is mirroring the other AllowHost*
// fields, while maintaining current API conventions.
type AllowHostUsersStrategyType string
...
const (
// HostUsersStrategyAllowed allows the use of HostUsers in a pod.
HostUsersStrategyAllowed HostUsersStrategyType = "Allowed"
// HostUsersStrategyDisallowed denies the use of HostUsers in a pod.
HostUsersStrategyDisallowed HostUsersStrategyType = "Disallowed"
// HostUsersStrategyEmpty will set to the default, which is currently "Allowed".
HostUsersStrategyEmpty HostUsersStrategyType = ""
...
)
```

This value will function similarly to its peers corresponding to PID, IPC and Network namespaces, except it is a string field rather than built-in boolean.
This is to follow current API conventions, which maintain boolean fields are not extendable enough.

##### restricted-v3

**NOTE TO REVIEWER** I am not very opinionated on this being present. I think it would provide value to customers, but I don't know that folks are asking for it.
If this proposal has too much, this would be the first thing I would drop.

this SCC profile will be identical to the existing restricted-v2, except it will set the `AllowHostUser` to `Disallowed`,
thus forcing pods to be in a user namespace. This will make it a more restrictive profile, as the user on the host will not be
the same as the one inside the container.

##### container-in-pod

**NOTE TO REVIEWER** I think the naming here will be a source of contention.

The intention of this SCC is to allow a user to run `podman` or other container engine inside of an Openshift pod.
Since user namespaces allow a process to gain access to the capabilities needed in a safe way, it's a natural addition
to the proposal of adding user namespaces generally.

This SCC will largely mirror the `restricted-v2` SCC, but have a couple of changes.
- SELinux context set to MustRunAs.Type: `container_engine_t`
- This SELinux type has been developed to allow the majority of podman in pod situations, and can
continue to be adapted without affecting the normal `container_t` which should be more restrictive.
- RunAsUserStrategy: RunAsAny
- Any user should be allowed, as the user running in the container is not the same running outside.
- RequiredDropCapabilities: None
- Inside of a user namespace, the capabilities a pod requests are only present in the user namespace,
not on the host. Thus, even for a less trusted user, the capabilities should be safe to access.
- AllowHostUser set to Disallowed

#### Feature Gates and Sets

Finally, add the features UserNamespacesSupport and UserNamespacesPodSecurityStandards to the list that qualify a cluster as TechPreviewNoUpgrade,
for the 4.17 cycle. We'll address whether we can move the feature out of tech preview after that.

### Risks and Mitigations

- Allowing user namespaces does open Openshift users to theoretical kernel vulnerabilities
- While user namespaces have existed for a while, kernel concepts like idmapped mounting
have not, and the newness could be seen as risky.
- However, the tangible security advantages of allowing user namespaces outweigh the theoretical
security risks.
- Plus, for the majority of users, user namespaces will not be enabled to begin with.

### Drawbacks

## Open Questions [optional]

- Should we include the restricted-v3 profile in this enhancement?
- Is there a better name than `container-in-pod`
- Do we need to include `container-in-pod` OOTB if we can allow a cluster admin to create it?

## Test Plan

- e2e tests for a user namespaced pod, especially with different volume types to verify kernel idmapped mount support
- long-term: e2e tests for running podman in a pod, so we have an established test path for users to know what works and what doesn't.

## Graduation Criteria

### Dev Preview -> Tech Preview

- All pieces of this enhancement implemented, within the TPNU feature set
- Extensive documentation on enablement and common pitfalls
- Unit and e2e coverage
- Gather feedback from users rather than just developers (there are customers who are interested in trying this out)

### Tech Preview -> GA

- More testing (upgrade, downgrade, scale)
- Sufficient time for feedback
- Available by default
- Backhaul SLI telemetry
- Document SLOs for the component
- Conduct load testing
- User facing documentation created in [Openshift-docs](https://github.com/Openshift/Openshift-docs/)

### Removing a deprecated feature
N/A

## Upgrade / Downgrade Strategy

- To begin with, this feature will be gated by TechPreviewNoUpgrade, so it will not be able to upgrade
- A downgrade of the apiserver will cause this feature to not work as well, as the SCC changes will be lost on older versions
- A downgrade of the kubelet down to 1.28 should continue to work
- Support for idmapped mounting was added in 1.28, so pods with volumes that have the hostUsers field specified will fail.
- Since this feature is being added in 4.17, the version skew supported goes down to 4.14, meaning it must stay in TechPreview for 4.17

In the future, this feature will have the TechPreviewNoUpgrade flag removed, at which point all supported Kubelets and apiservers will
be aware of the feature gate and attempt to create a pod with a user namespace. The only special upgrade/downgrade considerations are
for the SCC changes, which users will loose access to if the cluster downgrades.

## Version Skew Strategy

In Kubernetes and Openshift, a version skew of n-3 between the kubelet and apiserver is supported. The key consideration in version skew:
if the kube-apiserver believes the cluster supports user namespaces, will every supported kubelet create a pod with a user namespace?

There is risk if this does not happen, as we intend on having the apiserver relax validation for a pod that it believes is confined by a user namespace,
when in reality it is not, thus leading to security vulnerability.

In this enhancement, we assume homogeneous feature gates, where both the kubelet and kube-apiserver have the same feature gates set.

Support for user namespaces were intitally added in 1.27 without idmapped mount support. However, it used a different feature gate
`UserNamespacesStatelessPodsSupport`. As such, 4.14 kubelet does not create an pod with a user namespace when `UserNamespacesSupport` feature
is added. Thus, the skew from 4.17->4.14 is not supported, and the feature must stay in tech preview.

As for the future possibility of GA'ing as early as 4.18, the similar conversation happens. In this case, 4.15 has support for the `UserNamespacesSupport`
feature gate, and kubelet will create a user namespace for the pod. Further, support was added in CRI-O to deny a pod that was created with ID mapped mounts,
but the kernel doesn't support IDmapped mounts in 4.15/1.28. The kernel won't support them until 4.16 (when RHCOS is released based on RHEL 9.4).

Thus, we are safe to GA as early as 4.18, as CRI-O will fail to create a container in 4.15 that doesn't have idmapped mount support.
Even pods that have no volumes do need to have mounts done, and since the RHEL 9.2 kernel doesn't support the idmapped mount options,
all user namespaced pods will fail on 4.15 and below.

## Operational Aspects of API Extensions

- Describe the possible failure modes of the API extensions.
- The only API extension here is to SCC. The field will mirror existing ones, and thus have been vetted by time

## Support Procedures

Generally, failures in this feature will result in container creation failures for newly created containers that
use `hostUsers: false`. Some of these are platform problems; the kernel, CRI-O, kubelet and apiserver need to support
idmapped mounting.

However, there will be some configuration needed by the user. For instance, to use user namespaces, the OCI runtime also
needs to support the hostUsers field, but crun is the only packaged version that does as of today. If runc 1.2.0 is released
before 4.17, then it can be packaged and included in 4.17 and work OOTB. However, some users will need to update the OCI runtime
with a container runtime config

```
apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
name: enable-crun-worker
spec:
machineConfigPoolSelector:
matchLabels:
pools.operator.machineconfiguration.openshift.io/worker: ""
containerRuntimeConfig:
defaultRuntime: crun
```

In most cases, the issues can be identified in the pod status, though looking through the kubelet and CRI-O logs may also be
needed.

There should be no issues to pods without `hostUsers: false`.

## Alternatives

- Less integration with SCC is possible, as currently proposed it gives the most flexibility for admins to mandate which users can access user namespaces.
- It was previously discussed to have a mechanism to allow a kubelet to fail to run if user namespaces aren't supported.
- It was determined this was not needed, as the earliest this feature could go GA, all supported kubelets support the feature gate.
Even though it's not supported by the nodes themselves, there will not be any vulnerabilities opened from an apiserver thinking a pod has user
namespaces when corresponding kubelet/CRI-O doesn't actually create the pod with them.

## Infrastructure Needed [optional]

N/A

0 comments on commit edb1513

Please sign in to comment.