Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We want to start running all containers by default with /dev/fuse. #362

Closed
rhatdan opened this issue Jun 4, 2020 · 32 comments
Closed

We want to start running all containers by default with /dev/fuse. #362

rhatdan opened this issue Jun 4, 2020 · 32 comments

Comments

@rhatdan
Copy link

rhatdan commented Jun 4, 2020

This would allow us to more easily run fuse-overlay as well as other user file systems if the container is running in a user namespace. We have lots of user attempting to run buildah within a container, and this would enable it.

I plan on adding "/dev/fuse" to cri-o.conf in the installation and having the OS automatically load the fuse kernel module.

I don't see this as much of a security risk, since /dev/fuse is already allowed to be used with no capabilties/privileges by non root users.

@cgwalters
Copy link
Member

It sounds like this change was already merged in cri-o/cri-o#3822

My feeling on this though is that changes to the default container "attack surface" and available APIs needs a bit more of a formal process. I think it would be worth making this a full enhancement.

@cgwalters
Copy link
Member

So this is the equivalent of podman run --device /dev/fuse on by default, right? Except because our default SELinux policy denies dynamic kernel module loads, you are proposing to unconditionally load the module on system bootup?

@rhatdan
Copy link
Author

rhatdan commented Jun 5, 2020

Yes.

@haircommander
Copy link
Member

that PR was closed, not merged, because we've been squabbling on the approach 😃

@cgwalters
Copy link
Member

that PR was closed, not merged, because we've been squabbling on the approach smiley

Ah right, ok. So basically I think some squabbling on this is warranted because while I'd agree it's probably mostly safe it's still a notable change. (For example, if we turn this on by default it becomes another thing different from e.g. upstream Kubernetes + Docker)

@rhatdan
Copy link
Author

rhatdan commented Jun 6, 2020

We can not be locked into Docker forever. Docker made some choices that were not correct, and we need to be able to evolve.

Google is not using what Docker did, they are using gvisor containers. Amazon is using Firecracker, Alibaba is using Kata. We need to be able to evolve past what Docker decided 7 years ago.

The important thing is that the OCI images stored at container registries are able to run, that we are able to pass all of the Kubernetets test suites. If we are able to handle additional workloads that Kubernetes upstream can not, that is not a bad thing.

@cgwalters
Copy link
Member

Totally agree! Again I'm not saying we should block on anything, just we should have at least some slightly formal process before we expose new APIs/attack surfaces to containers by default.

@rhatdan
Copy link
Author

rhatdan commented Jun 8, 2020

@haircommander Could you reopen your PR. We now have the issue with OpenShift.

@haircommander
Copy link
Member

it is not clear we've figured out the best route. the options are:

  • have cri-o unconditionally set it if /dev/fuse is mounted
  • add an additional device in MCO
  • add it as an additional device in the RPM

to me, the second option seems most transparent, as well as idiomatic

@rhatdan
Copy link
Author

rhatdan commented Jun 8, 2020

I am fine with 2 or 3

@rhatdan
Copy link
Author

rhatdan commented Jun 8, 2020

@runcom WDYT?

@dustymabe
Copy link
Member

I'm a bit late to this party, but just wondering if we could just have containers that need /dev/fuse request it like can be done for /dev/kvm today with device-plugins. See https://kubevirt.io/2018/KVM-Using-Device-Plugins.html

@dustymabe
Copy link
Member

But maybe that's not desired because it would require the user to specify they wanted that and it we want it to just be there by default?

@haircommander
Copy link
Member

yeah a user can already ask to have /dev/fuse bind mounted into the container.

The currently proposed approach is we're mixing a user namespace capable RuntimeClass with the needs of a builder pod. We'll be gating the user namespace annotations to a specific RuntimeClass (so it can be gated by scc easily). We can give that RuntimeClass /dev/fuse unconditionally, which will allow builder pods to use it. That won't give it to every pod, but theoretically most of the pods that need it

@jskov-jyskebank-dk
Copy link

yeah a user can already ask to have /dev/fuse bind mounted into the container

Interesting, @haircommander. How does that work?

I spent a fairly long time trying to get exactly that working on OCP this summer, and had to settle for VFS in my (non-privileged) pods. The saga is here: containers/podman#6667

I would super appreciate it, if you could be very specific, so I (n00b-level) can understand it :)

Cheers!

@haircommander
Copy link
Member

... it turns out I may have spoken without fully understanding. it was my impression that one could ask /dev/fuse to be mounted into the container and it would all magically work, but I've never actually done it (thus I am on the n00b-level 😄 ). I think my impression came from @rhatdan's comment above describing how a user can pass /dev/fuse to the additional_devices field, which is not allowed in OCP. I'm going to play with it a bit, but feel free to assume I was incorrect--I am doing so

@dustymabe
Copy link
Member

I spent a fairly long time trying to get exactly that working on OCP this summer, and had to settle for VFS in my (non-privileged) pods. The saga is here: containers/podman#6667

@jskovjyskebankdk - were you able to get /dev/fuse to work for privileged pods? I'm trying to mount /dev/fuse/ in via hostPath and I can't get that to work even with privileged. I still get fuse: failed to open /dev/fuse: Operation not permitted.

I do think if I get proper /dev/fuse/ access I might not even need privileged if I execute and unshare first. See these two (1 2) comments for why.

@jskov-jyskebank-dk
Copy link

@dustymabe I tried, but it did not work.

I think I recall @rhatdan commenting in some other issues, that /dev/fuse should not be mounted into the container, but created.

@rhatdan
Copy link
Author

rhatdan commented Sep 2, 2020

Yes from an SELinux point of view adding /dev/fuse via the spec file gets it labeled correctly, volume mounting it in gives it the hosts label. Also works better with User Namespace.

@jskov-jyskebank-dk
Copy link

I can now use /dev/fuse in a pod, running privileged.

Not sure if it is because we now run 4.4, or because I made a mistake the last time I tried this.

I tried running a privileged pod while dropping capabilitiies SYS_ADMIN and others. But it does not seem to matter when running privileged. (maybe because that ultimately means running as the host's user id 0?)

@rhatdan can you think of a way for the admins hosting our OCP instance to tweak the (host) SELinux prefs, so that Pods on OCP can use a hostmounted /dev/fuse? Without running privileged that is.
In other words: can we hack our way to the feature this issue describes?

We switch to a new 4.6 cluster next week or so, if that makes a difference in any way.

@haircommander
Copy link
Member

@jskovjyskebankdk you can add /dev/fuse:/dev/fuse to the default_mounts_file (/etc/containers/mounts.conf) pass in /dev/fuse to all containers run by CRI-O. In 4.7 we're going a different and finer grained route, but that should work for the time being if you don't mind having all containers have access to /dev/fuse

@jskov-jyskebank-dk
Copy link

Thank you @haircommander !

Looking forward to 4.7, and will probably try this on our current platform.

Cheers!

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2021
@zvonkok
Copy link

zvonkok commented Feb 22, 2021

Any update on this? I have a use-case for a custom build-strategy with buildah in shipwright. @adambkaplan could we make this (non-privileged builds) also "work" in shipwright on OpenShift?

@zvonkok
Copy link

zvonkok commented Feb 22, 2021

/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2021
@haircommander
Copy link
Member

we have moved to enable this when a pod specifically requests a user namespace: i.e: gated based on runtime class

@haircommander
Copy link
Member

I would say this can be closed

@rhatdan rhatdan closed this as completed Feb 22, 2021
@jskov-jyskebank-dk
Copy link

@haircommander sorry to reach out from a closed issue.

But the OCP enhancement you linked seems to be stale. At least it did not make it in time for OCP 4.7.

Is there a way to make per-Pod /dev/fuse mounts in OCP 4.7 (which we run now)?
Or plans (elsewhere) for that to be possible in a later release?

Thanks!

@haircommander
Copy link
Member

You'd want to use the allowed_annotations capability of cri-o https://github.com/cri-o/cri-o/blob/master/docs/crio.conf.5.md
Specifically, if you add the file

[crio.runtime.runtimes.fuse]
runtime_path = "/usr/bin/runc"
runtime_root = "/run/runc"
runtime_type = "oci"
allowed_annotations = [
    "io.kubernetes.cri-o.Devices",
]

to /etc/crio/crio.conf.d

you can specify the runtime class as fuse, and add an annotation:
io.kubernetes.cri-o.Devices: "/dev/fuse"

which will instruct cri-o to add it as a device

@jskov-jyskebank-dk
Copy link

Thanks!

Will it take care of the SELinux labelling?

@haircommander
Copy link
Member

I believe so, though I haven't personally checked

@jacobshivers
Copy link

I am leaving this for posterity in the event someone is trying to test this and would like a reference:

  • base64 encoding for MachineConfig
$ base64 --wrap=0 <<EOF
> [crio.runtime.runtimes.fuse]
> runtime_path = "/usr/bin/runc"
> runtime_root = "/run/runc"
> runtime_type = "oci"
> allowed_annotations = [
>     "io.kubernetes.cri-o.Devices",
> ]
> EOF
W2NyaW8ucnVudGltZS5ydW50aW1lcy5mdXNlXQpydW50aW1lX3BhdGggPSAiL3Vzci9iaW4vcnVuYyIKcnVudGltZV9yb290ID0gIi9ydW4vcnVuYyIKcnVudGltZV90eXBlID0gIm9jaSIKYWxsb3dlZF9hbm5vdGF0aW9ucyA9IFsKICAgICJpby5rdWJlcm5ldGVzLmNyaS1vLkRldmljZXMiLApdCg==
  • MachineConfig that extends crio functionality
$ cat fuse-MachineConfig.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-crio-fuse
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - path: /etc/crio/crio.conf.d/99-crio-fuse.conf
        overwrite: true
        contents:
          source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS5ydW50aW1lcy5mdXNlXQpydW50aW1lX3BhdGggPSAiL3Vzci9iaW4vcnVuYyIKcnVudGltZV9yb290ID0gIi9ydW4vcnVuYyIKcnVudGltZV90eXBlID0gIm9jaSIKYWxsb3dlZF9hbm5vdGF0aW9ucyA9IFsKICAgICJpby5rdWJlcm5ldGVzLmNyaS1vLkRldmljZXMiLApdCg==
  • Create MachineConfig
$ oc create -f fuse-MachineConfig.yaml 
machineconfig.machineconfiguration.openshift.io/crio-fuse created
  • RuntimeClass for fuse that will be used by pod
$ cat fuse-RuntimeClass.yaml 
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: fuse
handler: fuse
  • Create RuntimeClass
$ oc create -f fuse-RuntimeClass.yaml
  • Sample pod
$ cat fuse-podSpec.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: fuse-test
  annotations:
    io.kubernetes.cri-o.Devices: "/dev/fuse"
spec:
  containers:
  - name: fuse-test
    image: ubi8/ubi-minimal
    command: ['sh', '-c', 'echo "Hello from user $(id -u)" && sleep infinity']
  runtimeClassName: fuse
  • Create pod
$ oc create -f fuse-podSpec.yaml
  • Check for /dev/fuse within pod
$ oc exec -it fuse-test -- /bin/ls -l /dev/fuse
crw-rw-rw-. 1 root root 10, 229 Feb  4 15:41 /dev/fuse

If it is necessary to call mount() within the pod to access a filesystem via /dev/fuse then the SYS_ADMIN capability must be extended to the pod and an appropriate SCC provisioned. Below is an example of such an SCC and pod specification.

$ cat fuse-SCC.yaml
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegeEscalation: true
allowPrivilegedContainer: false
allowedCapabilities:
- SYS_ADMIN
allowedUnsafeSysctls:
- '*'
apiVersion: security.openshift.io/v1
defaultAddCapabilities: null
fsGroup:
  type: RunAsAny
groups:
- system:cluster-admins
- system:nodes
- system:masters
kind: SecurityContextConstraints
metadata:
  annotations:
    kubernetes.io/description: Test scc for fuse mounts.
  name: fuse-scc
priority: null
readOnlyRootFilesystem: false
requiredDropCapabilities:
- KILL
- MKNOD
- SETUID
- SETGID
runAsUser:
  type: RunAsAny
seLinuxContext:
  type: MustRunAs
seccompProfiles:
- '*'
supplementalGroups:
  type: RunAsAny
users:
- system:admin
- system:serviceaccount:builder
volumes:
- configMap
- downwardAPI
- emptyDir
- persistentVolumeClaim
- projected
- secret
$ cat fuse-podSpec.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: fuse-test-fuse-scc
  annotations:
    io.kubernetes.cri-o.Devices: "/dev/fuse"
spec:
  containers:
  - name: fuse-test-fuse-scc
    image: registry.access.redhat.com/ubi8/ubi:8.5-226
    command: ['sh', '-c', 'echo "Hello from user $(id -u)" && sleep infinity']
    securityContext:
      capabilities:
        add: ["SYS_ADMIN"]
  runtimeClassName: fuse
$ oc create -f fuse-SCC.yaml
$ oc create -f fuse-podSpec.yaml

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants