Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do we need to run buildah containers always with BUILDAH_ISOLATION = chroot #5818

Open
himmatss opened this issue Nov 6, 2024 · 16 comments
Open

Comments

@himmatss
Copy link

himmatss commented Nov 6, 2024

Hi,

I have a buildah container image (quay.io/buildah/stable:latest) running with default setting as a "BUILDAH_ISOLATION = chroot" in Kubernetes. However, I am wondering is this really required to run the buildah as a container ?

Can someone pleas explain this ,
https://github.com/containers/buildah/blob/main/docs/buildah-build.1.md
_"--isolation type

Controls what type of isolation is used for running processes as part of RUN instructions. Recognized types include oci (OCI-compatible runtime, the default), rootless (OCI-compatible runtime invoked using a modified configuration, with --no-new-keyring added to its create invocation, reusing the host's network and UTS namespaces, and creating private IPC, PID, mount, and user namespaces; the default for unprivileged users), and chroot (an internal wrapper that leans more toward chroot(1) than container technology, reusing the host's control group, network, IPC, and PID namespaces, and creating private mount and UTS namespaces, and creating user namespaces only when they're required for ID mapping).

Note: You can also override the default isolation type by setting the BUILDAH_ISOLATION environment variable. export BUILDAH_ISOLATION=oci"_

@nalind
Copy link
Member

nalind commented Nov 6, 2024

In many cases, a container that's run using the image will not be given enough privileges for buildah run or the handling of RUN instructions in Dockerfiles in buildah build to be able to launch a container using an actual runtime like crun or runc. The chroot-based method is severely limited in functionality compared to crun or runc, but in return it exercises fewer privileges than they might, so it works (or "works") in a number of cases where they might not. If your environment provides enough privileges to not have to use chroot, feel free to override it.

@himmatss
Copy link
Author

himmatss commented Nov 6, 2024

Thanks @nalind for your reply.
The documentation says the default value is "oci" for the BUILDAH_ISOLATION but in the dockerfile of the image quay.io/buildah/stable:latest ; it appears to be having the BUILDAH_ISOLATION=chroot
https://github.com/containers/image_build/blob/main/podman/Containerfile
https://github.com/containers/image_build/blob/main/buildah/Containerfile

@nalind
Copy link
Member

nalind commented Nov 6, 2024

Yes, the container image has the environment variable set in it to override the compiled-in default.

@chmeliik
Copy link

I have a similar need to run buildah in Kubernetes with better isolation.

If your environment provides enough privileges to not have to use chroot, feel free to override it.

What privileges are those? How can I check if the environment provides them?

@nalind
Copy link
Member

nalind commented Dec 2, 2024

For handling RUN instructions, it's a combination of

  • Being able to create a user namespace with multiple IDs mapped into it, or being started as UID 0 and having CAP_SYS_ADMIN, so that it doesn't need to do those things to set up a namespace where those things are true. If you're writing the pod spec, hostUsers: false may provide some of this.
  • Being able to create bind and overlay mounts for volumes (this generally requires CAP_SYS_ADMIN) that it provides.
  • Being able to chroot into the rootfs to make changes inside of it (CAP_SYS_CHROOT).
  • Being able to configure networking for a namespace that it creates if "host" networking isn't specified. There's no reason to not use "host" networking when we're in a container, because from buildah's point of view, the container's network is the host network, but that's configurable, and the hard-coded defaults don't assume being run inside of a container.
  • Being able to successfully execute a command using runc, or crun, or a comparable runtime that can be invoked similarly. That last part introduces some requirements of its own that we don't have control over.

Some of these operations can also be denied by the seccomp filter, or by the SELinux policy (or other mandatory access control rules), and it's entirely possible that I'm still forgetting some things. For me, it tends to be a trial-and-error process.

Copy link

github-actions bot commented Jan 2, 2025

A friendly reminder that this issue had no activity for 30 days.

@chmeliik
Copy link

chmeliik commented Feb 5, 2025

I've had some time to play with it. I ended up with a Pod definition that seemingly makes nested containerization possible with BUILDAH_ISOLATION=oci

buildah-pod.yaml (click to expand)
apiVersion: v1
kind: Pod
metadata:
  generateName: buildah-
  labels:
    buildah-isolation-test: "true"
  annotations:
    # /dev/fuse fixes:
    #
    # fuse: device not found, try 'modprobe fuse' first
    #
    # Wouldn't be needed with STORAGE_DRIVER=vfs
    io.kubernetes.cri-o.Devices: /dev/fuse
spec:
  restartPolicy: Never
  volumes:
     - name: workdir
       emptyDir: {}

  initContainers:
    - name: create-dockerfile
      image: quay.io/containers/buildah:v1.38.1
      volumeMounts:
        - name: workdir
          mountPath: /workdir
      workingDir: /workdir
      command: ["bash", "-c"]
      args:
        - |-
          cat << EOF > Dockerfile
          FROM docker.io/library/alpine:latest

          RUN echo "hello world"
          EOF

  containers:
    - name: buildah
      image: quay.io/containers/buildah:v1.38.1
      volumeMounts:
        - name: workdir
          mountPath: /workdir
      workingDir: /workdir
      env:
        - name: BUILDAH_ISOLATION
          value: oci
        - name: STORAGE_DRIVER
          value: overlay
      command: ["bash", "-c"]
      # unshare fixes:
      #
      # error running container: from /usr/bin/crun ... opening file `/sys/fs/cgroup/cgroup.subtree_control` for writing: Read-only file system
      #
      # --mount fixes:
      #
      # Error: mount /var/lib/containers/storage/overlay:/var/lib/containers/storage/overlay, flags: 0x1000: operation not permitted
      #
      # --map-root-user fixes:
      #
      # unshare: unshare failed: Operation not permitted

      # --net=host fixes:
      #
      # error running container: from /usr/bin/crun ...: open `/proc/sys/net/ipv4/ping_group_range`: Read-only file system
      #
      # --pid=host fixes:
      #
      # error running container: from /usr/bin/crun ...: mount `proc` to `proc`: Operation not permitted
      args:
        - |-
          # can also add --pid --fork to unshare
          unshare --map-root-user --mount -- buildah build --net=host --pid=host .
      securityContext:
        capabilities:
          add:
            # SETFCAP fixes:
            #
            # unshare: write failed /proc/self/uid_map: Operation not permitted
            - SETFCAP
        seLinuxOptions:
          # container_runtime_t fixes:
          #
          # error running container: from /usr/bin/crun ...: mount `devpts` to `dev/pts`: Permission denied
          type: container_runtime_t

Test with:

kubectl delete pod -l buildah-isolation-test=true
kubectl create -f buildah-pod.yaml
sleep 5
kubectl logs -l buildah-isolation-test=true --tail=-1 --follow

@nalind could you share your thoughts on the security implications of the settings I had to use:

  • --net=host --pid=host for buildah
    • You mentioned --net=host would be OK to use, does the same apply for --pid=host?
    • Could also be combined with unshare --pid --fork, which may help mitigate potential implications of --pid=host?
  • SETFCAP for the pod to enable unshare --map-root-user
  • container_runtime_t SELinux label for the pod to get around mount `devpts` to `dev/pts`: Permission denied from crun

@nalind
Copy link
Member

nalind commented Feb 5, 2025

When attempting to nest a container, the "host" namespaces are those being used by the container. If it runs, great.
Aside: with kernel 5.11 or on RHEL 8.5 or later, you shouldn't need to bother with fuse-overlayfs. The kernel's overlay implementation is available and should be fine if storage is an emptyDir volume (or more specifically, not on an overlay filesystem, which is what the container rootfs is on), so you can probably add an emptyDir volume and drop anything that's there purely to make /dev/fuse available to the pod.

@cgwalters
Copy link

cgwalters commented Feb 5, 2025

Here's what I'm using as a one-liner reference example that seems to work right now rootless (it will also work rootful but I think it needs more security for that, see below):

$ podman run -v /var/lib/containers --security-opt=label=disable --cap-add=all --rm -ti
  quay.io/centos-bootc/centos-bootc:stream10 \
  podman run --rm -ti --net=host --cgroups=disabled \
    busybox echo hello

Now of these alone, having to do --security-opt=label=disable on the outer container seems like a really important thing to fix. I find that surprising offhand...it looks like this must be something under dontaudit as I don't see a corresponding avc denial.

As far as security I'd emphasive from my PoV, as long as the outer container is invoked with a user namespace (as Nalin mentions, hostUsers: false from the pod spec) that provides a really key security level. Using unshare --map-root-user inside the container is suboptimal in that it's trying to constrain subprocesses of the inner container from inside it.

@chmeliik
Copy link

chmeliik commented Feb 6, 2025

Thanks for the suggestions!

Adding hostUsers: false didn't break anything, added that to the Pod spec 👍 (also started a repo for this to track the changes better https://github.com/chmeliik/buildah-isolation-test)

I tried removing the unshare command and adding --cgroupns=host to the buildah command, but that still failed on opening file `/sys/fs/cgroup/cgroup.subtree_control` for writing: Read-only file system. So I'm keeping unshare for now.

Unfortunately, mounting /var/lib/containers doesn't seem to work for me, neither locally with podman run nor in Kubernetes. I still get this error despite the kernels seemingly being new enough (5.14.0-284.100.1.el9_2.x86_64 on the Kubernetes node, 6.12.10-100.fc40.x86_64 locally):

... using mount program /usr/bin/fuse-overlayfs: unknown argument ignored: lazytime
fuse: device not found, try 'modprobe fuse' first

Now of these alone, having to do --security-opt=label=disable on the outer container seems like a really important thing to fix.

I found the labeling scary as well. It seems to work with type:container_runtime_t too, but I don't actually know what that means and how much better that is

@chmeliik
Copy link

chmeliik commented Feb 6, 2025

... using mount program /usr/bin/fuse-overlayfs: unknown argument ignored: lazytime
fuse: device not found, try 'modprobe fuse' first

Wait, maybe that's just the config in the quay.io/containers/buildah image

@nalind
Copy link
Member

nalind commented Feb 6, 2025

Unfortunately, mounting /var/lib/containers doesn't seem to work for me, neither locally with podman run nor in Kubernetes. I still get this error despite the kernels seemingly being new enough (5.14.0-284.100.1.el9_2.x86_64 on the Kubernetes node, 6.12.10-100.fc40.x86_64 locally):

... using mount program /usr/bin/fuse-overlayfs: unknown argument ignored: lazytime
fuse: device not found, try 'modprobe fuse' first

The image contains an /etc/containers/storage.conf which is configured to use fuse-overlayfs. You'll need to comment out the mount_program setting and remove the "fsync=0" argument from the mountopt setting to update the configuration to not use fuse-overlayfs.

@chmeliik
Copy link

chmeliik commented Feb 6, 2025

That helped! 🎉

Any thoughts on having to set type:container_runtime_t? I found it also works with type:container_engine_t, which one would be more appropriate?

@nalind
Copy link
Member

nalind commented Feb 6, 2025

Outside of a container, it's usually labeled container_runtime_exec_t, so I would expect container_runtime_t to be the preferred domain to be run in.

@cgwalters
Copy link

Yes I think using --security-opt=label=type:container_runtime_t helps here. One thing that surprises me when I look is that the inner process is running with spc_t - that may be triggered by the --cap-add=all? Anyways I think from a security point of view by specifying just type here we still get the level (I think) based separation - the stuff at the end. The useful property of the SELinux policy here is to ensure that two distinct containers can't touch each other's state (and to provide host protection too).

@chmeliik
Copy link

chmeliik commented Feb 12, 2025

Adding hostUsers: false didn't break anything, added that to the Pod spec 👍

Derp, I don't think it did anything at all. The cluster I was using to test probably doesn't enable the UserNamespacesSupport feature gate (https://kubernetes.io/docs/tasks/configure-pod-container/user-namespaces/#before-you-begin). And its nodes don't meet the requirements anyway.

I'll try to get a cluster with user namespaces actually enabled.

In any case, it seems it's possible to make BUILDAH_ISOLATION=oci work without user namespaces, but requires

  • unshare usage
  • SETFCAP on the Pod to enable unshare
  • the container_runtime_t label on the Pod

That still seems preferable to using BUILDAH_ISOLATION=chroot without those things, what do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants