support compose with default docker/kube privileges #1329

cgwalters · 2018-04-08T11:28:20Z

Currently we require recursive containerization support for compose; the ability to run bwrap.

In theory, we should be able to fall back on just chroot(), which is a capability is exposed.

Unfortunately that isn't going to give us privileges to e.g. mount /proc. What we'd need to do is inside the target, make those symlinks to the outer container's filesystems.

The text was updated successfully, but these errors were encountered:

cgwalters · 2018-04-08T21:31:34Z

symlinks wouldn't work obviously. Can't bind mount either since that's still mount() which requires CAP_SYS_ADMIN in the container (works fine with userns). Actually the only thing I can think of is a rather horrific hack of taking the target /usr and swapping it in over our /usr 😦. Would require careful management to ensure that we've loaded all the files we need.

In the end I think it'd probably be better to sprint towards getting usable userns in kube.

smarterclayton · 2018-04-09T16:26:51Z

That's 6 months out, no matter what, which means from then until now we can't build CoreOS on OpenShift. That seems like a problem - I'm still not clear why bwrap and other containerization is necessary when you are already inside a container? I assume you're trying to prevent the content that will be running as root from escaping and gaining access to some of the contents of the container running the container? We already operate in that mode for docker builds, and I'm not sure it's a bigger security concern

cgwalters · 2018-04-09T17:37:12Z

I'm still not clear why bwrap and other containerization is necessary when you are already inside a container?

Fundamentally we have our source rootfs (container) and we're generating a new target rootfs which is distinct from that source. Changing that assumption would be...ugly. It starts to look like this.

In this source/target separation then (which is the same scenario necessary to run the same mock code used to build RPMs today), while raw chroot() gives us the capabilities to run code in the target root, it doesn't give us the ability to e.g. mount /dev or /proc. We could downgrade to just using mkdev() for /dev but not having /proc around causes various bugs.

User namespaces is the general solution to this, it enables recursive containerization.

I assume you're trying to prevent the content that will be running as root from escaping and gaining access to some of the contents of the container running the container?

Yeah, it's slightly useful for host builds but not a big deal. However there are two cases where it's a lot more useful:

When doing client-side package layering, there is no "outer containerization" today, so rpm-ostree does its own. We for example survive rm -rf /. See also this commit.
For increasing the security of general RPM builds server side; there's a lot more to do here but having an "inner container" for scripts helps a lot

At a high level I'd say we're establishing some of the requirements to do RPM-style builds in Kube/OpenShift which is going to be a very useful thing in general. Where rpm-ostree adds additional complexity is that unlike yum/dnf (but like mock) it actually uses container features itself.

The bwrap code has some "opinionated" setup around e.g. /etc and such that I'd like to centralize even more. However, the dracut case of taking the host's `/etc` is the unusual standout. Let's split out a base constructor. Prep for coreos#1329

This is also prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What FUSE is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified (which we support today) that repository gets thrown away anyways. So there's no real value to using FUSE here. Let's hence make it so that in this scenario we don't. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.

This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: coreos#1329

In prep for cleaning up how we manage mounts. If we're looking at a real existing `/var`, then it must have the directory. The only case where we don't is in pkglayering, so move the special case there. Prep for coreos#1329

User namespaces denies overmounting existing targets as it could reveal things intended to be hidden. In the recursive container case, we need to just use e.g. `--bind /dev /dev` rather than trying to create a new instances. Further, in the selftest case we can't overmount what already exists, so skip doing that entirely. Prep for rpm-ostree-in-Kube: coreos#1329

This should all work with preceeding commits, let's start testing it. Closes: coreos#1329

cgwalters · 2018-04-10T19:30:43Z

PR in #1332

The bwrap code has some "opinionated" setup around e.g. /etc and such that I'd like to centralize even more. However, the dracut case of taking the host's `/etc` is the unusual standout. Let's split out a base constructor. Prep for coreos#1329

This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: coreos#1329

In prep for cleaning up how we manage mounts. If we're looking at a real existing `/var`, then it must have the directory. The only case where we don't is in pkglayering, so move the special case there. Prep for coreos#1329

The bwrap code has some "opinionated" setup around e.g. /etc and such that I'd like to centralize even more. However, the dracut case of taking the host's `/etc` is the unusual standout. Let's split out a base constructor. Prep for #1329 Closes: #1333 Approved by: jlebon

This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: #1329 Closes: #1333 Approved by: jlebon

In prep for cleaning up how we manage mounts. If we're looking at a real existing `/var`, then it must have the directory. The only case where we don't is in pkglayering, so move the special case there. Prep for #1329 Closes: #1333 Approved by: jlebon

The bwrap code has some "opinionated" setup around e.g. /etc and such that I'd like to centralize even more. However, the dracut case of taking the host's `/etc` is the unusual standout. Let's split out a base constructor. Prep for #1329 Closes: #1333 Approved by: jlebon

This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: #1329 Closes: #1333 Approved by: jlebon

In prep for cleaning up how we manage mounts. If we're looking at a real existing `/var`, then it must have the directory. The only case where we don't is in pkglayering, so move the special case there. Prep for #1329 Closes: #1333 Approved by: jlebon

This is also prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What FUSE is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified (which we support today) that repository gets thrown away anyways. So there's no real value to using FUSE here. Let's hence make it so that in this scenario we don't. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.

User namespaces denies overmounting existing targets as it could reveal things intended to be hidden. In the recursive container case, we need to just use e.g. `--bind /dev /dev` rather than trying to create a new instances. Further, in the selftest case we can't overmount what already exists, so skip doing that entirely. Prep for rpm-ostree-in-Kube: coreos#1329

This should all work with preceeding commits, let's start testing it. Closes: coreos#1329

Now that we've split out the "base" constructor for the initramfs special case, the "mutable" disposition tells us how to handle binding `/etc` and `/var` in addition to `/usr`. Drop the `RPMOSTREE_BWRAP_IMMUTABLE` case as it is now only used in the script case where we handle `/var`. Next, user namespaces denies overmounting existing targets as it could reveal things intended to be hidden. In the recursive container case, we need to just use e.g. `--bind /dev /dev` rather than trying to create a new instances. Further, in the selftest case we can't overmount what already exists, so skip doing that entirely. Closes: coreos#1329

smarterclayton · 2018-04-17T04:05:06Z

In pulling together what the CI process arc looks like, the concrete requirements for this should fall out. If the cost is high enough from your investigations that you think it needs to be scheduled in, we can manage that as part of the epic that tracks what the CI process looks like.

giuseppe · 2018-04-17T15:58:30Z

Unfortunately that isn't going to give us privileges to e.g. mount /proc. What we'd need to do is inside the target, make those symlinks to the outer container's filesystems.

Colin, how much is important to get a new pid namespace and a new /proc? Without the new pid ns requirement, it should be possible to run bwrap recursively (with containers/bubblewrap#256) and bind mount proc instead of mounting a new procfs. I had a look at it for running Buildah as unprivileged user, but having runc for each RUN gives less flexibility with /proc.

I gave it a quick try (bwrap under bwrap under docker), from the bwrap build directory:

$ echo '{}' > /tmp/seccomp.json
$ sudo docker run --security-opt seccomp=/tmp/seccomp.json --rm -ti -v $(pwd):/foo:Z -w /foo fedora \
./bwrap --uid 0 --cap-add ALL --unshare-user --bind / /  \
./bwrap --unshare-user --bind / / --bind /proc /proc dnf install -y nano

....

Installed:
  nano.x86_64 2.8.7-1.fc27                                                                                                                                                                    

Complete!

There is an issue for Moby to not have masked paths on /proc and thus enable mounting a procfs as unprivileged user: moby/moby#36597, I guess img is hitting the same issue. It won't probably help much though with users we don't want to trust

cgwalters · 2018-04-18T11:55:10Z

Colin, how much is important to get a new pid namespace and a new /proc?

Without a pid namespace (and without a seccomp policy in place) one can easily kill()/ptrace() etc. the parent processes. So: important.

giuseppe · 2018-04-18T14:38:08Z

Without a pid namespace (and without a seccomp policy in place) one can easily kill()/ptrace() etc. the parent processes. So: important.

Isn't relaxing seccomp required in any case when a new namespace must be created inside of a Docker container? Same for a new pid namespace + procfs mount: that is not permitted in an unprivileged Docker container as /proc has masked/readonly paths. Clarification: the pid namespace can be created but that will make /proc unusable as you cannot mount a new one.

In any case, I was just wondering if the recursive bwrap could help with anything as it can be used for launching rpm-ostree which in turn uses bwrap internally.

cgwalters · 2018-05-23T14:49:56Z

Maybe the path we should be looking at is doing a Custom builder specifically for the OpenShift casse?

smarterclayton · 2018-05-25T20:02:12Z

Custom builders are effectively dead, so not really a practical option. The problem is that building this content shouldn't require privileges, and i don't really want to encourage anyone to think that privileged container image builders are necessary or a good idea. When user namespaces and cri-o lands in openshift, and we start dropping privileges for those containers, I don't really want there to be this one special case.

Maybe I'm being unreasonable, but in a properly confined container that is running as uid 0, there's nothing special about what rpm-ostree is doing that requires isolation other than some trickery. You are already confined when you run.

smarterclayton · 2018-05-25T20:04:31Z

Mounting proc is in order to prevent what? Or allow what?

Inside of a containerized runner, we're not defending the container from the content (we will already have ways of defending that that will grow over time). We're just trying to ensure the content doesn't escape the container. Malicious content in a build container is not our threat model here, because all user code that ends up in the final image has to be considered malicious.

cgwalters · 2018-08-21T17:47:37Z

I forgot to comment here too, but the basic thing is that we're generating a root filesystem which is not the same as the source, and we want to run scripts in the target, and it really works best if those scripts are run as containers.

One of the biggest "scripts" is dracut, which generates the initramfs which is is another little filesystem generated from our target. We could in theory teach dracut how to deal with a src != target scenario but it'd be quite nontrivial, and a big risky change to an under-tested but critical component of the OS.

Mix in the major changes here with the fundamental thing that down the line, it should be quite possible to make recursive containerization can be made to work inside Kubernetes, I'd rather limp forward with the story that "use --privileged for now". See containers/bubblewrap#269 for some discussion on recursive containers.

This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.

This is prep for running inside (unprivileged) Kube containers as they exist today: #1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work. Closes: #1591 Approved by: jlebon

This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.

This is prep for running inside (unprivileged) Kube containers as they exist today: coreos#1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. We also make use of --force-copy-zerosized that just landed in libostree: ostreedev/ostree#1752 Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work.

This is prep for running inside (unprivileged) Kube containers as they exist today: #1329 Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN. I think there's some work on FUSE-in-containers but I'm not sure of the current status. What rofiles-fuse here is doing here is protecting is the hardlinked repo imports. But if `--cachedir` isn't specified, that repository gets thrown away anyways. So there's no real value to using FUSE here. Also since nothing is cached, disable the devino cache. We also make use of --force-copy-zerosized that just landed in libostree: ostreedev/ostree#1752 Down the line ideally we gain the capability to detect if either unprivileged overlayfs/FUSE are available. Then if `--cachedir` is specified we can make things work. Closes: #1591 Approved by: jlebon

jlebon · 2018-10-29T19:10:02Z

BTW, there's a PR open to implement userns in Kube: kubernetes/kubernetes#64005.

adityashah1212 · 2018-11-07T18:58:35Z

This might be a little different than the issue being discussed here, but I am trying to make a custom tree using rootless podman/buildah with --privileged flag. It seems everything goes fine until it downloads files from the repository (which I assume means bwrap is able to create a container). But after that it fails with Operation not permitted. Am I doing something wrong?

dustymabe · 2018-11-07T21:25:31Z

We just merged some code in coreos-assembler to make it work unprivileged: coreos/coreos-assembler#190. You might be able to use coreos-assembler (running in a rootless podman invocation) to do a compose (it uses an unprivileged VM).

adityashah1212 · 2018-11-07T22:04:44Z

@dustymabe Thanks for the advice. I did take a look at the coreos-assembler. From what I can see is that you guys are launching a qemu-kvm instance if the container is unprivileged right? Two thing I would like to ask.

I am fine having to run privileged containers as long as they are rootless. The problem is the installation of the downloaded rpms fails.
I cannot guarantee that /dev/kvm would be available, considering the container itself maybe running is some kind of vm. So if I were to launch a qemu instance, can I do so without kvm support, i.e. a plain emulator rather than a hypervisor?

BTW I am using centos 7.5 and the custom tree is for centos 7.5

dustymabe · 2018-11-08T01:51:27Z

@dustymabe Thanks for the advice. I did take a look at the coreos-assembler. From what I can see is that you guys are launching a qemu-kvm instance if the container is unprivileged right?

right

I am fine having to run privileged containers as long as they are rootless. The problem is the installation of the downloaded rpms fails.

yeah. this isn't something I've tried. I'm not sure if there's a technical reason why it's not working, or if it is a bug in rootless containers, which are pretty new

I cannot guarantee that /dev/kvm would be available, considering the container itself maybe running is some kind of vm

If you have control of the hypervisor you could enable nested virt, but that is a big IF.

So if I were to launch a qemu instance, can I do so without kvm support, i.e. a plain emulator rather than a hypervisor?

yeah. I don't know if I recommend this, it would probably be really slow. We are calling the qemu-kvm binary in coreos-assembler, so i'm not sure if it would fall back to non hardware assisted virt (emulation), but you could try.

adityashah1212 · 2018-11-08T21:58:39Z

@dustymabe Thanks for your help. I think I will stick to using nested virt approach. This will take some effort, but I think it will be worth the time I will save later.

cgwalters · 2018-11-29T16:12:11Z

While some initial work landed on this, I think we're unlikely to spend much more time on it. I'd prefer to focus on the coreos-assembler unpriv path: #1329 (comment)

cgwalters added the kind/enhancement label Apr 8, 2018

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018

compose: Add environment variable to skip tmpdir cleanup

eb62896

This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: coreos#1329

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018

ci: Add unprivileged compose test

5c19f0b

This should all work with preceeding commits, let's start testing it. Closes: coreos#1329

cgwalters mentioned this issue Apr 10, 2018

WIP: Support running in default kube privileges #1332

Closed

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018

compose: Add environment variable to skip tmpdir cleanup

02f564a

This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: coreos#1329

rh-atomic-bot pushed a commit that referenced this issue Apr 10, 2018

compose: Add environment variable to skip tmpdir cleanup

42a6c99

This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: #1329 Closes: #1333 Approved by: jlebon

rh-atomic-bot pushed a commit that referenced this issue Apr 11, 2018

compose: Add environment variable to skip tmpdir cleanup

abcaf42

This was useful for me to directly invoke bwrap against the target rootfs. Part of working on: #1329 Closes: #1333 Approved by: jlebon

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 11, 2018

ci: Add unprivileged compose test

6db5585

This should all work with preceeding commits, let's start testing it. Closes: coreos#1329

cgwalters mentioned this issue Jun 6, 2018

Support running inside default docker/podman w/container-selinux containers/bubblewrap#269

Open

dustymabe mentioned this issue Sep 6, 2018

Use builder user by default, add sudo for rpm-ostree compose coreos/coreos-assembler#32

Merged

cgwalters mentioned this issue Sep 14, 2018

Create tooling to override an oscontainer with updated RPMs openshift/os#289

Closed

cgwalters mentioned this issue Sep 27, 2018

core,scripts: When no cachedir is enabled in unified-core, disable FUSE #1591

Closed

cgwalters mentioned this issue Oct 11, 2018

prune: Use sudo for repo-build coreos/coreos-assembler#165

Merged

cgwalters mentioned this issue Nov 29, 2018

Teach installer to specify OS version through oscontainer image URLs openshift/installer#281

Closed

jlebon mentioned this issue Jan 14, 2022

WIP: Directly do container installs #3330

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support compose with default docker/kube privileges #1329

support compose with default docker/kube privileges #1329

cgwalters commented Apr 8, 2018

cgwalters commented Apr 8, 2018

smarterclayton commented Apr 9, 2018 •

edited by cgwalters

Loading

cgwalters commented Apr 9, 2018

cgwalters commented Apr 10, 2018

smarterclayton commented Apr 17, 2018

giuseppe commented Apr 17, 2018

cgwalters commented Apr 18, 2018

giuseppe commented Apr 18, 2018 •

edited

Loading

cgwalters commented May 23, 2018

smarterclayton commented May 25, 2018

smarterclayton commented May 25, 2018 •

edited

Loading

cgwalters commented Aug 21, 2018

jlebon commented Oct 29, 2018

adityashah1212 commented Nov 7, 2018 •

edited

Loading

dustymabe commented Nov 7, 2018

adityashah1212 commented Nov 7, 2018

dustymabe commented Nov 8, 2018

adityashah1212 commented Nov 8, 2018

cgwalters commented Nov 29, 2018

support compose with default docker/kube privileges #1329

support compose with default docker/kube privileges #1329

Comments

cgwalters commented Apr 8, 2018

cgwalters commented Apr 8, 2018

smarterclayton commented Apr 9, 2018 • edited by cgwalters Loading

cgwalters commented Apr 9, 2018

cgwalters commented Apr 10, 2018

smarterclayton commented Apr 17, 2018

giuseppe commented Apr 17, 2018

cgwalters commented Apr 18, 2018

giuseppe commented Apr 18, 2018 • edited Loading

cgwalters commented May 23, 2018

smarterclayton commented May 25, 2018

smarterclayton commented May 25, 2018 • edited Loading

cgwalters commented Aug 21, 2018

jlebon commented Oct 29, 2018

adityashah1212 commented Nov 7, 2018 • edited Loading

dustymabe commented Nov 7, 2018

adityashah1212 commented Nov 7, 2018

dustymabe commented Nov 8, 2018

adityashah1212 commented Nov 8, 2018

cgwalters commented Nov 29, 2018

smarterclayton commented Apr 9, 2018 •

edited by cgwalters

Loading

giuseppe commented Apr 18, 2018 •

edited

Loading

smarterclayton commented May 25, 2018 •

edited

Loading

adityashah1212 commented Nov 7, 2018 •

edited

Loading