Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support compose with default docker/kube privileges #1329

Open
cgwalters opened this issue Apr 8, 2018 · 19 comments
Open

support compose with default docker/kube privileges #1329

cgwalters opened this issue Apr 8, 2018 · 19 comments

Comments

@cgwalters
Copy link
Member

Currently we require recursive containerization support for compose; the ability to run bwrap.

In theory, we should be able to fall back on just chroot(), which is a capability is exposed.

Unfortunately that isn't going to give us privileges to e.g. mount /proc. What we'd need to do is inside the target, make those symlinks to the outer container's filesystems.

@cgwalters
Copy link
Member Author

symlinks wouldn't work obviously. Can't bind mount either since that's still mount() which requires CAP_SYS_ADMIN in the container (works fine with userns). Actually the only thing I can think of is a rather horrific hack of taking the target /usr and swapping it in over our /usr 😦. Would require careful management to ensure that we've loaded all the files we need.

In the end I think it'd probably be better to sprint towards getting usable userns in kube.

@smarterclayton
Copy link

smarterclayton commented Apr 9, 2018

That's 6 months out, no matter what, which means from then until now we can't build CoreOS on OpenShift. That seems like a problem - I'm still not clear why bwrap and other containerization is necessary when you are already inside a container? I assume you're trying to prevent the content that will be running as root from escaping and gaining access to some of the contents of the container running the container? We already operate in that mode for docker builds, and I'm not sure it's a bigger security concern

@cgwalters
Copy link
Member Author

I'm still not clear why bwrap and other containerization is necessary when you are already inside a container?

Fundamentally we have our source rootfs (container) and we're generating a new target rootfs which is distinct from that source. Changing that assumption would be...ugly. It starts to look like this.

In this source/target separation then (which is the same scenario necessary to run the same mock code used to build RPMs today), while raw chroot() gives us the capabilities to run code in the target root, it doesn't give us the ability to e.g. mount /dev or /proc. We could downgrade to just using mkdev() for /dev but not having /proc around causes various bugs.

User namespaces is the general solution to this, it enables recursive containerization.

I assume you're trying to prevent the content that will be running as root from escaping and gaining access to some of the contents of the container running the container?

Yeah, it's slightly useful for host builds but not a big deal. However there are two cases where it's a lot more useful:

  • When doing client-side package layering, there is no "outer containerization" today, so rpm-ostree does its own. We for example survive rm -rf /. See also this commit.
  • For increasing the security of general RPM builds server side; there's a lot more to do here but having an "inner container" for scripts helps a lot

At a high level I'd say we're establishing some of the requirements to do RPM-style builds in Kube/OpenShift which is going to be a very useful thing in general. Where rpm-ostree adds additional complexity is that unlike yum/dnf (but like mock) it actually uses container features itself.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
The bwrap code has some "opinionated" setup around e.g. /etc and
such that I'd like to centralize even more.  However, the dracut
case of taking the host's `/etc` is the unusual standout.  Let's
split out a base constructor.

Prep for coreos#1329
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
This is also prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What FUSE is protecting is the hardlinked repo imports.  But if
`--cachedir` isn't specified (which we support today) that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.  Let's hence make it so that in this scenario we don't.

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
This was useful for me to directly invoke bwrap against the target
rootfs.

Part of working on: coreos#1329
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
In prep for cleaning up how we manage mounts.  If we're looking
at a real existing `/var`, then it must have the directory.  The
only case where we don't is in pkglayering, so move the special
case there.

Prep for coreos#1329
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
User namespaces denies overmounting existing targets as it
could reveal things intended to be hidden.  In the recursive
container case, we need to just use e.g. `--bind /dev /dev`
rather than trying to create a new instances.

Further, in the selftest case we can't overmount what already
exists, so skip doing that entirely.

Prep for rpm-ostree-in-Kube: coreos#1329
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
This should all work with preceeding commits, let's start testing
it.

Closes: coreos#1329
@cgwalters
Copy link
Member Author

PR in #1332

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
The bwrap code has some "opinionated" setup around e.g. /etc and
such that I'd like to centralize even more.  However, the dracut
case of taking the host's `/etc` is the unusual standout.  Let's
split out a base constructor.

Prep for coreos#1329
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
This was useful for me to directly invoke bwrap against the target
rootfs.

Part of working on: coreos#1329
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 10, 2018
In prep for cleaning up how we manage mounts.  If we're looking
at a real existing `/var`, then it must have the directory.  The
only case where we don't is in pkglayering, so move the special
case there.

Prep for coreos#1329
rh-atomic-bot pushed a commit that referenced this issue Apr 10, 2018
The bwrap code has some "opinionated" setup around e.g. /etc and
such that I'd like to centralize even more.  However, the dracut
case of taking the host's `/etc` is the unusual standout.  Let's
split out a base constructor.

Prep for #1329

Closes: #1333
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this issue Apr 10, 2018
This was useful for me to directly invoke bwrap against the target
rootfs.

Part of working on: #1329

Closes: #1333
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this issue Apr 10, 2018
In prep for cleaning up how we manage mounts.  If we're looking
at a real existing `/var`, then it must have the directory.  The
only case where we don't is in pkglayering, so move the special
case there.

Prep for #1329

Closes: #1333
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this issue Apr 11, 2018
The bwrap code has some "opinionated" setup around e.g. /etc and
such that I'd like to centralize even more.  However, the dracut
case of taking the host's `/etc` is the unusual standout.  Let's
split out a base constructor.

Prep for #1329

Closes: #1333
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this issue Apr 11, 2018
This was useful for me to directly invoke bwrap against the target
rootfs.

Part of working on: #1329

Closes: #1333
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this issue Apr 11, 2018
In prep for cleaning up how we manage mounts.  If we're looking
at a real existing `/var`, then it must have the directory.  The
only case where we don't is in pkglayering, so move the special
case there.

Prep for #1329

Closes: #1333
Approved by: jlebon
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 11, 2018
This is also prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What FUSE is protecting is the hardlinked repo imports.  But if
`--cachedir` isn't specified (which we support today) that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.  Let's hence make it so that in this scenario we don't.

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 11, 2018
User namespaces denies overmounting existing targets as it
could reveal things intended to be hidden.  In the recursive
container case, we need to just use e.g. `--bind /dev /dev`
rather than trying to create a new instances.

Further, in the selftest case we can't overmount what already
exists, so skip doing that entirely.

Prep for rpm-ostree-in-Kube: coreos#1329
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 11, 2018
This should all work with preceeding commits, let's start testing
it.

Closes: coreos#1329
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Apr 11, 2018
Now that we've split out the "base" constructor for the initramfs
special case, the "mutable" disposition tells us how to handle
binding `/etc` and `/var` in addition to `/usr`.  Drop the
`RPMOSTREE_BWRAP_IMMUTABLE` case as it is now only used in the
script case where we handle `/var`.

Next, user namespaces denies overmounting existing targets as it
could reveal things intended to be hidden.  In the recursive
container case, we need to just use e.g. `--bind /dev /dev`
rather than trying to create a new instances.

Further, in the selftest case we can't overmount what already
exists, so skip doing that entirely.

Closes: coreos#1329
@smarterclayton
Copy link

In pulling together what the CI process arc looks like, the concrete requirements for this should fall out. If the cost is high enough from your investigations that you think it needs to be scheduled in, we can manage that as part of the epic that tracks what the CI process looks like.

@giuseppe
Copy link
Contributor

Unfortunately that isn't going to give us privileges to e.g. mount /proc. What we'd need to do is inside the target, make those symlinks to the outer container's filesystems.

Colin, how much is important to get a new pid namespace and a new /proc? Without the new pid ns requirement, it should be possible to run bwrap recursively (with containers/bubblewrap#256) and bind mount proc instead of mounting a new procfs. I had a look at it for running Buildah as unprivileged user, but having runc for each RUN gives less flexibility with /proc.

I gave it a quick try (bwrap under bwrap under docker), from the bwrap build directory:

$ echo '{}' > /tmp/seccomp.json
$ sudo docker run --security-opt seccomp=/tmp/seccomp.json --rm -ti -v $(pwd):/foo:Z -w /foo fedora \
./bwrap --uid 0 --cap-add ALL --unshare-user --bind / /  \
./bwrap --unshare-user --bind / / --bind /proc /proc dnf install -y nano

....

Installed:
  nano.x86_64 2.8.7-1.fc27                                                                                                                                                                    

Complete!

There is an issue for Moby to not have masked paths on /proc and thus enable mounting a procfs as unprivileged user: moby/moby#36597, I guess img is hitting the same issue. It won't probably help much though with users we don't want to trust

@cgwalters
Copy link
Member Author

Colin, how much is important to get a new pid namespace and a new /proc?

Without a pid namespace (and without a seccomp policy in place) one can easily kill()/ptrace() etc. the parent processes. So: important.

@giuseppe
Copy link
Contributor

giuseppe commented Apr 18, 2018

Without a pid namespace (and without a seccomp policy in place) one can easily kill()/ptrace() etc. the parent processes. So: important.

Isn't relaxing seccomp required in any case when a new namespace must be created inside of a Docker container? Same for a new pid namespace + procfs mount: that is not permitted in an unprivileged Docker container as /proc has masked/readonly paths. Clarification: the pid namespace can be created but that will make /proc unusable as you cannot mount a new one.

In any case, I was just wondering if the recursive bwrap could help with anything as it can be used for launching rpm-ostree which in turn uses bwrap internally.

@cgwalters
Copy link
Member Author

Maybe the path we should be looking at is doing a Custom builder specifically for the OpenShift casse?

@smarterclayton
Copy link

Custom builders are effectively dead, so not really a practical option. The problem is that building this content shouldn't require privileges, and i don't really want to encourage anyone to think that privileged container image builders are necessary or a good idea. When user namespaces and cri-o lands in openshift, and we start dropping privileges for those containers, I don't really want there to be this one special case.

Maybe I'm being unreasonable, but in a properly confined container that is running as uid 0, there's nothing special about what rpm-ostree is doing that requires isolation other than some trickery. You are already confined when you run.

@smarterclayton
Copy link

smarterclayton commented May 25, 2018

Mounting proc is in order to prevent what? Or allow what?

Inside of a containerized runner, we're not defending the container from the content (we will already have ways of defending that that will grow over time). We're just trying to ensure the content doesn't escape the container. Malicious content in a build container is not our threat model here, because all user code that ends up in the final image has to be considered malicious.

@cgwalters
Copy link
Member Author

I forgot to comment here too, but the basic thing is that we're generating a root filesystem which is not the same as the source, and we want to run scripts in the target, and it really works best if those scripts are run as containers.

One of the biggest "scripts" is dracut, which generates the initramfs which is is another little filesystem generated from our target. We could in theory teach dracut how to deal with a src != target scenario but it'd be quite nontrivial, and a big risky change to an under-tested but critical component of the OS.

Mix in the major changes here with the fundamental thing that down the line, it should be quite possible to make recursive containerization can be made to work inside Kubernetes, I'd rather limp forward with the story that "use --privileged for now". See containers/bubblewrap#269 for some discussion on recursive containers.

cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Sep 27, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Sep 27, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Sep 27, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
rh-atomic-bot pushed a commit that referenced this issue Sep 28, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: #1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.

Closes: #1591
Approved by: jlebon
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Oct 10, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Oct 10, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Oct 11, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

We also make use of --force-copy-zerosized that just landed
in libostree: ostreedev/ostree#1752

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Oct 11, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

We also make use of --force-copy-zerosized that just landed
in libostree: ostreedev/ostree#1752

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
cgwalters added a commit to cgwalters/rpm-ostree that referenced this issue Oct 12, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: coreos#1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

We also make use of --force-copy-zerosized that just landed
in libostree: ostreedev/ostree#1752

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.
rh-atomic-bot pushed a commit that referenced this issue Oct 12, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: #1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

We also make use of --force-copy-zerosized that just landed
in libostree: ostreedev/ostree#1752

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.

Closes: #1591
Approved by: jlebon
rh-atomic-bot pushed a commit that referenced this issue Oct 12, 2018
This is prep for running inside (unprivileged) Kube containers
as they exist today: #1329

Sadly FUSE today uses a suid binary that ends up wanting CAP_SYS_ADMIN.
I think there's some work on FUSE-in-containers but I'm not sure of
the current status.

What rofiles-fuse here is doing here is protecting is the hardlinked
repo imports.  But if `--cachedir` isn't specified, that repository
gets thrown away anyways.  So there's no real value to using FUSE
here.

Also since nothing is cached, disable the devino cache.

We also make use of --force-copy-zerosized that just landed
in libostree: ostreedev/ostree#1752

Down the line ideally we gain the capability to detect if either
unprivileged overlayfs/FUSE are available.  Then if `--cachedir`
is specified we can make things work.

Closes: #1591
Approved by: jlebon
@jlebon
Copy link
Member

jlebon commented Oct 29, 2018

BTW, there's a PR open to implement userns in Kube: kubernetes/kubernetes#64005.

@adityashah1212
Copy link

adityashah1212 commented Nov 7, 2018

This might be a little different than the issue being discussed here, but I am trying to make a custom tree using rootless podman/buildah with --privileged flag. It seems everything goes fine until it downloads files from the repository (which I assume means bwrap is able to create a container). But after that it fails with Operation not permitted. Am I doing something wrong?

@dustymabe
Copy link
Member

We just merged some code in coreos-assembler to make it work unprivileged: coreos/coreos-assembler#190. You might be able to use coreos-assembler (running in a rootless podman invocation) to do a compose (it uses an unprivileged VM).

@adityashah1212
Copy link

@dustymabe Thanks for the advice. I did take a look at the coreos-assembler. From what I can see is that you guys are launching a qemu-kvm instance if the container is unprivileged right? Two thing I would like to ask.

  1. I am fine having to run privileged containers as long as they are rootless. The problem is the installation of the downloaded rpms fails.
  2. I cannot guarantee that /dev/kvm would be available, considering the container itself maybe running is some kind of vm. So if I were to launch a qemu instance, can I do so without kvm support, i.e. a plain emulator rather than a hypervisor?

BTW I am using centos 7.5 and the custom tree is for centos 7.5

@dustymabe
Copy link
Member

@dustymabe Thanks for the advice. I did take a look at the coreos-assembler. From what I can see is that you guys are launching a qemu-kvm instance if the container is unprivileged right?

right

I am fine having to run privileged containers as long as they are rootless. The problem is the installation of the downloaded rpms fails.

yeah. this isn't something I've tried. I'm not sure if there's a technical reason why it's not working, or if it is a bug in rootless containers, which are pretty new

I cannot guarantee that /dev/kvm would be available, considering the container itself maybe running is some kind of vm

If you have control of the hypervisor you could enable nested virt, but that is a big IF.

So if I were to launch a qemu instance, can I do so without kvm support, i.e. a plain emulator rather than a hypervisor?

yeah. I don't know if I recommend this, it would probably be really slow. We are calling the qemu-kvm binary in coreos-assembler, so i'm not sure if it would fall back to non hardware assisted virt (emulation), but you could try.

@adityashah1212
Copy link

@dustymabe Thanks for your help. I think I will stick to using nested virt approach. This will take some effort, but I think it will be worth the time I will save later.

@cgwalters
Copy link
Member Author

While some initial work landed on this, I think we're unlikely to spend much more time on it. I'd prefer to focus on the coreos-assembler unpriv path: #1329 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants
@giuseppe @cgwalters @smarterclayton @dustymabe @jlebon @adityashah1212 and others