-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rootless containers don't work from unprivileged non-root Docker container (operation not permitted for mounting procfs) #1658
Comments
May be related: #1456 |
Hm, failing to mount proc isn't the first thing I'd think would fail. I'll look into this. As an aside, it should be noted that Docker blocks |
@cyphar I tried to disable all Docker security measurements in order to get rootless container working (but still using "secure", i.e. non-privileged, Docker container), so, yes, I disabled seccomp, AppArmor and even tried to get all capabilities for Docker container - still I can't workaround this permission denied error (and I don't understand the reason for this behavior). My current guess would be either I missed something in how this suppose to work (too many non-trivial Linux kernel functionality is used here), or there are custom Ubuntu kernel patches that prevents this to work (I saw in other issues that you, @cyphar, successfully debugged some issues due to customized kernels, e.g. on CentOS, and I hoped that you might know some specifics of Ubuntu patches). |
@rutsky you could try this: https://github.com/lxc/lxd/issues/2238, it worked for me, @cyphar so basically this means I have to give the CAP_SYSTEM_ADMIN capability to mount /proc, if there any way we can handle this in code?, this is my docker run command:
and the subsequent runc command used:
and here;s my config.json
|
I can reproduce this, it's super weird, trying to find the cause |
@jessfraz I would be super happy if you could fix this. This is how I got it working: #1658 (comment) |
Yeah if I even run it with a different unprivileged user in the container even adding SYS_ADMIN fails... so weird, gotta be something with how docker is setting up the containers. Also it should go without saying that you'd want |
I am using RHEL machines so apparmor is not an issue, but I have seen apparmor denying mount on ubuntu machines, as of now I had no issues with seccomp, I use runc with this patch #1657 |
I'm falling into this issue as well, doing some research, i seems to do with kernel (according to the Arch wiki) Correct me if i'm wrong, i'm currently using kernel 4.9.x, upgrading rn just to see it this can solve the problem. |
@ulm0 Arch Linux did not support user namespaces at all for a really long time, that's what that wiki article is talking about. While you do need user namespaces for rootless containers, the issue reported here is more than just a lack of user namespaces support. |
Alright, roger that. There's always something new to learn (^_^) |
fwiw I created a repro image at r.j3ss.co/runc-rootless $ docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined r.j3ss.co/runc-rootless
container_linux.go:297: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/home/user/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\"" testing various places I think this might be coming from now, but just wanted to give an easy repro |
Okay so the problem is the masked and readonly paths in docker itself that are there by default which is what I suspected, it works with this diff to docker:
|
I'm wondering whether it makes sense for us to block mounting over |
It is my understanding that masking those paths would allow an escape hatch out of containment, although I would figure most of these are also blocked by SELinux. If you set a flag that leads to easy breakout, then why not just use --privileged. I do agree that allowing mounting over the masked paths, seems to make sense, I could even see allowing processes to write to tmpfs there so they are fooled. The one advantage of having multiple flags to turn on and off security is for developers trying to figure out whether it is SELinux, AppArmor, Dropped Capabilities, Device Cgroup, SECCOMP, NO_NEW_PRIVS, readonly mounts, masked mounts is causing your failure. Years ago we attempted to get a patch into the kernel called FriendlyEperm, that would have written something into the logs telling a process why it was getting an access denial. But it could not be done without being racy. |
Yes but masking the paths again in nested containers would prevent breakout
|
Privileged allows more than just the unmasking of paths, this at least is
explicit
I do definitely agree that mounting over the masked paths would be more ideal
|
Sure but it also gives you a false sense of security. I only opened pandoras box a little, so I feel better about opening it. Perhaps I am sensitive since I keep getting people asking me how to change SELinux to allow a container to write to the docker socket. When I tell users that they should just run a privileged container, they say no, since they want to lock it down a little. Then a security analyser comes by runs some tool that says they are not running any privileged containers, so they are good to go... |
definitely agree with you @rhatdan (obviously if I did it it would be fine, but if anyone else did it then I would be horrified hahaha) |
@cyphar +1 for allowing mount /path even when /path/a/ is masked. Could you show the link for the CVE? |
I tried this repo with strace and I find:
The masked paths are e.g. on |
you are mounting a new proc... where the proc is already masked and set as readonly... |
As far as I understand, masked dirs are just bind mounts on |
Can you not reproduce what I did by removing the masked and readonly paths
or something?
…On Fri, Mar 23, 2018 at 12:21 PM, Alban Crequy ***@***.***> wrote:
As far as I understand, masked dirs are just bind mounts on /proc/kcore
(on the outer container) without using any further mechanisms like seccomp.
When preparing the inner container by mounting a new proc on
/home/user/rootfs/proc, how does the kernel know that a bind mount on a
unrelated directory (/proc/kcore) is supposed to block the mount()
syscall by returning EPERM? /proc itself (outer container) is not masked
or readonly, only some files inside are. And the mountpoint
/home/user/rootfs/proc (for the inner container) does not have anything
masked inside. So obviously I miss a detail in the story.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1658 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABYNbIMV6Xga67lr9DVgWc6FMIT9ROmqks5thSEXgaJpZM4QlCoF>
.
--
Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu <http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3>
|
tl;dr: I have no problem with the patches in runc & Kubernetes. I am just exploring different possible workarounds for the same problem. @jessfraz I could reproduce the bug and I have not yet tried your fix but I have no reason to believe your fix would behave differently on my computer :) I just wanted to really understand the underlying mechanism in order to see whether an easier solution would be possible, because I would like to have unprivileged builds in Kubernetes as well, and I would prefer if it was possible without having to use a new "rawproc" option in Docker and Kubernetes. What I learned today:
So, by adding any procfs fully visible in the outer container, that should work: it does not need to be the one located at
By adding That is making the host processes visible in the container though. But this could be avoided with:
Here I am using a procfs mount that refers to a dead pidns, so it does not have any processes inside. A bit hacky but it works as fine :) I would prefer if there was kernel support for mounting new procfs in the inner container with the same masked paths as the outer container though...
|
Ya I don't want to bind mount proc especially not the host one and the
latter would require the user setting up a different proc on the host... So
the rawproc option requires less user intervention, albeit more patching :)
Also thanks for all the kernel source places :) very interesting
|
Quickly about the mount protections in the kernel you're talking about -- I'm fairly sure that this is actually an unneeded restriction for Though, I'm questioning the benefit of protections like that (for pseudo-filesystems in particular). Because as far as I can tell, you should be able to first do a
Unfortunately having FUSE would add another daemon restriction which isn't something that I'm a huge fan of.
The purpose of rootless containers is to be able to run anything in a container (even services), not to only be used for Dockerfiles. So we need a solution for But if you want unprivileged builds without using |
Only the pseudo filesystems
And without this bit, this protection is not even checked and the kernel assumes there is no danger:
I think this escaping would not work:
|
My work-in-progress attempt to make unprivileged new proc mounts possible: |
Regarding the first two problems, they shouldn't be an issue you if do an As for your patch and the relevant discussion I really like Eric's idea of having an only-process-specific |
@cyphar, @alban, that's how @lxc been doing nested containers for years since we're using https://github.com/lxc/lxcfs to partially virtualize |
I just ran into this with trying to use As far as I can tell, this should work now after @jessfraz's pull requests moby/moby#36644 and (for those of us on k8s) kubernetes/kubernetes#64283, right? Or is there something pending in runc for this? I agree it'd be nicer to have a "real" fix to this, either a way to mount a new /proc while preserving the hidden files or a way to do unrestricted mounts of a limited procfs, as discussed in the most recent comments. (Though for my use case, I need a writable /proc/sys, which is tricky because Docker mounts /proc/sys read-only. There should still be a way to write to namespace-specific sysctls like kernel.ns_last_pid within my namespace, even if Docker wants to block access to its namespace... I agree with @brauner's comment on the mailing list thread that I don't see the point of the restriction, root can unmount the hiding and non-root can't write to things they lack capabilities for anyway.) But I think that the approach of starting a Docker container that's unprivileged but has an unmasked /proc should work fine today. Thanks to everyone on this issue for both explaining the problem nicely and all the work on it :) |
There isn't a way to solve this within runc directly because you currently need privileges to mount an unmasked There is some current kernel work to introduce a |
Running rootless container inside Docker under non-root user fails with
(operation not permitted for mounting procfs).
Actually master version of
runc
fails a bit earlier due to not handled read-only cgroup filesystem, but I managed to fix this with #1657, so I assume that this PR is applied.I built following Docker image to reproduce this issue (with master version of runc with applied #1657).
I created Docker image with user with uid/gid 1000/1000 (which matches my host user id for which I have entries in
/etc/subuid
and/etc/subgid
), start Docker container with this image and runrunc
inside as 1000/1000 user usingsu
.Dockerfile
:prepare.sh
:start.sh
:This image is pushed as
rutsky/runc-rootless-in-docker:bugreport
.Steps to reproduce:
Part of strace that includes failed mount:
Tested on Ubuntu 16.04 on my desktop and Ubuntu 16.04 in GKE. Docker info-s from them:
If I run Docker container with
--privileged
optionrunc
works as expected.If I run
runc
with rootless configuration under my host user it works as expected.I tried to disable apparmor system-wide --- doesn't help.
The text was updated successfully, but these errors were encountered: