Rootless containers don't work from unprivileged non-root Docker container (operation not permitted for mounting procfs) #1658

rutsky · 2017-11-20T23:10:04Z

Running rootless container inside Docker under non-root user fails with

container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/mycontainer/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""

(operation not permitted for mounting procfs).

Actually master version of runc fails a bit earlier due to not handled read-only cgroup filesystem, but I managed to fix this with #1657, so I assume that this PR is applied.

I built following Docker image to reproduce this issue (with master version of runc with applied #1657).
I created Docker image with user with uid/gid 1000/1000 (which matches my host user id for which I have entries in /etc/subuid and /etc/subgid), start Docker container with this image and run runc inside as 1000/1000 user using su.

Dockerfile:

FROM ubuntu:16.04
RUN apt-get update && apt-get install -y strace gdb less vim jq
# Busybox rootfs of some version.
COPY busybox.tar /
# Patched runc from master (with applied https://github.com/opencontainers/runc/pull/1657).
ADD runc /usr/local/bin/
RUN chmod +x /usr/local/bin/runc

RUN groupadd user -g 1000
RUN useradd -d /mycontainer -m -g user user

COPY prepare.sh /
COPY start.sh /

prepare.sh:

#!/bin/bash -eux

su -l user -c "mkdir -p /mycontainer/rootfs"
su -l user -c "mkdir -p /mycontainer/containerroot"
su -l user -c "tar -C /mycontainer/rootfs -xf /busybox.tar"
su -l user -c "cd /mycontainer/; runc spec --rootless"

start.sh:

#!/bin/bash -eux
su -l user -c "cd /mycontainer; runc --root /mycontainer/containerroot run mycontainerid"

This image is pushed as rutsky/runc-rootless-in-docker:bugreport.

Steps to reproduce:

$ sudo docker run --rm --cap-add SYS_ADMIN --security-opt seccomp:unconfined --security-opt=apparmor:unconfined -ti rutsky/runc-rootless-in-docker:bugreport
root@d4ff244031d9:/# ./prepare.sh 
+ su -l user -c 'mkdir -p /mycontainer/rootfs'
+ su -l user -c 'mkdir -p /mycontainer/containerroot'
+ su -l user -c 'tar -C /mycontainer/rootfs -xf /busybox.tar'
+ su -l user -c 'cd /mycontainer/; runc spec --rootless'
root@d4ff244031d9:/# ./start.sh 
+ su -l user -c 'cd /mycontainer; runc --root /mycontainer/containerroot run mycontainerid'
container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/mycontainer/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""
root@d4ff244031d9:/#

Part of strace that includes failed mount:

[pid    68] mount("", "/", 0xc42001b2ca, MS_REC|MS_SLAVE, NULL <unfinished ...>
[pid    69] <... pselect6 resumed> )    = 0 (Timeout)
[pid    68] <... mount resumed> )       = 0
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid    68] openat(AT_FDCWD, "/proc/self/mountinfo", O_RDONLY|O_CLOEXEC) = 8</proc/68/mountinfo>
[pid    68] epoll_ctl(7<anon_inode:[eventpoll]>, EPOLL_CTL_ADD, 8</proc/68/mountinfo>, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=1036353280, u64
=140128639352576}}) = 0
[pid    69] <... pselect6 resumed> )    = 0 (Timeout)
[pid    68] fcntl(8</proc/68/mountinfo>, F_GETFL <unfinished ...>
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid    68] <... fcntl resumed> )       = 0x8000 (flags O_RDONLY|O_LARGEFILE)
[pid    68] fcntl(8</proc/68/mountinfo>, F_SETFL, O_RDONLY|O_NONBLOCK|O_LARGEFILE) = 0
[pid    68] read(8</proc/68/mountinfo>,  <unfinished ...>
[pid    69] <... pselect6 resumed> )    = 0 (Timeout)
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid    68] <... read resumed> "263 239 0:119 / / rw,relatime - "..., 4096) = 3855
[pid    69] <... pselect6 resumed> )    = 0 (Timeout)
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL) = 0 (Timeout)
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL) = 0 (Timeout)
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL) = 0 (Timeout)
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid    68] read(8</proc/68/mountinfo>, "", 4096) = 0
[pid    68] epoll_ctl(7<anon_inode:[eventpoll]>, EPOLL_CTL_DEL, 8</proc/68/mountinfo>, 0xc4200fab0c <unfinished ...>
[pid    69] <... pselect6 resumed> )    = 0 (Timeout)
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid    68] <... epoll_ctl resumed> )   = 0
[pid    68] close(8</proc/68/mountinfo>) = 0
[pid    68] mount("/mycontainer/rootfs", "/mycontainer/rootfs", 0xc42001b5d0, MS_BIND|MS_REC, NULL <unfinished ...>
[pid    69] <... pselect6 resumed> )    = 0 (Timeout)
[pid    69] pselect6(0, NULL, NULL, NULL, {0, 20000}, NULL <unfinished ...>
[pid    68] <... mount resumed> )       = 0
[pid    68] stat("/mycontainer/rootfs/proc", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid    68] mount("proc", "/mycontainer/rootfs/proc", "proc", 0, NULL) = -1 EPERM (Operation not permitted)

Tested on Ubuntu 16.04 on my desktop and Ubuntu 16.04 in GKE. Docker info-s from them:

# Desktop
$ sudo docker info
Containers: 12
 Running: 1
 Paused: 0
 Stopped: 11
Images: 199
Server Version: 17.09.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 06b9cb35161009dcb7123345749fef02f7cea8e0
runc version: 3f2f8b84a77f73d38244dd690525642a72156c64
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.10.0-38-generic
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 11.63GiB
Name: bob-vaio
ID: EQPL:4SC2:YOP2:Z7IM:VEWI:ZSYQ:G7LG:UWWW:G24T:GSKL:3EJU:JT6H
Docker Root Dir: /srv/docker-data
Debug Mode (client): false
Debug Mode (server): false
Username: rutsky
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

# GKE
$ sudo docker info
Containers: 27
 Running: 25
 Paused: 0
 Stopped: 2
Images: 24
Server Version: 1.12.6
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 139
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge overlay null host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.4.0-1027-gke
Operating System: Ubuntu 16.04.3 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.755 GiB
Name: gke-cluster-1-default-pool-163751e2-sg48
ID: 46OX:MIU5:TESN:HGMY:KSKR:34H7:MLG6:GHVN:AOAZ:XN56:LFCF:AWBB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
 10.0.0.0/8
 127.0.0.0/8

If I run Docker container with --privileged option runc works as expected.
If I run runc with rootless configuration under my host user it works as expected.
I tried to disable apparmor system-wide --- doesn't help.

The text was updated successfully, but these errors were encountered:

rutsky · 2017-11-20T23:14:52Z

May be related: #1456

cyphar · 2017-11-21T10:21:22Z

Hm, failing to mount proc isn't the first thing I'd think would fail. I'll look into this.

As an aside, it should be noted that Docker blocks unshare(CLONE_NEWUSER) by default in their seccomp profile because it has caused a lot of local privilege escalations over time. It's a little weird you didn't hit that issue first (we need CLONE_NEWUSER in order to function in rootless) -- but presumably you disabled the seccomp profile.

rutsky · 2017-11-21T19:03:53Z

@cyphar I tried to disable all Docker security measurements in order to get rootless container working (but still using "secure", i.e. non-privileged, Docker container), so, yes, I disabled seccomp, AppArmor and even tried to get all capabilities for Docker container - still I can't workaround this permission denied error (and I don't understand the reason for this behavior).

My current guess would be either I missed something in how this suppose to work (too many non-trivial Linux kernel functionality is used here), or there are custom Ubuntu kernel patches that prevents this to work (I saw in other issues that you, @cyphar, successfully debugged some issues due to customized kernels, e.g. on CentOS, and I hoped that you might know some specifics of Ubuntu patches).
I will try to reproduce this issue on non-Ubuntu kernel.

frezbo · 2018-01-01T19:00:42Z

@rutsky you could try this: https://github.com/lxc/lxd/issues/2238, it worked for me, @cyphar so basically this means I have to give the CAP_SYSTEM_ADMIN capability to mount /proc, if there any way we can handle this in code?, this is my docker run command:

docker run --rm -it --cap-add SYS_ADMIN <image with rootfs and runc>

and the subsequent runc command used:

runc --root /tmp/runc run --no-pivot --no-new-keyring -b <oci-image-path> hello

and here;s my config.json

{
	"ociVersion": "1.0.0",
	"process": {
		"terminal": true,
		"user": {
			"uid": 0,
			"gid": 0
		},
		"args": [
			"bash"
		],
		"env": [
			"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
			"TERM=xterm"
		],
		"cwd": "/opt/data",
		"capabilities": {
			"bounding": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"effective": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"inheritable": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"permitted": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			],
			"ambient": [
				"CAP_AUDIT_WRITE",
				"CAP_KILL",
				"CAP_NET_BIND_SERVICE"
			]
		},
		"rlimits": [
			{
				"type": "RLIMIT_NOFILE",
				"hard": 1024,
				"soft": 1024
			}
		],
		"noNewPrivileges": true
	},
	"root": {
		"path": "rootfs",
		"readonly": false
	},
	"hostname": "runc",
	"mounts": [
		{
			"destination": "/proc",
			"type": "proc",
			"source": "/mnt/proc"
		},
		{
			"destination": "/dev",
			"type": "tmpfs",
			"source": "tmpfs",
			"options": [
				"nosuid",
				"strictatime",
				"mode=755",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/pts",
			"type": "devpts",
			"source": "devpts",
			"options": [
				"nosuid",
				"noexec",
				"newinstance",
				"ptmxmode=0666",
				"mode=0620"
			]
		},
		{
			"destination": "/dev/shm",
			"type": "tmpfs",
			"source": "shm",
			"options": [
				"nosuid",
				"noexec",
				"nodev",
				"mode=1777",
				"size=65536k"
			]
		},
		{
			"destination": "/dev/mqueue",
			"type": "mqueue",
			"source": "mqueue",
			"options": [
				"nosuid",
				"noexec",
				"nodev"
			]
		},
		{
			"destination": "/sys",
			"type": "none",
			"source": "/sys",
			"options": [
				"rbind",
				"nosuid",
				"noexec",
				"nodev",
				"ro"
			]
		}
	],
	"linux": {
		"uidMappings": [
			{
				"hostID": 100,
				"containerID": 0,
				"size": 1
			}
		],
		"gidMappings": [
			{
				"hostID": 101,
				"containerID": 0,
				"size": 1
			}
		],
		"namespaces": [
			{
				"type": "pid"
			},
			{
				"type": "ipc"
			},
			{
				"type": "uts"
			},
			{
				"type": "mount"
			},
			{
				"type": "user"
			}
		],
		"maskedPaths": [
			"/proc/kcore",
			"/proc/latency_stats",
			"/proc/timer_list",
			"/proc/timer_stats",
			"/proc/sched_debug",
			"/sys/firmware",
			"/proc/scsi"
		],
		"readonlyPaths": [
			"/proc/asound",
			"/proc/bus",
			"/proc/fs",
			"/proc/irq",
			"/proc/sys",
			"/proc/sysrq-trigger"
		]
	}
}
}
}
}

jessfraz · 2018-02-26T06:43:15Z

I can reproduce this, it's super weird, trying to find the cause

frezbo · 2018-02-26T06:46:56Z

@jessfraz I would be super happy if you could fix this. This is how I got it working: #1658 (comment)

jessfraz · 2018-02-26T06:49:34Z

Yeah if I even run it with a different unprivileged user in the container even adding SYS_ADMIN fails... so weird, gotta be something with how docker is setting up the containers.

Also it should go without saying that you'd want --security-opt apparmor=unconfined --security-opt seccomp=unconfined since seccomp blocks new user namespaces and apparmor blocks mount.

frezbo · 2018-02-26T07:07:43Z

I am using RHEL machines so apparmor is not an issue, but I have seen apparmor denying mount on ubuntu machines, as of now I had no issues with seccomp, I use runc with this patch #1657

ulm0 · 2018-02-27T03:08:53Z

I'm falling into this issue as well, doing some research, i seems to do with kernel (according to the Arch wiki)

https://wiki.archlinux.org/index.php/Linux_Containers#Enable_support_to_run_unprivileged_containers_.28optional.29

Correct me if i'm wrong, i'm currently using kernel 4.9.x, upgrading rn just to see it this can solve the problem.

cyphar · 2018-02-27T05:27:23Z

@ulm0 Arch Linux did not support user namespaces at all for a really long time, that's what that wiki article is talking about. While you do need user namespaces for rootless containers, the issue reported here is more than just a lack of user namespaces support.

ulm0 · 2018-02-27T05:32:30Z

Alright, roger that. There's always something new to learn (^_^)

jessfraz · 2018-03-14T18:11:03Z

fwiw I created a repro image at r.j3ss.co/runc-rootless
https://github.com/jessfraz/dockerfiles/tree/master/runc-rootless

$ docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined r.j3ss.co/runc-rootless
container_linux.go:297: starting container process caused "process_linux.go:402: container init caused \"rootfs_linux.go:58: mounting \\\"proc\\\" to rootfs \\\"/home/user/rootfs\\\" at \\\"/proc\\\" caused \\\"operation not permitted\\\"\""

testing various places I think this might be coming from now, but just wanted to give an easy repro

jessfraz · 2018-03-14T18:22:07Z

Okay so the problem is the masked and readonly paths in docker itself that are there by default which is what I suspected, it works with this diff to docker:

$ git diff
diff --git a/oci/defaults.go b/oci/defaults.go
index 4145412dd..67b55604a 100644
--- a/oci/defaults.go
+++ b/oci/defaults.go
@@ -114,22 +114,22 @@ func DefaultLinuxSpec() specs.Spec {
 
        s.Linux = &specs.Linux{
                MaskedPaths: []string{
-                       "/proc/kcore",
-                       "/proc/keys",
-                       "/proc/latency_stats",
-                       "/proc/timer_list",
-                       "/proc/timer_stats",
-                       "/proc/sched_debug",
-                       "/proc/scsi",
-                       "/sys/firmware",
+                       /*      "/proc/kcore",
+                               "/proc/keys",
+                               "/proc/latency_stats",
+                               "/proc/timer_list",
+                               "/proc/timer_stats",
+                               "/proc/sched_debug",
+                               "/proc/scsi",
+                               "/sys/firmware",*/
                },
                ReadonlyPaths: []string{
-                       "/proc/asound",
-                       "/proc/bus",
-                       "/proc/fs",
-                       "/proc/irq",
-                       "/proc/sys",
-                       "/proc/sysrq-trigger",
+                       /*      "/proc/asound",
+                               "/proc/bus",
+                               "/proc/fs",
+                               "/proc/irq",
+                               "/proc/sys",
+                               "/proc/sysrq-trigger",*/
                },
                Namespaces: []specs.LinuxNamespace{
                        {Type: "mount"},

cyphar · 2018-03-14T18:45:59Z

I'm wondering whether it makes sense for us to block mounting over /path if /path/a is masked -- since /path/a will still be masked afterwards. But I'd have to think about it a bit more (the protections were added in response to a CVE early in Docker's history).

rhatdan · 2018-03-16T10:56:36Z

It is my understanding that masking those paths would allow an escape hatch out of containment, although I would figure most of these are also blocked by SELinux. If you set a flag that leads to easy breakout, then why not just use --privileged.

I do agree that allowing mounting over the masked paths, seems to make sense, I could even see allowing processes to write to tmpfs there so they are fooled.

The one advantage of having multiple flags to turn on and off security is for developers trying to figure out whether it is SELinux, AppArmor, Dropped Capabilities, Device Cgroup, SECCOMP, NO_NEW_PRIVS, readonly mounts, masked mounts is causing your failure. Years ago we attempted to get a patch into the kernel called FriendlyEperm, that would have written something into the logs telling a process why it was getting an access denial. But it could not be done without being racy.

jessfraz · 2018-03-16T10:59:13Z

Yes but masking the paths again in nested containers would prevent breakout

jessfraz · 2018-03-16T11:03:14Z

Privileged allows more than just the unmasking of paths, this at least is explicit I do definitely agree that mounting over the masked paths would be more ideal

rhatdan · 2018-03-16T13:32:20Z

Sure but it also gives you a false sense of security. I only opened pandoras box a little, so I feel better about opening it. Perhaps I am sensitive since I keep getting people asking me how to change SELinux to allow a container to write to the docker socket. When I tell users that they should just run a privileged container, they say no, since they want to lock it down a little. Then a security analyser comes by runs some tool that says they are not running any privileged containers, so they are good to go...

jessfraz · 2018-03-16T13:44:06Z

definitely agree with you @rhatdan

(obviously if I did it it would be fine, but if anyone else did it then I would be horrified hahaha)

AkihiroSuda · 2018-03-21T03:35:11Z

I'm wondering whether it makes sense for us to block mounting over /path if /path/a is masked -- since /path/a will still be masked afterwards. But I'd have to think about it a bit more (the protections were added in response to a CVE early in Docker's history).

@cyphar +1 for allowing mount /path even when /path/a/ is masked.

Could you show the link for the CVE?

alban · 2018-03-23T12:50:24Z

fwiw I created a repro image at r.j3ss.co/runc-rootless
https://github.com/jessfraz/dockerfiles/tree/master/runc-rootless

$ docker run --rm -it --security-opt seccomp=unconfined --security-opt apparmor=unconfined r.j3ss.co/runc-rootless
container_linux.go:297: starting container process caused "process_linux.go:402: container init caused "rootfs_linux.go:58: mounting \"proc\" to rootfs \"/home/user/rootfs\" at \"/proc\" caused \"operation not permitted\"""

testing various places I think this might be coming from now, but just wanted to give an easy repro

I tried this repo with strace and I find:

[pid    79] unshare(CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWPID) = 0
[pid    79] clone(strace: Process 80 attached
child_stack=0x7ffe01185ba8, flags=CLONE_PARENT|SIGCHLD) = 80
...
[pid    80] mount("", "/", 0xc42001b25a, MS_REC|MS_SLAVE, NULL) = 0
[pid    80] mount("/home/user/rootfs", "/home/user/rootfs", 0xc42001b620, MS_BIND|MS_REC, NULL) = 0
[pid    80] mount("proc", "/home/user/rootfs/proc", "proc", 0, NULL) = -1 EPERM (Operation not permitted)

The masked paths are e.g. on /proc/kcore but not on /home/user/rootfs/proc. Why would the masked paths interfere with a new procfs mount in an unrelated directory? I guess there is some kind of protection that I don't understand...

jessfraz · 2018-03-23T13:03:48Z

you are mounting a new proc... where the proc is already masked and set as readonly...

alban · 2018-03-23T16:21:32Z

As far as I understand, masked dirs are just bind mounts on /proc/kcore (on the outer container) without using any further mechanisms like seccomp. When preparing the inner container by mounting a new proc on /home/user/rootfs/proc, how does the kernel know that a bind mount on a unrelated directory (/proc/kcore) is supposed to block the mount() syscall by returning EPERM? /proc itself (outer container) is not masked or readonly, only some files inside are. And the mountpoint /home/user/rootfs/proc (for the inner container) does not have anything masked inside. So obviously I miss a detail in the story.

jessfraz · 2018-03-23T16:29:29Z

Can you not reproduce what I did by removing the masked and readonly paths or something?

…

On Fri, Mar 23, 2018 at 12:21 PM, Alban Crequy ***@***.***> wrote: As far as I understand, masked dirs are just bind mounts on /proc/kcore (on the outer container) without using any further mechanisms like seccomp. When preparing the inner container by mounting a new proc on /home/user/rootfs/proc, how does the kernel know that a bind mount on a unrelated directory (/proc/kcore) is supposed to block the mount() syscall by returning EPERM? /proc itself (outer container) is not masked or readonly, only some files inside are. And the mountpoint /home/user/rootfs/proc (for the inner container) does not have anything masked inside. So obviously I miss a detail in the story. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1658 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABYNbIMV6Xga67lr9DVgWc6FMIT9ROmqks5thSEXgaJpZM4QlCoF> .

-- Jessie Frazelle 4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3 pgp.mit.edu <http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3>

alban · 2018-03-23T17:57:31Z

tl;dr: I have no problem with the patches in runc & Kubernetes. I am just exploring different possible workarounds for the same problem.

@jessfraz I could reproduce the bug and I have not yet tried your fix but I have no reason to believe your fix would behave differently on my computer :) I just wanted to really understand the underlying mechanism in order to see whether an easier solution would be possible, because I would like to have unprivileged builds in Kubernetes as well, and I would prefer if it was possible without having to use a new "rawproc" option in Docker and Kubernetes.

What I learned today:

this protection about masked paths with unprivileged userns seems to happen only on procfs and sysfs (only those two have the flag SB_I_USERNS_VISIBLE that triggers this check, see mount_too_revealing). Too bad for us, the masked paths are in procfs. But maybe someone could implement a FUSE solution with lxcfs that would escape the kernel check on procfs and sysfs. Unprivileged FUSE in userns coming soon :)
the protection works by iterating on all procfs mounts in the current mount namespace to try to find one without masked paths (see mnt_already_visible).

So, by adding any procfs fully visible in the outer container, that should work: it does not need to be the one located at /proc and I don't need to remove the masked paths at all.

docker run --rm -it \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --security-opt=no-new-privileges \
    -v /proc:/newproc \
    r.j3ss.co/runc-rootless

By adding -v /proc:/newproc, it works without the "rawproc" branch. So we could use this without patching Kubernetes or Docker.

That is making the host processes visible in the container though. But this could be avoided with:

sudo unshare -p -f mount -t proc proc /mnt/proc
docker run --rm -it \
    --security-opt seccomp=unconfined \
    --security-opt apparmor=unconfined \
    --security-opt=no-new-privileges \
    -v /mnt/proc:/newproc \
    r.j3ss.co/runc-rootless

Here I am using a procfs mount that refers to a dead pidns, so it does not have any processes inside. A bit hacky but it works as fine :)

I would prefer if there was kernel support for mounting new procfs in the inner container with the same masked paths as the outer container though...

/proc is needed to support generic Dockerfile builds with arbitrary commands in RUN but I might be lucky because I might need only to support a subset of Dockerfile that would play nice with a missing /proc.

jessfraz · 2018-03-23T18:09:00Z

Ya I don't want to bind mount proc especially not the host one and the latter would require the user setting up a different proc on the host... So the rawproc option requires less user intervention, albeit more patching :) Also thanks for all the kernel source places :) very interesting

cyphar · 2018-03-24T04:24:02Z

Quickly about the mount protections in the kernel you're talking about -- I'm fairly sure that this is actually an unneeded restriction for procfs, because it is made safe explicitly by the kernel. The protections are there for any other filesystems that may not have those protections.

Though, I'm questioning the benefit of protections like that (for pseudo-filesystems in particular). Because as far as I can tell, you should be able to first do a pivot_root somewhere where the /proc isn't visible. You then unmount it, and now there's no /proc visible that is masked -- and theoretically you should be able to mount procfs again. But I haven't actually tried this, so don't quote me on it. 😉

But maybe someone could implement a FUSE solution with lxcfs that would escape the kernel check on procfs and sysfs. Unprivileged FUSE in userns coming soon :)

Unfortunately having FUSE would add another daemon restriction which isn't something that I'm a huge fan of.

/proc is needed to support generic Dockerfile builds with arbitrary commands in RUN but I might be lucky because I might need only to support a subset of Dockerfile that would play nice with a missing /proc.

The purpose of rootless containers is to be able to run anything in a container (even services), not to only be used for Dockerfiles. So we need a solution for /proc so you can do ps inside a container. Several package managers also touch /proc quite a bit, so I'd be a little surprised if you can do most builds that you want to do without /proc.

But if you want unprivileged builds without using RUN you can just use https://github.com/openSUSE/umoci directly -- which doesn't use containers by itself at all (you can use them on top of it, but at its base it just generates layers from a rootfs -- no matter how you decided to modify it).

alban · 2018-03-25T19:27:00Z

Quickly about the mount protections in the kernel you're talking about -- I'm fairly sure that this is actually an unneeded restriction for procfs, because it is made safe explicitly by the kernel. The protections are there for any other filesystems that may not have those protections.

Only the pseudo filesystems procfs and sysfs have that restriction with the SB_I_USERNS_VISIBLE bit:

$ git grep -nw SB_I_USERNS_VISIBLE
fs/namespace.c:3418:    if (!(s_iflags & SB_I_USERNS_VISIBLE))
fs/proc/inode.c:486:    s->s_iflags |= SB_I_USERNS_VISIBLE | SB_I_NOEXEC | SB_I_NODEV;
fs/sysfs/mount.c:41:            root->d_sb->s_iflags |= SB_I_USERNS_VISIBLE;
include/linux/fs.h:1323:#define SB_I_USERNS_VISIBLE             0x00000010 /* fstype already mounted */

And without this bit, this protection is not even checked and the kernel assumes there is no danger:

        /* Can this filesystem be too revealing? */
        s_iflags = mnt->mnt_sb->s_iflags;
        if (!(s_iflags & SB_I_USERNS_VISIBLE))
                return false;

Though, I'm questioning the benefit of protections like that (for pseudo-filesystems in particular). Because as far as I can tell, you should be able to first do a pivot_root somewhere where the /proc isn't visible. You then unmount it, and now there's no /proc visible that is masked -- and theoretically you should be able to mount procfs again. But I haven't actually tried this, so don't quote me on it. 😉

I think this escaping would not work:

You cannot umount a mount marked as locked (MNT_LOCKED). A mount is marked as locked by the kernel if it is created as part of the creation of a less privileged (CL_UNPRIVILEGED) new mount namespace.
Similarly, MNT_LOCKED is checked for pivot_root and for moving a mountpoint.
If there's no /proc visible, you cannot mount a new one: the logic is reverse: mnt_already_visible() requires that there is at least one fully visible procfs.

alban · 2018-04-04T12:15:27Z

My work-in-progress attempt to make unprivileged new proc mounts possible:
https://lists.linuxfoundation.org/pipermail/containers/2018-April/038840.html

cyphar · 2018-04-09T17:42:33Z

Regarding the first two problems, they shouldn't be an issue you if do an MNT_DETACH umount2 -- you can definitely pivot_root even if you have MNT_LOCKED mounts in the soon-to-be-oldroot. But you're quite right that the third restriction would probably cause an issue with that method of escaping.

As for your patch and the relevant discussion I really like Eric's idea of having an only-process-specific procfs so that masking is no longer necessary for procfs (though I imagine people will have strong opinions about merging something like that).

brauner · 2018-08-13T16:47:22Z

So, by adding any procfs fully visible in the outer container, that should work: it does not need to be the one located at /proc and I don't need to remove the masked paths at all.

@cyphar, @alban, that's how @lxc been doing nested containers for years since we're using https://github.com/lxc/lxcfs to partially virtualize procfs which requires overmounting. For unprivileged containers you need to ensure that those sysfs and procfs mounts are inaccessible though.

geofft · 2020-04-14T17:07:16Z

I just ran into this with trying to use unshare -Urmpf --mount-proc inside a Kubernetes container - it took a while to dig up why the mount was failing but once I found mount_too_revealing this issue explains it nicely.

As far as I can tell, this should work now after @jessfraz's pull requests moby/moby#36644 and (for those of us on k8s) kubernetes/kubernetes#64283, right? Or is there something pending in runc for this?

I agree it'd be nicer to have a "real" fix to this, either a way to mount a new /proc while preserving the hidden files or a way to do unrestricted mounts of a limited procfs, as discussed in the most recent comments. (Though for my use case, I need a writable /proc/sys, which is tricky because Docker mounts /proc/sys read-only. There should still be a way to write to namespace-specific sysctls like kernel.ns_last_pid within my namespace, even if Docker wants to block access to its namespace... I agree with @brauner's comment on the mailing list thread that I don't see the point of the restriction, root can unmount the hiding and non-root can't write to things they lack capabilities for anyway.) But I think that the approach of starting a Docker container that's unprivileged but has an unmasked /proc should work fine today.

Thanks to everyone on this issue for both explaining the problem nicely and all the work on it :)

cyphar · 2020-04-16T10:47:27Z

@geofft

There isn't a way to solve this within runc directly because you currently need privileges to mount an unmasked /proc (even one inside a dead pidns which is what Jessie's patches do). The reason it works with Kubernetes is that Kubernetes is running as root.

There is some current kernel work to introduce a procfs2 (or perhaps procfs but with a special mount option) that would allow you to mount it in a rootless container, but that's definitely going to take a while to land.

williammartin mentioned this issue Nov 21, 2017

Treat EROFS in cgroups setup as skippable error #1657

Closed

frezbo mentioned this issue Jan 13, 2018

main: support rootless mode in userns #1688

Merged

cyphar added the rootless-containers label Feb 4, 2018

cyphar self-assigned this Feb 4, 2018

AkihiroSuda mentioned this issue Feb 9, 2018

docker service create doesn't allow --privileged flag moby/moby#24862

Open

cyphar assigned cyphar and unassigned cyphar Feb 23, 2018

jessfraz mentioned this issue Mar 1, 2018

we should run img within userns with subuid/subgid (especially for apt) genuinetools/img#49

Closed

jessfraz mentioned this issue Mar 14, 2018

[proposal]: option to not mask or set read-only paths in /proc moby/moby#36597

Closed

jessfraz mentioned this issue Mar 15, 2018

add ProcMount option kubernetes/community#1934

Merged

jonboulle mentioned this issue Mar 16, 2018

rootless: cgroup: treat EROFS as a skippable error #1759

Merged

AkihiroSuda mentioned this issue Mar 21, 2018

api: add MaskedPaths and ReadonlyPaths options moby/moby#36644

Merged

larssb mentioned this issue Oct 15, 2018

Permission denied when creating workers. Rootfs. concourse/concourse-docker#27

Closed

AkihiroSuda mentioned this issue Dec 27, 2018

(deleted) moby/buildkit#762

Closed

ncordon mentioned this issue Oct 23, 2019

Adds rootless containers support bblfsh/bblfshd#318

Merged

avikivity mentioned this issue Dec 8, 2019

patchelf breaks build-id in core files scylladb/scylladb#5429

Closed

AkihiroSuda closed this as completed Apr 14, 2020

89luca89 mentioned this issue Feb 20, 2022

OCI runtime create failed 89luca89/distrobox#170

Closed

bsilver8192 mentioned this issue Feb 24, 2023

Nested container can't start NVIDIA/nvidia-container-toolkit#168

Open

jedevc mentioned this issue Jul 28, 2023

rootless: mount proc:/proc (via /proc/self/fd/6), flags: 0xe: operation not permitted moby/buildkit#4073

Closed

terenceli mentioned this issue Dec 8, 2023

Cannot read mounts in rootless Podman google/gvisor#8205

Closed

EtiennePerot mentioned this issue Sep 23, 2024

Operation not permitted when mounting /proc to /tmp/proc google/gvisor#10944

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rootless containers don't work from unprivileged non-root Docker container (operation not permitted for mounting procfs) #1658

Rootless containers don't work from unprivileged non-root Docker container (operation not permitted for mounting procfs) #1658

rutsky commented Nov 20, 2017

rutsky commented Nov 20, 2017

cyphar commented Nov 21, 2017 •

edited

Loading

rutsky commented Nov 21, 2017

frezbo commented Jan 1, 2018 •

edited

Loading

jessfraz commented Feb 26, 2018

frezbo commented Feb 26, 2018

jessfraz commented Feb 26, 2018 •

edited

Loading

frezbo commented Feb 26, 2018

ulm0 commented Feb 27, 2018

cyphar commented Feb 27, 2018

ulm0 commented Feb 27, 2018

jessfraz commented Mar 14, 2018

jessfraz commented Mar 14, 2018

cyphar commented Mar 14, 2018

rhatdan commented Mar 16, 2018

jessfraz commented Mar 16, 2018 via email •

edited

Loading

jessfraz commented Mar 16, 2018 via email •

edited

Loading

rhatdan commented Mar 16, 2018

jessfraz commented Mar 16, 2018

AkihiroSuda commented Mar 21, 2018

alban commented Mar 23, 2018

jessfraz commented Mar 23, 2018 •

edited

Loading

alban commented Mar 23, 2018

jessfraz commented Mar 23, 2018 via email

alban commented Mar 23, 2018 •

edited

Loading

jessfraz commented Mar 23, 2018 via email •

edited

Loading

cyphar commented Mar 24, 2018

alban commented Mar 25, 2018

alban commented Apr 4, 2018

cyphar commented Apr 9, 2018 •

edited

Loading

brauner commented Aug 13, 2018

geofft commented Apr 14, 2020

cyphar commented Apr 16, 2020

Rootless containers don't work from unprivileged non-root Docker container (operation not permitted for mounting procfs) #1658

Rootless containers don't work from unprivileged non-root Docker container (operation not permitted for mounting procfs) #1658

Comments

rutsky commented Nov 20, 2017

rutsky commented Nov 20, 2017

cyphar commented Nov 21, 2017 • edited Loading

rutsky commented Nov 21, 2017

frezbo commented Jan 1, 2018 • edited Loading

jessfraz commented Feb 26, 2018

frezbo commented Feb 26, 2018

jessfraz commented Feb 26, 2018 • edited Loading

frezbo commented Feb 26, 2018

ulm0 commented Feb 27, 2018

cyphar commented Feb 27, 2018

ulm0 commented Feb 27, 2018

jessfraz commented Mar 14, 2018

jessfraz commented Mar 14, 2018

cyphar commented Mar 14, 2018

rhatdan commented Mar 16, 2018

jessfraz commented Mar 16, 2018 via email • edited Loading

jessfraz commented Mar 16, 2018 via email • edited Loading

rhatdan commented Mar 16, 2018

jessfraz commented Mar 16, 2018

AkihiroSuda commented Mar 21, 2018

alban commented Mar 23, 2018

jessfraz commented Mar 23, 2018 • edited Loading

alban commented Mar 23, 2018

jessfraz commented Mar 23, 2018 via email

alban commented Mar 23, 2018 • edited Loading

jessfraz commented Mar 23, 2018 via email • edited Loading

cyphar commented Mar 24, 2018

alban commented Mar 25, 2018

alban commented Apr 4, 2018

cyphar commented Apr 9, 2018 • edited Loading

brauner commented Aug 13, 2018

geofft commented Apr 14, 2020

cyphar commented Apr 16, 2020

cyphar commented Nov 21, 2017 •

edited

Loading

frezbo commented Jan 1, 2018 •

edited

Loading

jessfraz commented Feb 26, 2018 •

edited

Loading

jessfraz commented Mar 16, 2018 via email •

edited

Loading

jessfraz commented Mar 16, 2018 via email •

edited

Loading

jessfraz commented Mar 23, 2018 •

edited

Loading

alban commented Mar 23, 2018 •

edited

Loading

jessfraz commented Mar 23, 2018 via email •

edited

Loading

cyphar commented Apr 9, 2018 •

edited

Loading