Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix --read-only containers under --userns-remap #1572

Merged
merged 1 commit into from
Aug 26, 2017

Conversation

tych0
Copy link
Member

@tych0 tych0 commented Aug 24, 2017

The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.

This takes us from:

ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".

to:

ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae2948042b08ad3d6d13cd09f40a50ffff4fc688-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system

Signed-off-by: Tycho Andersen tycho@docker.com

The documentation here:
https://docs.docker.com/engine/security/userns-remap/#user-namespace-known-limitations

says that readonly containers can't be used with user namespaces do to some
kernel restriction. In fact, there is a special case in the kernel to be
able to do stuff like this, so let's use it.

This takes us from:

ubuntu@docker:~$ docker run -it --read-only ubuntu
docker: Error response from daemon: oci runtime error: container_linux.go:262: starting container process caused "process_linux.go:339: container init caused \"rootfs_linux.go:125: remounting \\\"/dev\\\" as readonly caused \\\"operation not permitted\\\"\"".

to:

ubuntu@docker:~$ docker-runc --version
runc version 1.0.0-rc4+dev
commit: ae29480-dirty
spec: 1.0.0
ubuntu@docker:~$ docker run -it --read-only ubuntu
root@181e2acb909a:/# touch foo
touch: cannot touch 'foo': Read-only file system

Signed-off-by: Tycho Andersen <tycho@docker.com>
@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

I actually was hitting a variant of this bug today. I noticed that in readonlyPath we also do MS_REMOUNT | MS_BIND which has caused me issues with readonly bindmounts (though I can't reproduce it outside of Docker weirdly). Would this change also apply to readonlyPath?

/cc @justincormack who was working on this stuff last.

@hqhq
Copy link
Contributor

hqhq commented Aug 25, 2017

Is it safe even if the original mount is not bind mount?

@rhatdan
Copy link
Contributor

rhatdan commented Aug 25, 2017

@rhvgoyal PTAL

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017 via email

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017 via email

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017 via email

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

@tych0

Not that I know of, the implementation looks fine to me. Do you have a
command line that will reproduce it?

I do, unfortunately it requires applying an out-of-tree patch to Docker. You can see the patch here, it's effectively forcefully adding a "secret" to every container and the mounting of /run/secrets is what's breaking. The strace log looks like this:

[pid 26131] mount("/var/lib/docker/100000.100000/containers/904f91a237a6c330a65fd4444635796fb25a67cfefc06213fcaa3df5d73d890c/secrets", "/var/lib/docker/100000.100000/vfs/dir/b252479c3020cdb9bedb93f1717be95714b449e3e0b302fe034149627ee37e08/run/secrets", 0xc4200b5df0, MS_RDONLY|MS_BIND|MS_REC, NULL) = 0
[pid 26131] mount("/var/lib/docker/100000.100000/containers/904f91a237a6c330a65fd4444635796fb25a67cfefc06213fcaa3df5d73d890c/secrets", "/var/lib/docker/100000.100000/vfs/dir/b252479c3020cdb9bedb93f1717be95714b449e3e0b302fe034149627ee37e08/run/secrets", 0xc4200b5df5, MS_RDONLY|MS_REMOUNT|MS_BIND|MS_REC, NULL) = -1 EPERM (Operation not permitted)

If you apply the above patch, every container run will fail.

I'm trying to reproduce the issue outside of Docker (which isn't working, even if I use the configuration generated by containerd directly!), or on a stock version of Docker, but I'm having trouble doing that. I'm going to try to reproduce it with the latest runc but the OCI updates makes it hard for me to test the latest runc with Docker because Docker doesn't support it (at least in the version I have set up on my machine at the moment).

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017 via email

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017 via email

@rhvgoyal
Copy link
Contributor

@tych0 So what does "MS_REMOUNT | MS_BIND" actually mean? Are we remounting the existing mount point or creating a new bind mount point?

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017

@cyphar I just pushed a patched to this branch that I think will fix your issue as well.

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017

@rhvgoyal it means hitting a special code path in the kernel, so it really has nothing to do with MS_BIND. I have no idea why this exists (presumably to do exactly this), but it works and I've used it elsewhere. See: https://github.com/torvalds/linux/blob/master/fs/namespace.c#L2270 in do_remount().

statfs := syscall.Statfs_t{}
syscall.Statfs(path, &statfs)

flags := statfs.Flags | unix.MS_BIND | unix.MS_REMOUNT | unix.MS_RDONLY | unix.MS_REC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use syscall anymore, we use the extended-stdlib golang.org/x/sys/unix library. Also you need to handle the error. So this should look more like:

var statfs unix.Statfs_t
if err := unix.Statfs(path, &statfs); err != nil {
	return err
}

statfs.Flags |= unix.MS_BIND | unix.MS_REMOUNT | unix.MS_RDONLY | unix.MS_REC
return unix.Mount(path, path, "", uintptr(statfs.Flags), "")

Copy link
Member

@cyphar cyphar Aug 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it looks like these changes weren't correctly gofmt'd. Run gofmt -s -w . to fix this up. Looks like you updated it while I was commenting.

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

@tych0 I was thinking the same thing wrt dropping inherited flags. Thanks for pushing a patch to fix it, I'll test it now. I'm a bit confused why we are redoing the mount in this case (since we're not actually updating any of the flags) but this shouldn't hurt.

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017 via email

@crosbymichael
Copy link
Member

crosbymichael commented Aug 25, 2017

LGTM

Approved with PullApprove

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

/me is testing to make sure it fixes the issue I was seeing.

@rhvgoyal
Copy link
Contributor

I am seeing 3 places where we are trying to convert a mount point read only. Can we unify all that.

setReadonly()
readonlyPath()
remountReadonly()

Also if retaining old flags is an issue with readonlyPath(), why is it not an issue with setReadonly() and remountReadonly()?

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

@tych0 b8136d3 is not sufficient to fix the issue I was seeing. I'm going to play with it to see if we need to apply it more places.

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017

remountReadonly() passes in the mount flags from the mount configuration, which should have nosuid or whatever; readonlyPath() makes a path that might not be a mount at all read only, setReadonly is only used for the rootfs, which presumably doesn't have stuff line noexec or nosuid, although it could. Perhaps we should use remountReadonly for that and find the "/" mount in the descriptor list?

@rhvgoyal
Copy link
Contributor

I think we can have a helper function whose job is to just remount an existing mount point read-only. It will be passed in flags and destination path. And then all the places can make use of that helper. Say remountReadOnly(path, flags).

remountReadOnly() {
return unix.Mount("", path, "", flags | unix.MS_BIND|unix.MS_REMOUNT|unix.MS_RDONLY, "")
}

And now this code can be reused.

May be we can get rid of setReadOnly() and replace that call with above helper.

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

Alright, I've figured out why my error is happening. It's caused by remount in this section not doing the necessary Statfs flag preservation:

		// bind mount won't change mount options, we need remount to make mount options effective.
		// first check that we have non-default options required before attempting a remount
		if m.Flags&^(unix.MS_REC|unix.MS_REMOUNT|unix.MS_BIND) != 0 {
			// only remount if unique mount options are set
			if err := remount(m, rootfs); err != nil {
				return err
			}
		}

@rhvgoyal
Copy link
Contributor

So I was reading remount code path in kernel. It does not seem to automatically get existing mount flags (except some atime stuff). That means it is caller's responsibility to provide existing flags in remount call if caller wants to retain these flags. We probably also need to differentiate in what cases we want to retain existing flags and in what cases we don't want to do that.

I feel a PR can be there just to cleanup code, consolidate these mount calls, put proper comments and lets merge that first and make sure things are not broken.

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

@rhvgoyal Yeah, that's what @justincormack noted a few months ago while cleaning up this code and I think @tych0 mentioned it above. I agree we really need to simplify the plethora of unix.Mount wrappers we have in libcontainer/rootfs_linux.go, and having a wrapper that encapsulated those cases (wanting and not wanting to preserve the original mount flags) is a good idea.

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

This patch fixes the issue I had, but as @rhvgoyal said, this code really needs a more significant cleanup:

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go                                                                 
index 3c714b0905e7..e58ee76d4d4e 100644                             
--- a/libcontainer/rootfs_linux.go                                  
+++ b/libcontainer/rootfs_linux.go                                  
@@ -791,7 +791,14 @@ func remount(m *configs.Mount, rootfs string) error {                                                               
        if !strings.HasPrefix(dest, rootfs) {                       
                dest = filepath.Join(rootfs, dest)                  
        }                                                           
-       if err := unix.Mount(m.Source, dest, m.Device, uintptr(m.Flags|unix.MS_REMOUNT), ""); err != nil {                               
+                                                                   
+       statfs := unix.Statfs_t{}                                   
+       if err := unix.Statfs(path, &statfs); err != nil {          
+               return err                                          
+       }                                                           
+       flags := int(statfs.Flags) | m.Flags | unix.MS_REMOUNT           
+                                                                   
+       if err := unix.Mount(m.Source, dest, m.Device, uintptr(flags), ""); err != nil {                                                 
                return err                                          
        }                                                           
        return nil       

The reason why this fixes the issue is (as above) it's because the remount with bind is purely so that you can add mount options to a bind mount.

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017

I think we need two functions: makeMountReadonly(m * configs.Mount) and makePathReadonly(path string); I'll clean that up

r.e. @cyphar's bug, that seems to me like whoever is providing the configuration for m should be providing all the right flags (presumably m itself is a bind mount, which needs to be inherited). Whatever is up one level should do that.

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017

Actually, never mind. This does need a more significant cleanup than even that. I think we should merge this, and I'll put together another branch trying to make it better.

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

@tych0 Here's the "higher up" version, but I agree this code all needs to be cleaner.

From 0f0054d22b1b2eadda635eb11089d826ecef2b81 Mon Sep 17 00:00:00 2001
From: Aleksa Sarai <asarai@suse.de>
Date: Sat, 26 Aug 2017 01:53:07 +1000
Subject: [PATCH] rootfs: preserve old mount flags when remounting bindmount

Fixes the case where a bind-mount is being defined in a user namespaced
container which requires a remount. Previously this code would attempt
to do a MS_REMOUNT that potentially dropped mount flags that were
inherited through the MS_BIND. Resolve this by intentionally adding the
mount flags to the mount configuration in that scenario.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
---
 libcontainer/rootfs_linux.go | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/libcontainer/rootfs_linux.go b/libcontainer/rootfs_linux.go
index d4f8595f2873..42a1b365e312 100644
--- a/libcontainer/rootfs_linux.go
+++ b/libcontainer/rootfs_linux.go
@@ -222,6 +222,15 @@ func mountToRootfs(m *configs.Mount, rootfs, mountLabel string) error {
 		// bind mount won't change mount options, we need remount to make mount options effective.
 		// first check that we have non-default options required before attempting a remount
 		if m.Flags&^(syscall.MS_REC|syscall.MS_REMOUNT|syscall.MS_BIND) != 0 {
+			// Make a copy of the previous mount flags, because in a user
+			// namespace we are not allowed to drop mount options from the host
+			// bindmount.
+			var statfs syscall.Statfs_t
+			if err := syscall.Statfs(dest, &statfs); err != nil {
+				return err
+			}
+			m.Flags |= int(statfs.Flags)
+
 			// only remount if unique mount options are set
 			if err := remount(m, rootfs); err != nil {
 				return err
-- 
2.14.1

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017

I still don't think so: whoever creates the struct Mount (presumably the user specifying the config to runc) needs to get their mount flags right. The only time we should need statfs is when something wasn't necessarily specified via a struct Mount, so nobody knows what its flags are.

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

@tych0 The problem is that if you want to create a readonly bindmount, you can't just do MS_BIND|MS_READONLY. You need to do an MS_BIND followed by MS_REMOUNT|MS_READONLY, but since MS_BIND will mirror any mount options that were present on the original mount now the remounter has to make sure they preserve those options (otherwise they're implicitly dropping them). So I'm not really sure where you would want those options to come from -- I don't think we can require users specify them in the OCI configuration, so we have to source them ourselves (because the original thing we're bind-mounting comes from somewhere outside of runc).

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

Sorry for derailing this patch discussion.

@tych0 How about we revert to just having the first patch (which should be fairly un-contentious), and we can discuss the rest of these issues in a separate issue/PR? If you'd prefer to work on it that's fine for me, alternatively I can work on it next week.

@tych0
Copy link
Member Author

tych0 commented Aug 25, 2017

IMO, the users should probably be specifying these things (i.e. not really users, but docker/whoever should be doing a statfs if it's going to specify bind mounts), but it would also be reasonable and user friendly to do it as you describe, especially if OCI doesn't support specifying options like nosuid.

Anyway, I've dropped the second patch for now. I'm happy to do some refactoring next week too, I won't be doing it today, though. We can coordinate off-list or something. Cheers!

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

IMO, the users should probably be specifying these things (i.e. not really users, but docker/whoever should be doing a statfs if it's going to specify bind mounts), but it would also be reasonable and user friendly to do it as you describe, especially if OCI doesn't support specifying options like nosuid.

On the one hand I agree (because it means we can do less guesswork), and OCI does support the full range of {no,}blah flags. The problem is that we already have a lot of "nice" features in rootfs_linux.go and this one seems like another logical addition. If we don't do it, now we require all OCI users that have bind mounts to have to additionally do statfs externally and then translate the flags to the OCI config.json which is a lot more work for them than it is for us.

In either case, the single patch PR is fine as-is and I'll LGTM it.

@cyphar
Copy link
Member

cyphar commented Aug 25, 2017

LGTM on 66eb2a3. Thanks @tych0!

Approved with PullApprove

@hqhq
Copy link
Contributor

hqhq commented Aug 26, 2017

LGTM

Approved with PullApprove

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants