Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless containers without uid mapping to root #1800

Open
Madeeks opened this issue May 8, 2018 · 16 comments
Open

Rootless containers without uid mapping to root #1800

Madeeks opened this issue May 8, 2018 · 16 comments

Comments

@Madeeks
Copy link

Madeeks commented May 8, 2018

Hello,

would it be possible to use runc to create a rootless container with the following characteristics:

  • a uid/gid mapping to the same values that a user has on the host
  • use the same uid/gid to run applications inside the container

In other words, I would like to know if the use case described here for LXC is supported by runc as well.

I tried setting up the config.json with the following details:

{
    "ociVersion": "1.0.0",
    "process": {
        "terminal":  true,
        "user": {
            "uid": 23689,
            "gid": 1000
        },   
        [ ... ]
    }
    [ ... ]
    "linux": {
        "uidMappings": [
            {
                "hostID": 23689,
                "containerID": 23689,
                "size": 1
            }
        ],
        "gidMappings": [
            {
                "hostID": 1000,
                "containerID": 1000,
                "size": 1
            }
        ],
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            },
            {
                "type": "user"
            }
            ],
        [ ... ]
    }
}

However runc run returns the following message: User namespaces enabled, but no user mapping found.

Thanks for any help provided.

@cyphar
Copy link
Member

cyphar commented May 9, 2018

At the moment this is not supported, though this is something that I agree would be useful. I haven't yet taken a look at how much work it would take (and whether LXC does anything special in this case which we would have to replicate). I can talk to @brauner out-of-band and see how he solved the "you need root in the namespace in order to set up the container" problem (maybe it was done using capabilities -- I'm not sure).

@Madeeks
Copy link
Author

Madeeks commented May 15, 2018

Thanks a lot for the reply @cyphar!
This feature would be very useful to me, so I'll keep an eye out for it in the future.

@llchan
Copy link

llchan commented Sep 7, 2018

This would also be useful to me. I'd like to move some existing processes into (rootless) containers, and would like for them to think they run as the same unprivileged user as before.

I'll experiment a bit, but this will likely take me out of my depth and I may need some guidance. Do we already have a general idea of how to get this to work?

@cyphar
Copy link
Member

cyphar commented Sep 8, 2018

Basically the core idea is that you need to just change the current restrictions and see what breaks. Likely the main breakages will be that runc currently assumes that running as a non-uid=0 (in the container) means that you want to drop capabilities. We need to stop this from happening and likely this will be the only really big pain point.

Aside from that most of it ought to mostly work (there isn't anything particularly special about mapping 1000->1000 versus 1000->0 as an unprivileged user).

@llchan
Copy link

llchan commented Sep 10, 2018

I think I have something minimally functional, but one snag I've hit is that the RHEL 7.5 kernel 3.10 doesn't allow unprivileged devpts mounts (it returns EINVAL on mount). A likely relevant conversation is apptainer/singularity#1186. As a workaround, I currently have to set "terminal": false and allow devpts mounts to fail, which is unfortunate but better than nothing. I could always do interactive work as container root if necessary. If you have any ideas for a better workaround let me know.

@cyphar
Copy link
Member

cyphar commented Sep 11, 2018

As far as I am aware, this is something that we should have already fixed a long time ago by dropping gid=5 in the default mount options configuration (it was part of the original batch of changes in #774). Have you tried removing gid=5?

But the discussion you linked appears to argue that there is a kernel-side check for devpts mounts that is based on uid? That's a bit odd, I would've imagined it's purely based on whether you have CAP_SYS_ADMIN. I'll take a look at the relevant kernel code (hopefully it's not a RHEL-only patch because it's a nightmare to get usable RHEL kernel sources).

@llchan
Copy link

llchan commented Sep 11, 2018

Yeah, saw some of the commits related to that. My config does not have a gid=5 option, and I verified via strace:

mount("devpts", "/path/to/bundle/rootfs/dev/pts", "devpts", MS_NOSUID|MS_NOEXEC, "newinstance,ptmxmode=0666,mode=0620") = -1 EINVAL (Invalid argument)

After re-reading that thread and peeking at the kernel source, I don't think this is RHEL-specific, it's just that the 3.10 kernel it comes with is fairly old and requires that uid=0 and gid=0 be valid in the user namespace. See the relevant 3.10 source at
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/fs/devpts/inode.c?h=v3.10#n249
and the commit that fixes this at
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=e98d41370392dbc3e94c8802ce4e9eec9efdf92e.

Is it possible for an unprivileged user to map host root to container root in the user namespace? I'm guessing not, for security reasons?

@cyphar
Copy link
Member

cyphar commented Sep 11, 2018

Is it possible for an unprivileged user to map host root to container root in the user namespace? I'm guessing not, for security reasons?

No, that's not possible (as an unprivileged user) because inside a user namespace you can change to any user (if you created the user namespace). As an unprivileged user, you can only map yourself into either uid=0 or uid=parent_uid.

@llchan
Copy link

llchan commented Sep 12, 2018

Right, yeah, thats what I thought. We may just have to accept that devpts wont work with nonzero uid/gid in older kernels. We can output a hint message in the logs when the devpts mount returns EINVAL and uid/gid are nonzero.

I'll put together a PR at some point.

@zokrezyl
Copy link

Hi, almost created a new issue for the same thing, luckily found this one.

I think the basic solution is trivial and would cover lot's of use-cases (I will link soon the way I solved it for me). It involves three (the fourths is already present) additional steps (semi-pseudo-code):

->  unshare(CLONE_NEWUSER)
->  write("/proc/$$/uuid_map", "1000 0")
->  write("/proc/$$/uuid_map", "1000 0")
-> execve("user_process")c

1000 is the original user's id, that was mapped in the previous step to 0.

I think if that would make it into the runc with additional flag, would be great.

@zokrezyl
Copy link

zokrezyl commented Oct 22, 2019

Just another note: probably bunch of exploits could have been and could be avoided (like https://seclists.org/oss-sec/2019/q1/119), if better tooling would be provided for unprivileged containers without reinventing the wheel...

And related more about the technical solution I proposed in my previous post:
Before executing the second unshare, it would be great to give the opportunity to run an executable from the containers filesystem as initialisation, thus the steps would be:

-> clone && execve user specific init process as uid 0
->  unshare(CLONE_NEWUSER)
->  write("/proc/$$/uuid_map", "1000 0")
->  write("/proc/$$/uuid_map", "1000 0")
-> execve("user_process")

@zokrezyl
Copy link

zokrezyl commented Oct 4, 2020

Found the solution. In order to implement it one needs an additional step (sub-command) like "init", let's call in "unsremap"

  • unsremap should import libcontainer/unsremap exactly how "init" is importing libcontainer/nsenter
  • libcontainer/unsremap would "import" a simple C function like this (similar like nsexec)
#define _GNU_SOURCE                                                                                                                        
#include <stdlib.h>                                                                                                                        
#include <sched.h>                                                                                                                         
                                                                                                                                           
                                                                                                                                           
void unsremap(void)                                                                                                                        
{                                                                                                                                          
    char *unshare_mode = getenv("UNSHARE_MODE");                                                                                           
    if(unshare_mode != NULL) {                                                                                                             
        unshare(CLONE_NEWUSER);                                                                                                            
    }                                                                                                                                      
}                                                                                                                                          
 
  • the config.json shuold contain an additianal re-mapping of the root (0) to the desired UID/GID

  • if the second mapping is defined in config.json then the "init" step will not call the process.args defined in config.json, but the "unsremap" sub-command passing

  • the unsremap sub-command will write the /proc/self/guid_map and /proc/self/gid_map the new mapping

  • and exec the original process.args defined in config.json

Am happy to provide a patch if this description is accepted

@cyphar
Copy link
Member

cyphar commented Oct 5, 2020

I'm not sure it's necessary to have a separate re-exec stage, you should just be able to add an extra CLONE_NEWUSER in the existing nsexec.c setup stages (though because you have to do the mappings this may require adding a new stage to setup...). In addition, doing the unshare after all the other namespaces are set up wouldn't be a good idea -- the new user namespace wouldn't own any of them and containers wouldn't function correctly.

Also changes to the configuration format of config.json require runtime-spec changes, ideally we would specify this separately (though I'd hate for it to be done through a new flag -- maybe it could be specified by saying that you only want a single mapping for a non-root user and the user to run as is set as the same user?).

@zokrezyl
Copy link

In nsexec may be too early as you cannot do any mounting and other init as non root, which I believe you are doing in the init subcommand.

In addition, doing the unshare after all the other namespaces are set up wouldn't be a good idea -- the new user namespace wouldn't own any of them and containers wouldn't function correctly.
Well, I am trading something against something. Obviously lot of containers will not work as they may be assuming that they are running as uid 0. However why would I need to own further the namespaces. The idea, at least my understanding is to run in highest isolation and lowest privileges. Which assumes that the processes in the resulting context should not pretend to own anything significant.

My containers do not work for the opposite reason: some executable are assuming that they are not uid 0.

@cyphar
Copy link
Member

cyphar commented Oct 11, 2020

However why would I need to own further the namespaces.

You cannot configure namespaces unless you own them (more specifically, have the correct capabilities in the user namespace which owns the namespace you're trying to configure), and since the configuration is done much later during setup you would need to do the unshare at the very end of setup which would make the logic much more complicated.

My containers do not work for the opposite reason: some executable are assuming that they are not uid 0.

There are some ghetto solutions for this problem which I helped develop some time ago -- https://github.com/rootless-containers/subuidless is the latest iteration of this idea.

@zokrezyl
Copy link

Not sure if you understood my initial proposal. The idea is that with some magic configuration, once everything is configured by runc (namespaces, mounts etc), instead of calling the process.args from config.json you would call

['/proc/self/exe', 'unsremap', '1000', '1000'] + process.args

the unsremap subcommand

  • would first do unshare user NS (as mentioned in the above C code)
  • write the new mapping into uid and gid map files
  • exec the original program with original arguments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants