Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jessie container with systemd fails to start with DebOps default configuration #15

Open
ganto opened this issue Mar 28, 2015 · 10 comments

Comments

@ganto
Copy link
Contributor

ganto commented Mar 28, 2015

Altough this was already discussed in IRC I allow myself to open an issue to track the problem and progress with this issue.

Starting position
Create Debian Jessie container on a Jessie LXC host with debops:

lxc_containers:
  - name: 'jessie01'
    template_options: '--release jessie'

This will install systemd by default.

Error
When trying to start the container, the following error appears:

# lxc-start -n jessie01
Failed to mount tmpfs at /dev/shm: Operation not permitted

Reason
'cap_sys_admin' is dropped in /var/lib/lxc/jessie01/config as defined in defaults/main.yml and therefore prevents systemd to mount some required file systems:

# List of default POSIX capabilities which should be dropped in all LXC containers
lxc_capabilities_drop: [ 'mknod', 'sys_admin', 'sys_rawio', 'syslog', 'wake_alarm' ]

Known Work-Arounds

  • Remove systemd from your Jessie installation. NOTE: lxc.autodev = 1 and lxc.kmesg = 0 must be removed from the container configuration to make this work.
  • Don't drop 'cap_sys_admin' in your container. This makes systemd to fully work without further
    configuration. NOTE: This has a huge negative security impact.

Unsuccessful Work-Around
I also tried to drop 'cap_sys_admin' and make LXC mount the required file systems without systemd involvement. For this I added:

lxc.mount.entry = tmpfs dev/shm tmpfs nosuid,nodev 0 0
lxc.mount.entry = tmpfs run tmpfs nosuid,relatime 0 0
lxc.mount.entry = tmpfs run/lock tmpfs nosuid,nodev,noexec,relatime 0 0

Unfortunately this fails with the message that /run/lock doesn't exist:

lxc-start: No such file or directory - failed to mount 'tmpfs' on '/usr/lib/x86_64-linux-gnu/lxc/rootfs/run/lock'

Bugs

  • Debian #775067: preventing journald to forward messages to syslog in case 'cap_sys_admin' is dropped. This is only fixed in systemd_218-4 in experimental now.

As I could live with the mentioned systemd bug, I'm still trying to find a way to run it without 'cap_sys_admin'. The challenges then are:

  • Is there any configuration twist for LXC which would allow me to create the nested mount path /run/lock before actually mounting it?
  • Or is there a configuration option for systemd to not mount a separate file system for /run/lock?

If there are some other possible work-arounds or any hints regarding my open questions, please let me know. I'll update once I found out more

@ganto
Copy link
Contributor Author

ganto commented Mar 28, 2015

Hi everyone

I found the answer of the LXC mounting error in Re: [systemd-devel] logind vs CAP_SYS_ADMIN-lessness. There is a mount option create=dir.

With the follwoing additional entries in /var/lib/lxc/jessie01/config, it's possible to boot a Jessie systemd container without 'cap_sys_admin':

# Custom container options
lxc.mount.auto = cgroup:mixed
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0
lxc.mount.entry = debugfs sys/kernel/debug debugfs rw,relatime 0 0
lxc.mount.entry = mqueue dev/mqueue mqueue rw,relatime,create=dir 0 0
lxc.mount.entry = hugetlbfs dev/hugepages hugetlbfs rw,relatime,create=dir 0 0

Also make sure, that you have the following line in your /etc/lxc/lxc.conf:

lxc.cgroup.use = @all

Otherwise the container start will fail with the following error:

# lxc-start -n jessie01 
Failed to mount cgroup at /sys/fs/cgroup/systemd: Operation not permitted

@ganto
Copy link
Contributor Author

ganto commented Mar 28, 2015

After merging #16 setting up a jessie container on a jessie LXC host should now work out of the box.

@lyrixx
Copy link

lyrixx commented Jan 5, 2016

Looks like this issue can be closed?

@drybjed
Copy link
Member

drybjed commented Jan 5, 2016

Well, nobody else raised any issues, so I guess this can be closed. :-)

@drybjed drybjed closed this as completed Jan 5, 2016
@kartoffelheinz
Copy link

Sorry to re-open this, but the issue came back with linux kernel 4.6. None of the workarounds except for "Don't drop 'cap_sys_admin' in your container" works. Reverting to kernel 4.5, everything works as expected. This might have to do with the addition of cgroup namespace support in the kernel, see this (and consecutive) pull request: http://lkml.iu.edu/hypermail/linux/kernel/1603.2/02432.html

Do you guys know any workaround here?

@drybjed drybjed reopened this Jul 12, 2016
@drybjed
Copy link
Member

drybjed commented Jul 12, 2016

@kartoffelheinz Unfortunately I haven't heard yet anything about this issue. If you find a solution, it would be great to hear it. Thanks for the heads up, I reopened the issue in case anybody else is interested.

@geaaru
Copy link

geaaru commented Jan 24, 2017

hi, if can help you... from kernel >=4.6 cgroup api/features are been rewrited. As describe on gentoo wiki
https://wiki.gentoo.org/wiki/LXC#Configuring_unprivileged_LXC
to start unprivileged container is needed mount cgroup filesystem with systemd name.

root #mkdir -p /sys/fs/cgroup/systemd

root #mount -t cgroup -o none,name=systemd systemd /sys/fs/cgroup/systemd

I tested this with kernel 4.8 and 4.9.
This solution use cgroup v1 api, currently I don't know how use correctly cgroup v2 api with unprivileged containers.

@sherpya
Copy link

sherpya commented Jun 26, 2017

@geaaru method worked for me, if you don't use systemd in the host you can add these lines to fstab

cgroup  /sys/fs/cgroup  cgroup  defaults    0   0
systemd /sys/fs/cgroup/systemd  cgroup  name=systemd,x-mount.mkdir=0555 0   0

perhaps I'm stil unable to mount with name=systemd option

@kartoffelheinz
Copy link

kartoffelheinz commented Aug 8, 2017

This issue is still a major PITA.

As of now, it is impossible to run privileged containers without sys_admin capability in latest Debian stable using the 4.9 Kernel with systemd present in both host and guest. System will not load and you can see the following errors in console / logfile.

Freezing execution.
Failed to mount tmpfs at /sys/fs/cgroup: Operation not permitted
Failed to mount cgroup at /sys/fs/cgroup/systemd: No such file or directory
[ESC[0;1;31m!!!!!!ESC[0m] Failed to mount API filesystems, freezing.

None of the workarounds (adding cap_sys is not a workaround anybody should consider) change that, the only way to make it work is to use the old Debian Jessie 3.16 Kernel.

matthijskooijman added a commit to matthijskooijman/Tika that referenced this issue Sep 17, 2017
Previously, mounts were added inside the lxc configuration files. Now,
the normal filesystem (bind)mounts are made in /etc/fstab in the host
instead, which is allowed by making /containers a "shared" mountpoint
(meaning that any bindmounts / clones of /containers retain all
submounts). This means that, unlike before, extra mountpoints can be
added later, without restarting the container and without giving the
container cap_sys_admin (e.g. mount permissions).

Also, some additional special filesystems are now mounted through the
lxc configuration files. These files are normally mounted by systemd on
startup, but without cap_sys_admin, these mounts would fail. By making
sure the mounts are already there, systemd will not try to mount them
and it will not fail (some additional configuration is needed for
systemd too, coming up next).

As an additional side effect, /etc/skel is actually mounted read-only
now. Due to a limitation of mount (worked around by using a systemd
generator in the host), this bindmount was previously read-write in all
containers.

The /proc and /sys mounts now use the lxc.mount.auto directive (which is
effectively the same, just a bit shorter).

Finally, the global lxc.conf is modified to create all cgroups, which is
apparently needed to support the cgroup mounting for systemd.

See also debops/ansible-lxc#15 and
https://s3hh.wordpress.com/2011/09/22/sharing-mounts-with-a-container/
@luken
Copy link

luken commented Feb 14, 2022

Note in case someone else runs into this.

Just updated lxc host to debian 11/bullseye and had some issues with old containers (config not managed by debops). I only had to add the following lines to each of the node's /var/lib/lxc//config file to get them to start.

# needed for drop_cap sys_admin
lxc.mount.entry = tmpfs dev/shm tmpfs rw,nosuid,nodev,create=dir 0 0
lxc.mount.entry = tmpfs run tmpfs rw,nosuid,nodev,mode=755,create=dir 0 0
lxc.mount.entry = tmpfs run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,create=dir 0 0

symptom was

$ lxc-start --foreground --logpriority debug --name container1
Failed to mount tmpfs at /dev/shm: Operation not permitted
Failed to mount tmpfs at /run: Operation not permitted
Failed to mount tmpfs at /run/lock: Operation not permitted
[!!!!!!] Failed to mount API filesystems.
Exiting PID 1...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants