Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

live-generator: Avoid tmpfs/overlayfs, add stronger deps #499

Merged
merged 1 commit into from
Jul 2, 2020

Conversation

cgwalters
Copy link
Member

@cgwalters cgwalters commented Jun 26, 2020

This has many things rolled into one commit because I just
needed to test them together. The high level goal is SELinux
enforcing for RHCOS and to mitigate a boot time race (still
not fully debugged, ended up adding After=systemd-udev-settle.service)

The SELinux enforcing one turned up a huge mess because as
it happens SELinux will nuke all labels on tmpfs after policy
is loaded. Which is very problematic in general (FCOS included)
but FCOS was papering over this problem by using the systemd relabel-extra.d
bits.

But we don't really want to use that anymore IMO - we should
have a clean model where files are always labeled correctly in the initramfs
before we switch root. Anything else going to lead to pain.

In order to work around the SELinux/tmpfs bug, instead make a loopback-mounted
xfs filesystem (on tmpfs).

When the kernel is fixed to retain labels in tmpfs we can drop that
hack.

@cgwalters
Copy link
Member Author

Still debugging this.

cat >>"${UNIT_DIR}/sysroot-etc-copy.service" <<EOF
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/cp -a /sysroot/etc /writable/etc-copy
ExecStart=/bin/sh -c 'mkdir -p /writable/etccopy && /bin/cp -a /sysroot/etc /writable/etccopy/etc && /sbin/setfiles -F -r /writable/etccopy /sysroot/etc/selinux/targeted/contexts/files/file_contexts /writable/etccopy'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about just adding another unit which runs after sysroot-etc.mount and does ExecStart=/usr/bin/coreos-relabel /etc? Feels cleaner since it's then part of the /sysroot rootfs proper too instead of having to create one just for setfiles. Not against doing it all in one unit though!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we can be sure no other units ran trying to operate on /sysroot/etc until the relabeling was done. Which...I think we can. I will investigate this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guideline I try to follow for relabeling is that every component is responsible for relabeling what it touched. So ideally it shouldn't matter whether something needs to modify a file in /etc before or after the mass relabeling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that relabeling can happen before or after files are changed, but it clearly is going to lead to races if we relabel concurrently with files changing right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can you expand on this? If every service always makes sure to call coreos-relabel after doing things in /etc, then it shouldn't matter if e.g. one service is changing a file while it's being relabeled since it will also trigger a relabel once it's done, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I guess so. In the end though these services are now moved to be Before=initrd-root-fs.target so nothing else (most notably ignition-files.service should be touching them).

@cgwalters cgwalters force-pushed the rhel-iso-selinux branch 3 times, most recently from 6f4f73f to 1e28db4 Compare June 30, 2020 17:09
@cgwalters
Copy link
Member Author

OK so...coreos-relabel here seems not to be working, it's not obvious to me why. I see a spam of "relabeling" messages...but when we get to the real root it's all still tmpfs_t. The absolutely maddening thing is

  1. Missing tooling for inspecting xattrs in the initramfs, I am manually adding attr
  2. Even after getting that, current kernel disallows reading them...

@cgwalters
Copy link
Member Author

cgwalters commented Jun 30, 2020

If I manually do:

chroot /sysroot
mount -t sysfs sysfs /sys
mount -t selinuxfs selinuxfs /sys/fs/selinux
/sbin/load_policy
<ctrl-d> back to initramfs

Then I can see the labels are still tmpfs_t, but now running coreos-relabel works.

Hum...I notice there's a zero 0 in the setfiles arguments but we don't seem to be NUL separating them?

@cgwalters
Copy link
Member Author

And notably this seems to be common to the things done via ignition-ostree-populate-var.service; I see e.g.

# journalctl -u ignition-ostree-populate-var | grep 'var/usrlocal '
Jun 30 16:10:49 localhost ignition-ostree-populate-var[823]: Relabeled /sysroot//var/usrlocal from unlabeled to system_u:object_r:var_t:s0
# ls -alZd /sysroot/var/usrlocal 
drwxr-xr-x. 11 root root system_u:object_r:tmpfs_t:s0 220 Jun 30 16:10 /sysroot/var/usrlocal
#

@cgwalters
Copy link
Member Author

With strace I see this:

[pid  1051] lgetxattr("/sysroot//etc/nsswitch.conf", "security.selinux", "unlabeled", 255) = 10
[pid  1051] access("/var/run/setrans/.setrans-unix", F_OK) = -1 ENOENT (No such file or directory)
[pid  1051] futex(0x7ff06b2315d0, FUTEX_WAKE_PRIVATE, 2147483647) = 0
[pid  1051] lsetxattr("/sysroot//etc/nsswitch.conf", "security.selinux", "system_u:object_r:etc_t:s0", 27, 0) = 0
[pid  1051] fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(0x5, 0x1), ...}) = 0
[pid  1051] ioctl(1, TCGETS, {B9600 opost isig icanon echo ...}) = 0
[pid  1051] write(1, "Relabeled /sysroot//etc/nsswitch"..., 83Relabeled /sysroot//etc/nsswitch.conf from unlabeled to system_u:object_r:etc_t:s0
) = 83

But...after doing the bits to load the policy so I can inspect the xattrs...it appears not to be set. With ftrace I can see that selinux_inode_setxattr is being invoked at least, but not what it's returning.

@cgwalters
Copy link
Member Author

cgwalters commented Jun 30, 2020

Reading some of the kernel code here...I think this may be a behavioral difference in the initramfs between tmpfs and xfs?

When policy is loaded, the kernel walks over all the inodes from mounted filesystems and ensures they're initialized, and I think this is wiping out any changes we make to tmpfs inodes or something like that.

If so, then this would explain why this further code for rhel8 is only broken there - for FCOS we're doing overlayfs on the squashfs, and this is mostly going to work because it would only affect files we've modified.

Yeah, getting more confident in this theory, look at the policy:
https://github.com/fedora-selinux/selinux-policy/blob/862368c92def52e3bccce571a46cd99dce34fc78/policy/modules/kernel/filesystem.te
e.g.
fs_use_xattr xfs gen_context(system_u:object_r:fs_t,s0);
versus what's happening for tmpfs_t below.

@cgwalters
Copy link
Member Author

Blah...maybe the simplest is to try to rework this so that rather than copying everything into tmpfs we use devmapper CoW device on the squashfs?

Clearly in parallel we need to get the fhandle bits backported into RHEL8's squashfs so that one can create an overlayfs on top - they go together so well.

@cgwalters
Copy link
Member Author

OK I just tested, rather than tmpfs it works to use loopback-mounted xfs filesystems (backed by a file in tmpfs). So awesome.

@cgwalters
Copy link
Member Author

Ahh but FCOS is also using a tmpfs for /var so shouldn't we be hitting the failure there too. Theory: FCOS /var is only working because it has /run/systemd/relabel-extra.d/

@jlebon
Copy link
Member

jlebon commented Jun 30, 2020

Hum...I notice there's a zero 0 in the setfiles arguments but we don't seem to be NUL separating them?

Ahh yeah, that's left over from when I copied this from the Ignition work where we feed the file list through setfiles' standard input. We pass it on the command-line here though, so I think that's probably a red herring (but still worth cleaning up).

Your comments about tmpfs ring a bell. I think @bgilbert might've encountered this as well when he was initially hacking on the live arfifacts?

Theory: FCOS /var is only working because it has /run/systemd/relabel-extra.d/

Hmm good catch. I thought we had gotten rid of all post-switchroot relabeling in FCOS (but at least relabel-extra.d is the best alternative of all of those). Let me play with this locally as well.

@jlebon
Copy link
Member

jlebon commented Jun 30, 2020

Yup, this is easy to reproduce on FCOS by booting with rd.break and then:

sh-5.0# rm /run/systemd/relabel-extra.d/coreos-writable.relabel
sh-5.0# coreos-relabel /etc
...
sh-5.0# coreos-relabel /var
...
sh-5.0# exit
...
(lots of AVC denials with tcontext=system_u:object_r:tmpfs_t:s0)

@cgwalters
Copy link
Member Author

cgwalters commented Jun 30, 2020

OK so...I'll rework both paths to use the loopback-mounted xfs trick instead of tmpfs, unless anyone has other ideas.

(Now, I think we should fix tmpfs + SELinux but...not going to block on that today)

@jlebon
Copy link
Member

jlebon commented Jun 30, 2020

OK so...I'll rework both paths to use the loopback-mounted xfs trick instead of tmpfs, unless anyone has other ideas.

Hmm, I'm torn. I think using relabel-extra.d/ for FCOS is better than the loopback, though I know how nice it'd be to unify the approaches here. WDYT about trying to keep it to RHEL8 as a first approach and see how it goes?

@cgwalters
Copy link
Member Author

It's going to be a huge mess in the code to have two paths though. I think the real bug here is to fix tmpfs + SELinux in the initramfs. It may not even be too hard; after that we can drop the loopback mounts.

@cgwalters
Copy link
Member Author

I feel like this is making progress but now what I'm seeing...seems to be systemd-tmpfiles-setup.service running twice in the initramfs and not in the real root? A bunch of services fail because e.g. the /var/run symlink is missing.

I am currently baffled by what would cause it to run again in the initramfs. It's not happening outside of live runs AFAICS.

@cgwalters
Copy link
Member Author

seems to be systemd-tmpfiles-setup.service running twice in the initramfs and not in the real root? A bunch of services fail because e.g. the /var/run symlink is missing.

And that problem turns out to be that we really do need the DefaultDependencies=false, otherwise the initrd-reload.service causes just that unit to start again...which really has to be some sort of bug.

Man, that was painful to figure out. I only stumbled into doing so because when initially testing changes for the race a week or two ago the first thing I tried was removing those DefaultDependencies= and got the same failure, but I'd forgotten about it and then only dimly recalled it when debugging this.

@cgwalters
Copy link
Member Author

YES

[core@localhost ~]$ sudo su -
Last login: Wed Jul  1 13:26:56 UTC 2020 on tty1
[root@localhost ~]# rpm-ostree status
State: idle
Deployments:
* ostree://0959f1d922a78d3e58f9ad064e73b9dc82fcbf151b28d3d7d9c349f79a3457d1
                   Version: 46.82.202007011315-0 (2020-07-01T13:18:39Z)
[root@localhost ~]# getenforce 
Enforcing
[root@localhost ~]# ls -al /run/ostree-live 
-rw-r--r--. 1 root root 0 Jul  1 13:26 /run/ostree-live
[root@localhost ~]# 

OK now I've introduced some other race in bootup - seeing run-ephemeral.mount failed with Operation not permitted
and its dependency sysroot-xfs-ephemeral-mkfs.service be Inactive (dead) which shouldn't happen because it's RemainAfterExit=yes.

But this is feeling close...

@cgwalters
Copy link
Member Author

OK now I've introduced some other race in bootup - seeing run-ephemeral.mount failed with Operation not permitted
and its dependency sysroot-xfs-ephemeral-mkfs.service be Inactive (dead) which shouldn't happen because it's RemainAfterExit=yes.

That seemed to somehow be related to the service starting too early.

@cgwalters cgwalters changed the title WIP: overlay/live-generator: Set labels on /etc in RHCOS Live ISO live-generator: Avoid tmpfs/overlayfs, add stronger deps Jul 1, 2020
@cgwalters
Copy link
Member Author

OK, lifting WIP on this! Tested with both FCOS/RHCOS, we have SELinux enforcing and haven't seen a boot race yet, but I'm working on some coreos-assembler patches to make testing that in an automated fashion easier.

@cgwalters
Copy link
Member Author

cgwalters commented Jul 1, 2020

And now building on top of coreos/coreos-assembler#1571 I have this going:

walters@toolbox /v/s/w/b/rhcos-master> jobs
Job	Group	CPU	State	Command
4	1866544	0%	running	bash -c 'while kola run --output-dir=tmp/kola-iso3 --qemu-image builds/latest/x86_64/rhcos-46.82.202007011714-0-live.x86_64.iso --qemu-memory 8192 basic; do :; done'&
3	1866037	0%	running	bash -c 'while kola run --output-dir=tmp/kola-iso2 --qemu-image builds/latest/x86_64/rhcos-46.82.202007011714-0-live.x86_64.iso --qemu-memory 8192 basic; do :; done'&
2	1865987	0%	running	bash -c 'while kola run --output-dir=tmp/kola-iso1 --qemu-image builds/latest/x86_64/rhcos-46.82.202007011714-0-live.x86_64.iso --qemu-memory 8192 basic; do :; done'&
1	1865911	0%	running	bash -c 'while kola run --output-dir=tmp/kola-iso0 --qemu-image builds/latest/x86_64/rhcos-46.82.202007011714-0-live.x86_64.iso --qemu-memory 8192 basic; do :; done'&

So let's see if we can find any races...

OK, did over 100 boots this time, no failures.

Copy link
Contributor

@darkmuggle darkmuggle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment about disk-space exhaustion and discards.

As is, this could be interesting. Consider:

  1. the tmpfs is capped at 50% of the RAM
  2. the filesystem is allocated for the full amount of RAM
  3. writes could fail since the filesystem is bigger than the backing disk
  4. Removing files won't help since space is still allocated.

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/bin/sh -c 'set -euo pipefail; mem=$$(grep "^MemTotal" /proc/meminfo | sed -e "s,^.*: *\\([0-9]*\\) .*,\\1,") && /bin/truncate -s $${mem}k /run/ephemeral.xfsloop'
Copy link
Contributor

@darkmuggle darkmuggle Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about awk '/^MemTotal/{print$$2;exit;}' /proc/meminfo? I think we already have awk in the initramfs and its a bit easier to read.

And if I am gawking this correctly this allows for 100% of the total memory. I think that having a percentage of RAM would better.

edit: https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt says that it will be capped at half the RAM. So it won't OOM. But it could have write errors since the backing memory is smaller than the allocated space.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, actually tmpfs uses 50% of RAM by default (totally arbitrary and I think anyone using the Live ISO/PXE would want to reconfigure this in some situations). So what we're doing here is wrong in that we're creating a sparse file larger than could be written.

How about sizing the filesystem to the total available in /run? Something like

[root@cosa-devsh ~]# echo $(($(stat -f -c '%b * %s / 1024' /run)))
495516

None of this is really right but I think for most use cases it doesn't matter, and for those it does (say a lot of container images in /var/lib/containers), an admin can make that a tmpfs mount point with a specific size.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing files won't help since space is still allocated.

Keep in mind an admin can still invoke fstrim on demand (and requiring them to do this is the current default) - discard is just turning it on by default synchronously (which is totally fine in this case because this is all just in memory and so there's no latency concerns).

This has many things rolled into one commit because I just
needed to test them together.  The high level goal is SELinux
enforcing for RHCOS and to mitigate a boot time race (still
not fully debugged, ended up adding `After=systemd-udev-settle.service`)

The SELinux enforcing one turned up a huge mess because as
it happens SELinux will nuke all labels on `tmpfs` after policy
is loaded.  Which is very problematic in general (FCOS included)
but FCOS was papering over this problem by using the systemd `relabel-extra.d`
bits.

But we don't really want to use that anymore IMO - we should
have a clean model where files are always labeled correctly in the initramfs
before we switch root.  Anything else going to lead to pain.

In order to work around the SELinux/`tmpfs` bug, instead make a loopback-mounted
`xfs` filesystem (on `tmpfs`).

When the kernel is fixed to retain labels in `tmpfs` we can drop that
hack.
@cgwalters
Copy link
Member Author

Updated per above discussion. I still want to try to diagnose exactly the tmpfs/SELinux interaction suitable for a bug report; i'm not 100% sure it's the kernel and not e.g. how systemd relabels /run. But, I don't think we should block on that either.

Copy link
Contributor

@darkmuggle darkmuggle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Good work @cgwalters

Copy link
Member

@jlebon jlebon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job debugging this! Just a couple of comments but LGTM overall.

[Unit]
After=${isosrc_escaped}
Requires=${isosrc_escaped}
EOF
cat >"${UNIT_DIR}/run-media-iso.mount" <<EOF
# Automatically generated by live-generator

[Unit]
DefaultDependencies=false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not new, though since we're on the hunt for unnecessary DefaultDependencies=, I think this is one of them since initrd-root-device.target is after basic.target.

cat >"${UNIT_DIR}/run-media-iso.mount" <<EOF
# Automatically generated by live-generator

[Unit]
DefaultDependencies=false
After=initrd-root-device.target
# HACK for https://github.com/coreos/fedora-coreos-config/issues/437
After=systemd-udev-settle.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure this is still needed? systemd-udev-settle.service is before sysinit.target, which is before basic.target (which is before initrd-root-device.target).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much ordering around initrd-root-device.target:

sh-5.0# systemctl show initrd-root-device.target |grep -Ee 'Before|After'
Before=initrd.target
After=ignition-disks.service

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm weird. I was going according to bootup(7): https://github.com/systemd/systemd/blob/c03ef420fa7157b8d4881636fe72596a06e08bb6/man/bootup.xml#L243-L251. But you're right, there's no obvious ordering that seems to implement that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think actually this may be not sufficient though, the idea is systemd-udev-settle.service doesn't run by default, and so one should actually use both Wants= and Requires= or so. Ah right, multipathd.service is:

sh-5.0# systemctl show systemd-udev-settle.service |grep WantedBy
WantedBy=multipathd.service multipathd-configure.service

[Unit]
DefaultDependencies=false
# Let's be sure we have basic devices, but other than that we
# can run really early.
After=systemd-tmpfiles-setup-dev.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, is this for /run? It's mounted by systemd itself on startup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I thought so too but I was seeing weird failures until I added this, see #499 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I'm skeptical of units with absolutely no dependencies at all.

DefaultDependencies=false
# Make sure our tmpfs is available
Requires=sysroot-xfs-ephemeral-setup.service
After=sysroot-xfs-ephemeral-setup.service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also have Before=initrd-root-fs.target since the etc and var mounts should be considered part of setting up the root fs too, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, no harm in adding that, but we already have that implicitly because sysroot-relabel.service (the final service in the chain) has it, and is ordered after those units, so it forces the whole chain.

@cgwalters
Copy link
Member Author

If we're not seeing anything seriously wrong I'd like to merge this (tested extensively as is) PR, and do other cleanups as a followup that can be tested independently. Iterating on this is painful!

@cgwalters
Copy link
Member Author

Testing this followup now:

From 56696592d5fec57e5362bfb5397e4512a937f476 Mon Sep 17 00:00:00 2001
From: Colin Walters <walters@verbum.org>
Date: Thu, 2 Jul 2020 16:20:44 +0000
Subject: [PATCH] wip

---
 .../usr/lib/dracut/modules.d/20live/live-generator   | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/overlay.d/05core/usr/lib/dracut/modules.d/20live/live-generator b/overlay.d/05core/usr/lib/dracut/modules.d/20live/live-generator
index 2a9c91e..855dfcc 100755
--- a/overlay.d/05core/usr/lib/dracut/modules.d/20live/live-generator
+++ b/overlay.d/05core/usr/lib/dracut/modules.d/20live/live-generator
@@ -90,9 +90,12 @@ EOF
 
 [Unit]
 DefaultDependencies=false
-After=initrd-root-device.target
 # HACK for https://github.com/coreos/fedora-coreos-config/issues/437
-After=systemd-udev-settle.service
+Wants=systemd-udev-settle.service
+# Note that `man bootup` implies that initrd-root-device is After=basic.target
+# but that appears to not be the case.  We explicitly order after sysinit.target
+After=sysinit.target
+After=initrd-root-device.target
 Before=initrd-root-fs.target
 
 [Mount]
@@ -164,9 +167,6 @@ After=sysroot.mount
 # And after OSTree has set up the chroot() equivalent
 After=ostree-prepare-root.service
 
-# We're part of assembling the root fs
-Before=initrd-root-fs.target
-
 [Service]
 Type=oneshot
 RemainAfterExit=yes
@@ -182,6 +182,8 @@ DefaultDependencies=false
 # Make sure our tmpfs is available
 Requires=sysroot-xfs-ephemeral-setup.service
 After=sysroot-xfs-ephemeral-setup.service
+# We're part of assembling the root fs
+Before=initrd-root-fs.target
 EOF
 }
 
-- 
2.26.2

@jlebon
Copy link
Member

jlebon commented Jul 2, 2020

If we're not seeing anything seriously wrong I'd like to merge this (tested extensively as is) PR, and do other cleanups as a followup that can be tested independently. Iterating on this is painful!

Yup, WFM!

@jlebon jlebon merged commit 2ecc704 into coreos:testing-devel Jul 2, 2020
cgwalters added a commit to cgwalters/installer that referenced this pull request Jul 6, 2020
This brings in at least coreos/fedora-coreos-config#499
for the Live ISO, and we want those fixes for all of the work
happening on top of the Live ISO like the assisted installer, etc.
c4rt0 pushed a commit to c4rt0/fedora-coreos-config that referenced this pull request Mar 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants