Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ubuntu] long duration hang during boot #2802

Closed
Tracked by #2804
sdwilsh opened this issue Aug 11, 2024 · 24 comments · Fixed by kairos-io/packages#1022
Closed
Tracked by #2804

[ubuntu] long duration hang during boot #2802

sdwilsh opened this issue Aug 11, 2024 · 24 comments · Fixed by kairos-io/packages#1022
Assignees
Labels
bug Something isn't working

Comments

@sdwilsh
Copy link
Contributor

sdwilsh commented Aug 11, 2024

Kairos version:

PRETTY_NAME="Ubuntu 24.04 LTS"
NAME="Ubuntu"
VERSION_ID="24.04"
VERSION="24.04 LTS (Noble Numbat)"
VERSION_CODENAME=noble
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=noble
LOGO=ubuntu-logo
KAIROS_VERSION="v3.1.1-v1.29.4-k3s1"
KAIROS_VERSION_ID="v3.1.1-v1.29.4-k3s1"
KAIROS_IMAGE_REPO="quay.io/kairos/ubuntu:24.04-standard-amd64-generic-v3.1.1-k3sv1.29.4-k3s1"
KAIROS_FLAVOR="ubuntu"
KAIROS_FLAVOR_RELEASE="24.04"
KAIROS_BUG_REPORT_URL="https://github.com/kairos-io/kairos/issues"
KAIROS_ID_LIKE="kairos-standard-ubuntu-24.04"
KAIROS_MODEL="generic"
KAIROS_RELEASE="v3.1.1"
KAIROS_GITHUB_REPO="kairos-io/kairos"
KAIROS_IMAGE_LABEL="24.04-standard-amd64-generic-v3.1.1-k3sv1.29.4-k3s1"
KAIROS_ARTIFACT="kairos-ubuntu-24.04-standard-amd64-generic-v3.1.1-k3sv1.29.4+k3s1"
KAIROS_REGISTRY_AND_ORG="quay.io/kairos"
KAIROS_HOME_URL="https://github.com/kairos-io/kairos"
KAIROS_SOFTWARE_VERSION_PREFIX="k3s"
KAIROS_ID="kairos"
KAIROS_NAME="kairos-standard-ubuntu-24.04"
KAIROS_PRETTY_NAME="kairos-standard-ubuntu-24.04 v3.1.1-v1.29.4-k3s1"
KAIROS_FAMILY="ubuntu"
KAIROS_VARIANT="standard"
KAIROS_TARGETARCH="amd64"
KAIROS_SOFTWARE_VERSION="v1.29.4+k3s1"

CPU architecture, OS, and Version:

Linux worker03.hogs.tswn.us 6.8.0-38-generic #38-Ubuntu SMP PREEMPT_DYNAMIC Fri Jun  7 15:25:01 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Describe the bug

I'm trying to evaluate upgrading from the 22.04 ubuntu images to the 24.04 ubuntu ones, and there's a 11+ minute hang during boot prior to switching root. In the logs below, look for Aug 11 02:35:06, and the next log line is Aug 11 02:46:48. I actually thought this was completely hung, but had to step away to eat dinner and was surprised to come back and see it had booted.

To Reproduce

Boot the image.

Expected behavior

Booting takes much less time.

Logs

journalctl -b output up until the root switch happens: https://gist.github.com/sdwilsh/91e17e2ffbed5a912937df8427a2a415

Additional context

I was previously running quay.io/kairos/ubuntu:22.04-standard-amd64-generic-v3.1.0-k3sv1.29.4-k3s1 on this, and ran quay.io/kairos/ubuntu:24.04-standard-amd64-generic-v3.1.1-k3sv1.29.4-k3s1 via kairos-agent upgrade.

@sdwilsh sdwilsh added bug Something isn't working triage Add this label to issues that should be triaged and prioretized in the next planning call unconfirmed labels Aug 11, 2024
@robarnold
Copy link
Contributor

I've hit the same thing and have the same large pause point, though mine is much faster for some reason:

Aug 12 00:57:27 localhost kernel: audit: type=1403 audit(1723424246.970:4): auid=4294967295 ses=4294967295 lsm=selinux res=1
Aug 12 00:57:27 localhost kernel: audit: type=1400 audit(1723424246.972:5): avc:  denied  { entrypoint } for  pid=944 comm="sh" path="/usr/sbin/setfiles" dev="loop0" ino=147597 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=1
Aug 12 00:57:27 localhost kernel: audit: type=1400 audit(1723424246.989:6): avc:  denied  { associate } for  pid=944 comm="restorecon" name="#a" scontext=system_u:object_r:unlabeled_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=filesystem permissive=1
Aug 12 01:02:51 localhost passwd[980]: password for 'root' changed by 'root'
Aug 12 01:02:51 jetsam kernel: audit: type=1400 audit(1723424571.590:7): avc:  denied  { execmem } for  pid=1005 comm="kairos-agent" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=process permissive=1
Aug 12 01:02:52 jetsam immucore[579]: 2024-08-12T01:02:52Z INF 1.

@jimmykarily
Copy link
Contributor

It seems that it's blocked by selinux for some reason. Selinux support is planned for later, we shouldn't have selinux enabled in the first place. Until we fix, you should be able to disable it through the extra_cmdline in the kairos config (selinux=0 ) . In the meantime, we'll try to disable it in the next release.

@jimmykarily jimmykarily moved this to In Progress 🏃 in 🧙Issue tracking board Aug 12, 2024
@jimmykarily jimmykarily removed the triage Add this label to issues that should be triaged and prioretized in the next planning call label Aug 12, 2024
@mudler mudler mentioned this issue Aug 12, 2024
35 tasks
@mauromorales
Copy link
Member

@jimmykarily I remember having troubles disabling it for 24.04, but haven't tested in a long time

@mauromorales mauromorales changed the title [ubuntu] 11+ minute hang during boot [ubuntu] long duration hang during boot Aug 12, 2024
@sdwilsh
Copy link
Contributor Author

sdwilsh commented Aug 13, 2024

24.04 is explicitly installing selinux: https://github.com/kairos-io/kairos/blob/master/images/Dockerfile.ubuntu#L336-L338

This was added in #2625 by @Itxaka to fix #2577.

I think what we probably want to do is to set it to permissive mode until #111. Given the original issue that added selinux, I think it might be safer to do that instead of disabling at the command line.

@sdwilsh
Copy link
Contributor Author

sdwilsh commented Aug 13, 2024

I spent some time tonight to try and workaround this by setting grub_options.extra_cmdline to enforcing=0 per selinux docs. However, that doesn't seem to modify the grub options, so I still hit the issue. I'm guessing that grub options only get set on install, not upgrade (not documented as far as I can tell, so it is unclear what the intended behavior is). /boot/grub/menu.lst doesn't seem to exist either, so I'm not really sure where this would be written out to. I'm guessing I'm seeing the file for after the root switch, so what I'm looking for probably lives somewhere else.

Additionally, looking at /etc/selinux/config, SELINUX=permissive is already present (again, post-root switch so the config pre-switch might be different), so I'm not really sure how to work around this.

@jimmykarily
Copy link
Contributor

yes you can also see permissive=1 in the logs above.

(this line:

Aug 12 00:57:27 localhost kernel: audit: type=1400 audit(1723424246.989:6): avc:  denied  { associate } for  pid=944 comm="restorecon" name="#a" scontext=system_u:object_r:unlabeled_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=filesystem permissive=1

)

@jimmykarily
Copy link
Contributor

The problem does not occur in qemu with the same artifact KAIROS_ARTIFACT="kairos-ubuntu-24.04-standard-amd64-generic-v3.1.1-k3sv1.29.4+k3s1"

I tried both with k3s enabled and disabled, not preproducible. Maybe it only happens on upgrade?

I also see some "audit" logs in journal but not for the same things probably:

~ $ cat j.txt  | grep -E "avc:.*denied"
Aug 13 13:43:35 localhost kernel: audit: type=1400 audit(1723556615.147:2): avc:  denied  { entrypoint } for  pid=931 comm="update-rc.d" path="/usr/bin/systemctl" dev="loop0" ino=79245 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=1
Aug 13 13:43:35 localhost kernel: audit: type=1400 audit(1723556615.157:3): avc:  denied  { associate } for  pid=932 comm="update-rc.d" name="#8" scontext=system_u:object_r:unlabeled_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=filesystem permissive=1
Aug 13 13:43:35 localhost kernel: audit: type=1400 audit(1723556615.161:4): avc:  denied  { entrypoint } for  pid=934 comm="immucore" path="/usr/bin/dash" dev="loop0" ino=78863 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=1
Aug 13 13:43:36 localhost kernel: audit: type=1400 audit(1723556616.740:6): avc:  denied  { execmem } for  pid=1007 comm="kairos-agent" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=process permissive=1
Aug 13 13:43:37 localhost kernel: audit: type=1400 audit(1723556617.214:7): avc:  denied  { entrypoint } for  pid=1026 comm="systemd" path="/usr/lib/systemd/systemd-executor" dev="rootfs" ino=4722 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:root_t:s0 tclass=file permissive=1
Aug 13 13:43:35 localhost kernel: audit: type=1400 audit(1723556615.147:2): avc:  denied  { entrypoint } for  pid=931 comm="update-rc.d" path="/usr/bin/systemctl" dev="loop0" ino=79245 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=1
Aug 13 13:43:35 localhost kernel: audit: type=1400 audit(1723556615.157:3): avc:  denied  { associate } for  pid=932 comm="update-rc.d" name="#8" scontext=system_u:object_r:unlabeled_t:s0 tcontext=system_u:object_r:tmpfs_t:s0 tclass=filesystem permissive=1
Aug 13 13:43:35 localhost kernel: audit: type=1400 audit(1723556615.161:4): avc:  denied  { entrypoint } for  pid=934 comm="immucore" path="/usr/bin/dash" dev="loop0" ino=78863 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=1
Aug 13 13:43:36 localhost kernel: audit: type=1400 audit(1723556616.740:6): avc:  denied  { execmem } for  pid=1007 comm="kairos-agent" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=process permissive=1
Aug 13 13:43:37 localhost kernel: audit: type=1400 audit(1723556617.214:7): avc:  denied  { entrypoint } for  pid=1026 comm="systemd" path="/usr/lib/systemd/systemd-executor" dev="rootfs" ino=4722 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:root_t:s0 tclass=file permissive=1
Aug 13 13:43:38 localhost kernel: audit: type=1400 audit(1723556617.630:9): avc:  denied  { module_load } for  pid=1 comm="systemd" path="/usr/lib/modules/6.8.0-38-generic/kernel/net/netfilter/x_tables.ko.zst" dev="loop0" ino=102997 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=system permissive=1
Aug 13 13:43:38 localhost kernel: audit: type=1107 audit(1723556618.199:10): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='avc:  denied  { status } for auid=n/a uid=0 gid=0 path="/usr/lib/systemd/system/systemd-udev-settle.service" cmdline="" function="mac_selinux_filter" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=service permissive=1 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
Aug 13 13:43:40 localhost kernel: audit: type=1400 audit(1723556620.830:11): avc:  denied  { execmem } for  pid=1278 comm="kairos-agent" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=process permissive=1
Aug 13 13:43:43 localhost kernel: audit: type=1107 audit(1723556623.266:12): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='avc:  denied  { start } for auid=n/a uid=0 gid=0 path="/etc/systemd/system/kairos-agent.service" cmdline="" function="bus_unit_method_start_generic" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:usr_t:s0 tclass=service permissive=1 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
Aug 13 13:43:43 localhost kernel: audit: type=1107 audit(1723556623.266:13): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='avc:  denied  { status } for auid=n/a uid=0 gid=0 path="/etc/systemd/system/kairos-agent.service" cmdline="" function="reply_unit_path" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:usr_t:s0 tclass=service permissive=1 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'
Aug 13 13:43:46 localhost kernel: audit: type=1400 audit(1723556626.707:14): avc:  denied  { watch_reads } for  pid=1616 comm="(agetty)" path="/dev/ttyS0" dev="devtmpfs" ino=90 scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:tty_device_t:s0 tclass=chr_file permissive=1
Aug 13 13:43:46 localhost kernel: audit: type=1400 audit(1723556626.720:15): avc:  denied  { checkpoint_restore } for  pid=1620 comm="agetty" capability=40  scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:system_r:kernel_t:s0 tclass=capability2 permissive=1
Aug 13 13:43:51 localhost kernel: audit: type=1107 audit(1723556631.766:16): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:kernel_t:s0 msg='avc:  denied  { start } for auid=n/a uid=0 gid=0 path="/usr/lib/systemd/system/user@.service" cmdline="/usr/lib/systemd/systemd-logind" function="bus_unit_method_start_generic" scontext=system_u:system_r:kernel_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=service permissive=1 exe="/usr/lib/systemd/systemd" sauid=0 hostname=? addr=? terminal=?'

In any case though, I see no delay.

@Itxaka
Copy link
Member

Itxaka commented Aug 13, 2024

Notice that hang its currently built from master https://github.com/kairos-io/kairos/blob/master/images/Dockerfile.ubuntu#L35 so what happens today migth not happen tomorrow, depending on if nohang master is broken or not

  • do we need to ship nohang?
  • do we need to ship nohag from master?
  • if selinux is set to permissive, then it will allow all kind of stuff, its basically disabled + log, are we sure this is the real issue?

@mauromorales
Copy link
Member

  • no i don't think we need to, maybe it's better to remove and let users who need it to install or build
  • we ship from master because:

    There's an issue between the nohang package provided by Ubuntu and the zfsutils-linux package, there is a fix in nohang upstream but it's not yet available in the Ubuntu package, so we build it from source

@sdwilsh
Copy link
Contributor Author

sdwilsh commented Aug 14, 2024

if selinux is set to permissive, then it will allow all kind of stuff, its basically disabled + log, are we sure this is the real issue?

This is what I was wondering today, actually. I think in permissive mode it only logs when it would deny and does not deny anything (useful for when we want to support it, at least!), if I'm reading the docs correctly.

Looking more closely, the non-selinux logs just before the hang look like this:

Aug 11 02:35:02 localhost immucore[688]: 2024-08-11T02:35:02Z INF mount done options=["ro","suid","dev","exec","async"] type=ext4 what=/dev/disk/by-label/COS_ACTIVE where=/sysroot
Aug 11 02:35:02 localhost immucore[688]: 2024-08-11T02:35:02Z INF creating a runtime
Aug 11 02:35:02 localhost immucore[688]: 2024-08-11T02:35:02Z INF detecting boot state
Aug 11 02:35:03 localhost kernel: EXT4-fs (nvme0n1p2): mounted filesystem 234e242d-3ec8-490e-ab9d-f3d25e6f7440 r/w with ordered data mode. Quota mode: none.
Aug 11 02:35:03 localhost immucore[688]: 2024-08-11T02:35:03Z INF mount done options=["rw","suid","dev","exec","async"] type=ext4 what=/dev/disk/by-label/COS_OEM where=/sysroot/oem
Aug 11 02:35:03 localhost immucore[688]: 2024-08-11T02:35:03Z INF Running rootfs stage
Aug 11 02:35:03 localhost immucore[688]: 2024-08-11T02:35:03Z INF triggering udev to populate disk info
Aug 11 02:35:04 localhost kernel: EXT4-fs (nvme0n1p5): mounted filesystem 57896295-63a8-4fb4-97f2-31a7e273a6ba r/w with ordered data mode. Quota mode: none.
Aug 11 02:35:04 localhost immucore[688]: 2024-08-11T02:35:04Z INF mount done options=["rw"] type=ext4 what=/dev/disk/by-label/COS_PERSISTENT where=/sysroot/usr/local
Aug 11 02:35:04 localhost immucore[688]: 2024-08-11T02:35:04Z WRN
Aug 11 02:35:04 localhost kernel: overlayfs: "xino" feature enabled using 2 upper inode bits.
Aug 11 02:35:04 localhost kernel: overlayfs: "xino" feature enabled using 2 upper inode bits.
Aug 11 02:35:04 localhost kernel: overlayfs: "xino" feature enabled using 2 upper inode bits.
Aug 11 02:35:04 localhost immucore[688]: 2024-08-11T02:35:04Z WRN executing mount callback error="mkdir /sysroot/snap: read-only file system" options=["bind"] type=overlay what=/sysroot/usr/local/.state/snap.bind where=/sysroot/snap
Aug 11 02:35:04 localhost immucore[688]: 2024-08-11T02:35:04Z ERR error="mkdir /sysroot/snap: read-only file system"
Aug 11 02:35:06 localhost immucore[688]: 2024-08-11T02:35:06Z WRN executing mount callback error="mkdir /sysroot/usr/share/pki: read-only file system" options=["bind"] type=overlay what=/sysroot/usr/local/.state/usr-share-pki-trust.bind where=/sysroot/usr/share/pki/trust
Aug 11 02:35:06 localhost immucore[688]: 2024-08-11T02:35:06Z ERR error="mkdir /sysroot/usr/share/pki: read-only file system"
Aug 11 02:35:06 localhost immucore[688]: 2024-08-11T02:35:06Z WRN executing mount callback error="mkdir /sysroot/usr/share/pki: read-only file system" options=["bind"] type=overlay what=/sysroot/usr/local/.state/usr-share-pki-trust-anchors.bind where=/sysroot/usr/share/pki/trust/anchors
Aug 11 02:35:06 localhost immucore[688]: 2024-08-11T02:35:06Z ERR error="mkdir /sysroot/usr/share/pki: read-only file system"
Aug 11 02:35:06 localhost immucore[688]: 2024-08-11T02:35:06Z WRN error="3 errors occurred:\n\t* mkdir /sysroot/snap: read-only file system\n\t* mkdir /sysroot/usr/share/pki: read-only file system\n\t* mkdir /sysroot/usr/share/pki: read-only file system\n\n"
Aug 11 02:35:06 localhost kernel: evm: overlay not supported
Aug 11 02:35:06 localhost immucore[688]: 2024-08-11T02:35:06Z INF Running initramfs stage
Aug 11 02:46:48 localhost passwd[1105]: password for 'root' changed by 'root'

The errors about read-only file system are concerning, but probably unrelated. What happens after initramfs stage is started and before something updates the root password?

Maybe it only happens on upgrade?

I did reboot once after the upgrade and it took as long to boot the second time. I don't really want to re-image my whole machine, but if we think it's needed for debugging, I can try it. I suppose there could be a difference between reset and fresh install as well, right?

do we need to ship nohang?

Do we need to ship nohang and zfsutils-linux? It seems like Kairos is moving more towards minimal dependencies and having others add packages that they need in their own build pipeline.

@jimmykarily
Copy link
Contributor

The last message before the wait comes from here: https://github.com/kairos-io/immucore/blob/3574491f740536cb1d0e99b8d58bb0225fe997f6/pkg/state/steps_shared.go#L137

I don't think the immucore logs make it to journal. There should be more interesting logs in /run/immucore/*.log.

Better not re-image the machine yet. After all, we might not be able to reproduce if you do.

@sdwilsh
Copy link
Contributor Author

sdwilsh commented Aug 16, 2024

I'm not going to upload this, because I don't think it's useful, but I think I've figured out what is taking so long. In initramfs_stage.log, lines 67-5494602 look something like this:

Relabeled /oem/90_custom.yaml from system_u:object_r:unlabeled_t:s0 to system_u:object_r:default_t:s0

The above was the very last line of it all. These 5,494,535 log lines for relabeling were preceded by this:

2024-08-16T00:25:26Z INF Processing stage step 'Relabelling'. ( commands: 1, files: 0, ... )
2024-08-16T00:25:26Z WRN (conditional) Skip 'Skipping stage (if statement error: failed to run (grep -q "kairos.reset" /proc/cmdline || [ -f /run/cos/autoreset_mode ]) && \
( [ -e "/sbin/systemctl" ] || [ -e "/usr/bin/systemctl" ] || [ -e "/usr/sbin/systemctl" ] || [ -e "/usr/bin/systemctl" ] )
: exit status 1)' stage name: Starts kairos-reset for systemd based systems
2024-08-16T00:25:26Z INF Command output:
2024-08-16T00:25:26Z ERR 3 errors occurred:
        * failed to run systemctl enable fail2ban: exit status 1
        * failed to run systemctl enable logrotate.timer: exit status 1
        * failed to run systemctl enable sshd: exit status 1


2024-08-16T00:38:24Z INF Command output:

So as best I can tell, it's relabeling every single file on the disk, and it's taking a really long time. Given that there are lots of log files in this output, some going all the way back to when this particular node was added in March of this year, that might be why this doesn't show up in a fresh install.

@sdwilsh
Copy link
Contributor Author

sdwilsh commented Aug 16, 2024

Given that there are lots of log files in this output, some going all the way back to when this particular node was added in March of this year, that might be why this doesn't show up in a fresh install.

This is just a small fraction of the files it is relabeling, and after a reboot, it's not any faster. It still looks like it tries to relabel every file on disk, although most things are set correctly on the second boot.

@sdwilsh
Copy link
Contributor Author

sdwilsh commented Aug 16, 2024

This does appear to be selinux related. On a node where I am still running 22.04, I looked at the initramfs_stage.log file, and there's an ERR logged that is very relevant:

2024-08-16T04:47:53Z INF Processing stage step 'Relabelling'. ( commands: 1, files: 0, ... )
2024-08-16T04:47:53Z ERR /bin/sh: 1: load_policy: not found
/bin/sh: 2: restorecon: not found
: failed to run load_policy -i
restorecon -R -i -v /etc /root /opt /srv /var /home /usr/local /oem
: exit status 127
2024-08-16T04:47:53Z ERR 1 error occurred:
        * failed to run load_policy -i
restorecon -R -i -v /etc /root /opt /srv /var /home /usr/local /oem
: exit status 127

restorecon will restore file default SELinux security contexts. So it looks like we're trying to restore it and it fails when selinux is not installed, but when it is, it can take a long time to walk all those files paths. This is done in this code, which seems like where selinux is not present, the if check isn't working right (like on ubuntu 22.04).
We probably want a set -e in this step too so that if load_policy failed, we then don't also restorecon.

Do we really need to run restorecon on all of these paths? Is this something where we can do it for some of these paths only on upgrade?

@sdwilsh
Copy link
Contributor Author

sdwilsh commented Aug 18, 2024

Another quick data point: I tried a custom image using the Kairos Factory this weekend based on 24.04. That took 32 minutes to relabel everything on one of my oldest nodes clocking in at 268 days of running k3s.

@jimmykarily
Copy link
Contributor

So in 22.04, load_policy and restorecon are not installed while in 24.04 they are, so restorecon is creating that delay.

set -e is a good idea. On the other hand, the whole script is gated by selinux=1 in cmdline which makes me think that, since selinux is not properly configured, there shouldn't be selinux=1 in the cmdline in the first place, right?

I don't see where we enable selinux in Ubuntu, that must be coming from upstream. I'll keep looking.

@jimmykarily
Copy link
Contributor

Actually, that's us here: https://github.com/kairos-io/packages/blob/d464f1b18bf1bbff110e74059d6c0f6d71fe78b9/packages/static/kairos-overlay-files/files/etc/cos/bootargs.cfg#L9

Since we haven't put any work to make selinux work, I wonder why we enable it at all. If we just want it in permissive mode, to generate the logs, then should we check for that mode too and avoid running restorecon and load_policy when in permissive mode? I'm not a selinux expert, I'm just trying to understand what is the most reasonable fix for this.

@jimmykarily
Copy link
Contributor

we could change the condition to something like this:

cat /proc/cmdline | grep -v "selinux=1" && command -v sestatus && sestatus | grep -E "Current mode:.*enforcing"

(not tested)
This way we would only run those commands when selinux is actually enforced.
I still think, disabling selinux in the cmdline is the best option though.

@jimmykarily
Copy link
Contributor

Setting selinux=0 skips relabeling. Let's see what the tests have to say about this. If everything works, I suggest we make the change in the packages repo.

jimmykarily added a commit to kairos-io/packages that referenced this issue Aug 27, 2024
because in order to support selinux, we need to do more than just enable
it in the cmdline.

Fixes kairos-io/kairos#2802

Signed-off-by: Dimitris Karakasilis <dimitris@karakasilis.me>
@jimmykarily
Copy link
Contributor

I opened the packages PR to have it ready, if we decide to go that route.

@jimmykarily jimmykarily self-assigned this Aug 27, 2024
@sdwilsh
Copy link
Contributor Author

sdwilsh commented Aug 28, 2024

On the other hand, the whole script is gated by selinux=1 in cmdline which makes me think that, since selinux is not properly configured, there shouldn't be selinux=1 in the cmdline in the first place, right?

There's a few things I want to point out:

  1. On 22.04, I was seeing this script ran, even though as far as I can tell selinux=1 is not in the cmdline (maybe I was looking at the wrong file though?). I know there's an if check there for it, but are we sure that's working correctly?
  2. I don't know if setting selinux=0 is safe, given ubuntu 24.04 with k3s 1.28.9 crash at boot #2577, although I readily admit I don't fully follow what was going on there.

Regardless of the short-term fix we decide to do here, something better than what we have today also needs to be figured out before selinux is implemented. My 30+ minute hang on boot for a node that hasn't even been running for a year isn't going to be a good experience.

@jimmykarily
Copy link
Contributor

@sdwilsh thanks for sharing that link. I remembered we had an issue with selinux on Ubuntu but I couldn't remember what it was :D. On 2577, Itxaka reported on that ticket that it also works when selinux is disabled (which is the case for this ticket too). For some reason on 2577, we decided to just install the missing policy packages instead of disabling it completely, probably because it just worked and we found it a better idea (?).

I agree with you about the final solution. That's why I put a comment to find a better solution when the time comes.

@github-project-automation github-project-automation bot moved this from In Progress 🏃 to Done ✅ in 🧙Issue tracking board Aug 28, 2024
@jimmykarily jimmykarily reopened this Aug 28, 2024
@github-project-automation github-project-automation bot moved this from Done ✅ to Under review 🔍 in 🧙Issue tracking board Aug 28, 2024
jimmykarily added a commit that referenced this issue Aug 28, 2024
Fixes:
  - #2385
  - #2802

Signed-off-by: Dimitris Karakasilis <dimitris@karakasilis.me>
@jimmykarily
Copy link
Contributor

Fix has been merged on master

@github-project-automation github-project-automation bot moved this from Under review 🔍 to Done ✅ in 🧙Issue tracking board Aug 28, 2024
@Itxaka
Copy link
Member

Itxaka commented Aug 28, 2024

@sdwilsh thanks for sharing that link. I remembered we had an issue with selinux on Ubuntu but I couldn't remember what it was :D. On 2577, Itxaka reported on that ticket that it also works when selinux is disabled (which is the case for this ticket too). For some reason on 2577, we decided to just install the missing policy packages instead of disabling it completely, probably because it just worked and we found it a better idea (?).

I agree with you about the final solution. That's why I put a comment to find a better solution when the time comes.

Can't remember why exactly but it was about the newer Linux kernel not allowing setting the selinux stuff on the fly anymore, and those tools missing caused it to crash on boot on any parameter or something similar. So installing the policy made it boot, but indeed it should be disabled if it takes that long to relabel on each boot

mauromorales added a commit that referenced this issue Aug 29, 2024
Focuses on these fixes:

- Kairos user ids change on upgrade, breaking ssh login #2797
- Long duration hang during boot #2802
- Support TRIM in LUKS partitions mounted by Kairos #2693
mauromorales added a commit that referenced this issue Aug 29, 2024
Focuses on these fixes:

- Kairos user ids change on upgrade, breaking ssh login #2797
- Long duration hang during boot #2802
- Support TRIM in LUKS partitions mounted by Kairos #2693
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants