-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ubuntu] long duration hang during boot #2802
Comments
I've hit the same thing and have the same large pause point, though mine is much faster for some reason:
|
It seems that it's blocked by selinux for some reason. Selinux support is planned for later, we shouldn't have selinux enabled in the first place. Until we fix, you should be able to disable it through the extra_cmdline in the kairos config ( |
@jimmykarily I remember having troubles disabling it for 24.04, but haven't tested in a long time |
24.04 is explicitly installing This was added in #2625 by @Itxaka to fix #2577. I think what we probably want to do is to set it to |
I spent some time tonight to try and workaround this by setting Additionally, looking at |
yes you can also see (this line:
) |
The problem does not occur in qemu with the same artifact I tried both with k3s enabled and disabled, not preproducible. Maybe it only happens on upgrade? I also see some "audit" logs in journal but not for the same things probably:
In any case though, I see no delay. |
Notice that hang its currently built from master https://github.com/kairos-io/kairos/blob/master/images/Dockerfile.ubuntu#L35 so what happens today migth not happen tomorrow, depending on if nohang master is broken or not
|
|
This is what I was wondering today, actually. I think in permissive mode it only logs when it would deny and does not deny anything (useful for when we want to support it, at least!), if I'm reading the docs correctly. Looking more closely, the non-selinux logs just before the hang look like this:
The errors about read-only file system are concerning, but probably unrelated. What happens after
I did reboot once after the upgrade and it took as long to boot the second time. I don't really want to re-image my whole machine, but if we think it's needed for debugging, I can try it. I suppose there could be a difference between reset and fresh install as well, right?
Do we need to ship |
The last message before the wait comes from here: https://github.com/kairos-io/immucore/blob/3574491f740536cb1d0e99b8d58bb0225fe997f6/pkg/state/steps_shared.go#L137 I don't think the immucore logs make it to journal. There should be more interesting logs in Better not re-image the machine yet. After all, we might not be able to reproduce if you do. |
I'm not going to upload this, because I don't think it's useful, but I think I've figured out what is taking so long. In
The above was the very last line of it all. These 5,494,535 log lines for relabeling were preceded by this:
So as best I can tell, it's relabeling every single file on the disk, and it's taking a really long time. Given that there are lots of log files in this output, some going all the way back to when this particular node was added in March of this year, that might be why this doesn't show up in a fresh install. |
This is just a small fraction of the files it is relabeling, and after a reboot, it's not any faster. It still looks like it tries to relabel every file on disk, although most things are set correctly on the second boot. |
This does appear to be
Do we really need to run |
Another quick data point: I tried a custom image using the Kairos Factory this weekend based on 24.04. That took 32 minutes to relabel everything on one of my oldest nodes clocking in at 268 days of running k3s. |
So in 22.04,
I don't see where we enable selinux in Ubuntu, that must be coming from upstream. I'll keep looking. |
Actually, that's us here: https://github.com/kairos-io/packages/blob/d464f1b18bf1bbff110e74059d6c0f6d71fe78b9/packages/static/kairos-overlay-files/files/etc/cos/bootargs.cfg#L9 Since we haven't put any work to make selinux work, I wonder why we enable it at all. If we just want it in permissive mode, to generate the logs, then should we check for that mode too and avoid running |
we could change the condition to something like this:
(not tested) |
Setting selinux=0 skips relabeling. Let's see what the tests have to say about this. If everything works, I suggest we make the change in the packages repo. |
because in order to support selinux, we need to do more than just enable it in the cmdline. Fixes kairos-io/kairos#2802 Signed-off-by: Dimitris Karakasilis <dimitris@karakasilis.me>
I opened the packages PR to have it ready, if we decide to go that route. |
There's a few things I want to point out:
Regardless of the short-term fix we decide to do here, something better than what we have today also needs to be figured out before |
@sdwilsh thanks for sharing that link. I remembered we had an issue with selinux on Ubuntu but I couldn't remember what it was :D. On 2577, Itxaka reported on that ticket that it also works when selinux is disabled (which is the case for this ticket too). For some reason on 2577, we decided to just install the missing policy packages instead of disabling it completely, probably because it just worked and we found it a better idea (?). I agree with you about the final solution. That's why I put a comment to find a better solution when the time comes. |
Fix has been merged on master |
Can't remember why exactly but it was about the newer Linux kernel not allowing setting the selinux stuff on the fly anymore, and those tools missing caused it to crash on boot on any parameter or something similar. So installing the policy made it boot, but indeed it should be disabled if it takes that long to relabel on each boot |
Kairos version:
CPU architecture, OS, and Version:
Describe the bug
I'm trying to evaluate upgrading from the 22.04 ubuntu images to the 24.04 ubuntu ones, and there's a 11+ minute hang during boot prior to switching root. In the logs below, look for
Aug 11 02:35:06
, and the next log line isAug 11 02:46:48
. I actually thought this was completely hung, but had to step away to eat dinner and was surprised to come back and see it had booted.To Reproduce
Boot the image.
Expected behavior
Booting takes much less time.
Logs
journalctl -b
output up until the root switch happens: https://gist.github.com/sdwilsh/91e17e2ffbed5a912937df8427a2a415Additional context
I was previously running
quay.io/kairos/ubuntu:22.04-standard-amd64-generic-v3.1.0-k3sv1.29.4-k3s1
on this, and ranquay.io/kairos/ubuntu:24.04-standard-amd64-generic-v3.1.1-k3sv1.29.4-k3s1
viakairos-agent upgrade
.The text was updated successfully, but these errors were encountered: