-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[4.12 upgrade] Node upgrade fails because of SELinux policies preventing nm-dispatcher
from working
#1475
Comments
MCO operator says
MCO controller says:
|
Checking why okd-xwwxf-master-2 is not coming back from the reboot |
in 4.11 -> 4.12 we upgrade from F36 to F37. NM dispatcher on F37 expects scripts to be labelled with Workaround:
Not sure why MCD/rpm-ostree rebase didn't update the labels. Possibly an rpm-ostree/mco regression? cc @cgwalters |
No, that's a one-way transition effectively. We want As far as the incorrect label...hmm, definitely needs some debugging. Does Does the |
On an upgrading node after OS update restart:
|
I booted 36.20221030.3.0 and that doesn't seem to be true, I see
In a stock node. |
Some dispatcher related issues were documented in coreos/fedora-coreos-tracker#1218. Not sure if that's part of the problem here or not. |
Using the following works for my 4.11 to 4.12 upgrade (vsphere IPI). Did not need to set enforcing=0 on boot |
Hi, I had same issue. For those who should be in the situation of a blocked update, the workaround at the following url worked in my case: #1317 (comment) |
We've recently hit this while updating one of our clusters and are a bit concerned with the impact this has on MachineSet scaling or other "new provisioning" scenarios in existing clusters. Are there any potential workarounds aside from the MachineConfig workaround in #1317 (comment)? We've done some limited testing of that workaround and it doesn't appear to work for new systems. What we've seen is that systems will get provisioned but they never make it to a running Node. We're going to do some more testing with this to see what the additional issues are be encountered but given that we're in uncharted territory I'm reluctant to post an issue on an environment that's had a workaround applied to it. |
Is there anything else we can do to determine the cause of this? It seems to still be impacting new Machine builds in 4.12.0-0.okd-2023-03-05-022504. There's a FCOS issue mentioned upthread and then there's #1438 and #1450 where it seems selinux is at play as in this issue but there's no clear identification (to me at least) of where the root of the issue is and thus where we can focus for a fix. Happy to help test in any way that we can. |
Hello, I hit the same issue when upgrading a cluster from 4.11 to 4.12 and tried to apply the workaround mentioned here:
However, I now hit another issue related to OVN with no idea what so ever how to debug this :/
Then I get no network connectivity on the node. Or maybe OVN fails to boot up because network manager didn't manage to boot up properly but that does not appear in the logs ? Thanks a lot in advance for any help regarding this. |
@Bengrunt To be clear, you shelled into the broken node(s) and executed the (if so you'll probably want to open a new issue and attach a must-gather to get some visibility. also be sure to mention that it's currently an ovn issue!) |
@nate-duke Not exactly, what I did was:
But maybe you're right I should rather open a new bug, sorry about that. |
Hello, just to let other users that would hit this issue that I eventually managed to make the above mentioned workaround work, by running it manually in single mode on the nodes and overriding MCD's validation process. Thus, I managed to carry out the cluster upgrade process and then run two successive cluster upgrades without any issue. So I imagine that others with clusters deployed back in 4.6 or 4.7 could work around this issue using the same technique. Feels like I learned a lot about FCOS and rpm-ostree and the MCO/MCD in the process 😆 |
I posted this in the bug referenced above, but this service "rhcos-selinux-policy-upgrade.service" is supposed to be rebuilding the SELinux policy but its not running becuase its trying to use a variable that doesn't exist in fcos.
It probably needs to be updated to just Version.
|
Hello, we experience the same issue with an IPI installation on OpenStack. The initial cluster version was 4.8, and we have been updating since then. After the udpate to 4.12, kubelet fails to start because NetworkManager scripts have incorrect SeLinux labels, and the file /run/resolv-prepender-kni-conf-done is never created. By running restorecon -vR /etc/NetowrkManager/dispatcher.d/ it seems to fix the issue for the kubelet, it starts normally, but then the afterburn-hostname.service fails on boot. Manual restart of afterburn-hostname.service runs OK though. |
I ran into the same error situation when upgrading from 4.11.0-0.okd-2023-01-14-152430 to 4.12.0-0.okd-2023-04-16-041331. After executing During the upgrade process if the node switch to "Not Ready" state "restorecon -vR /etc/NetworkManager/dispatcher.d/;semodule -B" was enough to continue the upgrade process. At the beginning the labels will be reset so executing this command before on every node is not working. |
So, we're still dealing with this on every new node provision (and nearly if not every update?). Is there a recommended place we can file an issue to get this fixed in FCOS as mentioned in #1475 (comment)? |
ah yes. Please file an issue on https://github.com/openshift/os/ Something like: Please also include a link to this issue here. |
Hi, We are not working on FCOS builds of OKD any more. Please see these documents... https://okd.io/blog/2024/06/01/okd-future-statement Please test with the OKD SCOS nightlies and file a new issue as needed. Many thanks, Jaime |
Describe the bug
Upgrading OKD from 4.11 to 4.12, I'm stopped by kubelets not starting on both master and worker nodes. The problem is the same: file
/run/resolv-prepender-kni-conf-done
does not get created, so thatkubelet
's pre-condition does not allow it to start. Logs are full of SELinux prohibitingnm-dispatcher
to read NetworkManager's configuration:Version
IPI with vSphere, 4.11.0-0.okd-2023-01-14-152430 updating to 4.12.0-0.okd-2023-01-21-055900.
How reproducible
100% so far, adding a node works, but with an earlier version of Fedora CoreOS, which will probably get updated in time and fail too.
Log bundle
https://drive.google.com/file/d/16oVumQ6SAHoiP2FlvItbAsIY87CvcW64/view?usp=sharing
The text was updated successfully, but these errors were encountered: