-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to support installing kernel modules #249
Comments
A hugely tricky question here is whether 3rd parties will want a mechanism that also works nearly the same for One thing I mentioned in the Silverblue+nvidia discussion is we could add rpm-ostree support for arbitrary hooks run during upgrades. Today |
One useful pattern then would be having in Kubernetes a daemonset container inject its hook on startup into the host, ensuring that it got executed when an upgrade was attempted. |
The easiest/cleanest approach is to have all kernel modules built for every kernel and provided via an rpm that requires that kernel. For example someone could set up a copr that triggers on every kernel build and builds a related kernel module rpm for that kernel. Then adding the yum repo and rpm-ostree installing the rpm should suffice, correct? It's a lot uglier when we have to recompile on upgrade on the host. Especially when that host is supposed to be minimal (hence why you need to do it in a container). |
coreos/rpm-ostree#1882 is a quick hack I started on the hooks thing. |
@lucab If an upgrade fails, will Zincati retry later, or give up immediately? This seems like a case where a later retry might succeed. |
Zincati will keep retrying after some delay, both when trying to stage (i.e. |
I think I agree with this. It works just as well on FCOS/RHCOS as on traditional yum/dnf-managed systems. In the context of immutable host clusters, it makes more sense to me to build the kernel module once than have e.g. potentially thousands of nodes all compiling them on each upgrade. Not just for efficiency, but also for keeping down the number of things that could go wrong at upgrade time. The flip side of this though is that we're then on the hook (pun intended) to provide tooling for this. Not everyone can use COPR. For RHCOS... maybe what we want is a way to hook into the update payload delivery flow so one can work on top of the new |
Yeah, this is a fine approach. |
A slightly tricky thing here though at least for RHCOS is I'd like to support shipping the kernel modules in a container via e.g. daemonset - this is a real-world practice. Doing that with the "multi-version rpm-md repo" approach...hm, maybe simplest is actually to write a MachineConfig that injects the |
Been thinking about this a lot lately and we've had a ton of discussions and the usual pile of private google docs. I want to emphasize how much I have come to agree with One issue with this is that is that we don't have any direct package layering support in the MCD; we'd probably have to document dropping the |
For reference, here is how people use to bring nvidia & wireguard modules to CL on k8s: https://github.com/squat/modulus |
OK now I got convinced in another meeting that:
The core problem with atomic-wireguard and similar CL-related projects is they don't have a good way to do the "strong binding" I think is really important, to again block the upgrade if the kernel module won't work with the new kernel. So that seems to take us back to coreos/rpm-ostree#1882 |
This is intended to support kernel module systems like [atomic-wireguard](https://github.com/jdoss/atomic-wireguard). See the Fedora CoreOS tracker issue: coreos/fedora-coreos-tracker#249 With a "roothook", one can perform arbitrary modifications to the *new* root filesystem; if a hook exits with an error, that also stops the upgrade. Specifically with this, atomic-wireguard could *block* an upgrade if the new kernel isn't compatible with a module.
This is intended to support kernel module systems like [atomic-wireguard](https://github.com/jdoss/atomic-wireguard). See the Fedora CoreOS tracker issue: coreos/fedora-coreos-tracker#249 With a "roothook", one can perform arbitrary modifications to the *new* root filesystem; if a hook exits with an error, that also stops the upgrade. Specifically with this, atomic-wireguard could *block* an upgrade if the new kernel isn't compatible with a module.
hmm. exactly who are we concerned about exposing things to? Is it end users or is it module producers? For example, with wireguard we could work with the maintainer and set up one project that does the building of the rpms and creation of repos for each new kernel. So we expose the pain of the "build service" to one person (or small group of people) and the end users don't have pain. The end users simply add the yum repo and rpm-ostree install the rpm and it should work from then on. |
Dusty, I’m largely responsible for the back and forth on this so I’ll try to re-frame a bit here. I’ll summarize one proposal in two points. To use an out of tree module on *COS:
I’ve no doubt that if the two conditions above are met, the resulting behavior at the *COS level will be robust, bordering on bulletproof. Nothing prevents the community from trying to move forward with this. I have two concerns. Firstly, the existence of 2) above is problematic. In the product context (by which I mean OpenShift running on RHCOS) I’m getting hard pushback on the idea of introducing a new service/container that is responsible for hosting such a repo, and updating it with fresh RPM builds as needed, in coordination with the updates of the underlying *COS kernel. I don’t know what else to say on this point, other than that if we don’t have this repo, we do not have this solution. My deeper concern is with point 1) above. Put bluntly I suspect that if we require RPM-ification as a prerequisite for third party modules on *COS, we will get far fewer third party modules on *COS. To be clear, I’m not saying that it’s not possible to rpm-ify all desirable modules. What I am saying is that it’s extremely unlikely to happen organically. It has had plenty of time to happen organically on Fedora and RHEL and has not. There are very good tools and approaches that can be used to do this with RPMs and they come with many of the same advantages that the proposal outlined above would give. In spite of this, after over a decade and a half of RHEL and Fedora, some kernel third party modules are RPM-ified but many are not. If, as I fear, it doesn’t happen organically, it will not happen. We simply do not have the bandwidth in the *COS teams and the broader community to maintain these SPECs and supporting scripts on our own, nor do we have the deployed base to provide the incentive to third parties to adopt this approach. (Again, if Fedora/RHEL/CentOS can’t drive this, how will we?) What has happened organically in the kube/container space are variations on the approach best represented by Joe’s work on wireguard. I’d summarize this as:
This is substantially less prescriptive than RPMs plus package layering and has the advantage of being container-native-ish and uses packaging/bundling techniques with a much larger user base (container builds and running containers). Thoughts? |
👍
I agree with this. As noted, there isn't anything wrong with RPMs, package layering, etc.. in fact they are quite powerful .... but I tend to believe using OCI containers + builds has less friction as it already has uptake. |
I figured most things that people in the Fedora/RHEL/CentOS ecosystem care about can already be delivered as an rpm. I didn't know this was that big of a blocker.
Regarding steps 1/2 that exactly what I was proposing we do on the build side somewhere and then the output of that process would be rpms that could then be consumed. I think my whole point here is that it would be much cleaner to do it this way than it would be to add hooks to execute things on the host (that may or may not fail) that then modify the host on every upgrade. I think you've laid out a few points about why it's too hard to do it that way. |
As I've said, I am quite sure it'd be easy for us to provide a container image which accepts kernel module sources (or - potentially a pre-built module) and generates an RPM.
But that doesn't solve the binding problem on its own. We're talking about kernel modules which execute fully on the host, so saying "OCI containers" is deceptive as it's really host tied. There's some blurry lines here about how much containers are used, but it's not just containers. |
For the NVIDIA use-case, we have been using a DriverContainer for 3.10/3.11 (AtomicHost | RHEL) and for 4.x (RHCOS | RHEL). (https://gitlab.com/nvidia/container-images/driver) The reference implementation of a GPU operator (https://github.com/openshift-psap/special-resource-operator) which NVIDIA uses as a template to implement their "official" GPU operator, uses the DriverContainer to install the drivers on a host (RHCOS or RHEL). NVIDIA uses source installs but we have created a DriverContainer that uses released RPMs, this way we are only using tested driver versions. The GPU operator will check the kernel version and OS and deploy the correct DriverContainer to the node. The benefits of a DriverContainer are, we can easily update drivers and libraries. In the case of NVIDIA a prestart hook injects libs,bins,config files from the DriverContainer into GPU workload contianers. It works on RHCOS and RHEL We are not touching the base OS If the node gets updated the DriverContainer will not be scheduled on the new node since it has a nodeSelector on kernelversion, operating system and library version. We can easily have several DriverContainers running on the same node to support several accelerator cards. DriverContainers are taking care of mod load and unloading and starting of services that are needed for a specific accelerator card to work. If one removes the DriverContainer it takes care of mod unloading and cleanup. |
Thanks @zvonkok for the references! That should be useful |
On this topic I have been looking recently at the atomic-wireguard implementation and have come up with a similar proof of concept called kmods-via-containers. Included in the project is a complete simple-kmod example. I have also done some work to make sure this works with a real world example. For that I used mellanox on RHEL 8. |
One thing that came up on a RHT-internal thread is that we should probably support (for RHCOS) driver update disks there are apparently some vendors that make pre-built RPMs that actually use KABI - so they don't need to be rebuilt for kernel updates. |
I think for OpenShift, we should focus on https://github.com/openshift-psap/special-resource-operator |
@cgwalters thanks very much for your response on openshift/installer#3761, |
No additional technical content to add here, but I will say that I am seeing a lot more end users of OpenShift/CoreOS asking about this kind of functionality, especially to support their preferred security vendors. Tools like Falco, Sysdig, etc... It would be VERY useful to be able to say that there is a supported solution for getting kernel modules/settings into nodes without breaking the cluster. |
If you create your own image, should you be able to install drivers/kernel modules? I can't seem to get this to work: https://www.itix.fr/blog/build-your-own-distribution-on-fedora-coreos/ |
Of course. If you create your own image you can do anything you like. By far the easiest way to get this to work (IMO) is to build your own RPM for the kernel module and include the rpm/yum repo in the definition manifest list that's fed into |
Users may have a need to install kernel drivers on there hosts to support additional hardware. This could be required for boot (day 1 operation) or could be required after install to enable adapters (day 2 operation).
The straight-forward way to accomplish this is to package the drivers in RPM format, so that they can be installed via
rpm-ostree install
. Users may want to be able to build these drivers on an FCOS host, which would require a container with the necessary dependencies installed.It would be useful to come up with a framework that is generic enough to be reused by multiple drivers and is possible to produce multiple versions of the driver (per kernel version).
Copying notes from @cgwalters below:
The text was updated successfully, but these errors were encountered: