Split RHCOS into layers #1637

jlebon · 2024-06-07T15:25:03Z

This enhancement describes improvements to the way RHEL CoreOS (RHCOS) is built so that it will better align with image mode for RHEL, all while also providing benefits on the OpenShift side. Currently, RHCOS is built as a single layer that includes both RHEL and OCP content. This enhancement proposes splitting it into three layers. Going from bottom to top:

the (RHEL-versioned) bootc layer (i.e. the base rhel-bootc image shared with image mode for RHEL)
the (RHEL-versioned) CoreOS layer (i.e. coreos-installer, ignition, afterburn, scripts, etc...)
the (OCP-versioned) node layer (i.e. kubelet, cri-o, etc...)

The terms "bootc layer", "CoreOS layer", and "node layer" will be used throughout this enhancement to refer to these.

The details of this enhancement focus on doing the first split: creating the node layer as distinct from the CoreOS layer (which will not yet be rebased on top of a bootc layer). The two changes involved which most affect OCP are:

bootimages will no longer contain OCP components (e.g. kubelet, cri-o, etc...)
the rhel-coreos payload image will be built in Prow/Konflux (as any other)

Tracked at: https://issues.redhat.com/browse/OCPSTRAT-1190

openshift-ci · 2024-06-07T15:25:08Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci · 2024-06-07T15:26:53Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mandre for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlebon · 2024-06-07T15:48:37Z

enhancements/rhcos/split-rhcos-into-layers.md

+reviewers:
+  - "@patrickdillon, for installer impact"
+  - "@rphillips, for node impact"
+  - "@joepvd, for ART impact"
+  - "@sinnykumari, for MCO impact"
+  - "@LorbusChris, for OKD impact"
+  - "@zaneb, for agent installer impact"
+  - "@sdodson, for overall architecture"
+  - "@cgwalters, for overall architecture"
+approvers:
+  - "@mrunalp"


Apparently, the bot won't automatically tag the folks listed here, so manually doing it: @patrickdillon @rphillips @joepvd @sinnykumari @LorbusChris @zaneb @sdodson @cgwalters.

cgwalters

Awesome work on this!

enhancements/rhcos/split-rhcos-into-layers.md

cgwalters · 2024-06-07T15:49:32Z

enhancements/rhcos/split-rhcos-into-layers.md

+To do this, we will start building two new streams in the RHCOS pipeline containing only pure
+RHEL/CentOS Stream content (let's call these the "pure RHEL stream" and "pure CentOS stream").
+Those streams will also be building the usual bootimages and uploading them to cloud.


Per above, IMO we need a clear, concise and understandable term for this...I don't think many people understand "stream" in this context (even though it is applicable!). I was straw-manning rhel-coreos-base above. And to extend on that, I think there's a bit of an open question here whether in the bigger picture the pipeline really needs to churn bootimages as often given that we are producing supported tooling for materializing on-demand disk images from container images.

So in the more medium term I'd advocate trying to create a clearer split between containers and disk images versus rolling them into a "stream" concept.

Yeah, wanted to avoid mentioning streams here, since it's so overloaded, but when it gets to implementation details, it'd be weird to not discuss it. I'll rework it to clarify that we're talking about "CoreOS pipeline streams" and not something else.

Re. lowering cadence of bootimages, yes that's definitely an overall goal of this. Part of it will happen naturally simply by the fact that we're decoupling OCP churn from the bootimages. The other part will be through conscious efforts (need to pick up coreos/fedora-coreos-pipeline#810 again). But I'd say this is more CoreOS pipeline implementation details.

cgwalters · 2024-06-07T15:50:51Z

enhancements/rhcos/split-rhcos-into-layers.md

+Once we have these bootimages, we can better start adapting components that will need it
+to account for the lack of OpenShift components in the bootimages. Likely suspects here are
+any components involved in the bringup of the cluster (installer, Assisted Installer, MCO, etc...).


We have three choices broadly:

Per above, pull and reboot into OCP container image, what is already done in installer today for OCP. Note a subtle but very important point today...OCP sometimes ships kernels outside of RHEL cycles, and while that's not usually relevant for bootstrap, I wouldn't say it never would be

Special case bootstrap to just e.g. bootc usroverlay + dnf install

Run the installation process in a container (IMO...this is actually best)

That reads to me like a next step after this enhancement? Ideally we don't change the cluster installation flow (yet) and complete this transition first before looking at other installation options?

That reads to me like a next step after this enhancement? Ideally we don't change the cluster installation flow (yet) and complete this transition first before looking at other installation options?

If the bootimages are changed to not contain OCP content as the enhancement says, then things will definitely need to change in the install process.

Or, that bit could be moved to a later phase, which also seems viable to me (but I think also lowers the value of things, because being able to reuse the same bootimages for multiple OCP versions would be quite a big improvement).

Note a subtle but very important point today...OCP sometimes ships kernels outside of RHEL cycles, and while that's not usually relevant for bootstrap, I wouldn't say it never would be

Can you expand on this? E.g. FIPS/crypto-related things? I could imagine detecting those cases where we should reboot and only do so then. Though in the common case it'd be very desirable to try to avoid rebooting.

Run the installation process in a container (IMO...this is actually best)

Doesn't that imply trying to run the kubelet containerized? Not sure how supported that is nowadays.

Special case bootstrap to just e.g. bootc usroverlay + dnf install

Yeah, definitely a potential path forward that's not too complex. We could ship the kubelet RPM in the extensions container image, which is already in the payload.

Hmm, or another approach is to just copy the kubelet out of rhel-coreos, since it's almost static anyway. There's still glibc, but I don't think mismatch issues would be a concern here (this isn't the "new worker node" flow with arbitrarily old bootimages, here we're doing an install so we should be guaranteed to be using matching images + it's a known combination that can be extensively tested in CI).

Doesn't that imply trying to run the kubelet containerized? Not sure how supported that is nowadays.

Yes, but the important thing here is that we don't need any non-core kubelet functionality. In particular, CSI (storage) drivers running containerized I think were the #1 problem. But we don't need any of that, it just needs to run pods.

There is also the angle that nowadays we could just pass pod definitions directly to podman...which I don't believe existed or was really fleshed out when OCP 4.0 was being designed. That may even be simplest.

Hmm, or another approach is to just copy the kubelet out of rhel-coreos, since it's almost static anyway. There's still glibc, but I don't think mismatch issues would be a concern here

This might lead bugs such as: e.g. https://issues.redhat.com/browse/MGMT-16705 ( in this case the kubelet is incompatible because it's doing late-binding hence the node was booted with a default RHCOS and the not RHCOS of the release, yet the kubelet.conf is the one for the release).

Likely suspects here are any components involved in the bringup of the cluster

I think it would be more helpful to focus on stages rather than components. There are 3 that will need to work. In reverse order:

Booting a host as a node after it is installed. Presumably this is a fairly straightforward matter of having it boot the OCP layer from the release payload.

Running the bootstrap node. This normally depends on OCP components. In the Assisted/ABI case, it necessarily runs on the base image, and we cannot stop services that were already running on the host prior to bootstrap starting.

The Assisted discovery ISO/ABI agent ISO. I believe this is a no-op, as neither of them should depend on OCP components today.

Hmm, or another approach is to just copy the kubelet out of rhel-coreos, since it's almost static anyway. There's still glibc, but I don't think mismatch issues would be a concern here

This might lead bugs such as: e.g. issues.redhat.com/browse/MGMT-16705 ( in this case the kubelet is incompatible because it's doing late-binding hence the node was booted with a default RHCOS and the not RHCOS of the release, yet the kubelet.conf is the one for the release).

Is this something we should support though? Having the bootstrap process also have to worry about different bootimage versions instead of only the blessed, heavily CI-tested, version in rhcos.json doesn't sound great.

Running the bootstrap node. This normally depends on OCP components. In the Assisted/ABI case, it necessarily runs on the base image, and we cannot stop services that were already running on the host prior to bootstrap starting.

What does ABI refer to here?

Can you expand/link to more information re. the "we cannot stop services that were already running on the host prior to bootstrap starting"? Don't we own the bootstrap node?

ABI = agent-based installer

In both assisted and ABI installs, the bootstrap-in-place runs inside the live ISO that is already running the agent. I believe assisted-service doesn't take well to the agent being restarted, but ABI has even tighter requirements: assisted-service itself is running on the bootstrap node, so interrupting it will break the entire installation process.

cgwalters · 2024-06-07T15:53:50Z

enhancements/rhcos/split-rhcos-into-layers.md

+
+#### Hypershift / Hosted Control Planes
+
+TODO


One thing I've mentioned in other places here is that if we can get to the point of shrinking our use case for Ignition to be very small by using container images instead, then we don't require the MCS anymore, which would definitely simplify Hypershift.

I'll add a note for that, though probably in the "Follow-ups" section instead.

cgwalters · 2024-06-07T15:54:28Z

enhancements/rhcos/split-rhcos-into-layers.md

+
+#### Standalone Clusters
+
+TODO


This is what the enhancement is currently about right? I think we can just state:

This enhancement covers standalone clusters.

cgwalters · 2024-06-07T15:55:48Z

enhancements/rhcos/split-rhcos-into-layers.md

+
+#### Single-node Deployments or MicroShift
+
+TODO


I think we can clearly say that the overarching changes here are going to much more strongly align MicroShift with OCP.

I also think (hope, pretty confidently) we aren't going to break anything with SNO here and will trend towards improving that too.

Yeah, I wonder if it'd make sense to share with MicroShift the node layer definition file (i.e. the MicroShift layers would be: bootc layer, node layer, MicroShift layer). Would be cool to have it actually derive from the OCP node image, but it definitely includes a lot of stuff not needed there so that doesn't make sense.

For SNO, the way it does bootstrap-in-place using the live ISO I think should guide how we implement bootstrapping so that it works just as well there as it does in the standalone flow (which also then argues for not rebooting).

enhancements/rhcos/split-rhcos-into-layers.md

cgwalters · 2024-06-07T15:57:54Z

enhancements/rhcos/split-rhcos-into-layers.md

+
+### Risks and Mitigations
+
+TODO


One thing I've learned here when we first did the layering switch is that it's a bit like changing the jet engine mid flight, and this is going to be bigger than that. OTOH, we know more, and we have a plan.

Yeah, I have some items to add to this section!

travier · 2024-06-10T13:01:29Z

enhancements/rhcos/split-rhcos-into-layers.md

+additional testing, this layered image will replace the current `rhel-coreos` image in the
+production release payload.
+
+### Workflow Description


We should probably note somewhere that users will have to look at RHEL versioned boot images in the future and not OCP versioned ones anymore. This will likely create some confusion at first.

Not sure about that. I think it would still make sense on https://mirror.openshift.com/ to have the bootimages accessible through OCP-versioned directories (but e.g. it'd be silly not to symlink them on the server side). I think that would reduce the likelihood that users pick the wrong bootimages.

@travier Any thoughts on this? Should we resolve this?

travier · 2024-06-10T13:02:28Z

enhancements/rhcos/split-rhcos-into-layers.md

+- Introduce cluster administrator-visible changes. This change should be transparent to
+  admistrators. CoreOS layering instructions should keep working as is, but documentation
+  should ideally be reworked to leverage rhel-bootc docs more.


We'll have to reword that to mention to boot image versioning change.

enhancements/rhcos/split-rhcos-into-layers.md

rphillips · 2024-06-12T16:00:48Z

openshift/kubernetes has a specific workflow where jobs will build a new kubelet to use during the job run. This helps with rebase work and validating new kubernetes versions coming into OpenShift. We should preserve this workflow when migrating to RHCOS layering.

/cc @soltysh

jlebon · 2024-06-12T16:39:18Z

openshift/kubernetes has a specific workflow where jobs will build a new kubelet to use during the job run. This helps with rebase work and validating new kubernetes versions coming into OpenShift. We should preserve this workflow when migrating to RHCOS layering.

/cc @soltysh

I don't expect any issues there. That workflow should keep working as is.

This enhancement describes improvements to the way RHEL CoreOS (RHCOS) is built so that it will better align with image mode for RHEL, all while also providing benefits on the OpenShift side. Currently, RHCOS is built as a single layer that includes both RHEL and OCP content. This enhancement proposes splitting it into three layers. Going from bottom to top: 1. the (RHEL-versioned) bootc layer (i.e. the base rhel-bootc image shared with image mode for RHEL) 2. the (RHEL-versioned) CoreOS layer (i.e. coreos-installer, ignition, afterburn, scripts, etc...) 3. the (OCP-versioned) node layer (i.e. kubelet, cri-o, etc...) The terms "bootc layer", "CoreOS layer", and "node layer" will be used throughout this enhancement to refer to these. The details of this enhancement focus on doing the first split: creating the node layer as distinct from the CoreOS layer (which will not yet be rebased on top of a bootc layer). The two changes involved which most affect OCP are: 1. bootimages will no longer contain OCP components (e.g. kubelet, cri-o, etc...) 2. the `rhel-coreos` payload image will be built in Prow/Konflux (as any other) Tracked at: https://issues.redhat.com/browse/OCPSTRAT-1190

zaneb · 2024-06-24T11:28:21Z

enhancements/rhcos/split-rhcos-into-layers.md

+Once we have these bootimages, we can better start adapting components that will need it
+to account for the lack of OpenShift components in the bootimages. Likely suspects here are
+any components involved in the bringup of the cluster (installer, Assisted Installer, MCO, etc...).


Likely suspects here are any components involved in the bringup of the cluster

I think it would be more helpful to focus on stages rather than components. There are 3 that will need to work. In reverse order:

Booting a host as a node after it is installed. Presumably this is a fairly straightforward matter of having it boot the OCP layer from the release payload.

Running the bootstrap node. This normally depends on OCP components. In the Assisted/ABI case, it necessarily runs on the base image, and we cannot stop services that were already running on the host prior to bootstrap starting.

The Assisted discovery ISO/ABI agent ISO. I believe this is a no-op, as neither of them should depend on OCP components today.

zaneb · 2024-06-24T11:48:02Z

enhancements/rhcos/split-rhcos-into-layers.md

+Another important follow-up would to be to hook up better CI testing to openshift/os. There is
+currently no CI test on that repo which actually launches a cluster. The reason is that RHCOS
+is just built differently. But now that the node image is simply yet another layered image build,
+it fits perfectly in Prow's opinionated model of building the image and shoving it in the test


Worth noting that the bootstrap node can't be tested this way. (The final node can, which is great.)

zaneb · 2024-06-24T11:50:23Z

enhancements/rhcos/split-rhcos-into-layers.md

+  into the node image before continuing with bootstrapping, but that incurs an additional
+  reboot which is against the stated goals. Some other possibilities so far:
+  - Run the kubelet as a container
+  - Use podman to run the static pods


FWIW my experience with podman kube play a couple of years ago was a bit frustrating. There wasn't a way to generate the correct systemd service file automatically (podman systemd generate didn't work for kube-played pods), and having all containers in the pod running under a single systemd service was not great - systemd can only watch one conmon, so you lost the ability to granularly control what happens when a container within the pod dies. Logs for the different containers aren't separated in the journal.
Kube Play support in quadlet is a big step forward (not sure if that landed in RHEL 9 yet?), but if there is any internal complexity in the pods we are starting then we might still come to regret relying on it.

quadlet does work on RHEL 9.

Yeah, looks like Kube unit in quadlet was there when quadlet first rolled out in podman 4.4.
I misremembered, thinking it was only added later. It was Pod unit that was added later, in podman 5.0.

There are specific shutdown requirements for static and regular pods. Podman running these pods would not work out for the shutdown logic. Additionally, Kubelet also has to report status and mirror pods to the API. I do not believe having podman run static pods is a viable solution.

@rphillips And to be clear, we rely on these parts of the kubelet surface even just in the bootstrapping phase? And it would be too onerous to try to adapt to the gaps that switching to podman kube play would create?

Correct. Static pods run in a normal OpenShift cluster, namely: kube-apiserver, controller-manager, and etcd. Networking may have 1-2 as well. All these static pods are reported to the API server, and operators interact with configuring and managing them via the K8S API.

podman kube play might be able to run them, and could be written to follow the correct termination sequence; however, the Kubelet reports the static pods to the API as "mirror" pods which includes their pod lifecycle state. If we were to go down this path, then there would be a lot of re-writing of core K8S functionality.

zaneb · 2024-06-24T11:51:28Z

enhancements/rhcos/split-rhcos-into-layers.md

+- How should we adapt the bootstrapping process of the installer to handle the lack
+  of oc and kubelet in the bootimage? The easy approach would be to download and pivot
+  into the node image before continuing with bootstrapping, but that incurs an additional
+  reboot which is against the stated goals. Some other possibilities so far:


This is a non-starter for ABI and Assisted.

zaneb · 2024-06-24T11:51:59Z

enhancements/rhcos/split-rhcos-into-layers.md

+  reboot which is against the stated goals. Some other possibilities so far:
+  - Run the kubelet as a container
+  - Use podman to run the static pods
+  - Install the kubelet RPM


From where?

This is probably the best approach. OpenShift would have to publish a RPM repo. Another issue that would arise are offline installs; where do they get oc and the kubelet.

I think the rpm repo would have to be running in a container from the release payload.
Even finding that without oc is challenging though, since it involves digging through the metadata to find the right image.

We actually already ship RPMs today in the extensions container (part of the payload). I'm not sure yet whether it makes sense to ship the kubelet there (the package set is 95% RHEL, so ideally it would also be RHEL-versioned), but it's not hard to add another container image similar to it.

I think this is probably the lowest friction approach. It's just awkward to have the kubelet RPM in a container image and in the node image. It's not very far at that point from just copying the kubelet out of the image.

Even finding that without oc is challenging though, since it involves digging through the metadata to find the right image.

That part should be fine. image_for uses podman, and we can run oc using podman as well. Basically: podman run ... $(image_for rhel-coreos) oc ....

For reference, recently the hyperkube rpm [spec] was refactored to split controller-manager, api-server, and kubelet into separate RPMs. We used to ship api server and controller-manager in RHCOS, but never used them there. The container image for those components would run it out of the image. Since the recent change, the kubelet rpm is injected, but it wasn't always the case.

I worry about copying the kubelet out of the image due to openssl, fips, glibc , and perhaps other dependencies. I might vote 'awkward' is ok if it's just a duplication of a binary. The alternative of copying the binary out of an image seems a bit more brittle.

zaneb · 2024-06-24T11:53:44Z

enhancements/rhcos/split-rhcos-into-layers.md

+  - Run the kubelet as a container
+  - Use podman to run the static pods
+  - Install the kubelet RPM
+  - `bootc apply-live`


Would this result in running systemd services being stopped? If so this is a non-starter for ABI and Assisted.

It wouldn't in this case, no. (Also I didn't mention it there, but this would be rpm-ostree apply-live for now; containers/bootc#76).

But actually, I don't think this will work for the ISO case currently. Or at least, it'd need some work I think on the bootc/rpm-ostree side first.

zaneb · 2024-06-24T11:55:01Z

enhancements/rhcos/split-rhcos-into-layers.md

+  - Use podman to run the static pods
+  - Install the kubelet RPM
+  - `bootc apply-live`
+  - `systemctl soft-reboot`


This is a non-starter for ABI and Assisted.

Do we have more details for ABI/Assisted and an option which will work for those scenarios?

Do we have more details for ABI/Assisted

How long have you got? 🙃

and an option which will work for those scenarios?

There's a discussion in the main thread that seems promising: #1637 (comment)

zaneb · 2024-06-24T11:59:26Z

/cc @cybertron @andfasano

soltysh · 2024-06-26T12:04:05Z

I don't expect any issues there. That workflow should keep working as is.

I believe this was the pre-req work done in openshift/kubernetes#1805, which ensured we won't have problems in o/k.

rphillips · 2024-06-26T23:32:08Z

enhancements/rhcos/split-rhcos-into-layers.md

+  of oc and kubelet in the bootimage? The easy approach would be to download and pivot
+  into the node image before continuing with bootstrapping, but that incurs an additional
+  reboot which is against the stated goals. Some other possibilities so far:
+  - Run the kubelet as a container


Upstream is not supporting running kubelet in a container anymore. CoreOS and other K8S distributions ran Kubelet in a container many many years ago. The lessons learned were not to do it.

Yes, we can rule this one out.

romfreiman · 2024-06-30T12:53:13Z

enhancements/rhcos/split-rhcos-into-layers.md

+on the RHEL side is that we would have only one stream of RHCOS per RHEL release,
+rather than one per OpenShift release. This greatly reduces the workload on the
+CoreOS team. Another benefit is easier integration in the CI processes of
+rhel-bootc and centos-bootc, as well as better shared documentation.


User facing enhancement is that hardware that is certified by partners for RHEL will be also rhcos certified, which is currently not the case.

I don't that's accurate. The installation methods and composition of RHCOS today may yield certain differences in use but it's still RHEL and our position is still that it has the same hardware certification that RHEL carries.

enhancements/rhcos/split-rhcos-into-layers.md

romfreiman · 2024-06-30T12:59:40Z

enhancements/rhcos/split-rhcos-into-layers.md

+  - "@LorbusChris, for OKD impact"
+  - "@zaneb, for agent installer impact"
+  - "@sdodson, for overall architecture"
+  - "@cgwalters, for overall architecture"


@fabiendupont @bthurber @ybettan please review in a context of kmm and dtk

Since KMM and DTK are day 2 in the case of OCP, I don't see any real impact or change here. We aren't managing anything as day 0.

andfasano · 2024-06-25T07:36:09Z

enhancements/rhcos/split-rhcos-into-layers.md

+
+### Non-Goals
+
+- Change the cluster installation flow. It should remain the same whether IPI/UPI/AI/etc...


Suggested change

- Change the cluster installation flow. It should remain the same whether IPI/UPI/AI/etc...

- Change the cluster installation flow. It should remain the same whether IPI/UPI/AI/ABI/etc...

As per openshift/enhancements#1637, we're trying to get rid of all OpenShift-versioned components from the bootimages. This means that there will no longer be `oc`, `kubelet`, or `crio` binaries for example, which bootstrapping obviously relies on. Instead, now we change things up so that early on when booting the bootstrap node, we pull down the node image, unencapsulate it (this just means convert it back to an OSTree commit), then mount over its `/usr`, and import new `/etc` content. This is done by isolating to a different systemd target to only bring up the minimum number of services to do the pivot and then carry on with bootstrapping. This does not incur additional reboots and should be compatible with AI/ABI/SNO. But it is of course, a huge conceptual shift in how bootstrapping works. With this, we would now always be sure that we're using the same binaries as the target version as part of bootstrapping, which should alleviate some issues such as AI late-binding (see e.g. https://issues.redhat.com/browse/MGMT-16705). The big exception of course being the kernel. Relatedly, currently `/usr/lib/modules` is also shadowed by the mount, but we could re-mount it if needed. To be conservative, the new logic only kicks in when using bootimages which do not have `oc`. This will allow us to ratchet this in more easily. Down the line, we should be able to replace some of this with `bootc apply-live` once that's available (and also works in a live environment). (See containers/bootc#76.) For full context, see the linked enhancement and discussions there.

jlebon · 2024-07-16T21:34:59Z

OK, so let's resume the bootstrapping issue. Restating some of the things from above and from researching further:

We can't run the kubelet in a container because it's no longer supported.
The delta between kubelet and podman play is too large to make the latter a feasible replacement.
systemctl soft-reboot is not in RHEL9.
In the AI/ABI/SNO cases, bootstrapping happens in the live environment where e.g. rebooting is not possible.
I considered cobbling something around kexec, but in the limit, there are potential issues with kexec and hardware reliability, as well as how it meshes with Secure Boot.

What I'm playing with now is basically to have a special node-image-pivot.target that the node isolates to first. There, we pull the node image, unencapsulate it, check out its contents, and then mount over /usr and do a rough 3-way /etc merge. We then isolate back to multi-user.target to continue with the bootstrapping process.

This is in effect like a more aggressive bootc/rpm-ostree apply-live, though that doesn't currently work in live environments. (Though even in the non-live case, there are some issues there that would need to be resolved.) It's close to what OKD currently does when using FCOS live media today, though using the ostree stack and isolating targets should make this more robust.

WIP for this in openshift/installer#8742.

rphillips · 2024-07-16T22:32:52Z

@jlebon That sounds like it might work. Where will the Kubelet be coming from? An OpenShift built image?

zaneb · 2024-07-17T03:52:23Z

Won't doing systemctl isolate node-image-pivot.targethave the effect of stopping the assisted/agent services that we need to avoid stopping?

jlebon · 2024-07-17T14:40:22Z

@jlebon That sounds like it might work. Where will the Kubelet be coming from? An OpenShift built image?

From the node image (i.e. for OCP, the rhel-coreos image in the release payload).

Won't doing systemctl isolate node-image-pivot.targethave the effect of stopping the assisted/agent services that we need to avoid stopping?

No. The system boots into node-image-pivot.target first. Any other service hooked into multi-user.target aren't started until after we've finished the live pivot.

cgwalters · 2024-07-17T14:50:32Z

The system boots into node-image-pivot.target first.

Via a generator overriding default.target?

jlebon · 2024-07-17T15:37:55Z

The system boots into node-image-pivot.target first.

Via a generator overriding default.target?

Yup, exactly. You can see that in openshift/installer#8742.

zaneb · 2024-07-17T22:41:29Z

The bootstrap ignition that's modified in openshift/installer#8742 doesn't exist at boot time, so we'd need to also add this into the agent ISO and assisted Discovery ISO.
That's not impossible, but it'll be a challenge to maintain.
I guess if we are not able to pull the release image then nothing's going to work anyway, but doing it at boot time also limits our options for reporting any failure.

jlebon · 2024-07-18T21:08:17Z

The bootstrap ignition that's modified in openshift/installer#8742 doesn't exist at boot time, so we'd need to also add this into the agent ISO and assisted Discovery ISO. That's not impossible, but it'll be a challenge to maintain. I guess if we are not able to pull the release image then nothing's going to work anyway, but doing it at boot time also limits our options for reporting any failure.

Ahh right, tricky. Hmm, I think once the bootstrap bits are written out (by the --once-from MCD call), we could still just isolate back to node-image-pivot.target before continuing to bootstrapping. It should be possible to keep both the service and agent alive throughout. This will need more experimenting.

jlebon · 2024-07-18T21:10:53Z

so we'd need to also add this into the agent ISO and assisted Discovery ISO

A clarification on this: if we can split out those units somehow so they're accessible in the release payload and not baked into the installer, would whatever generates the ISOs be able to pull from the payload so we don't have to duplicate them across codebases? That'd make the more vanilla (non-AI/ABI) install flows more awkward though with a level of indirection.

zaneb · 2024-07-19T00:27:07Z

if we can split out those units somehow so they're accessible in the release payload and not baked into the installer, would whatever generates the ISOs be able to pull from the payload so we don't have to duplicate them across codebases?

We try to avoid having a hard dependency on the release payload in ABI, because the ISO may be generated in a different environment from where the cluster is going to run.

I think it could work to group all of the agent services under e.g. agent.target and in that target set WantedBy: multi-user.target node-image-pivot.target. That way I think we could isolate without stopping those services. Assuming it's safe to do the pivot while stuff is running.

jlebon · 2024-07-19T21:20:16Z

I think it could work to group all of the agent services under e.g. agent.target and in that target set WantedBy: multi-user.target node-image-pivot.target. That way I think we could isolate without stopping those services. Assuming it's safe to do the pivot while stuff is running.

Yeah, that's along the lines of what I was thinking as well. I like the agent.target idea.

I might reach out with questions for getting a test environment up to iterate on this.

enhancements/rhcos/split-rhcos-into-layers.md

cverna · 2024-08-21T07:02:54Z

enhancements/rhcos/split-rhcos-into-layers.md

+benefits on the OpenShift side. Currently, RHCOS is built as a single layer that
+includes both RHEL and OCP content. This enhancement proposes splitting it into
+three layers. Going from bottom to top:
+1. the bootc layer; i.e. the base rhel-bootc image shared with image mode for RHEL (RHEL-versioned)


One thing that I have been thinking about, is that RHCOS often ships RHEL packages early (a few weeks before they are available in RHEL), in particular the kernel. These RHEL packages version will not be available in the rhel-bootc image and therefore would need to be replaced in the CoreOS layer, this happens often enough that I wonder if actually using the rhel-bootc image as base image for OCP is going to be useful.

@cgwalters, @sdodson and @jlebon what's your point of view for this?

Overriding things shipped in the base image to me is totally fine and expected. At a certain size and scale, squashing and optimization can be desirable - and as you point out, especially for the kernel where it necessarily implies regenerating the initramfs.

But I would clarify something here, I think we should go from:

"I wonder if actually using the rhel-bootc image as base image for OCP is going to be useful."

to

"I wonder if we may need to go from simple derivation to using supported tooling to build or optimize derived images"

https://gitlab.com/fedora/bootc/tracker/-/issues/32

I think it's of high importance to the wider "us" that even if we don't use "simple derivation", but some sort of more sophisticated build, that doesn't go all the way to "fork of the base image" e.g. or use tools that aren't also supported for customers.

This is why the sentence is left fuzzy; i.e. "shared with" instead of "deriving from". How exactly we'll share is still to be discussed. A key ingredient to success for me in this sharing is centralizing CI, and for that to be effective the NEVRAs in the RHCOS image ideally should be a superset of the NEVRAs in the rhel-bootc image that passed CI. And for that, we don't necessarily need to literally derive from.

That said, obviously the kernel is a pretty big one to deviate on, and we can manage that, but ideally we'd have a rhel-bootc stream that also consumes those early kernels since ISTM like it shouldn't interest only OCP, if only for testing purposes (though that widening of scope would need to be cleared by the kernel maintainers).

This could also just be a layer internal to the build process. I think the main goal is that we start from an image that RHCOS team doesn't produce, the RHEL team does.

Prematurely hit send...

The RHEL weekly kernel process is now actually shipping the kernel every two weeks to RHEL. OCP picks it up early in our pipeline because it's about 5 days from RHCOS integration to OCP errata shipping, but if this were just moved to the bootc image build process and they shipped every two weeks alongside the kernel AND we had access to that build 5 days ahead we could just use their output too.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2024

jlebon mentioned this pull request Jun 7, 2024

Rework build process to generate rhel-coreos-base distinct from ocp-rhel-coreos openshift/os#799

Closed

jlebon force-pushed the pr/split-rhcos-into-layers branch from 067ece5 to f79684b Compare June 7, 2024 15:38

jlebon commented Jun 7, 2024

View reviewed changes

cgwalters reviewed Jun 7, 2024

View reviewed changes

travier reviewed Jun 10, 2024

View reviewed changes

sdodson reviewed Jun 10, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

enhancements/rhcos/split-rhcos-into-layers.md Outdated Show resolved Hide resolved

openshift-ci bot requested a review from soltysh June 12, 2024 16:00

jlebon force-pushed the pr/split-rhcos-into-layers branch from f79684b to a6a7438 Compare June 20, 2024 21:15

zaneb reviewed Jun 24, 2024

View reviewed changes

openshift-ci bot requested review from andfasano and cybertron June 24, 2024 11:59

rphillips reviewed Jun 26, 2024

View reviewed changes

romfreiman reviewed Jun 30, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

romfreiman reviewed Jun 30, 2024

View reviewed changes

andfasano reviewed Jul 11, 2024

View reviewed changes

jlebon mentioned this pull request Jul 16, 2024

WIP: bootstrap: pivot into node image before bootstrapping openshift/installer#8742

Draft

yuqi-zhang reviewed Jul 21, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

yuqi-zhang reviewed Jul 21, 2024

View reviewed changes

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

cverna reviewed Aug 21, 2024

View reviewed changes

jlebon mentioned this pull request Aug 22, 2024

Run a basic cluster test in Prow openshift/os#1584

Open


		### Non-Goals

		- Change the cluster installation flow. It should remain the same whether IPI/UPI/AI/etc...


		#### Standalone Clusters

		TODO

Split RHCOS into layers #1637

Are you sure you want to change the base?

Split RHCOS into layers #1637

Conversation

jlebon commented Jun 7, 2024

openshift-ci bot commented Jun 7, 2024

openshift-ci bot commented Jun 7, 2024

Choose a reason for hiding this comment

cgwalters left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgwalters Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips commented Jun 12, 2024

jlebon commented Jun 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips Jun 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rphillips Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneb commented Jun 24, 2024

soltysh commented Jun 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlebon commented Jul 16, 2024

rphillips commented Jul 16, 2024

zaneb commented Jul 17, 2024

jlebon commented Jul 17, 2024

cgwalters commented Jul 17, 2024

jlebon commented Jul 17, 2024

zaneb commented Jul 17, 2024

jlebon commented Jul 18, 2024

jlebon commented Jul 18, 2024

zaneb commented Jul 19, 2024

jlebon commented Jul 19, 2024

Choose a reason for hiding this comment

cgwalters Jun 11, 2024 •

edited

Loading

rphillips Jun 27, 2024 •

edited

Loading

rphillips Jun 27, 2024 •

edited

Loading

rphillips Jun 26, 2024 •

edited

Loading

rphillips Jun 27, 2024 •

edited

Loading

sdodson Aug 21, 2024 •

edited

Loading