Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split RHCOS into layers #1637

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Jun 7, 2024

This enhancement describes improvements to the way RHEL CoreOS (RHCOS) is built so that it will better align with image mode for RHEL, all while also providing benefits on the OpenShift side. Currently, RHCOS is built as a single layer that includes both RHEL and OCP content. This enhancement proposes splitting it into three layers. Going from bottom to top:

  1. the (RHEL-versioned) bootc layer (i.e. the base rhel-bootc image shared with image mode for RHEL)
  2. the (RHEL-versioned) CoreOS layer (i.e. coreos-installer, ignition, afterburn, scripts, etc...)
  3. the (OCP-versioned) node layer (i.e. kubelet, cri-o, etc...)

The terms "bootc layer", "CoreOS layer", and "node layer" will be used throughout this enhancement to refer to these.

The details of this enhancement focus on doing the first split: creating the node layer as distinct from the CoreOS layer (which will not yet be rebased on top of a bootc layer). The two changes involved which most affect OCP are:

  1. bootimages will no longer contain OCP components (e.g. kubelet, cri-o, etc...)
  2. the rhel-coreos payload image will be built in Prow/Konflux (as any other)

Tracked at: https://issues.redhat.com/browse/OCPSTRAT-1190

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2024
Copy link
Contributor

openshift-ci bot commented Jun 7, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

openshift-ci bot commented Jun 7, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign mandre for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Comment on lines +7 to +17
reviewers:
- "@patrickdillon, for installer impact"
- "@rphillips, for node impact"
- "@joepvd, for ART impact"
- "@sinnykumari, for MCO impact"
- "@LorbusChris, for OKD impact"
- "@zaneb, for agent installer impact"
- "@sdodson, for overall architecture"
- "@cgwalters, for overall architecture"
approvers:
- "@mrunalp"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently, the bot won't automatically tag the folks listed here, so manually doing it: @patrickdillon @rphillips @joepvd @sinnykumari @LorbusChris @zaneb @sdodson @cgwalters.

Copy link
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work on this!

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved
enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved
Comment on lines 136 to 138
To do this, we will start building two new streams in the RHCOS pipeline containing only pure
RHEL/CentOS Stream content (let's call these the "pure RHEL stream" and "pure CentOS stream").
Those streams will also be building the usual bootimages and uploading them to cloud.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above, IMO we need a clear, concise and understandable term for this...I don't think many people understand "stream" in this context (even though it is applicable!). I was straw-manning rhel-coreos-base above. And to extend on that, I think there's a bit of an open question here whether in the bigger picture the pipeline really needs to churn bootimages as often given that we are producing supported tooling for materializing on-demand disk images from container images.

So in the more medium term I'd advocate trying to create a clearer split between containers and disk images versus rolling them into a "stream" concept.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, wanted to avoid mentioning streams here, since it's so overloaded, but when it gets to implementation details, it'd be weird to not discuss it. I'll rework it to clarify that we're talking about "CoreOS pipeline streams" and not something else.

Re. lowering cadence of bootimages, yes that's definitely an overall goal of this. Part of it will happen naturally simply by the fact that we're decoupling OCP churn from the bootimages. The other part will be through conscious efforts (need to pick up coreos/fedora-coreos-pipeline#810 again). But I'd say this is more CoreOS pipeline implementation details.

Comment on lines +140 to +146
Once we have these bootimages, we can better start adapting components that will need it
to account for the lack of OpenShift components in the bootimages. Likely suspects here are
any components involved in the bringup of the cluster (installer, Assisted Installer, MCO, etc...).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have three choices broadly:

  • Per above, pull and reboot into OCP container image, what is already done in installer today for OCP. Note a subtle but very important point today...OCP sometimes ships kernels outside of RHEL cycles, and while that's not usually relevant for bootstrap, I wouldn't say it never would be
  • Special case bootstrap to just e.g. bootc usroverlay + dnf install
  • Run the installation process in a container (IMO...this is actually best)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reads to me like a next step after this enhancement? Ideally we don't change the cluster installation flow (yet) and complete this transition first before looking at other installation options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That reads to me like a next step after this enhancement? Ideally we don't change the cluster installation flow (yet) and complete this transition first before looking at other installation options?

If the bootimages are changed to not contain OCP content as the enhancement says, then things will definitely need to change in the install process.

Or, that bit could be moved to a later phase, which also seems viable to me (but I think also lowers the value of things, because being able to reuse the same bootimages for multiple OCP versions would be quite a big improvement).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note a subtle but very important point today...OCP sometimes ships kernels outside of RHEL cycles, and while that's not usually relevant for bootstrap, I wouldn't say it never would be

Can you expand on this? E.g. FIPS/crypto-related things? I could imagine detecting those cases where we should reboot and only do so then. Though in the common case it'd be very desirable to try to avoid rebooting.

Run the installation process in a container (IMO...this is actually best)

Doesn't that imply trying to run the kubelet containerized? Not sure how supported that is nowadays.

Special case bootstrap to just e.g. bootc usroverlay + dnf install

Yeah, definitely a potential path forward that's not too complex. We could ship the kubelet RPM in the extensions container image, which is already in the payload.

Hmm, or another approach is to just copy the kubelet out of rhel-coreos, since it's almost static anyway. There's still glibc, but I don't think mismatch issues would be a concern here (this isn't the "new worker node" flow with arbitrarily old bootimages, here we're doing an install so we should be guaranteed to be using matching images + it's a known combination that can be extensively tested in CI).

Copy link
Member

@cgwalters cgwalters Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that imply trying to run the kubelet containerized? Not sure how supported that is nowadays.

Yes, but the important thing here is that we don't need any non-core kubelet functionality. In particular, CSI (storage) drivers running containerized I think were the #1 problem. But we don't need any of that, it just needs to run pods.

There is also the angle that nowadays we could just pass pod definitions directly to podman...which I don't believe existed or was really fleshed out when OCP 4.0 was being designed. That may even be simplest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, or another approach is to just copy the kubelet out of rhel-coreos, since it's almost static anyway. There's still glibc, but I don't think mismatch issues would be a concern here

This might lead bugs such as: e.g. https://issues.redhat.com/browse/MGMT-16705 ( in this case the kubelet is incompatible because it's doing late-binding hence the node was booted with a default RHCOS and the not RHCOS of the release, yet the kubelet.conf is the one for the release).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely suspects here are any components involved in the bringup of the cluster

I think it would be more helpful to focus on stages rather than components. There are 3 that will need to work. In reverse order:

  1. Booting a host as a node after it is installed. Presumably this is a fairly straightforward matter of having it boot the OCP layer from the release payload.
  2. Running the bootstrap node. This normally depends on OCP components. In the Assisted/ABI case, it necessarily runs on the base image, and we cannot stop services that were already running on the host prior to bootstrap starting.
  3. The Assisted discovery ISO/ABI agent ISO. I believe this is a no-op, as neither of them should depend on OCP components today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, or another approach is to just copy the kubelet out of rhel-coreos, since it's almost static anyway. There's still glibc, but I don't think mismatch issues would be a concern here

This might lead bugs such as: e.g. issues.redhat.com/browse/MGMT-16705 ( in this case the kubelet is incompatible because it's doing late-binding hence the node was booted with a default RHCOS and the not RHCOS of the release, yet the kubelet.conf is the one for the release).

Is this something we should support though? Having the bootstrap process also have to worry about different bootimage versions instead of only the blessed, heavily CI-tested, version in rhcos.json doesn't sound great.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running the bootstrap node. This normally depends on OCP components. In the Assisted/ABI case, it necessarily runs on the base image, and we cannot stop services that were already running on the host prior to bootstrap starting.

What does ABI refer to here?

Can you expand/link to more information re. the "we cannot stop services that were already running on the host prior to bootstrap starting"? Don't we own the bootstrap node?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ABI = agent-based installer

In both assisted and ABI installs, the bootstrap-in-place runs inside the live ISO that is already running the agent. I believe assisted-service doesn't take well to the agent being restarted, but ABI has even tighter requirements: assisted-service itself is running on the bootstrap node, so interrupting it will break the entire installation process.


#### Hypershift / Hosted Control Planes

TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've mentioned in other places here is that if we can get to the point of shrinking our use case for Ignition to be very small by using container images instead, then we don't require the MCS anymore, which would definitely simplify Hypershift.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a note for that, though probably in the "Follow-ups" section instead.


#### Standalone Clusters

TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what the enhancement is currently about right? I think we can just state:

This enhancement covers standalone clusters.


#### Single-node Deployments or MicroShift

TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can clearly say that the overarching changes here are going to much more strongly align MicroShift with OCP.

I also think (hope, pretty confidently) we aren't going to break anything with SNO here and will trend towards improving that too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I wonder if it'd make sense to share with MicroShift the node layer definition file (i.e. the MicroShift layers would be: bootc layer, node layer, MicroShift layer). Would be cool to have it actually derive from the OCP node image, but it definitely includes a lot of stuff not needed there so that doesn't make sense.

For SNO, the way it does bootstrap-in-place using the live ISO I think should guide how we implement bootstrapping so that it works just as well there as it does in the standalone flow (which also then argues for not rebooting).

enhancements/rhcos/split-rhcos-into-layers.md Show resolved Hide resolved

### Risks and Mitigations

TODO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I've learned here when we first did the layering switch is that it's a bit like changing the jet engine mid flight, and this is going to be bigger than that. OTOH, we know more, and we have a plan.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I have some items to add to this section!

additional testing, this layered image will replace the current `rhel-coreos` image in the
production release payload.

### Workflow Description
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably note somewhere that users will have to look at RHEL versioned boot images in the future and not OCP versioned ones anymore. This will likely create some confusion at first.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about that. I think it would still make sense on https://mirror.openshift.com/ to have the bootimages accessible through OCP-versioned directories (but e.g. it'd be silly not to symlink them on the server side). I think that would reduce the likelihood that users pick the wrong bootimages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travier Any thoughts on this? Should we resolve this?

Comment on lines +110 to +113
- Introduce cluster administrator-visible changes. This change should be transparent to
admistrators. CoreOS layering instructions should keep working as is, but documentation
should ideally be reworked to leverage rhel-bootc docs more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll have to reword that to mention to boot image versioning change.

@rphillips
Copy link
Contributor

openshift/kubernetes has a specific workflow where jobs will build a new kubelet to use during the job run. This helps with rebase work and validating new kubernetes versions coming into OpenShift. We should preserve this workflow when migrating to RHCOS layering.

/cc @soltysh

@openshift-ci openshift-ci bot requested a review from soltysh June 12, 2024 16:00
@jlebon
Copy link
Member Author

jlebon commented Jun 12, 2024

openshift/kubernetes has a specific workflow where jobs will build a new kubelet to use during the job run. This helps with rebase work and validating new kubernetes versions coming into OpenShift. We should preserve this workflow when migrating to RHCOS layering.

/cc @soltysh

I don't expect any issues there. That workflow should keep working as is.

This enhancement describes improvements to the way RHEL CoreOS (RHCOS)
is built so that it will better align with image mode for RHEL, all
while also providing benefits on the OpenShift side. Currently, RHCOS
is built as a single layer that includes both RHEL and OCP content. This
enhancement proposes splitting it into three layers. Going from bottom
to top:
1. the (RHEL-versioned) bootc layer (i.e. the base rhel-bootc image
   shared with image mode for RHEL)
2. the (RHEL-versioned) CoreOS layer (i.e. coreos-installer, ignition,
   afterburn, scripts, etc...)
3. the (OCP-versioned) node layer (i.e. kubelet, cri-o, etc...)

The terms "bootc layer", "CoreOS layer", and "node layer" will be used
throughout this enhancement to refer to these.

The details of this enhancement focus on doing the first split: creating
the node layer as distinct from the CoreOS layer (which will not yet be
rebased on top of a bootc layer). The two changes involved which most
affect OCP are:
1. bootimages will no longer contain OCP components (e.g. kubelet,
   cri-o, etc...)
2. the `rhel-coreos` payload image will be built in Prow/Konflux (as
   any other)

Tracked at: https://issues.redhat.com/browse/OCPSTRAT-1190
@jlebon jlebon force-pushed the pr/split-rhcos-into-layers branch from f79684b to a6a7438 Compare June 20, 2024 21:15
Comment on lines +140 to +146
Once we have these bootimages, we can better start adapting components that will need it
to account for the lack of OpenShift components in the bootimages. Likely suspects here are
any components involved in the bringup of the cluster (installer, Assisted Installer, MCO, etc...).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely suspects here are any components involved in the bringup of the cluster

I think it would be more helpful to focus on stages rather than components. There are 3 that will need to work. In reverse order:

  1. Booting a host as a node after it is installed. Presumably this is a fairly straightforward matter of having it boot the OCP layer from the release payload.
  2. Running the bootstrap node. This normally depends on OCP components. In the Assisted/ABI case, it necessarily runs on the base image, and we cannot stop services that were already running on the host prior to bootstrap starting.
  3. The Assisted discovery ISO/ABI agent ISO. I believe this is a no-op, as neither of them should depend on OCP components today.

Another important follow-up would to be to hook up better CI testing to openshift/os. There is
currently no CI test on that repo which actually launches a cluster. The reason is that RHCOS
is just built differently. But now that the node image is simply yet another layered image build,
it fits perfectly in Prow's opinionated model of building the image and shoving it in the test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that the bootstrap node can't be tested this way. (The final node can, which is great.)

into the node image before continuing with bootstrapping, but that incurs an additional
reboot which is against the stated goals. Some other possibilities so far:
- Run the kubelet as a container
- Use podman to run the static pods
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW my experience with podman kube play a couple of years ago was a bit frustrating. There wasn't a way to generate the correct systemd service file automatically (podman systemd generate didn't work for kube-played pods), and having all containers in the pod running under a single systemd service was not great - systemd can only watch one conmon, so you lost the ability to granularly control what happens when a container within the pod dies. Logs for the different containers aren't separated in the journal.
Kube Play support in quadlet is a big step forward (not sure if that landed in RHEL 9 yet?), but if there is any internal complexity in the pods we are starting then we might still come to regret relying on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quadlet does work on RHEL 9.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, looks like Kube unit in quadlet was there when quadlet first rolled out in podman 4.4.
I misremembered, thinking it was only added later. It was Pod unit that was added later, in podman 5.0.

Copy link
Contributor

@rphillips rphillips Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are specific shutdown requirements for static and regular pods. Podman running these pods would not work out for the shutdown logic. Additionally, Kubelet also has to report status and mirror pods to the API. I do not believe having podman run static pods is a viable solution.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rphillips And to be clear, we rely on these parts of the kubelet surface even just in the bootstrapping phase? And it would be too onerous to try to adapt to the gaps that switching to podman kube play would create?

Copy link
Contributor

@rphillips rphillips Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Static pods run in a normal OpenShift cluster, namely: kube-apiserver, controller-manager, and etcd. Networking may have 1-2 as well. All these static pods are reported to the API server, and operators interact with configuring and managing them via the K8S API.

podman kube play might be able to run them, and could be written to follow the correct termination sequence; however, the Kubelet reports the static pods to the API as "mirror" pods which includes their pod lifecycle state. If we were to go down this path, then there would be a lot of re-writing of core K8S functionality.

- How should we adapt the bootstrapping process of the installer to handle the lack
of oc and kubelet in the bootimage? The easy approach would be to download and pivot
into the node image before continuing with bootstrapping, but that incurs an additional
reboot which is against the stated goals. Some other possibilities so far:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-starter for ABI and Assisted.

reboot which is against the stated goals. Some other possibilities so far:
- Run the kubelet as a container
- Use podman to run the static pods
- Install the kubelet RPM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From where?

Copy link
Contributor

@rphillips rphillips Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the best approach. OpenShift would have to publish a RPM repo. Another issue that would arise are offline installs; where do they get oc and the kubelet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rpm repo would have to be running in a container from the release payload.
Even finding that without oc is challenging though, since it involves digging through the metadata to find the right image.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually already ship RPMs today in the extensions container (part of the payload). I'm not sure yet whether it makes sense to ship the kubelet there (the package set is 95% RHEL, so ideally it would also be RHEL-versioned), but it's not hard to add another container image similar to it.

I think this is probably the lowest friction approach. It's just awkward to have the kubelet RPM in a container image and in the node image. It's not very far at that point from just copying the kubelet out of the image.

Even finding that without oc is challenging though, since it involves digging through the metadata to find the right image.

That part should be fine. image_for uses podman, and we can run oc using podman as well. Basically: podman run ... $(image_for rhel-coreos) oc ....

Copy link
Contributor

@rphillips rphillips Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, recently the hyperkube rpm [spec] was refactored to split controller-manager, api-server, and kubelet into separate RPMs. We used to ship api server and controller-manager in RHCOS, but never used them there. The container image for those components would run it out of the image. Since the recent change, the kubelet rpm is injected, but it wasn't always the case.

I worry about copying the kubelet out of the image due to openssl, fips, glibc , and perhaps other dependencies. I might vote 'awkward' is ok if it's just a duplication of a binary. The alternative of copying the binary out of an image seems a bit more brittle.

- Run the kubelet as a container
- Use podman to run the static pods
- Install the kubelet RPM
- `bootc apply-live`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this result in running systemd services being stopped? If so this is a non-starter for ABI and Assisted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't in this case, no. (Also I didn't mention it there, but this would be rpm-ostree apply-live for now; containers/bootc#76).

But actually, I don't think this will work for the ISO case currently. Or at least, it'd need some work I think on the bootc/rpm-ostree side first.

- Use podman to run the static pods
- Install the kubelet RPM
- `bootc apply-live`
- `systemctl soft-reboot`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-starter for ABI and Assisted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have more details for ABI/Assisted and an option which will work for those scenarios?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have more details for ABI/Assisted

How long have you got? 🙃

and an option which will work for those scenarios?

There's a discussion in the main thread that seems promising: #1637 (comment)

@zaneb
Copy link
Member

zaneb commented Jun 24, 2024

/cc @cybertron @andfasano

@soltysh
Copy link
Member

soltysh commented Jun 26, 2024

I don't expect any issues there. That workflow should keep working as is.

I believe this was the pre-req work done in openshift/kubernetes#1805, which ensured we won't have problems in o/k.

of oc and kubelet in the bootimage? The easy approach would be to download and pivot
into the node image before continuing with bootstrapping, but that incurs an additional
reboot which is against the stated goals. Some other possibilities so far:
- Run the kubelet as a container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream is not supporting running kubelet in a container anymore. CoreOS and other K8S distributions ran Kubelet in a container many many years ago. The lessons learned were not to do it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can rule this one out.

on the RHEL side is that we would have only one stream of RHCOS per RHEL release,
rather than one per OpenShift release. This greatly reduces the workload on the
CoreOS team. Another benefit is easier integration in the CI processes of
rhel-bootc and centos-bootc, as well as better shared documentation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

User facing enhancement is that hardware that is certified by partners for RHEL will be also rhcos certified, which is currently not the case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't that's accurate. The installation methods and composition of RHCOS today may yield certain differences in use but it's still RHEL and our position is still that it has the same hardware certification that RHEL carries.

- "@LorbusChris, for OKD impact"
- "@zaneb, for agent installer impact"
- "@sdodson, for overall architecture"
- "@cgwalters, for overall architecture"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiendupont @bthurber @ybettan please review in a context of kmm and dtk

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since KMM and DTK are day 2 in the case of OCP, I don't see any real impact or change here. We aren't managing anything as day 0.


### Non-Goals

- Change the cluster installation flow. It should remain the same whether IPI/UPI/AI/etc...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Change the cluster installation flow. It should remain the same whether IPI/UPI/AI/etc...
- Change the cluster installation flow. It should remain the same whether IPI/UPI/AI/ABI/etc...

jlebon added a commit to jlebon/installer that referenced this pull request Jul 16, 2024
As per openshift/enhancements#1637, we're trying
to get rid of all OpenShift-versioned components from the bootimages.

This means that there will no longer be `oc`, `kubelet`, or `crio`
binaries for example, which bootstrapping obviously relies on.

Instead, now we change things up so that early on when booting the
bootstrap node, we pull down the node image, unencapsulate it (this just
means convert it back to an OSTree commit), then mount over its `/usr`,
and import new `/etc` content.

This is done by isolating to a different systemd target to only bring
up the minimum number of services to do the pivot and then carry on
with bootstrapping.

This does not incur additional reboots and should be compatible
with AI/ABI/SNO. But it is of course, a huge conceptual shift in how
bootstrapping works. With this, we would now always be sure that we're
using the same binaries as the target version as part of bootstrapping,
which should alleviate some issues such as AI late-binding (see e.g.
https://issues.redhat.com/browse/MGMT-16705).

The big exception of course being the kernel. Relatedly, currently
`/usr/lib/modules` is also shadowed by the mount, but we could re-mount
it if needed.

To be conservative, the new logic only kicks in when using bootimages
which do not have `oc`. This will allow us to ratchet this in more
easily.

Down the line, we should be able to replace some of this with
`bootc apply-live` once that's available (and also works in a live
environment). (See containers/bootc#76.)

For full context, see the linked enhancement and discussions there.
@jlebon
Copy link
Member Author

jlebon commented Jul 16, 2024

OK, so let's resume the bootstrapping issue. Restating some of the things from above and from researching further:

  • We can't run the kubelet in a container because it's no longer supported.
  • The delta between kubelet and podman play is too large to make the latter a feasible replacement.
  • systemctl soft-reboot is not in RHEL9.
  • In the AI/ABI/SNO cases, bootstrapping happens in the live environment where e.g. rebooting is not possible.
  • I considered cobbling something around kexec, but in the limit, there are potential issues with kexec and hardware reliability, as well as how it meshes with Secure Boot.

What I'm playing with now is basically to have a special node-image-pivot.target that the node isolates to first. There, we pull the node image, unencapsulate it, check out its contents, and then mount over /usr and do a rough 3-way /etc merge. We then isolate back to multi-user.target to continue with the bootstrapping process.

This is in effect like a more aggressive bootc/rpm-ostree apply-live, though that doesn't currently work in live environments. (Though even in the non-live case, there are some issues there that would need to be resolved.) It's close to what OKD currently does when using FCOS live media today, though using the ostree stack and isolating targets should make this more robust.

WIP for this in openshift/installer#8742.

@rphillips
Copy link
Contributor

@jlebon That sounds like it might work. Where will the Kubelet be coming from? An OpenShift built image?

@zaneb
Copy link
Member

zaneb commented Jul 17, 2024

Won't doing systemctl isolate node-image-pivot.targethave the effect of stopping the assisted/agent services that we need to avoid stopping?

@jlebon
Copy link
Member Author

jlebon commented Jul 17, 2024

@jlebon That sounds like it might work. Where will the Kubelet be coming from? An OpenShift built image?

From the node image (i.e. for OCP, the rhel-coreos image in the release payload).

Won't doing systemctl isolate node-image-pivot.targethave the effect of stopping the assisted/agent services that we need to avoid stopping?

No. The system boots into node-image-pivot.target first. Any other service hooked into multi-user.target aren't started until after we've finished the live pivot.

@cgwalters
Copy link
Member

The system boots into node-image-pivot.target first.

Via a generator overriding default.target?

@jlebon
Copy link
Member Author

jlebon commented Jul 17, 2024

The system boots into node-image-pivot.target first.

Via a generator overriding default.target?

Yup, exactly. You can see that in openshift/installer#8742.

@zaneb
Copy link
Member

zaneb commented Jul 17, 2024

The bootstrap ignition that's modified in openshift/installer#8742 doesn't exist at boot time, so we'd need to also add this into the agent ISO and assisted Discovery ISO.
That's not impossible, but it'll be a challenge to maintain.
I guess if we are not able to pull the release image then nothing's going to work anyway, but doing it at boot time also limits our options for reporting any failure.

@jlebon
Copy link
Member Author

jlebon commented Jul 18, 2024

The bootstrap ignition that's modified in openshift/installer#8742 doesn't exist at boot time, so we'd need to also add this into the agent ISO and assisted Discovery ISO. That's not impossible, but it'll be a challenge to maintain. I guess if we are not able to pull the release image then nothing's going to work anyway, but doing it at boot time also limits our options for reporting any failure.

Ahh right, tricky. Hmm, I think once the bootstrap bits are written out (by the --once-from MCD call), we could still just isolate back to node-image-pivot.target before continuing to bootstrapping. It should be possible to keep both the service and agent alive throughout. This will need more experimenting.

@jlebon
Copy link
Member Author

jlebon commented Jul 18, 2024

so we'd need to also add this into the agent ISO and assisted Discovery ISO

A clarification on this: if we can split out those units somehow so they're accessible in the release payload and not baked into the installer, would whatever generates the ISOs be able to pull from the payload so we don't have to duplicate them across codebases? That'd make the more vanilla (non-AI/ABI) install flows more awkward though with a level of indirection.

@zaneb
Copy link
Member

zaneb commented Jul 19, 2024

if we can split out those units somehow so they're accessible in the release payload and not baked into the installer, would whatever generates the ISOs be able to pull from the payload so we don't have to duplicate them across codebases?

We try to avoid having a hard dependency on the release payload in ABI, because the ISO may be generated in a different environment from where the cluster is going to run.

I think it could work to group all of the agent services under e.g. agent.target and in that target set WantedBy: multi-user.target node-image-pivot.target. That way I think we could isolate without stopping those services. Assuming it's safe to do the pivot while stuff is running.

@jlebon
Copy link
Member Author

jlebon commented Jul 19, 2024

I think it could work to group all of the agent services under e.g. agent.target and in that target set WantedBy: multi-user.target node-image-pivot.target. That way I think we could isolate without stopping those services. Assuming it's safe to do the pivot while stuff is running.

Yeah, that's along the lines of what I was thinking as well. I like the agent.target idea.

I might reach out with questions for getting a test environment up to iterate on this.

benefits on the OpenShift side. Currently, RHCOS is built as a single layer that
includes both RHEL and OCP content. This enhancement proposes splitting it into
three layers. Going from bottom to top:
1. the bootc layer; i.e. the base rhel-bootc image shared with image mode for RHEL (RHEL-versioned)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that I have been thinking about, is that RHCOS often ships RHEL packages early (a few weeks before they are available in RHEL), in particular the kernel. These RHEL packages version will not be available in the rhel-bootc image and therefore would need to be replaced in the CoreOS layer, this happens often enough that I wonder if actually using the rhel-bootc image as base image for OCP is going to be useful.

@cgwalters, @sdodson and @jlebon what's your point of view for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overriding things shipped in the base image to me is totally fine and expected. At a certain size and scale, squashing and optimization can be desirable - and as you point out, especially for the kernel where it necessarily implies regenerating the initramfs.

But I would clarify something here, I think we should go from:

"I wonder if actually using the rhel-bootc image as base image for OCP is going to be useful."

to

"I wonder if we may need to go from simple derivation to using supported tooling to build or optimize derived images"

https://gitlab.com/fedora/bootc/tracker/-/issues/32

I think it's of high importance to the wider "us" that even if we don't use "simple derivation", but some sort of more sophisticated build, that doesn't go all the way to "fork of the base image" e.g. or use tools that aren't also supported for customers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is why the sentence is left fuzzy; i.e. "shared with" instead of "deriving from". How exactly we'll share is still to be discussed. A key ingredient to success for me in this sharing is centralizing CI, and for that to be effective the NEVRAs in the RHCOS image ideally should be a superset of the NEVRAs in the rhel-bootc image that passed CI. And for that, we don't necessarily need to literally derive from.

That said, obviously the kernel is a pretty big one to deviate on, and we can manage that, but ideally we'd have a rhel-bootc stream that also consumes those early kernels since ISTM like it shouldn't interest only OCP, if only for testing purposes (though that widening of scope would need to be cleared by the kernel maintainers).

Copy link
Member

@sdodson sdodson Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could also just be a layer internal to the build process. I think the main goal is that we start from an image that RHCOS team doesn't produce, the RHEL team does.

Prematurely hit send...

The RHEL weekly kernel process is now actually shipping the kernel every two weeks to RHEL. OCP picks it up early in our pipeline because it's about 5 days from RHCOS integration to OCP errata shipping, but if this were just moved to the bootc image build process and they shipped every two weeks alongside the kernel AND we had access to that build 5 days ahead we could just use their output too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet