Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO-1026: Enhancement for bootc integration into MCO #1631

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

inesqyx
Copy link

@inesqyx inesqyx commented May 28, 2024

No description provided.

@openshift-ci-robot
Copy link

openshift-ci-robot commented May 28, 2024

@inesqyx: This pull request references MCO-1026 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 28, 2024
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 28, 2024
Copy link
Contributor

openshift-ci bot commented May 28, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@inesqyx inesqyx force-pushed the MCO-1026-bootc-enhancement branch from 6260522 to ad46715 Compare May 29, 2024 14:29
@inesqyx
Copy link
Author

inesqyx commented Jun 20, 2024

Tagging manually for review, apology for the spam: @cheesesashimi @yuqi-zhang @sinnykumari @mrunalp @jlebon
@cgwalters @sdodson

@inesqyx
Copy link
Author

inesqyx commented Jun 25, 2024

Update on Kernel Argument:
Doc: containers/bootc@a67286f
PR: containers/bootc#630

@travier
Copy link
Member

travier commented Jun 27, 2024

Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions inline, thanks for starting this enhancement!

My main question after reading is I'm not really sure if we're trying to describe the tech preview (and just enabling bootc in the MCO) or the overall plan for bootc via this enhancement. I think it would help me understand better if we directly state that, and have clearly defined goals around it

enhancements/machine-config/bootc_update_path.md Outdated Show resolved Hide resolved
enhancements/machine-config/bootc_update_path.md Outdated Show resolved Hide resolved
enhancements/machine-config/bootc_update_path.md Outdated Show resolved Hide resolved
enhancements/machine-config/bootc_update_path.md Outdated Show resolved Hide resolved
enhancements/machine-config/bootc_update_path.md Outdated Show resolved Hide resolved
enhancements/machine-config/bootc_update_path.md Outdated Show resolved Hide resolved
enhancements/machine-config/bootc_update_path.md Outdated Show resolved Hide resolved

### Topology Considerations

#### Hypershift / Hosted Control Planes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will have to align with Hypershift in the future. As it stands Hypershift wouldn't utilize this codepath

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, Jerry! Do you by any chance know if anyone from the Hypershift team would be interested in this as well? We could include him/her/them into this conversation and review process to get better alignment in early stages.

@jlebon
Copy link
Member

jlebon commented Jul 4, 2024

My main question after reading is I'm not really sure if we're trying to describe the tech preview (and just enabling bootc in the MCO) or the overall plan for bootc via this enhancement. I think it would help me understand better if we directly state that, and have clearly defined goals around it

Agreed, yeah.

I think for this enhancement, we should take a broader approach and talk more about the overall story and phases than focus on the bootc part. E.g. we could retitle it "Image-mode integration into MCO" and have phases like (this is a rough list, we'd need to agree on importance and ordering):

  • Phase 1: always-on on-cluster layering for updates: cluster updates always result in an on-cluster rebuild to layer MachineConfigs (i.e. no more direct writes to the host's /etc)
  • Phase 2: move extensions handling to on-cluster layering
  • Phase 3: move kernel arguments handling to on-cluster layering (via https://containers.github.io/bootc/building/kernel-arguments.html)
  • Phase 4: use bootc switch instead of rpm-ostree rebase

Each one of those phases has a lot of details to work through, and ironically the last phase to move from calling rpm-ostree to bootc is likely the simplest of them (but obviously opens the door to then have tighter integration with bootc, like the mentioned configmap support).

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 24, 2024

@inesqyx: This pull request references MCO-1026 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.17.0" version, but no target version was set.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@inesqyx inesqyx force-pushed the MCO-1026-bootc-enhancement branch from 78cc643 to 80442b2 Compare July 24, 2024 17:47
Copy link
Contributor

openshift-ci bot commented Aug 8, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hasbro17 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment


### User Stories

* As an OpenShift cluster administrator I would like MCO to leverage RHEL disk images to boot/update my nodes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* As an OpenShift cluster administrator I would like MCO to leverage RHEL disk images to boot/update my nodes
* As an OpenShift cluster administrator I would like MCO to leverage image-mode RHEL to boot/update my nodes

Disk images are only involved in the first boot of the node. After that it's container images.


After On Clustering GA, we will gradually deprecate the current configuration method of direct file writing to disk and move to layered update path (Phase 1) by default. At the end of this phase, OCP nodes will be configured in a full image-based way and will have a RHEL-bootc layer, a CoreOS layer, an OCP node layer, and day-2 configuration layers added on top built and rolled out by On Cluster Layering discussed in Phase 1. Please note here that the proposal and implementation details of the first three layers (the RHEL-bootc layer, the CoreOS layer and the OCP node layer) are explained in the Splitting the RHCOS Image into layers enhancement, thus a non-goal for this enhancement. Some of the gaps we observed here are:
* "Default" MCO managed configuration moved to /usr and shipped as a container image layer: Includes all MCO specific config rendered to /usr + MachineConfigs which are rendered to /etc -> /usr/etc (to be confirmed)
* Configs that involve file writing to immutable local file system directories need to be handled: The immutable CoreOS model guards the immutability of certain directories. With that being said, if we write a file (e.g. SSH key) to an immutable directory during an OS image build and use rpm-ostree to rebase onto the newly built OS image, that file will not actually be there when the node reboots.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you're getting at here, but let me clarify a few things.

In a container build, there are no immutable directories. That's why you can e.g. install packages or change stuff in /usr.

/var is special. It's not immutable either but in this context, files written to it will simply be ignored when deploying to the node because nodes have their own /var that wins.

SSH keys are a big one, but it's basically any path in the MachineConfig under /var will have this issue. I think in the short/mid-term, we'll have to keep having the MCO manually write those to the nodes (though that could still be done transactionally). But ideally longer term, we can work towards moving all big use cases for this to other directories under /usr or /etc or to systemd-tmpfiles dropins instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


### Phase 3: Deprecate rpm-ostree and Move Layered Image Update Path to Bootc

Upon completion of the above, the last step is to update the image rollout mechanism. In the foreseeable future, we will be expecting a dual-stream RHCOS where rpm-ostree will be gradually deprecated and transition its responsibility to bootc. Following this trend, MCO will also enable image deployment by both “rpm-ostree rebase” and “bootc switch” and, in the end, settle down on “bootc switch” for image rollout.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "dual-stream RHCOS" referring to here? I.e. that we'll have e.g. 9.x and 10.x versions? Why wouldn't they be handled the same way?

### Workflow Description

1. The cluster administrator wants to add new configuration to the OS on each of their cluster nodes.
2. The cluster administrator specifies base OS image pull spec & secret, base OS extension pull spec, rendered image pull spec & secret, rendered image push spec & secret and custom content through a MachineOSConfig object.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully we don't have a need for the extensions container in the future.

Also ideally there's an optimized path for the common case where the rendered image pull and push secrets are the same?

After On Clustering GA, we will gradually deprecate the current configuration method of direct file writing to disk and move to layered update path (Phase 1) by default. At the end of this phase, OCP nodes will be configured in a full image-based way and will have a RHEL-bootc layer, a CoreOS layer, an OCP node layer, and day-2 configuration layers added on top built and rolled out by On Cluster Layering discussed in Phase 1. Please note here that the proposal and implementation details of the first three layers (the RHEL-bootc layer, the CoreOS layer and the OCP node layer) are explained in the Splitting the RHCOS Image into layers enhancement, thus a non-goal for this enhancement. Some of the gaps we observed here are:
* "Default" MCO managed configuration moved to /usr and shipped as a container image layer: Includes all MCO specific config rendered to /usr + MachineConfigs which are rendered to /etc -> /usr/etc (to be confirmed)
* Configs that involve file writing to immutable local file system directories need to be handled: The immutable CoreOS model guards the immutability of certain directories. With that being said, if we write a file (e.g. SSH key) to an immutable directory during an OS image build and use rpm-ostree to rebase onto the newly built OS image, that file will not actually be there when the node reboots.
* Address functionality gap between bootc and rpm-ostree: figure out a story for kernel argument changes as a container image layer/label
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

- A: Go bindings for bootc is not a must and, as a result, will be outside of the scope of this enhancement and will not be implemented. The main purpose of creating go bindings for rpm-ostree/bootc is to have an easier way to parse the created json and to read the status of rpm-ostree/bootc. This can be done via creating a simple wrapper function inside of the MCO. It will make sense to separate it out to a standalone helper/library in the future when the demand raises, but is a non-goal for now.

* Discussion question for gaps highlighted in phase 2
- Q: A problem with the current model is that configs that involve file writing to local file system directories are not respected by “rpm-ostree rebase” / “bootc switch”. For example, if we write an SSH key to “/home/core/.ssh/authorized_keys'' directory during an OS image build and use rpm-ostree to rebase onto the newly built OS image, that file will not actually be there when the node reboots. Will this be handled/considered in the future planning of RHCOS and bootc?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to my earlier comment about /var.

Comment on lines +143 to +145
- Everything rebootless should not be in the image, unless we have live image changes. It is better for those changes to be in writable directories on the disk. Need to keep in mind the priority order of /usr vs /etc.
- MCO would have to apply its own internal configs first then the MachineConfigs.
- [Actionable item] Get a list of files managed by the MCO to further refine this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoiding reboots when possible isn't unique to OCP of course. I think ideally here we try to lower this functionality into bootc directly. See e.g. containers/bootc#76 related to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants