Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: MachineConfigPool Surge and NodeDrainTimeout Support #1616

Closed
wants to merge 2 commits into from

Conversation

jupierce
Copy link
Contributor

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 30, 2024
Copy link
Contributor

openshift-ci bot commented Apr 30, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jupierce. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@petr-muller
Copy link
Member

/cc

@openshift-ci openshift-ci bot requested a review from petr-muller May 7, 2024 12:46
Comment on lines 70 to 78
As previously mentioned, there are other reasons that drains can stall. For example, `PodDisruptionBudgets` can
be configured in such a way as to prevent pods from draining even if there is sufficient capacity for them
to be rescheduled on other nodes. A powerful (though blunt) tool to prevent drain stalls is to limit the amount of time
a drain operation is permitted to run before forcibly terminating pods and allowing an update to proceed.
`NodeDrainTimeout`, in HCP's `NodePools` allows users to configure this timeout.
The Managed Update Operator also supports this feature with [`PDBForceDrainTimeout`](https://github.com/openshift/managed-upgrade-operator/blob/master/docs/faq.md).

This enhancement includes adding `NodeDrainTimeout` to `MachineConfigPools` to provide this feature in standlone
cluster environments.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like surge and drain timeouts should be separate enhancements - same high-level problem motivation, but different set of involved components, api concerns etc

enhancements/update/standalone-worker-surge.md Outdated Show resolved Hide resolved
enhancements/update/standalone-worker-surge.md Outdated Show resolved Hide resolved
enhancements/update/standalone-worker-surge.md Outdated Show resolved Hide resolved
enhancements/update/standalone-worker-surge.md Outdated Show resolved Hide resolved
```

Like HCP, `UpgradeType` will support:
- `InPlace` where no additional nodes are brought online to support draining workloads.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"InPlace" is the way nodes are updated today. Is that changing for non-HCP? If yes, where is the reference? I do not see this in the spec for MCP. MUO implements surge for the current in-place node updates. These need to be decoupled, upgrade type independent of max surge configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here, I don't think we should switch to having multiple update types as part of this enhancement.

enhancements/update/standalone-worker-surge.md Outdated Show resolved Hide resolved

## Summary

Add `MaxSurge` and `NodeDrainTimeout` semantics to `MachineConfigPool` to improve the predictability
Copy link
Contributor

@sinnykumari sinnykumari Jun 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at enhancement it feels like MaxSurge and NodeDrainTimeout could be two separate enhancement. MaxSurge looks like more of a MAPI/CAPI feature and NodeDrainTimeout is something that fits into MCO's MachineConfigPool .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find these values in CAPI MachineDeployment today: https://github.com/Nordix/cluster-api/blob/9a2d8cdc5ad681ba407e47106cb159bfa708763c/config/crd/bases/cluster.x-k8s.io_machinedeployments.yaml .
I'm assuming we will not be suggesting direct manipulation of MachineDeployment once we are on CAPI; i.e. we will still be using MachineConfigPool as the abstraction / customer touch point for this type of configuration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming we will not be suggesting direct manipulation of MachineDeployment once we are on CAPI; i.e. we will still be using MachineConfigPool as the abstraction / customer touch point for this type of configuration.

I disagree. We are expecting users to move towards MachineDeployment instead of MachineSets to provide greater abstractions and upgrade features such as RollingUpdates. MachineConfigPool is really about the software that is configured on a group of nodes, the MachineDeployment IMO is the one that specifies how that gets onto the nodes. Whether that is in-place or a rolling update

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the priority of this enhancement? I think what you're looking for really is the behaviour of the CAPI MachineDeployment.

We are intending to bring this into OCP at some point, probably 4.19/4.20 ish.

We still need to work out exactly how upgrades will happen with MachineDeployments and will need to teach MCO to allow us to choose the most up to date rendered MachineConfig to boot from, but otherwise, I think that achieves exactly what you're looking for here, and is how HCP implements this already

CC @enxebre do you agree?

Comment on lines 58 to 60
One cost-effective approach to ensure capacity is called "surging". With a surge strategy, during an update, the platform
is permitted to bring additional worker nodes online, to accommodate workloads being drained from existing nodes. After an
update concludes, the surged nodes are scaled down and the cluster resumes its steady state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues I can think of here:

  1. How do you understand which Machines to surge. A user may have many differently shaped MachineSets, and perhaps the workloads need different qualities of those MachineSets to schedule. Can we reliably say that this Machine belongs to this MachineSet, so we create another Machine from the MachineSet and all will be well, I don't think we can, since MachineSets are mutable
  2. By scaling up and then scaling down additional nodes, you are creating a double disruption for the user workloads. What would be better is to scale up one more machine, and then scale down to remove the last not-updated Machine at the end of the update. Is that possible?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OSD/ROSA we scale all machinesets by creating a new machineset with replicas=1 that matches the original machineset in all but name (we append some string). In this proposal I'd expect as a machineset is being updated the same happens, blindly scaling everything by the surge amount.

Agree on the disruption! If it were possible to reduce that it would be great. A challenge we have had for managed openshift to date is machineset is not immuable while upgrades happen, hence the new machineset creation to isolate from any changes that might come in while an upgrade is in progress. This means we can't get away from this disruption. If surge were a first class citizen and this tweak to simply destroy non-updated machines to get back to the non-surge replicas it would be very nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am adding implementation details after further discussion with lifecycle that mirror what OSD has done. This should address the machineset shape question directly.
For disruption, I don't think any update mechanism tries to prevent double disruption (of non-HA workloads). Unless the kube-scheduler can add weight to "newest nodes", there is every chance a drained workload will find a seat on a old node that is, itself, soon to be drained..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In OSD/ROSA we scale all machinesets by creating a new machineset with replicas=1 that matches the original machineset in all but name

You just mean during upgrades right? Not during normal operation?

If surge were a first class citizen and this tweak to simply destroy non-updated machines to get back to the non-surge replicas it would be very nice.

I believe that is the behaviour of the Cluster API MachineDeployment already, do you know if that is the behaviour on HCP?

For disruption, I don't think any update mechanism tries to prevent double disruption (of non-HA workloads). Unless the kube-scheduler can add weight to "newest nodes", there is every chance a drained workload will find a seat on a old node that is, itself, soon to be drained..

Yeah true, but I think the current suggestion of scaling up and then down again makes it more likely/perhaps opens us up to triple?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, only in upgrade.

In HCP we're using default "Replace" behavior. I hope it's not updating a node just to delete one. Asking to confirm, will follow up here.

I haven't heard a requirement to optimize for least disruption of workloads. That feels like a larger thing to tackle and is not a new problem brought on by this enhancement. Scheduling to a yet-to-be-updated node is going to happen today 100% of the time for the first node update with additional chances on subsequent nodes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't heard a requirement to optimize for least disruption of workloads. That feels like a larger thing to tackle and is not a new problem brought on by this enhancement. Scheduling to a yet-to-be-updated node is going to happen today 100% of the time for the first node update with additional chances on subsequent nodes.

Yes agreed, but, if we know that this is a problem and we are making choices we know make this worse, do we really want to make those choices? We don't need to fix it now, but I'd have thought we would not want to add something we know makes the situation worse

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have thought we would not want to add something we know makes the situation worse

Would it be a considered a regression for this behavior to change and for restarts of workloads to happen more often? If not, who makes the call to make that a concern we're willing to consider for regression?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that depends on whether this is something we test. I know that we do test disruption of workloads during upgrades today. For example, there are load balancer tests with example workloads that test how much downtime a customer application would face as we go through an upgrade.
Currently, because that works with PDBs and the LB health checks for the various clouds, I think that this wouldn't necessarily fall foul of those.

But I'd be interested to see what the TRT folks think, since they are the ones who look after our testing and pick up regressions like this.

Comment on lines +62 to +64
HyperShfit Hosted Control Planes (HCP) already support the surge concept. HyperShift `NodePools`
expose `maxUnavailable` and `maxSurge` as configurable options during updates: https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.RollingUpdate .
Unfortunately, standalone OpenShift, which uses `MachineConfigPools`, does not. To workaround this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are feature of MachineDeployments, I'm not sure how HyperShift handles OS updates so perhaps they are replacing workers for OS updates, but in either case, MachineDeployments are already on the CAPI roadmap and will exist within OCP at some point in the future.

Copy link
Contributor Author

@jupierce jupierce Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are indeed in MachineDeployment. I was thinking MachineConfigPool would still be the touchpoint for configuration in standalone though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would probably be preferable all round to not duplicate the features here since we want MachineDeployment in the standalone product in the future. Perhaps instead of spending cycles implementing this here, the time could be spent helping to bring MachineDeployment in and solving the issues within OCP that this would cause (like the MCS being able to serve a particular new ignition version of the rendered configuration)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypershift does indeed replace all workers. I think one thing we can look to do in standalone is allow a hybrid update model, where you can configure maxSurge and maxUnavailable on top.

What I mean by that is, let's say you have 6 worker nodes, and you want the end result to be 6 worker nodes, and you don't really care how that's achieved, so long as:

  1. you never go above 9 workers (maxSurge of 3), and
  2. 4 workers are always available for scheduling (maxUnavailable of 4)

In that case something in the cluster can configure it such that:

  1. some nodes are updated inplace, in parallel
  2. some nodes are surge'd in to the cluster on the newest configuration directly
  3. some nodes are simply removed

In that case this something needs to preserve the maxSurge and maxUnavailable we set above. Could that be done via a hybrid of CAPI and MCO? Or should there be a central management point for this?

### Preventing Other Stalled Drains

As previously mentioned, there are other reasons that drains can stall. For example, `PodDisruptionBudgets` can
be configured in such a way as to prevent pods from draining even if there is sufficient capacity for them
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there is a legitimate reason for this, eg, a user workload is genuinely disrupted and cannot get back to ready? And therefore the PDB is just doing its job

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When self managed it's fine for the admin to handle this on a case by case basis. For RH managed openshift it cannot scale, we have to automate this. And we don't know customer workloads, actually cannot even see them. Best (only) thing we can do is ensure the upgrade progresses after a reasonable and documented waiting period.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I'm about to cross into a whole other territory here, but bear with me.

IIUC, the other enhancement currently is talking about separating control plane and worker upgrades in a way such that our end users have greater control over when they are disrupted right? I'm wondering if the "policy" for force upgrading should be a part of that configuration?

Users can't run workloads on the control plane right so I don't think this is a control plane issue.

Once the other EP is merged, how much do we care about workload nodes being upgraded, how much drift can there be without it causing headaches?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customers can set the timeout for forcing drain today at a cluster level. For drift we care about infra nodes, which managed openshift does deploy. Those should not fall behind. Should be a max allowed skew on control plane upgrade. I'll check the other EP for both of these..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated the enhancements because I think they have value in isolation, but it is most impactful for SD if we execute on both (i.e. they can dramatically reduce the MUO). I would still suggest separate configuration stanzas:

  1. It allows this enhancement config stanza to mirror HCP and MachineDeployment.
  2. It is valid and simplifying to separate (a) when an update should be performed (WIP: Add Change Management and Maintenance Schedules #1571) and (b) how an update should be performed (this enhancement).
  3. No one wants the maintenance schedule enhancement to be even longer :) .

Copy link
Contributor Author

@jupierce jupierce Jun 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the other EP is merged, how much do we care about workload nodes being upgraded.

Much less. I expect most customers will update worker nodes less often.

how much drift can there be without it causing headaches?

N-2 today and N-3 tomorrow when we adopt upstream kube's posture.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it makes sense to separate the two, but I'm wondering if this becomes much simpler if we assume that the maintenance schedules already exist (ie doing that one first and implementing this after)

N-2 today and N-3 tomorrow when we adopt upstream kube's posture.

So realistically a maximum drain timeout could be set to a year and not cause an issue, interesting 🤔

* As an engineer in the Service Delivery organization, I want to use core platform
features instead of developing, evolving, and testing the Managed Update Operator.
* As an Operations team managing standalone and HCP based OpenShift clusters, I
want a consistent update experience leveraging a surge strategy and/or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but doesn't HCP have surge because it rolls out new nodes to handle updates rather than doing in-place upgrades like OCP does today?

I wonder if that's what we should be aiming for, rolling infrastructure replacements to get updates out might be simpler than trying to handle surge by adding additional MachineSets that are later removed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HCP does replace nodes by default, I believe you can opt for in-place updates also. For managed openshift we use replace. I'm +1 to getting the option for replace for classic if that's the direction OCP wants to go. We do have some customers in HCP that require in-place upgrades if I remember correctly, so it needs to be configurable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm concerned, yes, it is something we want to bring in. I have this on the discussion topics already for the CAPI F2F end of July where I'm hoping to get more consensus on how we are going to handle upgrades (CVO/MCO) in a more CAPI driven OpenShift land.

I imagine that yes, in the future, both HCP and OCP will support both, and it would be preferable I think to leverage the same building blocks to do so, rather than inventing something new

Comment on lines +157 to +162
upgradeType: "Replace"
replace:
strategy: "RollingUpdate"
rollingUpdate:
maxUnavailable: 0
maxSurge: 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much Machine API/Cluster API side of things rather than MCP. Especially if we are talking about Replace as a strategy


The `InPlace` update type is similar to `MachineConfigPools` traditional behavior where a
user can configure the `MaxUnavailable` nodes. This approach assumes the number (or percentage)
of nodes specified by `MaxUnavailable` can be drained simultaneously with workloads finding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens here today? Does it do nodes 1 by 1 or in groups? Is there any existing configuration for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MCP has maxUnavailable today, the difference here would be the scope. It has been ages since I poked at the behavior of maxUnavailable > 1 for MCP but what I recall is it wouldn't care what zones or machinesets a machine was a member of. It simply picked machines and went to town against all machines in the pool. That pool being "workers" meant it could be very disruptive to workloads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we did try and leverage the MachineDeployment for this, then the disruption would be tied to the MD, and so, would allow more disruption across the cluster, but limited within classes of Machines


The `Replace` update type removes old machines and replaces them with new instances. It supports
two strategies:
- `OnDelete` where a new machine is brought online only after the old machine is deleted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to clarify here that "deleted" means marked for deletion, doesn't mean that the node is gone away. Eg the cluster might not have space to move the evicted pods so a new node is needed to come up before this deleted one can go away

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to kubernetes-sigs/cluster-api#4346 , OnDelete in CAPI implies:

The machineDeployment controller waits for the machine to get deleted completely and then will proceed to provision the new replica with the new configuration.

This can absolutely lead to wedging in a cluster with insufficient capacity. The main use case I can see here is if you are truly limited to N nodes by your provider and attempts at N+1 would fail indefinitely. OnDelete with a drain timeout achieves an update (with disruption) in this scenario, but you can at least get through the update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, that is not how I remember that working... Hmm, wonder if that changed 🤔

Then yes, fair with the drain timeout.

Out of interest, what would you expect as an acceptable range of values for the Drain timeout? What if I tried to set it to, 1 year?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, what would you expect as an acceptable range of values for the Drain timeout? What if I tried to set it to, 1 year?

Between 1 second (admin just wants the update to unfold immediately) and int64 seconds (grace period data type podspec). If podspec doesn't have a perspective on what is reasonable, I don't think we should.

to work across all zones, all such `MachineSets` should be associated with a `MachineConfigPool` with well considered
values for `MaxSurge` and `NodeDrainTimeout`.

Each `MachineSet` associated with a `MachineConfigPool` will be permitted to scale by the `MaxSurge` number of nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be confusing for end users, we may want to try and name the field so that it indicates this better than what it does today.

Also, this really makes me think we want to reverse this and look at this as a CAPI problem rather than an MCP problem

Copy link
Contributor

openshift-ci bot commented Jun 14, 2024

@jupierce: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/markdownlint 784f341 link true /test markdownlint

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 16, 2024
@openshift-bot
Copy link

Stale enhancement proposals rot after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Rotten proposals close after an additional 7d of inactivity.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 23, 2024
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments

Comment on lines +62 to +64
HyperShfit Hosted Control Planes (HCP) already support the surge concept. HyperShift `NodePools`
expose `maxUnavailable` and `maxSurge` as configurable options during updates: https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.RollingUpdate .
Unfortunately, standalone OpenShift, which uses `MachineConfigPools`, does not. To workaround this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hypershift does indeed replace all workers. I think one thing we can look to do in standalone is allow a hybrid update model, where you can configure maxSurge and maxUnavailable on top.

What I mean by that is, let's say you have 6 worker nodes, and you want the end result to be 6 worker nodes, and you don't really care how that's achieved, so long as:

  1. you never go above 9 workers (maxSurge of 3), and
  2. 4 workers are always available for scheduling (maxUnavailable of 4)

In that case something in the cluster can configure it such that:

  1. some nodes are updated inplace, in parallel
  2. some nodes are surge'd in to the cluster on the newest configuration directly
  3. some nodes are simply removed

In that case this something needs to preserve the maxSurge and maxUnavailable we set above. Could that be done via a hybrid of CAPI and MCO? Or should there be a central management point for this?


### Goals

- Implement an update configuration, including `MaxSurge`, similar to HCP's `NodePool`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


- Implement an update configuration, including `MaxSurge`, similar to HCP's `NodePool`
in standalone OpenShift's `MachineConfigPool`.
- Implement `NodeDrainTimeout`, similar to HCP's `NodePool`, in standalone OpenShift's `MachineConfigPool`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a bit more straightforward to do since it shouldn't be platform specific

```

Like HCP, `UpgradeType` will support:
- `InPlace` where no additional nodes are brought online to support draining workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree here, I don't think we should switch to having multiple update types as part of this enhancement.


Nodes are rebooted after they are drained.

#### Replace Update Type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an implicit difference between HCP and MCO in that HCP's replace will always use the latest payload bootimage, and MCO will always use the install-time bootimage. If we do want to get into the business of update types, maybe we should make bootimage management fully explicit as a dependency of this (see linked enhancement above)

@openshift-bot
Copy link

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Aug 3, 2024
Copy link
Contributor

openshift-ci bot commented Aug 3, 2024

@openshift-bot: Closed this PR.

In response to this:

Rotten enhancement proposals close after 7d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Reopen the proposal by commenting /reopen.
Mark the proposal as fresh by commenting /remove-lifecycle rotten.
Exclude this proposal from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants