-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: MachineConfigPool Surge and NodeDrainTimeout Support #1616
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,343 @@ | ||
--- | ||
title: machine-config-pool-update-surge-and-nodedraintimeout | ||
authors: | ||
- jupierce | ||
reviewers: | ||
- TBD | ||
approvers: | ||
- TBD | ||
api-approvers: | ||
- TBD | ||
creation-date: 2024-04-29 | ||
last-updated: 2024-04-29 | ||
tracking-link: | ||
- TBD | ||
see-also: | ||
- https://github.com/openshift/enhancements/pull/1571 | ||
--- | ||
|
||
# MachineConfigPool Update Surge and NodeDrainTimeout | ||
|
||
## Summary | ||
|
||
Add `MaxSurge` and `NodeDrainTimeout` semantics to `MachineConfigPool` to improve the predictability | ||
of standalone OpenShift cluster updates. `MaxSurge` allows clusters to scale above configured replica | ||
counts during an update -- helping to ensure worker node capacity is available for drained workloads. | ||
`NodeDrainTimeout` limits the amount of time an update operation will block waiting for a potentially | ||
stalled drain to succeed -- helping to ensure that updates can proceed (by incurring disruption) even | ||
in the presence of poorly configured workloads. | ||
|
||
## Motivation | ||
|
||
During a typical worker node update for an OpenShift cluster, it is necessary to "cordon" nodes (prevent new pods from being scheduled on a node) | ||
and "drain" them (attempt to migrate workloads by rescheduling its pods onto uncordoned nodes). Workers generally need to be rebooted during a | ||
cluster update and draining nodes is standard practice before rebooting them. If they were not drained first, pods running | ||
on a node targeted by the update process could be terminated with no other viable pods on the cluster to | ||
handle the workload. This outcome can cause a disruption in the service the terminated pod was attempt to provide. For example, | ||
an incoming web request may not be routable to a pod for a given Kubernetes service - resulting in errors being returned | ||
to the consumers of that service. | ||
|
||
With appropriate cluster management, node draining can be used to ensure that sufficient pods are running to satisfy workload requirements | ||
at all times - even during updates. "Appropriate cluster management," though, is a multi-faceted challenge involving considerations | ||
from pod replicas to cluster topology. | ||
|
||
### Managing Worker Node Capacity | ||
|
||
One aspect of this challenge is ensuring that, while a node is being drained, there is sufficient worker node capacity (CPU/memory/other | ||
resources/topology) available for new pods to take the place of old pods from the node being drained. Consider the reductive example of | ||
a static cluster with a single worker node. If there is an attempt to drain the node in this example, there is no additional worker | ||
node capacity available to schedule new pods to replace the pods being drained. This can result in a stalled drain -- one that | ||
does not terminate until there is external intervention. | ||
|
||
Stalled drains create a frustrating experience for operations teams as they require analysis and intervention. They can also | ||
make it impossible to predict when an update will complete -- complicating work schedules and communications. There are a number of reasons | ||
drains can stall, but simple lack of spare worker node capacity is a common one. One solution to this problem is | ||
to turn on autoscaling - allowing a cluster to add nodes if pods are unschedulable. This reduces the likelihood of the problem | ||
without eliminating it (i.e. if the cluster is at capacity and has provisioned the maximum number of nodes permitted by its | ||
autoscaler configuration). Administrators may also hesitant to use autoscaling (e.g. they prefer a fixed number of | ||
nodes to guarantee they do not significantly exceed expected opex). | ||
|
||
Capacity related stalled drains are particularly troublesome for our managed fleet. Our SRE team needs to be able to | ||
ensure that updates across the fleet can proceed without individual manual attention. With customer managed | ||
configurations and workloads, the ability to drain nodes in a customer environment is highly unpredictable. | ||
|
||
One cost-effective approach to ensure capacity is called "surging". With a surge strategy, during an update and | ||
only during an update, the platform is permitted to bring additional worker nodes online, to accommodate workloads | ||
being drained from existing nodes. After the update concludes, the surged nodes are scaled down and the cluster | ||
resumes its steady state. | ||
|
||
HyperShfit Hosted Control Planes (HCP) already support the surge concept. HyperShift `NodePools` | ||
expose `maxUnavailable` and `maxSurge` as configurable options during updates: https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.RollingUpdate . | ||
Unfortunately, standalone OpenShift, which uses `MachineConfigPools`, does not. To workaround this | ||
Comment on lines
+69
to
+71
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are feature of MachineDeployments, I'm not sure how HyperShift handles OS updates so perhaps they are replacing workers for OS updates, but in either case, MachineDeployments are already on the CAPI roadmap and will exist within OCP at some point in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They are indeed in MachineDeployment. I was thinking There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would probably be preferable all round to not duplicate the features here since we want MachineDeployment in the standalone product in the future. Perhaps instead of spending cycles implementing this here, the time could be spent helping to bring MachineDeployment in and solving the issues within OCP that this would cause (like the MCS being able to serve a particular new ignition version of the rendered configuration) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hypershift does indeed replace all workers. I think one thing we can look to do in standalone is allow a hybrid update model, where you can configure maxSurge and maxUnavailable on top. What I mean by that is, let's say you have 6 worker nodes, and you want the end result to be 6 worker nodes, and you don't really care how that's achieved, so long as:
In that case something in the cluster can configure it such that:
In that case this something needs to preserve the maxSurge and maxUnavailable we set above. Could that be done via a hybrid of CAPI and MCO? Or should there be a central management point for this? |
||
limitation for managed services customers, Service Delivery developed a custom Managed Upgrade Operator (MUO) | ||
which can surge a standalone cluster during an update (see [reserved capacity feature](https://github.com/openshift/managed-upgrade-operator/blob/a56079fda6ab4088f350b05ed007896a4cabcd97/docs/faq.md)). | ||
|
||
### Preventing Other Stalled Drains | ||
|
||
There are other reasons that drains can stall. For example, `PodDisruptionBudgets` can | ||
be configured in such a way as to prevent pods from draining even if there is sufficient capacity for them | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What if there is a legitimate reason for this, eg, a user workload is genuinely disrupted and cannot get back to ready? And therefore the PDB is just doing its job There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When self managed it's fine for the admin to handle this on a case by case basis. For RH managed openshift it cannot scale, we have to automate this. And we don't know customer workloads, actually cannot even see them. Best (only) thing we can do is ensure the upgrade progresses after a reasonable and documented waiting period. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know I'm about to cross into a whole other territory here, but bear with me. IIUC, the other enhancement currently is talking about separating control plane and worker upgrades in a way such that our end users have greater control over when they are disrupted right? I'm wondering if the "policy" for force upgrading should be a part of that configuration? Users can't run workloads on the control plane right so I don't think this is a control plane issue. Once the other EP is merged, how much do we care about workload nodes being upgraded, how much drift can there be without it causing headaches? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Customers can set the timeout for forcing drain today at a cluster level. For drift we care about infra nodes, which managed openshift does deploy. Those should not fall behind. Should be a max allowed skew on control plane upgrade. I'll check the other EP for both of these.. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I separated the enhancements because I think they have value in isolation, but it is most impactful for SD if we execute on both (i.e. they can dramatically reduce the MUO). I would still suggest separate configuration stanzas:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Much less. I expect most customers will update worker nodes less often.
N-2 today and N-3 tomorrow when we adopt upstream kube's posture. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree it makes sense to separate the two, but I'm wondering if this becomes much simpler if we assume that the maintenance schedules already exist (ie doing that one first and implementing this after)
So realistically a maximum drain timeout could be set to a year and not cause an issue, interesting 🤔 |
||
to be rescheduled on other nodes. A powerful (though blunt) tool to prevent drain stalls is to limit the amount of time | ||
a drain operation is permitted to run before forcibly terminating pods and allowing an update to proceed. | ||
`NodeDrainTimeout`, in HCP's `NodePools` allows users to configure this timeout. | ||
The Managed Upgrade Operator also supports this feature with [`PDBForceDrainTimeout`](https://github.com/openshift/managed-upgrade-operator/blob/master/docs/faq.md). | ||
|
||
This enhancement includes adding `NodeDrainTimeout` to `MachineConfigPools` to provide this feature in standlone | ||
cluster environments. The timeout will only apply to drains triggered by the Machine Config Operator (e.g. | ||
it will not impact drains triggered by the CLI). | ||
|
||
### User Stories | ||
|
||
Implementing surge and node drain timeout support in `MachineConfigPools` can simplify cluster management for self-managed | ||
standalone clusters as well as managed clusters (i.e. Service Delivery can remove this customized behavior from the MUO and use | ||
more of the core platform). | ||
|
||
* As an Operations team managing one or more standalone clusters, I want to | ||
help ensure smooth updates by surging my worker node count without constantly | ||
having my cluster over-provisioned. | ||
* As an Operations team managing one or more standalone clusters, I want to | ||
ensure my cluster update makes steady progress by limiting the amount of time a node drain can | ||
consume. | ||
jupierce marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* As an engineer in the Service Delivery organization, I want to use core platform | ||
features instead of developing, evolving, and testing the Managed Upgrade Operator. | ||
* As an Operations team managing standalone and HCP based OpenShift clusters, I | ||
want a consistent update experience leveraging a surge strategy and/or | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Correct me if I'm wrong, but doesn't HCP have surge because it rolls out new nodes to handle updates rather than doing in-place upgrades like OCP does today? I wonder if that's what we should be aiming for, rolling infrastructure replacements to get updates out might be simpler than trying to handle surge by adding additional MachineSets that are later removed There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. HCP does replace nodes by default, I believe you can opt for in-place updates also. For managed openshift we use replace. I'm +1 to getting the option for replace for classic if that's the direction OCP wants to go. We do have some customers in HCP that require in-place upgrades if I remember correctly, so it needs to be configurable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I'm concerned, yes, it is something we want to bring in. I have this on the discussion topics already for the CAPI F2F end of July where I'm hoping to get more consensus on how we are going to handle upgrades (CVO/MCO) in a more CAPI driven OpenShift land. I imagine that yes, in the future, both HCP and OCP will support both, and it would be preferable I think to leverage the same building blocks to do so, rather than inventing something new |
||
node drain timeouts regardless of the cluster profile. | ||
|
||
### Goals | ||
|
||
- Implement an update configuration, including `MaxSurge`, similar to HCP's `NodePool` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This kinda ties into our managed bootimage path https://github.com/openshift/enhancements/blob/e97e6773669d24f2a8dba57ba7c533117ce5f5b3/enhancements/machine-config/manage-boot-images.md, just wanted to cross link those |
||
in standalone OpenShift's `MachineConfigPool`. | ||
- Implement `NodeDrainTimeout`, similar to HCP's `NodePool`, in standalone OpenShift's `MachineConfigPool`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be a bit more straightforward to do since it shouldn't be platform specific |
||
- Provide a consistent update controls between standalone and HCP cluster profiles. | ||
- Allow Service Delivery to deprecate their MUO reserved capacity & `PDBForceDrainTimeout` features and use more of the core platform. | ||
|
||
### Non-Goals | ||
|
||
- Address all causes of problematic updates. | ||
- Prevent workload disruption when `NodeDrainTimeout` is utilized. | ||
- Fully unify the update experience for Standalone vs HCP. | ||
|
||
## Proposal | ||
|
||
The HyperShift HCP `NodePool` exposes a [`NodePoolManagement`](https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.NodePoolManagement) | ||
stanza which captures traditional `MachineConfigPool` update semantics ([`MaxUnavailable`](https://docs.openshift.com/container-platform/4.14/rest_api/machine_apis/machineconfigpool-machineconfiguration-openshift-io-v1.html#spec)) | ||
as well as the ability to specify a `MaxSurge` preferences. HCP's `NodePool` also exposes a `NodeDrainTimeout` configuration | ||
option. | ||
|
||
This enhancement proposes an analog for `NodePoolManagement` and `NodeDrainTimeout` be added | ||
to standalone OpenShift's `MachineConfigPool` custom resource. | ||
|
||
### Workflow Description | ||
|
||
**Cluster Lifecycle Administrator** is a human user responsible for triggering, monitoring, and | ||
managing all aspects of a cluster update. They are operating a standalone OpenShift cluster. | ||
|
||
1. The cluster lifecycle administrator desires to ensure that there is sufficient worker node capacity during | ||
updates to handle graceful termination of pods and rescheduling of workloads. | ||
2. They want to avoid other causes of drain stalls by limiting the amount of time permitted for any drain operation. | ||
3. They configure worker `MachineConfigPools` on the cluster with a `MaxSurge`. | ||
value that will bring additional worker node capacity online for the duration of an update. | ||
4. They configure worker `MachineConfigPools` on the cluster with a `NodeDrainTimeout` value of 30 minutes to | ||
limit the amount of time non-capacity related draining issues can stall the overall update. | ||
|
||
### API Extensions | ||
|
||
#### API Overview | ||
The Standalone `MachineConfigPool` custom resource is updated to include new update strategies (one of which | ||
supports`MaxSurge`) and `NodeDrainTimeout` semantics identical to HCP's `NodePool`. | ||
|
||
Documentation for these configuration options can be found in HyperShift's API reference: | ||
- https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.NodePoolManagement exposes `MaxSurge`. | ||
- https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.NodePoolSpec exposes `NodePoolTimeout` | ||
|
||
Example `MachineConfigPool` including both `NodeDrainTimeout` and a `MaxSurge` setting: | ||
```yaml | ||
kind: MachineConfigPool | ||
spec: | ||
# Existing spec fields are not shown. | ||
|
||
# Adopted from NodePool to create consistency and further our goal | ||
# to improve the reliability of worker updates. This only applies | ||
# to drains triggered by the MCO (e.g. CLI triggered drains will | ||
# not be impacted). | ||
nodeDrainTimeout: 10m | ||
|
||
# New policy analog to NodePool.NodePoolManagement. | ||
machineManagement: | ||
upgradeType: "Replace" | ||
replace: | ||
strategy: "RollingUpdate" | ||
rollingUpdate: | ||
maxUnavailable: 0 | ||
maxSurge: 4 | ||
Comment on lines
+167
to
+172
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is very much Machine API/Cluster API side of things rather than MCP. Especially if we are talking about |
||
``` | ||
|
||
Like HCP, `UpgradeType` will support: | ||
- `InPlace` where no additional nodes are brought online to support draining workloads. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "InPlace" is the way nodes are updated today. Is that changing for non-HCP? If yes, where is the reference? I do not see this in the spec for MCP. MUO implements surge for the current in-place node updates. These need to be decoupled, upgrade type independent of max surge configuration. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree here, I don't think we should switch to having multiple update types as part of this enhancement. |
||
- `Replace` where new nodes will be brought online with `MaxSurge` support. | ||
|
||
#### InPlace Update Type | ||
|
||
The `InPlace` update type is similar to `MachineConfigPools` traditional behavior where a | ||
user can configure the `MaxUnavailable` nodes. This approach assumes the number (or percentage) | ||
of nodes specified by `MaxUnavailable` can be drained simultaneously with workloads finding | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens here today? Does it do nodes 1 by 1 or in groups? Is there any existing configuration for this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. MCP has maxUnavailable today, the difference here would be the scope. It has been ages since I poked at the behavior of maxUnavailable > 1 for MCP but what I recall is it wouldn't care what zones or machinesets a machine was a member of. It simply picked machines and went to town against all machines in the pool. That pool being "workers" meant it could be very disruptive to workloads. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think if we did try and leverage the MachineDeployment for this, then the disruption would be tied to the MD, and so, would allow more disruption across the cluster, but limited within classes of Machines |
||
sufficient resources on other nodes to avoid stalled drains. | ||
|
||
```yaml | ||
kind: MachineConfigPool | ||
spec: | ||
# Existing spec fields are not shown. | ||
|
||
machineManagement: | ||
upgradeType: "InPlace" | ||
inPlace: | ||
maxUnavailable: 10% | ||
``` | ||
|
||
Nodes are rebooted after they are drained. | ||
|
||
#### Replace Update Type | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's an implicit difference between HCP and MCO in that HCP's replace will always use the latest payload bootimage, and MCO will always use the install-time bootimage. If we do want to get into the business of update types, maybe we should make bootimage management fully explicit as a dependency of this (see linked enhancement above) |
||
|
||
The `Replace` update type removes old machines and replaces them with new instances. It supports | ||
two strategies: | ||
- `OnDelete` where a new machine is brought online only after the old machine is deleted. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You may want to clarify here that "deleted" means marked for deletion, doesn't mean that the node is gone away. Eg the cluster might not have space to move the evicted pods so a new node is needed to come up before this deleted one can go away There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. According to kubernetes-sigs/cluster-api#4346 , OnDelete in CAPI implies:
This can absolutely lead to wedging in a cluster with insufficient capacity. The main use case I can see here is if you are truly limited to N nodes by your provider and attempts at N+1 would fail indefinitely. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ack, that is not how I remember that working... Hmm, wonder if that changed 🤔 Then yes, fair with the drain timeout. Out of interest, what would you expect as an acceptable range of values for the Drain timeout? What if I tried to set it to, 1 year? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Between 1 second (admin just wants the update to unfold immediately) and int64 seconds (grace period data type podspec). If podspec doesn't have a perspective on what is reasonable, I don't think we should. |
||
- `RollingUpdate` which supports the `MaxSurge` option. | ||
|
||
```yaml | ||
kind: MachineConfigPool | ||
spec: | ||
# Existing spec fields are not shown. | ||
|
||
machineManagement: | ||
upgradeType: "Replace" | ||
replace: | ||
strategy: "RollingUpdate" | ||
rollingUpdate: | ||
maxUnavailable: 0 | ||
maxSurge: 4 | ||
``` | ||
|
||
`MaxSurge` applies independently to each associated MachineSet. For example, if three MachineSets are associated | ||
with a MachineConfigPool, and `MaxSurge` is set to 4, then it is possible for the cluster to surge up to 12 nodes | ||
(4 for each of the 3 MachineSets). | ||
|
||
The `OnDelete` strategy is included for consistency with HCP. It does not directly support | ||
the consistent update experience motivation driving this enhancement. However, it does provide | ||
value to customers with highly static environments. Consider a standalone customer using a | ||
provider where they have a fixed quota of machines. Autoscaling and surging are not options | ||
in this case. To provide a reliable update, they would select 'OnDelete' and specify a 'NodeDrainTimeout'. | ||
This will likely result in workload disruption for an at-capacity cluster during an upgrade, but the administrator | ||
is at least empowered to make that tradeoff. | ||
|
||
### Topology Considerations | ||
|
||
Multi-AZ (availability zone) clusters function by using one or more `MachineSets` per zone. In order for this enhancement | ||
to work across all zones, all such `MachineSets` should be associated with a `MachineConfigPool` with well considered | ||
values for `MaxSurge` and `NodeDrainTimeout`. | ||
|
||
Each `MachineSet` associated with a `MachineConfigPool` will be permitted to scale by the `MaxSurge` number of nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will be confusing for end users, we may want to try and name the field so that it indicates this better than what it does today. Also, this really makes me think we want to reverse this and look at this as a CAPI problem rather than an MCP problem |
||
The alternative (trying to spread a surge value evenly across `MachineSets`) is problematic. Consider a cluster with two `MachineSets`: | ||
- machine-set-1a which creates nodes in availability zone us-east-1a. | ||
- machine-set-1b which creates nodes in availability zone us-east-1b. | ||
|
||
Further, assume that `MaxSurge` is set to 1 for the `MachineConfigPool` associated with these `MachineSets`. | ||
|
||
There may be pods running on 1b nodes that can only be scheduled on 1b nodes (e.g. due to taints / affinity / | ||
topology constraints, machine type, etc.). If `MaxSurge` was interpreted in such a way as to only surge machine-set-1a by 1 node, | ||
constrained pods requiring 1b nodes could not benefit from this additional capacity. | ||
|
||
Instead, this enhancement proposes each `MachineSet` be permitted to surge up to the `MachineConfigPool` surge | ||
value independently. | ||
|
||
|
||
#### Hypershift / Hosted Control Planes | ||
|
||
N/A. Hosted Control Planes provide the model for the settings this enhancement seeks to expose in standalone clusters. | ||
|
||
#### Standalone Clusters | ||
|
||
The `MachineConfigPool` custom resource must be updated to expose the new semantics. The existing `spec.maxUnavailable` | ||
will be deprecated in favor of the more expressive `MachineManagement` stanza. | ||
|
||
#### Single-node Deployments or MicroShift | ||
|
||
N/A. | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
#### MaxSurge Implementation | ||
|
||
##### Surge Setup | ||
During a configuration update rollout, the Machine Config Operator (MCO) will determine which `MachineSets` are associated with | ||
a `MachineConfigPool` with `MaxSurge` greater than 0. For each `MachineSet` meeting this requirement (if it does not possess | ||
a proposed annotation `machineconfiguration.openshift.io/noSurge`) , the MCO will create a near duplicate of the `MachineSet` with a few | ||
key differences: | ||
- The name of the resource will be `<~machineset-name>-surge-<nonce>`. The implementation must handle: | ||
- the truncation of the original `MachineSet` name if appending `surge-<nonce>` will violate k8s name length limitations. | ||
- the calculation of a nonce value that does not conflict any existing resource that the controller did not, itself, create (as indication by a special label). | ||
- The new `MachineSet` will be labeled to clearly indicate that the MCO created the resource in order to satisfy a surge operation. | ||
- The new `MachineSet` will be set with a replica count of 0 if `ClusterAutoscaler` exists and `MaxSurge` if `ClusterAutoscaler` does not exist. | ||
|
||
##### Surge With ClusterAutoscaler | ||
If `ClusterAutoscaler` exists, for each surge `MachineSet`, an associated `MachineAutoscaler` will be instantiated with its | ||
minimum replica value set to 0 and its maximum replica count set to `MaxSurge`. The `MachineAutoscaler` instance will also be labeled to indicate it | ||
was created programmatically for the surge procedure. | ||
|
||
As nodes are drained, any unschedulable pods will cause the `ClusterAutoscaler` to scale an appropriate surge | ||
`MachineSet` to supply the necessary capacity requirement. | ||
|
||
##### Surge Without ClusterAutoscaler | ||
If the `ClusterAutoscaler` does not exist, `MachineAutoscalers` will not work. Instead, the surge `MachineSets` will have | ||
their replica count set to `MaxSurge`. This is a less efficient use of cloud resources, so customer facing documentation | ||
should suggest the use of `ClusterAutoscaler` when a surge strategy is being used. | ||
|
||
##### Surge Teardown | ||
Once a `MachineConfigPool` has consistent, up-to-date, machines associated with it, the surge `MachineSet` and | ||
(optional) `MachineAutoscaler` resources will be deleted. This will cause drain of the nodes created for the | ||
surge. This drain should obey the `NodeDrainTimeout` set in the `MachineConfigPool`. | ||
|
||
#### Node Drain Timeout | ||
When non-zero, a normal cordon and drain should be attempted. However, if the duration of the attempt | ||
surpasses `NodeDrainTimeout`, the node can be forcibly terminated. | ||
|
||
### Risks and Mitigations | ||
|
||
Service Delivery believes this enhancement is a key to dramatically simplifying the MUO in conjunction with | ||
https://github.com/openshift/enhancements/pull/1571 . Without this enhancement, https://github.com/openshift/enhancements/pull/1571 | ||
may not be useful to Service Delivery without this enhancement as well. | ||
|
||
### Drawbacks | ||
|
||
The primary drawback is that alternative priorities are not pursued or that the investment is not ultimately | ||
warranted by the proposed business value. | ||
|
||
### Removing a deprecated feature | ||
|
||
- `MachineConfigPool.spec.maxUnavailable` will be deprecated. | ||
|
||
## Upgrade / Downgrade Strategy | ||
|
||
This feature is integral to standalone updates. Preceding sections discuss its behavior. | ||
|
||
## Version Skew Strategy | ||
|
||
N/A. | ||
|
||
## Operational Aspects of API Extensions | ||
|
||
The new stanzas are specifically designed to be tools used to improve update predictability | ||
and reliability for operations teams. Preceding sections discuss its behavior. | ||
|
||
## Support Procedures | ||
|
||
The machine-api-operator logs will indicate the decisions being made to actuate the new configuration | ||
fields. If machines are scaled into the cluster during an update but are unable to successfully join | ||
the cluster, this scenario is debugged just as if the problem occurred during normal scaling operations. | ||
|
||
## Alternatives | ||
|
||
1. The status quo of standalone updates and the MUO can be maintained. We can assume | ||
that customers impacted by the existing operational burden of drain timeouts will | ||
find their own solutions or migrate to HCP. | ||
2. Aspects of the MUO could be incorporated into the OpenShift core. Unfortunately, the MUO | ||
is deeply integrated into SD's architecture and is not easily productized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at enhancement it feels like
MaxSurge
andNodeDrainTimeout
could be two separate enhancement. MaxSurge looks like more of a MAPI/CAPI feature and NodeDrainTimeout is something that fits into MCO's MachineConfigPool .There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can find these values in CAPI MachineDeployment today: https://github.com/Nordix/cluster-api/blob/9a2d8cdc5ad681ba407e47106cb159bfa708763c/config/crd/bases/cluster.x-k8s.io_machinedeployments.yaml .
I'm assuming we will not be suggesting direct manipulation of
MachineDeployment
once we are on CAPI; i.e. we will still be usingMachineConfigPool
as the abstraction / customer touch point for this type of configuration.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. We are expecting users to move towards MachineDeployment instead of MachineSets to provide greater abstractions and upgrade features such as RollingUpdates. MachineConfigPool is really about the software that is configured on a group of nodes, the MachineDeployment IMO is the one that specifies how that gets onto the nodes. Whether that is in-place or a rolling update