openshift · jupierce · Apr 30, 2024 · Jun 14, 2024 · sinnykumari · Jun 6, 2024
diff --git a/enhancements/update/standalone-worker-surge.md b/enhancements/update/standalone-worker-surge.md
@@ -0,0 +1,343 @@
+---
+title: machine-config-pool-update-surge-and-nodedraintimeout
+authors:
+  - jupierce
+reviewers: 
+  - TBD
+approvers: 
+  - TBD
+api-approvers: 
+  - TBD
+creation-date: 2024-04-29
+last-updated: 2024-04-29
+tracking-link:
+  - TBD
+see-also:
+  - https://github.com/openshift/enhancements/pull/1571
+---
+
+# MachineConfigPool Update Surge and NodeDrainTimeout
+
+## Summary
+
+Add `MaxSurge` and `NodeDrainTimeout` semantics to `MachineConfigPool` to improve the predictability 
+of standalone OpenShift cluster updates. `MaxSurge` allows clusters to scale above configured replica
+counts during an update -- helping to ensure worker node capacity is available for drained workloads. 
+`NodeDrainTimeout` limits the amount of time an update operation will block waiting for a potentially 
+stalled drain to succeed -- helping to ensure that updates can proceed (by incurring disruption) even 
+in the presence of poorly configured workloads.
+
+## Motivation
+
+During a typical worker node update for an OpenShift cluster, it is necessary to "cordon" nodes (prevent new pods from being scheduled on a node)
+and "drain" them (attempt to migrate workloads by rescheduling its pods onto uncordoned nodes). Workers generally need to be rebooted during a
+cluster update and draining nodes is standard practice before rebooting them. If they were not drained first, pods running
+on a node targeted by the update process could be terminated with no other viable pods on the cluster to 
+handle the workload. This outcome can cause a disruption in the service the terminated pod was attempt to provide. For example,
+an incoming web request may not be routable to a pod for a given Kubernetes service - resulting in errors being returned
+to the consumers of that service.
+
+With appropriate cluster management, node draining can be used to ensure that sufficient pods are running to satisfy workload requirements
+at all times - even during updates. "Appropriate cluster management," though, is a multi-faceted challenge involving considerations
+from pod replicas to cluster topology.
+
+### Managing Worker Node Capacity
+
+One aspect of this challenge is ensuring that, while a node is being drained, there is sufficient worker node capacity (CPU/memory/other
+resources/topology) available for new pods to take the place of old pods from the node being drained. Consider the reductive example of 
+a static cluster with a single worker node. If there is an attempt to drain the node in this example, there is no additional worker 
+node capacity available to schedule new pods to replace the pods being drained. This can result in a stalled drain -- one that 
+does not terminate until there is external intervention.
+
+Stalled drains create a frustrating experience for operations teams as they require analysis and intervention. They can also
+make it impossible to predict when an update will complete -- complicating work schedules and communications. There are a number of reasons 
+drains can stall, but simple lack of spare worker node capacity is a common one. One solution to this problem is
+to turn on autoscaling - allowing a cluster to add nodes if pods are unschedulable. This reduces the likelihood of the problem
+without eliminating it (i.e. if the cluster is at capacity and has provisioned the maximum number of nodes permitted by its
+autoscaler configuration). Administrators may also hesitant to use autoscaling (e.g. they prefer a fixed number of 
+nodes to guarantee they do not significantly exceed expected opex). 
+
+Capacity related stalled drains are particularly troublesome for our managed fleet. Our SRE team needs to be able to 
+ensure that updates across the fleet can proceed without individual manual attention. With customer managed 
+configurations and workloads, the ability to drain nodes in a customer environment is highly unpredictable.
+
+One cost-effective approach to ensure capacity is called "surging". With a surge strategy, during an update and
+only during an update, the platform is permitted to bring additional worker nodes online, to accommodate workloads 
+being drained from existing nodes. After the update concludes, the surged nodes are scaled down and the cluster 
+resumes its steady state.
+
+HyperShfit Hosted Control Planes (HCP) already support the surge concept. HyperShift `NodePools`
+expose `maxUnavailable` and `maxSurge` as configurable options during updates: https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.RollingUpdate . 
+Unfortunately, standalone OpenShift, which uses `MachineConfigPools`, does not. To workaround this
+limitation for managed services customers, Service Delivery developed a custom Managed Upgrade Operator (MUO)
+which can surge a standalone cluster during an update (see [reserved capacity feature](https://github.com/openshift/managed-upgrade-operator/blob/a56079fda6ab4088f350b05ed007896a4cabcd97/docs/faq.md)).
+
+### Preventing Other Stalled Drains
+
+There are other reasons that drains can stall. For example, `PodDisruptionBudgets` can
+be configured in such a way as to prevent pods from draining even if there is sufficient capacity for them
+to be rescheduled on other nodes. A powerful (though blunt) tool to prevent drain stalls is to limit the amount of time
+a drain operation is permitted to run before forcibly terminating pods and allowing an update to proceed. 
+`NodeDrainTimeout`, in HCP's `NodePools` allows users to configure this timeout.
+The Managed Upgrade Operator also supports this feature with [`PDBForceDrainTimeout`](https://github.com/openshift/managed-upgrade-operator/blob/master/docs/faq.md).
+
+This enhancement includes adding `NodeDrainTimeout` to `MachineConfigPools` to provide this feature in standlone 
+cluster environments. The timeout will only apply to drains triggered by the Machine Config Operator (e.g.
+it will not impact drains triggered by the CLI).
+
+### User Stories
+
+Implementing surge and node drain timeout support in `MachineConfigPools` can simplify cluster management for self-managed 
+standalone clusters as well as managed clusters (i.e. Service Delivery can remove this customized behavior from the MUO and use
+more of the core platform).
+
+* As an Operations team managing one or more standalone clusters, I want to
+  help ensure smooth updates by surging my worker node count without constantly 
+  having my cluster over-provisioned.
+* As an Operations team managing one or more standalone clusters, I want to 
+  ensure my cluster update makes steady progress by limiting the amount of time a node drain can
+  consume.
+* As an engineer in the Service Delivery organization, I want to use core platform
+  features instead of developing, evolving, and testing the Managed Upgrade Operator.
+* As an Operations team managing standalone and HCP based OpenShift clusters, I
+  want a consistent update experience leveraging a surge strategy and/or 
+  node drain timeouts regardless of the cluster profile.
+
+### Goals
+
+- Implement an update configuration, including `MaxSurge`, similar to HCP's `NodePool`
+  in standalone OpenShift's `MachineConfigPool`. 
+- Implement `NodeDrainTimeout`, similar to HCP's `NodePool`, in standalone OpenShift's `MachineConfigPool`.
+- Provide a consistent update controls between standalone and HCP cluster profiles. 
+- Allow Service Delivery to deprecate their MUO reserved capacity & `PDBForceDrainTimeout` features and use more of the core platform.
+
+### Non-Goals
+
+- Address all causes of problematic updates.
+- Prevent workload disruption when `NodeDrainTimeout` is utilized.
+- Fully unify the update experience for Standalone vs HCP.
+
+## Proposal
+
+The HyperShift HCP `NodePool` exposes a [`NodePoolManagement`](https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.NodePoolManagement)
+stanza which captures traditional `MachineConfigPool` update semantics ([`MaxUnavailable`](https://docs.openshift.com/container-platform/4.14/rest_api/machine_apis/machineconfigpool-machineconfiguration-openshift-io-v1.html#spec)) 
+as well as the ability to specify a `MaxSurge` preferences. HCP's `NodePool` also exposes a `NodeDrainTimeout` configuration
+option.
+
+This enhancement proposes an analog for `NodePoolManagement` and `NodeDrainTimeout` be added
+to standalone OpenShift's `MachineConfigPool` custom resource.
+
+### Workflow Description
+
+**Cluster Lifecycle Administrator** is a human user responsible for triggering, monitoring, and 
+managing all aspects of a cluster update. They are operating a standalone OpenShift cluster.
+
+1. The cluster lifecycle administrator desires to ensure that there is sufficient worker node capacity during
+   updates to handle graceful termination of pods and rescheduling of workloads.
+2. They want to avoid other causes of drain stalls by limiting the amount of time permitted for any drain operation.
+3. They configure worker `MachineConfigPools` on the cluster with a `MaxSurge`.
+   value that will bring additional worker node capacity online for the duration of an update.
+4. They configure worker `MachineConfigPools` on the cluster with a `NodeDrainTimeout` value of 30 minutes to 
+   limit the amount of time non-capacity related draining issues can stall the overall update.
+
+### API Extensions
+
+#### API Overview
+The Standalone `MachineConfigPool` custom resource is updated to include new update strategies (one of which 
+supports`MaxSurge`) and `NodeDrainTimeout` semantics identical to HCP's `NodePool`.
+
+Documentation for these configuration options can be found in HyperShift's API reference: 
+- https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.NodePoolManagement exposes `MaxSurge`.
+- https://hypershift-docs.netlify.app/reference/api/#hypershift.openshift.io/v1beta1.NodePoolSpec exposes `NodePoolTimeout`
+
+Example `MachineConfigPool` including both `NodeDrainTimeout` and a `MaxSurge` setting:
+```yaml
+kind: MachineConfigPool
+spec:
+  # Existing spec fields are not shown.    
+
+  # Adopted from NodePool to create consistency and further our goal
+  # to improve the reliability of worker updates. This only applies
+  # to drains triggered by the MCO (e.g. CLI triggered drains will
+  # not be impacted).
+  nodeDrainTimeout: 10m
+
+  # New policy analog to NodePool.NodePoolManagement.
+  machineManagement:
+    upgradeType: "Replace"
+    replace:
+      strategy: "RollingUpdate"
+      rollingUpdate:
+        maxUnavailable: 0
+        maxSurge: 4
+```
+
+Like HCP, `UpgradeType` will support:
+- `InPlace` where no additional nodes are brought online to support draining workloads.
+- `Replace` where new nodes will be brought online with `MaxSurge` support.
+
+#### InPlace Update Type
+
+The `InPlace` update type is similar to `MachineConfigPools` traditional behavior where a 
+user can configure the `MaxUnavailable` nodes. This approach assumes the number (or percentage)
+of nodes specified by `MaxUnavailable` can be drained simultaneously with workloads finding
+sufficient resources on other nodes to avoid stalled drains.
+
+```yaml
+kind: MachineConfigPool
+spec:
+  # Existing spec fields are not shown.    
+
+  machineManagement:
+    upgradeType: "InPlace"
+    inPlace:
+      maxUnavailable: 10%
+```
+
+Nodes are rebooted after they are drained.
+
+#### Replace Update Type
+
+The `Replace` update type removes old machines and replaces them with new instances. It supports
+two strategies:
+- `OnDelete` where a new machine is brought online only after the old machine is deleted.
+- `RollingUpdate` which supports the `MaxSurge` option.
+
+```yaml
+kind: MachineConfigPool
+spec:
+  # Existing spec fields are not shown.    
+
+  machineManagement:
+    upgradeType: "Replace"
+    replace:
+      strategy: "RollingUpdate"
+      rollingUpdate:
+        maxUnavailable: 0
+        maxSurge: 4
+```
+
+`MaxSurge` applies independently to each associated MachineSet. For example, if three MachineSets are associated
+with a MachineConfigPool, and `MaxSurge` is set to 4, then it is possible for the cluster to surge up to 12 nodes
+(4 for each of the 3 MachineSets).
+
+The `OnDelete` strategy is included for consistency with HCP. It does not directly support
+the consistent update experience motivation driving this enhancement. However, it does provide
+value to customers with highly static environments. Consider a standalone customer using a 
+provider where they have a fixed quota of machines. Autoscaling and surging are not options
+in this case. To provide a reliable update, they would select 'OnDelete' and specify a 'NodeDrainTimeout'.
+This will likely result in workload disruption for an at-capacity cluster during an upgrade, but the administrator 
+is at least empowered to make that tradeoff.
+
+### Topology Considerations
+
+Multi-AZ (availability zone) clusters function by using one or more `MachineSets` per zone. In order for this enhancement
+to work across all zones, all such `MachineSets` should be associated with a `MachineConfigPool` with well considered
+values for `MaxSurge` and `NodeDrainTimeout`.
+
+Each `MachineSet` associated with a `MachineConfigPool` will be permitted to scale by the `MaxSurge` number of nodes. 
+The alternative (trying to spread a surge value evenly across `MachineSets`) is problematic. Consider a cluster with two `MachineSets`:
+- machine-set-1a which creates nodes in availability zone us-east-1a.
+- machine-set-1b which creates nodes in availability zone us-east-1b.
+
+Further, assume that `MaxSurge` is set to 1 for the `MachineConfigPool` associated with these `MachineSets`.
+
+There may be pods running on 1b nodes that can only be scheduled on 1b nodes (e.g. due to taints / affinity / 
+topology constraints, machine type, etc.). If `MaxSurge` was interpreted in such a way as to only surge machine-set-1a by 1 node,
+constrained pods requiring 1b nodes could not benefit from this additional capacity.
+
+Instead, this enhancement proposes each `MachineSet` be permitted to surge up to the `MachineConfigPool` surge
+value independently. 
+
+
+#### Hypershift / Hosted Control Planes
+
+N/A. Hosted Control Planes provide the model for the settings this enhancement seeks to expose in standalone clusters.
+
+#### Standalone Clusters
+
+The `MachineConfigPool` custom resource must be updated to expose the new semantics.  The existing `spec.maxUnavailable`
+will be deprecated in favor of the more expressive `MachineManagement` stanza. 
+
+#### Single-node Deployments or MicroShift
+
+N/A.
+
+### Implementation Details/Notes/Constraints
+
+#### MaxSurge Implementation
+
+##### Surge Setup
+During a configuration update rollout, the Machine Config Operator (MCO) will determine which `MachineSets` are associated with
+a `MachineConfigPool` with `MaxSurge` greater than 0. For each `MachineSet` meeting this requirement (if it does not possess
+a proposed annotation `machineconfiguration.openshift.io/noSurge`) , the MCO will create a near duplicate of the `MachineSet` with a few
+key differences:
+- The name of the resource will be `<~machineset-name>-surge-<nonce>`. The implementation must handle:
+  - the truncation of the original `MachineSet` name if appending `surge-<nonce>` will violate k8s name length limitations.
+  - the calculation of a nonce value that does not conflict any existing resource that the controller did not, itself, create (as indication by a special label).
+- The new `MachineSet` will be labeled to clearly indicate that the MCO created the resource in order to satisfy a surge operation.
+- The new `MachineSet` will be set with a replica count of 0 if `ClusterAutoscaler` exists and `MaxSurge` if `ClusterAutoscaler` does not exist.
+
+##### Surge With ClusterAutoscaler
+If `ClusterAutoscaler` exists, for each surge `MachineSet`, an associated `MachineAutoscaler` will be instantiated with its 
+minimum replica value set to 0 and its maximum replica count set to `MaxSurge`. The `MachineAutoscaler` instance will also be labeled to indicate it
+was created programmatically for the surge procedure.
+
+As nodes are drained, any unschedulable pods will cause the `ClusterAutoscaler` to scale an appropriate surge 
+`MachineSet` to supply the necessary capacity requirement.
+
+##### Surge Without ClusterAutoscaler
+If the `ClusterAutoscaler` does not exist, `MachineAutoscalers` will not work. Instead, the surge `MachineSets` will have
+their replica count set to `MaxSurge`. This is a less efficient use of cloud resources, so customer facing documentation
+should suggest the use of `ClusterAutoscaler` when a surge strategy is being used.
+
+##### Surge Teardown
+Once a `MachineConfigPool` has consistent, up-to-date, machines associated with it, the surge `MachineSet` and
+(optional) `MachineAutoscaler` resources will be deleted. This will cause drain of the nodes created for the
+surge. This drain should obey the `NodeDrainTimeout` set in the `MachineConfigPool`.
+
+#### Node Drain Timeout
+When non-zero, a normal cordon and drain should be attempted. However, if the duration of the attempt 
+surpasses `NodeDrainTimeout`, the node can be forcibly terminated. 
+
+### Risks and Mitigations
+
+Service Delivery believes this enhancement is a key to dramatically simplifying the MUO in conjunction with
+https://github.com/openshift/enhancements/pull/1571 . Without this enhancement, https://github.com/openshift/enhancements/pull/1571
+may not be useful to Service Delivery without this enhancement as well. 
+
+### Drawbacks
+
+The primary drawback is that alternative priorities are not pursued or that the investment is not ultimately
+warranted by the proposed business value.
+
+### Removing a deprecated feature
+
+- `MachineConfigPool.spec.maxUnavailable` will be deprecated. 
+
+## Upgrade / Downgrade Strategy
+
+This feature is integral to standalone updates. Preceding sections discuss its behavior.
+
+## Version Skew Strategy
+
+N/A.
+
+## Operational Aspects of API Extensions
+
+The new stanzas are specifically designed to be tools used to improve update predictability
+and reliability for operations teams. Preceding sections discuss its behavior.
+
+## Support Procedures
+
+The machine-api-operator logs will indicate the decisions being made to actuate the new configuration
+fields. If machines are scaled into the cluster during an update but are unable to successfully join
+the cluster, this scenario is debugged just as if the problem occurred during normal scaling operations.
+
+## Alternatives
+
+1. The status quo of standalone updates and the MUO can be maintained. We can assume
+   that customers impacted by the existing operational burden of drain timeouts will
+   find their own solutions or migrate to HCP.
+2. Aspects of the MUO could be incorporated into the OpenShift core. Unfortunately, the MUO
+   is deeply integrated into SD's architecture and is not easily productized.