KEP-4212: Declarative Node Maintenance

Release Signoff Checklist
Summary
Motivation
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

This KEP proposes adding a declarative API to manage node maintenance. This API can be used to implement additional capabilities around node draining.

Motivation

The goal of this KEP is to analyze and improve node maintenance in Kubernetes.

Node maintenance is a request from a cluster administrator to remove all pods from a node(s) so that it can be disconnected from the cluster to perform a software upgrade (OS, kubelet, etc.), hardware or firmware upgrade, or simply to remove the node as it is no longer needed.

Kubernetes has existing support for this use case in the following way with kubectl drain:

There are running pods on node A, some of which are protected with PodDisruptionBudgets (PDB).
Set the node Unschedulable (cordon) to prevent new pods from being scheduled there.
Evict (default behavior) pods from node A by using the eviction API (see kubectl drain worklflow).
Proceed with the maintenance and shut down or restart the node.
On platforms and nodes that support it, the kubelet will try to detect the imminent shutdown and then attempt to perform a Graceful Node Shutdown:
- delay the shutdown pending graceful termination of remaining pods
- terminate remaining pods in reverse priority order (see pod-priority-graceful-node-shutdown)

The main problem is that the current approach tries to solve this in an application agnostic way and will simply attempt to remove all the pods currently running on the node. Since this approach cannot be applied generically to all pods, the Kubernetes project has defined special drain filters that either skip groups of pods or an admin has to consent to override those groups to be either skipped or deleted. This means that without knowledge of all the underlying applications on the cluster, the admin has to make a potentially harmful decision.

From an application owner or developer perspective, the only standard tool they have is a PodDisruptionBudget. This is sufficient in a basic scenario with a simple multi-replica application. The edge case applications, where this does not work are very important to the cluster admin, as they can block the node drain. And, in turn, very important to the application owner, as the admin can then override the pod disruption budget and disrupt their sensitive application anyway.

List of cases where the current solution is not optimal:

Without extra manual effort, an application running with a single replica has to settle for experiencing application downtime during the node drain. They cannot use PDBs with minAvailable: 1 or maxUnavailable: 0, or they will block node maintenance. Not every user needs high availability either, due to a preference for a simpler deployment model, lack of application support for HA, or to minimize compute costs. Also, any automated solution needs to edit the PDB to account for the additional pod that needs to be spun to move the workload from one node to another. This has been discussed in the issue kubernetes/kubernetes#66811 and in the issue kubernetes/kubernetes#114877.
Similar to the first point, it is difficult to use PDBs for applications that can have a variable number of pods; for example applications with a configured horizontal pod autoscaler (HPA). These applications cannot be disrupted during a low load when they have only one pod. However, it is possible to disrupt the pods during a high load of the application (pods > 1) without experiencing application downtime. If the minimum number of pods is 1, PDBs cannot be used without blocking the node drain. This has been discussed in the issue kubernetes/kubernetes#93476.
Graceful termination of DaemonSet pods is currently only supported on Linux as part of Graceful Node Shutdown feature. The length of the shutdown is again not application specific and is set cluster-wide (optionally by priority) by the cluster admin. This only partially takes into account .spec.terminationGracePeriodSeconds of each pod and may cause premature termination of the application. This has been discussed in the issue kubernetes/kubernetes#75482 and in the issue kubernetes-sigs/cluster-api#6158.
There are cases during a node shutdown, when data corruption can occur due to premature node shutdown. It would be great if applications could perform data migration and synchronization of cached writes to the underlying storage before the pod deletion occurs. This is not easy to quantify even with pod's .spec.shutdownGracePeriod, as the time depends on the size of the data and the speed of the storage. This has been discussed in the issue kubernetes/kubernetes#116618 and in the issue kubernetes/kubernetes#115148.
During the Graceful Node Shutdown the kubelet terminates the pods in order of their priority. The DaemonSet controller runs its own scheduling logic and creates the pods again. This causes a race. Such pods should be removed and not recreated, but higher priority pods that have not yet been terminated should be recreated if they are missing. This has been discussed in the issue kubernetes/kubernetes#122912.
The Graceful Node Shutdown feature is not always reliable. If Dbus or kubelet is restarted during the shutdown, pods may be ungracefully terminated, leading to application disruption and data loss. New applications can get scheduled on such a node which can also be harmful. This has been discussed in issues kubernetes/kubernetes#122674, kubernetes/kubernetes#120613 and kubernetes/kubernetes#122674.
There is no way to gracefully terminate static pods during a node shutdown kubernetes/kubernetes#122674, and the lifecycle/termination is not clearly defined for static pods kubernetes/kubernetes#16627.
Different pod termination mechanisms are not synchronized with each other. So for example, the taint manager may prematurely terminate pods that are currently under Node Graceful Shutdown. This can also happen with other mechanism (e.g., different types of evictions). This has been discussed in the issue kubernetes/kubernetes#124448 and in the issue kubernetes/kubernetes#72129.
There is not enough metadata about why the node drain was requested or why the pods are terminating. This has been discussed in the issue kubernetes/kubernetes#30586 and in the issue kubernetes/kubernetes#116965.

Approaches and workarounds used by other projects to deal with these shortcomings:

https://github.com/medik8s/node-maintenance-operator uses a declarative approach that tries to mimic kubectl drain (and uses kubectl implementation under the hood).
https://github.com/kubereboot/kured performs automatic node reboots and relies on kubectl drain implementation to achieve that.
https://github.com/strimzi/drain-cleaner prevents Kafka or ZooKeeper pods from being drained until they are fully synchronized. Implemented by intercepting eviction requests with a validating admission webhook. The synchronization is also protected by a PDB with the .spec.maxUnavailable field set to 0. See the experience reports for more information.
https://github.com/kubevirt/kubevirt intercepts eviction requests with a validating admission webhook to block eviction and to start a virtual machine live migration from one node to another. Normally, the workload is also guarded by a PDB with the .spec.minAvailable field set to 1. During the migration the value is increased to 2.
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler has an eviction process that takes inspiration from kubectl and build additional logic on top of it. See Cluster Autoscaler for more details.
https://github.com/kubernetes-sigs/karpenter taints the node during the node drain. It then attempts to evict all the pods on the node by calling the Eviction API. It prioritizes non-critical pods and non-DaemonSet pods
https://github.com/aws/aws-node-termination-handler watches for a predefined set of events (spot instance termination, EC2 termination, etc.), then cordons and drains the node. It relies on the kubectl implementation.
https://github.com/openshift/machine-config-operator updates/drains nodes by using a cordon and relies on the kubectl drain implementation.

Experience Reports:

Federico Valeri, Drain Cleaner: What's this?, Sep 24, 2021, description of the use case and implementation of drain cleaner
Tommer Amber, Solution!! Avoid Kubernetes/Openshift Node Drain Failure due to active PodDisruptionBudget, Apr 30, 2022 - user is unhappy about the manual intervention required to perform node maintenance and gives the unfortunate advice to cluster admins to simply override the PDBs. This can have negative consequences for user applications, including data loss. This also discourages the use of PDBs. We have also seen an interest in the issue kubernetes/kubernetes#83307 for overriding evictions, which led to the addition of the --disable-eviction flag to kubectl drain. There are other examples of this approach on the web .
Kevin Reeuwijk, How to handle blocking PodDisruptionBudgets on K8s with distributed storage, June 6, 2022 - a simple shell script example on how to drain the node in a safer way. It does a normal eviction, then looks for a pet application (Rook-Ceph in this case) and does hard delete if it does not see it. This approach is not plagued by the loss of data resiliency, but it does require maintenaning a list of pet applications, which can be prone to mistakes. In the end, the cluster admin has to do a job of the application maintainer.
Artur Rodrigues, Impossible Kubernetes node drains, 30 Mar, 2023 - discusses the problem with node drains and offers a workaround to restart the application without the application owner's consents, but acknowledges that this may be problematic without the knowledge of the application
Jack Roper, How to Delete Pods from a Kubernetes Node with Examples, 05 Jul, 2023 - also discusses the problem of blocking PDBs and offers several workarounds. Similar to others also offers a force deletion, but also a less destructive method of scaling up the application. However, this also interferes with application deployment and has to be supported by the application.

Cluster Autoscaler

Accepts a drain-priority-config option, which is similar to Graceful Node Shutdown in that it gives each priority a shutdown grace period. Also has a max-graceful-termination-sec option for pod termination and a max-pod-eviction-time option after which the eviction is forfeited.

Each pod is first analyzed to see if it is drainable. Part of the logic is similar to kubectl and its drain filters (see Cluster Autoscaler rules):

Mirror pods are skipped.
Terminating pods are skipped.
Pods and ReplicaSets/ReplicationControllers without owning controllers are blocking by default (the check can be modified with the skip-nodes-with-custom-controller-pods option).
System pods (in the kube-system namespace) without a matching PDB are blocking by default (the check can be modified with the skip-nodes-with-system-pods option).
Pods with cluster-autoscaler.kubernetes.io/safe-to-evict: "false" annotation are blocking.
Pods with local storage are blocking unless they have a cluster-autoscaler.kubernetes.io/safe-to-evict-local-volumes annotation (the check can be modified with the skip-nodes-with-local-storage option, e.g., this check is skipped on AKS).
Pods with PDBs that do not have disruptionsAllowed are blocking.

This can be enhanced with other rules and overrides.

It uses this logic to first check if all pods can be removed from a node. If not, it will report those nodes. Then it will group all pods by a priority and evict them gradually from the lowest to highest priority. This may include DaemonSet pods.

kubelet

Graceful Node Shutdown is a part of the current solution for node maintenance. Unfortunately, it is not possible to rely solely on this feature as a go-to solution for graceful node and workload termination.

The Graceful Node Shutdown feature is not application aware and may prematurely disrupt workloads and lead to data loss.
The kubelet controls the shutdown process using Dbus and systemd, and can delay (but not entirely block) it using the systemd inhibitor. However, if Dbus or the kubelet is restarted during the node shutdown, the shutdown might not be registered again, and pods might be terminated ungracefully. Also, new workloads can get scheduled on the node while the node is shutting down. Cluster admins should, therefore, plan the maintenance in advance and ensure that pods are gracefully removed before attempting to shut down or restart the machine.
The kubelet has no way of reliably detecting ongoing maintenance if the node is restarted in the meantime.
Graceful termination of static pods during a shutdown is not possible today. It is also not currently possible to prevent them from starting back up immediately after the machine has been restarted and the kubelet has started again, if the node is still under maintenance.

Motivation Summary

To sum up. Some applications solve the disruption problem by introducing validating admission webhooks. This has some drawbacks. The webhooks are not easily discoverable by cluster admins. And they can block evictions for other applications if they are misconfigured or misbehave. The eviction API is not intended to be extensible in this way. The webhook approach is therefore not recommended.

Some drainers solve the node drain by depending on the kubectl logic, or by extending/rewriting it with additional rules and logic.

As seen in the experience reports and GitHub issues, some admins solve their problems by simply ignoring PDBs which can cause unnecessary disruptions or data loss. Some solve this by playing with the application deployment, but have to understand that the application supports this.

kubelet's Graceful Node Shutdown feature is a best-effort solution for unplanned shutdowns, but it is not sufficient to ensure application and data safety.

Goals

Introduce NodeMaintenance API.
Introduce a node maintenance controller that creates evacuations.
Deprecate kubectl drain in favor of NodeMaintenance.
Implement NodeMaintenanceAwareKubelet feature to implement a lifecycle for static pods during a maintenance.
Implement NodeMaintenanceAwareDaemonSet feature to prevent the scheduling of DaemonSet pods on nodes during a maintenance.

Non-Goals

Introduce a node maintenance period, nodeDrainTimeout (similar to cluster-api nodeDrainTimeout) or a TTL optional field as an upper bound on the duration of node maintenance. Then the node maintenance would be garbage collected and the node made schedulable again.
Solve the node lifecycle management or automatic shutdown after the node drain is completed. Implementation of this is better suited for other cluster components and actors who can use the node maintenance as a building block to achieve their desired goals.
Synchronize all pod termination mechanisms (see #8 in the Motivation section), so that they do not terminate pods under NodeMaintenance/Evacuation.

Proposal

Most of these issues stem from the lack of a standardized way of detecting a start of the node drain. This KEP proposes the introduction of a NodeMaintenance object that would signal an intent to gracefully remove pods from given nodes. The intent will be implemented by the newly proposed Evacuation API KEP, which ensures graceful pod removal or migration, an ability to measure the progress and a fallback to eviction if progress is lost. The NodeMaintenance implementation should also utilize existing node's .spec.unschedulable field, which prevents new pods from being scheduled on such a node.

We will deprecate the kubectl drain as the main mechanism for draining nodes and drive the whole process via a declarative API. This API can be used either manually or programmatically by other drain implementations (e.g. cluster autoscalers).

To support workload migration, a new controller should be introduced to observe the NodeMaintenance objects and then select pods for evacuation. The pods should be selected by node (nodeSelector) and the pods should be gradually evacuated according to the workload they are running. Controllers can then implement the migration/termination either by reacting to the Evacuation API or by reacting to the NodeMaintenance API if they need more details.

User Stories

Story 1

As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without any required manual interventions. I want to have an ability to manually switch between the maintenance phases (Planning, Cordon, Drain, Drain Complete, Maintenance Complete). I also want to observe the node drain via the API and check on its progress. I also want to be able to discover workloads that are blocking the node drain.

Story 2

As an application owner, I want to run single replica applications without disruptions and have the ability to easily migrate the workload pods from one node to another. This also applies to applications with larger number of replicas that prefer to surge (upscale) pods first rather than downscale.

Story 3

Cluster or node autoscalers that take on the role of kubectl drain want to signal the intent to drain a node using the same API and provide a similar experience to the CLI counterpart.

Notes/Constraints/Caveats (Optional)

This KEP depends on Evacuation API KEP.

Risks and Mitigations

A misconfigured .spec.nodeSelector could select all the nodes (or just all master nodes) in the cluster. This can cause the cluster to get into a degraded and unrecoverable state.

An admission plugin (NodeMaintenance Admission) is introduced to issue a warning in this scenario.

Design Details

Kubectl

kubectl drain: as we can see in the Motivation section, kubectl is heavily used either manually or as a library by other projects. It is safer to keep the old behavior of this command. However, we will deprecate it along with all the library functions. We can print a deprecation warning when this command is used, and promote the NodeMaintenance. Additionally, pods that support evacuation and have evacuation.coordination.k8s.io/priority_${EVACUATOR_CLASS} annotations will block eviction requests.

kubectl cordon and kubectl uncordon commands will be enhanced with a warning that will warn the user if a node is made un/schedulable, and it collides with an existing NodeMaintenance object. As a consequence the node maintenance controller will reconcile the node back to the old value. Because of this we can make these commands noop when the node is under the NodeMaintenance.

NodeMaintenance API

NodeMaintenance objects serve as an intent to remove or migrate pods from a set of nodes. We will include Cordon and Drain toggles to support the following states/stages of the maintenance:

Planning: this is to let the users know that maintenance will be performed on a particular set of nodes in the future. Configured with .spec.stage=Idle.
Cordon: stop accepting (scheduling) new pods. Configured with .spec.stage=Cordon.
Drain: gives an intent to drain all selected nodes by creating Evacuation objects for the node's pods. Configured with .spec.stage=Drain.
Drain Complete: all targeted pods have been drained from all the selected nodes. The nodes can be upgraded, restarted, or shut down. The configuration is still kept at .spec.stage=Drain and Drained condition is set to "True" on the node maintenance object.
Maintenance Complete: make the nodes schedulable again once the node maintenance is done. Configured with .spec.stage=Complete.

// +enum
type NodeMaintenanceStage string

const (
// Idle does not interact with the cluster.
Idle NodeMaintenanceStage = "Idle"
// Cordon cordons all selected nodes by making them unschedulable.
Cordon NodeMaintenanceStage = "Cordon"
// Drain:
// 1. Cordons all selected nodes by making them unschedulable.
// 2. Gives an intent to drain all selected nodes by creating Evacuation objects for the
//    node's pods.
Drain NodeMaintenanceStage = "Drain"
// Complete:
// 1. Removes all Evacuation objects requested by this NodeMaintenance.
// 2. Uncordons all selected nodes by making them schedulable again, unless there is not another
//    maintenance in progress.
Complete NodeMaintenanceStage = "Complete"
)

type NodeMaintenance struct {
    ...
    Spec NodeMaintenanceSpec
    Status NodeMaintenanceStatus
}

type NodeMaintenanceSpec struct {
    // NodeSelector selects nodes for this node maintenance.
    // +required
    NodeSelector *v1.NodeSelector

    // The order of the stages is Idle -> Cordon -> Drain -> Complete.
    //
    // - The Cordon or Drain stage can be skipped by setting the stage to Complete.
    // - The NodeMaintenance object is moved to the Complete stage on deletion unless the Idle stage has been set.
    //
    // The default value is Idle.
    Stage NodeMaintenanceStage

    // DrainPlan is executed from the first entry to the last entry during the Drain stage.
    // DrainPlanEntry podType fields should be in the following order:
    // nil -> DaemonSet -> Static
    // DrainPlanEntry priority fields should be in ascending order for each podType.
    // If the priority and podType are the same, concrete selectors are executed first.
    //
    // The following entries are injected into the drainPlan on the NodeMaintenance admission:
    // - podPriority: 1000000000 // highest priority for user defined priority classes
    //   podType: "Default"
    // - podPriority: 2000000000 // system-cluster-critical priority class
    //   podType: "Default"
    // - podPriority: 2000001000 // system-node-critical priority class
    //   podType: "Default"
    // - podPriority: 2147483647 // maximum value
    //   podType: "Default"
    // - podPriority: 1000000000 // highest priority for user defined priority classes
    //   podType: "DaemonSet"
    // - podPriority: 2000000000 // system-cluster-critical priority class
    //   podType: "DaemonSet"
    // - podPriority: 2000001000 // system-node-critical priority class
    //   podType: "DaemonSet"
    // - podPriority: 2147483647 // maximum value
    //   podType: "DaemonSet"
    // - podPriority: 1000000000 // highest priority for user defined priority classes
    //   podType: "Static"
    // - podPriority: 2000000000 // system-cluster-critical priority class
    //   podType: "Static"
    // - podPriority: 2000001000 // system-node-critical priority class
    //   podType: "Static"
    // - podPriority: 2147483647 // maximum value
    //   podType: "Static"
    //
    // Duplicate entries are not allowed.
    // This field is immutable.
    DrainPlan []DrainPlanEntry

    // Reason for the maintenance.
    Reason string
}

const (
// Default selects all pods except DaemonSet and Static pods.	
Default PodType = "Default"
// DaemonSet selects DaemonSet pods.
DaemonSet PodType = "DaemonSet"
// Static selects static pods.
Static PodType = "Static"
)

type DrainPlanEntry struct {
    // PodSelector selects pods according to their labels.
    // This can help to select which pods of the same priority should be evacuated first.
    // +optional
    PodSelector *metav1.LabelSelector
    // PodPriority specifies a pod priority.
    // Pods with a priority less or equal to this value are selected.
    PodPriority int32
    // PodType selects pods according to the pod type:
    // - Default selects all pods except DaemonSet and Static pods.	
    // - DaemonSet selects DaemonSet pods.
    // - Static selects static pods.
    PodType PodType
}

type NodeMaintenanceStatus struct {
    // StageStatuses tracks the statuses of started stages.
    StageStatuses []StageStatus
    // List of a maintenance statuses for all nodes targeted by this maintenance.
    NodeStatuses []NodeMaintenanceNodeStatus
    Conditions []metav1.Condition
}

type StageStatus struct {
    // Name of the Stage.
    Name NodeMaintenanceStage
    // StartTimestamp is the time that indicates the start of this stage.
    StartTimestamp *metav1.Time
}

type NodeReference struct {
    // Name of the node.
    Name string
}

type NodeMaintenanceNodeStatus struct {
    // NodeRef identifies a Node.
    NodeRef NodeReference
    // DrainTargets specifies which pods on this node are currently being targeted for evacuation.
    // Once evacuation of the Default PodType finishes, DaemonSet PodType entries appear.
    // Once the evacuation of DaemonSet PodType finishes, Static PodType entries appear.
    // The PodPriority for these entries is increased over time according to the .spec.DrainPlan
    // as the lower-priority pods finish evacuation.
    // The next entry in the .spec.DrainPlan is selected once all the nodes have reached their
    // DrainTargets.
    // If there are multiple NodeMaintenances for a node, the least powerful DrainTargets among
    // them are selected and set for all the NodeMaintenances for that node. Thus, the DrainTargets
    // do not have to correspond to the entries in .spec.drainPlan for a single NodeMaintenance
    // instance.
    // DrainTargets cannot backtrack and will target more pods with each update until all pods on
    // the node are targeted.
    DrainTargets  []DrainPlanEntry
    // DrainMessage may specify a state of the drain on this node and a reason why the drain
    // targets are set to a particular values.
    DrainMessage string
    // Number of pods that have not yet started evacuation.
    PodsPendingEvacuation int32
    // Number of pods that have started evacuation and have a matching Evacuation object.
    PodsEvacuating int32
}

const (
    // DrainedCondition is a condition set by the node-maintenance controller that signals
    // whether all pods pending termination have terminated on all target nodes when drain is
    // requested by the maintenance object.
    DrainedCondition = "Drained"
}

NodeMaintenance Admission

nodemaintenance admission plugin will be introduced.

It will validate all incoming requests for CREATE, UPDATE, and DELETE operations on the NodeMaintenance objects. All nodes matching the .spec.nodeSelector must pass an authorization check for the DELETE operation.

Also, if the .spec.nodeSelector matches all cluster nodes, a warning will be produced indicating that the cluster may get into a degraded and unrecoverable state. The warning is non-blocking and such NodeMaintenance is still valid and can proceed.

NodeMaintenance Controller

Node maintenance controller will be introduced and added to kube-controller-manager. It will observe NodeMaintenance objects and have the following main features:

Idle

The controller should not touch the pods or nodes that match the selector of the NodeMaintenance object in any way in the Idle stage.

Finalizers and Deletion of the NodeMaintenance

When a stage is not Idle, nodemaintenance.k8s.io/maintenance-completion finalizer is placed on the NodeMaintenance object to ensure uncordon and removal of Evacuations upon deletion.

When a deletion of the NodeMaintenance object is detected, its .spec.stage is set to Complete. The finalizer is not removed until the Complete stage has been completed.

Cordon

When a Cordon or Drain stage is detected on the NodeMaintenance object, the controller will set (and reconcile) .spec.Unschedulable to true on all nodes that satisfy .spec.nodeSelector. It should alert via events if too many occur appear and a race to change this field is detected.

Uncordon (Complete)

When a Complete stage is detected on the NodeMaintenance object, the controller sets .spec.Unschedulable back to false on all nodes that satisfy .spec.nodeSelector, unless there is no other maintenance in progress.

When the node maintenance is canceled (reaches the Complete stage without all of its pods terminating), the controller will attempt to remove all Evacuations that match the node maintenance, unless there is no other maintenance in progress.

If there are foreign finalizers on the Evacuation, it should only remove its own instigator finalizer (see Drain).
If the evacuator does not support a cancellation and it has set .status.evacuationCancellationPolicy to Forbid, deletion of the Evacuation object will not be attempted.

Consequences for pods:

Pods whose evacuators have not yet initiated evacuation will continue to run unchanged.
Pods whose evacuators have initiated evacuation and support cancellation (.status.evacuationCancellationPolicy=Allow) should cancel the evacuation and keep the pods available.
Pods whose evacuators have initiated evacuation and do not support cancellation (.status.evacuationCancellationPolicy=Forbid) should continue the evacuation and eventually terminate the pods.

Drain

When a Drain stage is detected on the NodeMaintenance object, Evacuation objects are created for selected pods (Pod Selection).

apiVersion: v1alpha1
kind: Evacuation
metadata:
  finalizers:
    evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
  name: f5823a89-e03f-4752-b013-445643b8c7a0-muffin-orders-6b59d9cb88-ks7wb
  namespace: blue-deployment
spec:
  podRef:
    name: muffin-orders-6b59d9cb88-ks7wb
    uid:  f5823a89-e03f-4752-b013-445643b8c7a0
  progressDeadlineSeconds: 1800

This is resolved to the following Evacuation object according to the pod on admission:

apiVersion: v1alpha1
kind: Evacuation
metadata:
  finalizers:
    evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
  labels:
    app: muffin-orders
  name: f5823a89-e03f-4752-b013-445643b8c7a0-muffin-orders-6b59d9cb88-ks7wb
  namespace: blue-deployment
spec:
  podRef:
    name: muffin-orders-6b59d9cb88-ks7wb
    uid:  f5823a89-e03f-4752-b013-445643b8c7a0
  progressDeadlineSeconds: 1800
  evacuators:
    - evacuatorClass: deployment.apps.k8s.io
      priority: 10000
      role: controller

The node maintenance controller requests the removal of a pod from a node by the presence of the Evacuation. Setting progressDeadlineSeconds to 1800 (30m) should give potential evacuators enough time to recover from a disruption and continue with the graceful evacuation. If the evacuators are unable to evacuate the pod, or if there are no evacuators, the evacuation controller will attempt to evict these pods, until they are deleted.

The only job of the node maintenance controller is to make sure that the Evacuation object exist and has the evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io finalizer.

Pod Selection

The pods for evacuation would first be selected by node (.spec.nodeSelector). NodeMaintenance should eventually remove all the pods from each node. To do this in a graceful manner, the controller will first ensure that lower priority pods are evacuated first for the same pod type. The user can also target some pods earlier than others with a label selector.

DaemonSet and static pods typically run critical workloads that should be scaled down last.

<<[UNRESOLVED Pod Selection Priority]>> Should user daemon sets (priority up to 1000000000) be scaled down first? <<[/UNRESOLVED]>>

To achieve this, we will ensure that the NodeMaintenance .spec.drainPlan always contains the following entries:

spec:
  drainPlan:
    - podPriority: 1000000000 # highest priority for user defined priority classes
      podType: "Default"
    - podPriority: 2000000000 # system-cluster-critical priority class
      podType: "Default"
    - podPriority: 2000001000 # system-node-critical priority class
      podType: "Default"
    - podPriority: 2147483647 # maximum value
      podType: "Default"
    - podPriority: 1000000000 # highest priority for user defined priority classes
      podType: "DaemonSet"
    - podPriority: 2000000000 # system-cluster-critical priority class
      podType: "DaemonSet"
    - podPriority: 2000001000 # system-node-critical priority class
      podType: "DaemonSet"
    - podPriority: 2147483647 # maximum value
      podType: "DaemonSet"
    - podPriority: 1000000000 # highest priority for user defined priority classes
      podType: "Static"
    - podPriority: 2000000000 # system-cluster-critical priority class
      podType: "Static"
    - podPriority: 2000001000 # system-node-critical priority class
      podType: "Static"
    - podPriority: 2147483647 # maximum value
      podType: "Static"
  ...

If not they will be added during the NodeMaintenance admission.

The node maintenance controller resolves this plan across intersecting NodeMaintenances. To indicate which pods are being evacuated on which node, the controller populates .status[0].nodeStatuses.drainTargets. This status field is updated during the Drain stage to incrementally select pods with higher priority and pod type (Default ->DaemonSet -> Static). It is also possible to partition the updates for the same priorities according to the pod labels.

If there is only a single NodeMaintenance present, it selects the first entry from the .spec.drainPlan and makes sure that all the targeted pods are evacuated/removed. It then selects the next entry and repeats the process. If a new pod appears that matches the previous entries, it will also be evacuated.

If there are multiple NodeMaintenances, we have to first resolve the lowest priority entry from the .spec.drainPlan among them for the intersecting nodes. Non-intersecting nodes may have a higher priority or pod type. The next entry in the plan can be selected once all the nodes of a NodeMaintenance have finished evacuation and all the NodeMaintenances of intersecting nodes have finished evacuation for the current drain targets. See the Pod Selection and DrainTargets Example for additional details.

A similar kind of drain plan, albeit with fewer features is offered today by the Graceful Node Shutdown feature and by the Cluster Autoscaler's drain-priority-config. The downside of these configurations is that they have shutdownGracePeriodSeconds which sets a limit on how long the termination of pods should take. This is not application-aware and some applications may require more time to gracefully shut down. Allowing such hard-coded timeouts may result in unnecessary application disruptions or data corruption.

To support the evacuation of DaemonSet and Static pods, the daemon set controller and kubelet should observe NodeMaintenance objects and Evacuations to coordinate the scale down of the pods on the targeted nodes.

To ensure more streamlined experience we will not support the default kubectl drain filters. Instead, it should be possible to create the NodeMaintenance object with just a spec.nodeSelector. The only thing that can be configured is which pods should be scaled down first.

NodeMaintenance alternatives to kubectl drain filters:

daemonSetFilter: Removal of these pods should be supported by the DaemonSet controller.
mirrorPodFilter: Removal of these pods should be supported by the kubelet.
skipDeletedFilter: Creating evacuation of already terminating pods should have no downside and be informative for the user.
unreplicatedFilter: Actors who own pods without a controller owner reference should have the opportunity to register an evacuator to evacuate their pods. Many drain solutions today evict these types of pods indiscriminately.
localStorageFilter: Actors who own pods with local storage (having EmptyDir volumes) should have the opportunity to register an evacuator to evacuate their pods. Many drain solutions today evict these types of pods indiscriminately.

Pod Selection and DrainTargets Example

If two Node Maintenances are created at the same time for the same node. Then, for the intersecting nodes, the entry with the lowest priority in the drainPlan is resolved first.

apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
  name: "maintenance-a"
...
spec:
  stage: Drain
  drainPlan:
    - podPriority: 5000
      podType: Default
    - podPriority: 15000
      podType: Default
    - podPriority: 3000
      podType: DaemonSet
  ...
status:
  nodeStatuses:
    - nodeRef:
        name: one
      podsPendingEvacuation: 100
      podsEvacuating: 10
      drainMessage: "Evacuating"
      drainTargets:
        - podPriority: 5000
          podType: Default
    - nodeRef:
        name: two
      podsPendingEvacuation: 30
      podsEvacuating: 7
      drainMessage: "Evacuating"
      drainTargets:
        - podPriority: 5000
          podType: Default
  ...
---
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
  name: "maintenance-b"
...
spec:
  stage: Drain
  drainPlan:
    - podPriority: 10000
      podType: Default
    - podPriority: 15000
      podType: Default
    - podPriority: 4000
      podType: DaemonSet
  ...
status:
  nodeStatuses:
    - nodeRef:
        name: one
      podsPendingEvacuation: 100
      podsEvacuating: 10
      drainMessage: "Evacuating (limited by maintenance-a)"
      drainTargets:
        - podPriority: 5000
          podType: Default
    - nodeRef:
        name: three
      podsPendingEvacuation: 45
      podsEvacuating: 25
      drainMessage: "Evacuating"
      drainTargets:
        - podPriority: 10000
          podType: Default
  ...

If the node three is drained, then it has to wait for the node one, because the drain plan specifies that all the pods with priority 10000 or lower should be evacuated first before moving on to the next entry.

apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
  name: "maintenance-b"
...
spec:
  stage: Drain
  drainPlan:
    - podPriority: 10000
      podType: Default
    - podPriority: 15000
      podType: Default
    - podPriority: 4000
      podType: DaemonSet
  ...
status:
  nodeStatuses:
    - nodeRef:
        name: one
      podsPendingEvacuation: 100
      podsEvacuating: 5
      drainMessage: "Evacuating (limited by maintenance-a)"
      drainTargets:
        - podPriority: 5000
          podType: Default
    - nodeRef:
        name: three
      podsPendingEvacuation: 45
      podsEvacuating: 0
      drainMessage: "Waiting for node one."
      drainTargets:
        - podPriority: 10000
          podType: Default
  ...

If the node one is drained, we still have to wait for the maintenance-a to drain node two. If we were to start evacuating higher priority pods from node one earlier, we would not conform to the drainPlan of maintenance-a. The plan specifies that all the pods with priority 5000 or lower should be evacuated first before moving on to the next entry.

apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
  name: "maintenance-a"
...
spec:
  stage: Drain
  drainPlan:
    - podPriority: 5000
      podType: Default
    - podPriority: 15000
      podType: Default
    - podPriority: 3000
      podType: DaemonSet
  ...
status:
  nodeStatuses:
    - nodeRef:
        name: one
      podsPendingEvacuation: 100
      podsEvacuating: 0
      drainMessage: "Waiting for node two."
      drainTargets:
        - podPriority: 5000
          podType: Default
    - nodeRef:
        name: two
      podsPendingEvacuation: 30
      podsEvacuating: 2
      drainMessage: "Evacuating"
      drainTargets:
        - podPriority: 5000
          podType: Default
  ...
---
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
  name: "maintenance-b"
...
spec:
  stage: Drain
  drainPlan:
    - podPriority: 10000
      podType: Default
    - podPriority: 15000
      podType: Default
    - podPriority: 4000
      podType: DaemonSet
  ...
status:
  nodeStatuses:
    - nodeRef:
        name: one
      podsPendingEvacuation: 100
      podsEvacuating: 0
      drainMessage: "Waiting for node two (maintenance-a)."
      drainTargets:
        - podPriority: 5000
          podType: Default
    - nodeRef:
        name: three
      podsPendingEvacuation: 45
      podsEvacuating: 0
      drainMessage: "Waiting for node two (maintenance-a)."
      drainTargets:
        - podPriority: 10000
          podType: Default
  ...

Once the node two drains, we can increment the drainTargets.

apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
  name: "maintenance-a"
...
spec:
  stage: Drain
  drainPlan:
    - podPriority: 5000
      podType: Default
    - podPriority: 15000
      podType: Default
    - podPriority: 3000
      podType: DaemonSet
  ...
status:
  nodeStatuses:
    - nodeRef:
        name: one
      podsPendingEvacuation: 70
      podsEvacuating: 30
      drainMessage: "Evacuating (limited by maintenance-b)"
      drainTargets:
        - podPriority: 10000
          podType: Default
    - nodeRef:
        name: two
      podsPendingEvacuation: 21
      podsEvacuating: 9
      drainMessage: "Evacuating"
      drainTargets:
        - podPriority: 15000
          podType: Default
  ...
---
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
  name: "maintenance-b"
...
spec:
  stage: Drain
  drainPlan:
    - podPriority: 10000
      podType: Default
    - podPriority: 15000
      podType: Default
    - podPriority: 4000
      podType: DaemonSet
  ...
status:
  nodeStatuses:
    - nodeRef:
        name: one
      podsPendingEvacuation: 70
      podsEvacuating: 30
      drainMessage: "Evacuating"
      drainTargets:
        - podPriority: 10000
          podType: Default
    - nodeRef:
        name: three
      podsPendingEvacuation: 45
      podsEvacuating: 0
      drainMessage: "Waiting for node one."
      drainTargets:
        - podPriority: 10000
          podType: Default
  ...

The progress of the drain should not be backtracked. If an intersecting maintenance-c is created, it will be fast-forwarded for node one regardless of its drainPlan.

apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
  name: "maintenance-c"
...
spec:
  stage: Drain
  drainPlan:
    - podPriority: 2000
      podType: Default
    - podPriority: 15000
      podType: Default
  ...
status:
  nodeStatuses:
    - nodeRef:
        name: one
      podsPendingEvacuation: 70
      podsEvacuating: 30
      drainMessage: "Evacuating (fast-forwarded by older maintenance-b)"
      drainTargets:
        - podPriority: 10000
          podType: Default
    - nodeRef:
        name: four
      podsPendingEvacuation: 20
      podsEvacuating: 5
      drainMessage: "Evacuating"
      drainTargets:
        - podPriority: 2000
          podType: Default
  ...

This is done to ensure that the pre-conditions of the older maintenances (maintenance-a and maintenance-b) are not broken. When we remove workloads with priority 15000, our pre-condition is that workloads with priority 5000 that might depend on these 15000 priority workloads are gone. If we allow rescheduling of the lower priority pods, this assumption is broken.

Unfortunately, a similar precondition is broken for the maintenance-c, so we can at least emit an event saying that we are fast-forwarding maintenance-c due to existing older maintenance(s). In the extreme scenario, node one may already be turned off and creating a new maintenance that assumes priority X pods are still running will not help to bring it back. Emitting an event would help with observability and might help cluster admins better schedule node maintenances.

PodTypes and Label Selectors Progression

An example progression for the following drain plan might look as follows:

spec:
  stage: Drain
  drainPlan:
    - podPriority: 1000
      podType: Default
    - podPriority: 2000
      podType: Default
      podSelector:
        matchLabels:
          app: postgres
    - podPriority: 2147483647
      podType: Default
    - podPriority: 1000
      podType: DaemonSet
    - podPriority: 2147483647
      podType: DaemonSet
    - podPriority: 2147483647
      podType: Static
status:
  nodeStatuses:
    - nodeRef:
        name: five
      drainTargets:
        - podPriority: 1000
          podType: Default
        - podPriority: 1000
          podType: Default
          podSelector:
            matchLabels:
              app: postgres 
  ...

status:
  nodeStatuses:
    - nodeRef:
        name: five
      drainTargets:
        - podPriority: 1000
          podType: Default
        - podPriority: 2000
          podType: Default
          podSelector:
            matchLabels:
              app: postgres 
  ...

status:
  nodeStatuses:
    - nodeRef:
        name: five
      drainTargets:
        - podPriority: 2147483647
          podType: Default
        - podPriority: 2147483647
          podType: Default
          podSelector:
            matchLabels:
              app: postgres 
  ...

status:
  nodeStatuses:
    - nodeRef:
        name: five
      drainTargets:
        - podPriority: 2147483647
          podType: Default
        - podPriority: 2147483647
          podType: Default
          podSelector:
            matchLabels:
              app: postgres
        - podPriority: 1000
          podType: DaemonSet
  ...

status:
  nodeStatuses:
    - nodeRef:
        name: five
      drainTargets:
        - podPriority: 2147483647
          podType: Default
        - podPriority: 2147483647
          podType: Default
          podSelector:
            matchLabels:
              app: postgres
        - podPriority: 2147483647
          podType: DaemonSet
  ...

status:
  nodeStatuses:
    - nodeRef:
        name: five
      drainTargets:
        - podPriority: 2147483647
          podType: Default
        - podPriority: 2147483647
          podType: Default
          podSelector:
            matchLabels:
              app: postgres
        - podPriority: 2147483647
          podType: DaemonSet
        - podPriority: 2147483647
          podType: Static
  ...

Status

The controller can show progress by reconciling:

.status.stageStatuses should be amended when a new stage is selected. This is used to track which stages have been started. Additional metadata can be added to this struct in the future.
.status.nodeStatuses[0].drainTargets should be updated during a Drain stage. The drain targets should be resolved according to the Pod Selection and Pod Selection and DrainTargets Example.
.status.nodeStatuses[0].drainMessage should be updated during a Drain stage. The message should be resolved according to Pod Selection and DrainTargets Example.
.status.nodeStatuses[0].PodsPendingEvacuation, to indicate how many pods are left to start evacuation from the first node.
.status.nodeStatuses[0].PodsEvacuating, to indicate how many pods are being evacuated from the first node. These are the pods that have matching Evacuation objects.
To keep track of the entire maintenance the controller will reconcile a Drained condition and set it to true if all pods pending evacuation/termination have terminated on all target nodes when drain is requested by the maintenance object.
NodeMaintenance condition or annotation can be set on the node object to advertise the current phase of the maintenance.

Supported Stage Transitions

The following transitions should be validated by the API server.

Idle -> Deletion
- Planning a maintenance in the future and canceling/deleting it without any consequence.
(Idle) -> Cordon -> (Complete) -> Deletion.
- Make a set of nodes unschedulable and then schedulable again.
- The complete stage will always be run even without specifying it.
(Idle) -> (Cordon) -> Drain -> (Complete) -> Deletion.
- Make a set of nodes unschedulable, drain them, and then make them schedulable again.
- Cordon and Complete stages will always be run, even without specifying them.
(Idle) -> Complete -> Deletion.
- Make a set of nodes schedulable.

The stage transitions are invoked either manually by the cluster admin or by a higher-level controller. For a simple drain, cluster admin can simply create the NodeMaintenance with stage: Drain directly.

DaemonSet Controller

The DaemonSet workloads should be tied to the node lifecycle because they typically run critical workloads where availability is paramount. Therefore, the DaemonSet controller should respond to the Evacuation only if there is a NodeMaintenance happening on that node and the DaemonSet is in the drainTargets. For example, if we observe the following NodeMaintenance:

apiVersion: v1alpha1
kind: NodeMaintenance
...
status:
  nodeStatuses:
    - nodeRef:
        name: six
      drainTargets:
        - podPriority: 2147483647
          podType: Default
        - podPriority: 5000
          podType: DaemonSet
  ...

To fulfil the Evacuation API, the DaemonSet controller should register itself as a controller evacuator. To do this, it should ensure that the following annotation is present on its own pods.

evacuation.coordination.k8s.io/priority_daemonset.apps.k8s.io: "10000/controller"

The controller should respond to the Evacuation object when it observes its own class (daemonset.apps.k8s.io) in .status.activeEvacuatorClass.

For the above node maintenance, the controller should not react to Evacuations of DaemonSet pods with a priority greater than 5000. This state should not normally occur, as Evacuation requests should be coordinated with NodeMaintenance. If it does occur, we should not encourage this flow by updating the .status.ActiveEvacuatorCompleted field, although it is required to update this field for normal workloads.

If the DaemonSet pods have a priority equal to or less than 5000, the Evacuation status should be updated appropriately as follows, and the targeted pod should be deleted by the DaemonSet controller:

apiVersion: v1alpha1
kind: Evacuation
metadata:
  finalizers:
    evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
  labels:
    app: critical-ds
  name: ae9b4bc6-e4ca-4f8e-962b-2d4459b1f684-critical-ds-5nxjs
  namespace: critical-workloads
spec:
  podRef:
    name: critical-ds-5nxjs
    uid:  ae9b4bc6-e4ca-4f8e-962b-2d4459b1f684
  progressDeadlineSeconds: 1800
  evacuators:
    - evacuatorClass: daemonset.apps.k8s.io
      priority: 10000
      role: controller
status:
  activeEvacuatorClass: daemonset.apps.k8s.io
  activeEvacuatorCompleted: false
  evacuationProgressTimestamp: "2024-04-22T21:40:32Z"
  expectedEvacuationFinishTime: "2024-04-22T21:41:32Z" # now + terminationGracePeriodSeconds:
  failedEvictionCounter: 0
  message: "critical-ds is terminating the pod due to node maintenance (OS upgrade)."
  conditions: []

Once the pod is terminated and removed from the node, it should not be re-scheduled on the node by the DaemonSet controller until the node maintenance is complete.

kubelet: Graceful Node Shutdown

The current Graceful Node Shutdown feature has a couple of drawbacks when compared to NodeMaintenance:

It is application agnostic as it only provides a static grace period before the shutdown based on priority. This does not always give the application enough time to react and can lead to data loss or application availability loss.
The DaemonSet pods may be running important services (critical priority) that should be available even during part of the shutdown. The daemon set controller does not have the observability of the kubelet shutdown procedure and cannot infer which DaemonSets should stop running. The controller needs to know which DaemonSets should run on each node with which priorities and reconcile accordingly.

To support these use cases we could introduce a new configuration option to the kubelet called preferNodeMaintenanceDuringGracefulShutdown.

This would result in the following behavior:

When a shutdown is detected, the kubelet would create a NodeMaintenance object for that node. Then it would block the shutdown indefinitely, until all the pods are terminated. The kubelet could pass the priorities from the shutdownGracePeriodByPodPriority to the NodeMaintenance, just without the shutdownGracePeriodSeconds. This would give applications a chance to react and gracefully leave the node without a timeout. Pod Selection would ensure that user workloads are terminated first and critical pods are terminated last.

By default, all user workloads will be asked to terminate at once. The Evacuation API ensures that an evacuator is selected or an eviction API is called. This should result in a fast start of a pod termination. NodeMaintenance could then be used even by spot instances.

The NodeMaintenance object should survive kubelet restarts, and the kubelet would always know if the node is under shutdown (maintenance). The cluster admin would have to remove the NodeMaintenance object after the node restart to indicate that the node is healthy and can run pods again. Admins are expected to deal with the lifecycle of planned NodeMaintenances, so reacting to the unplanned one should not be a big issue.

If there is no connection to the apiserver (apiserver down, network issues, etc.) and the NodeMaintenance object cannot be created, we would fall back to the original behavior of Graceful Node Shutdown feature. If the connection is restored, we would stop the Graceful Node Shutdown and proceed with the NodeMaintenance.

The NodeMaintenance would ensure that all pods are removed. This also includes the DaemonSet and static pods.

kubelet: Static Pods

Currently, there is no standard solution for terminating static pods. We can advertise what state each node should be in, declaratively with NodeMaintenance. This can include static pods as well.

Since static pods usually run the most critical workloads, they should be terminated last according to Pod Selection.

Similar to DaemonSets, static pods should be tied to the node lifecycle because they typically run critical workloads where availability is paramount. Therefore, the kubelet should respond to the Evacuation only if there is a NodeMaintenance happening on that node and the Static pod is in the drainTargets. For example, if we observe the following NodeMaintenance:

apiVersion: v1alpha1
kind: NodeMaintenance
...
status:
  nodeStatuses:
    - nodeRef:
        name: six
      drainTargets:
        - podPriority: 2147483647
          podType: Default
        - podPriority: 2147483647
          podType: DaemonSet
        - podPriority: 7000
          podType: Static
  ...

To fulfil the Evacuation API, the DaemonSet controller should register itself as a controller evacuator. To do this, it should ensure that the following annotation is present on its own pods.

evacuation.coordination.k8s.io/priority_kubelet.k8s.io: "10000/controller"

The kubelet should respond to the Evacuation object when it observes its own class (kubelet.k8s.io) in .status.activeEvacuatorClass.

For the above node maintenance, the kubelet should not react to Evacuations of static pods with a priority greater than 7000. This state should not normally occur, as Evacuation requests should be coordinated with NodeMaintenance. If it does occur, we should not encourage this flow by updating the .status.ActiveEvacuatorCompleted field, although it is required to update this field for normal workloads.

If the static pods have a priority equal to or less than 7000, the Evacuation status should be updated appropriately as follows, and the targeted pod should be terminated by the kubelet:

apiVersion: v1alpha1
kind: Evacuation
metadata:
  finalizers:
    evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
  labels:
    app: critical-static-workload
  name: 08deef1c-1838-42a5-a3a8-3a6d0558c7f9-critical-static-workload
  namespace: critical-workloads
spec:
  podRef:
    name: critical-static-workload
    uid:  08deef1c-1838-42a5-a3a8-3a6d0558c7f9
  progressDeadlineSeconds: 1800
  evacuators:
    - evacuatorClass: kubelet.k8s.io
      priority: 10000
      role: controller
status:
  activeEvacuatorClass: kubelet.k8s.io
  activeEvacuatorCompleted: false
  evacuationProgressTimestamp: "2024-04-22T22:10:05Z"
  expectedEvacuationFinishTime: "2024-04-22T22:11:05Z" # now + terminationGracePeriodSeconds:
  failedEvictionCounter: 0
  message: "critical-static-workload is terminating the pod due to node maintenance (OS upgrade)."
  conditions: []

Once the pod is terminated and removed from the node, it should not be started on the node by the kubelet again until the node maintenance is complete.

Test Plan

[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

Unit tests

<package>: <date> - <test coverage>

Integration tests

:

e2e tests

:

Graduation Criteria

Upgrade / Downgrade Strategy

Version Skew Strategy

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate
- Feature gate name: DeclarativeNodeMaintenance - this feature gate enables the NodeMaintenance API and node maintenance controller which creates Evacuation
- Components depending on the feature gate: kube-apiserver, kube-controller-manager

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Out-of-tree Implementation

We could implement the NodeMaintenance or Evacuation API out-of-tree first as a CRD.

The KEP aims to solve graceful termination of any pod in the cluster. This is not possible with a 3rd party CRD as we need an integration with core components.

We would like to solve the lifecycle of static pods during a node maintenance. This means that static pods should be terminated during the drain according to drainPlan, and they should stay terminated after the kubelet restart if the node is still under maintenance. This requires integration with kubelet. See kubelet: Static Pods for more details.
We would like to improve the Graceful Node Shutdown feature. Terminating pods via NodeMaintenance will improve application safety and availability. It will also improve the reliability of the Graceful Node Shutdown feature. However, this also requires the kubelet to interact with a NodeMaintenance. See kubelet and kubelet: Graceful Node Shutdown for more details.
We would like to also solve the lifecycle of DaemonSet pods during the NodeMaintenance. Usually these pods run important or critical services. These should be terminated at the right time during the node drain. To solve this, integration with NodeMaintenance is required. See DaemonSet Controller for more details.

Also, one of the disadvantages of using a CRD is that it would be more difficult to get real-word adoption and thus important feedback on this feature. This is mainly because the NodeMaintenance feature coordinates the node drain and provides good observability of the whole process. Third-party components that are both cluster admin and application developer facing can depend on this feature, use it, and build on top of it.

Use a Node Object Instead of Introducing a New NodeMaintenance API

As an alternative, it would be possible to signal the node maintenance by marking the node object instead of introducing a new API. But, it is probably better to decouple this from the node for reasons of extensibility and complexity.

Advantages of the NodeMaintenance API approach:

It allows us to implement incremental scale down of pods by various attributes according to a drainPlan across multiple nodes.
There may be additional features that can be added to the NodeMaintenance in the future.
It helps to decouple RBAC permissions and general update responsibility from the node object.
It is easier to manage a NodeMaintenance lifecycle compared to the node object.
Two or more different actors may want to maintain the same node in two different overlapping time slots. Creating two different NodeMaintenance objects would help with tracking each maintenance along with the reason behind it.
Observability is better achieved with an additional object.

Use Taint Based Eviction for Node Maintenance

To signal the start of the eviction we could simply taint a node with the NoExecute taint. This taint should be easily recognizable and have a standard name, such as node.kubernetes.io/maintenance. Other actors could observe the creations of such a taint and migrate or delete the pod. To ensure pods are not removed prematurely, application owners would have to set a toleration on their pods for this maintenance taint. Such applications could also set .spec.tolerations[].tolerationSeconds, which would give a deadline for the pods to be removed by the NoExecuteTaintManager.

This approach has the following disadvantages:

Taints and tolerations do not support PDBs, which is the main mechanism for preventing voluntary disruptions. People who want to avoid the disruptions caused by the maintenance taint would have to specify the toleration in the pod definition and ensure it is present at all times. This would also have an impact on the controllers, who would have to pollute the pod definitions with these tolerations, even though the users did not specify them in their pod template. The controllers could override users' tolerations, which the users might not be happy about. It is also hard to make such behaviors consistent across all the controllers.
Taints are used as a mechanism for involuntary disruption; to get pods out of the node for some reason (e.g. node is not ready). Modifying the taint mechanism to be less harmful (e.g. by adding a PDB support) is not possible due to the original requirements.
It is not possible to incrementally scale down according to pod priorities, labels, etc.

Names considered for the new API

These names are considered as an alternative to NodeMaintenance:

NodeIsolation
NodeDetachment
NodeClearance
NodeQuarantine
NodeDisengagement
NodeVacation

Files

README.md

Latest commit

History