- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
This KEP proposes adding a declarative API to manage node maintenance. This API can be used to implement additional capabilities around node draining.
The goal of this KEP is to analyze and improve node maintenance in Kubernetes.
Node maintenance is a request from a cluster administrator to remove all pods from a node(s) so that it can be disconnected from the cluster to perform a software upgrade (OS, kubelet, etc.), hardware or firmware upgrade, or simply to remove the node as it is no longer needed.
Kubernetes has existing support for this use case in the following way with kubectl drain
:
- There are running pods on node A, some of which are protected with PodDisruptionBudgets (PDB).
- Set the node
Unschedulable
(cordon) to prevent new pods from being scheduled there. - Evict (default behavior) pods from node A by using the eviction API (see kubectl drain worklflow).
- Proceed with the maintenance and shut down or restart the node.
- On platforms and nodes that support it, the kubelet will try to detect the imminent shutdown and
then attempt to perform a Graceful Node Shutdown:
- delay the shutdown pending graceful termination of remaining pods
- terminate remaining pods in reverse priority order (see pod-priority-graceful-node-shutdown)
The main problem is that the current approach tries to solve this in an application agnostic way and will simply attempt to remove all the pods currently running on the node. Since this approach cannot be applied generically to all pods, the Kubernetes project has defined special drain filters that either skip groups of pods or an admin has to consent to override those groups to be either skipped or deleted. This means that without knowledge of all the underlying applications on the cluster, the admin has to make a potentially harmful decision.
From an application owner or developer perspective, the only standard tool they have is a PodDisruptionBudget. This is sufficient in a basic scenario with a simple multi-replica application. The edge case applications, where this does not work are very important to the cluster admin, as they can block the node drain. And, in turn, very important to the application owner, as the admin can then override the pod disruption budget and disrupt their sensitive application anyway.
List of cases where the current solution is not optimal:
- Without extra manual effort, an application running with a single replica has to settle for
experiencing application downtime during the node drain. They cannot use PDBs with
minAvailable: 1
ormaxUnavailable: 0
, or they will block node maintenance. Not every user needs high availability either, due to a preference for a simpler deployment model, lack of application support for HA, or to minimize compute costs. Also, any automated solution needs to edit the PDB to account for the additional pod that needs to be spun to move the workload from one node to another. This has been discussed in the issue kubernetes/kubernetes#66811 and in the issue kubernetes/kubernetes#114877. - Similar to the first point, it is difficult to use PDBs for applications that can have a variable number of pods; for example applications with a configured horizontal pod autoscaler (HPA). These applications cannot be disrupted during a low load when they have only one pod. However, it is possible to disrupt the pods during a high load of the application (pods > 1) without experiencing application downtime. If the minimum number of pods is 1, PDBs cannot be used without blocking the node drain. This has been discussed in the issue kubernetes/kubernetes#93476.
- Graceful termination of DaemonSet pods is currently only supported on Linux as part of Graceful
Node Shutdown feature. The length of the shutdown is again not application specific and is set
cluster-wide (optionally by priority) by the cluster admin. This only partially
takes into account
.spec.terminationGracePeriodSeconds
of each pod and may cause premature termination of the application. This has been discussed in the issue kubernetes/kubernetes#75482 and in the issue kubernetes-sigs/cluster-api#6158. - There are cases during a node shutdown, when data corruption can occur due to premature node
shutdown. It would be great if applications could perform data migration and synchronization of
cached writes to the underlying storage before the pod deletion occurs. This is not easy to
quantify even with pod's
.spec.shutdownGracePeriod
, as the time depends on the size of the data and the speed of the storage. This has been discussed in the issue kubernetes/kubernetes#116618 and in the issue kubernetes/kubernetes#115148. - During the Graceful Node Shutdown the kubelet terminates the pods in order of their priority. The DaemonSet controller runs its own scheduling logic and creates the pods again. This causes a race. Such pods should be removed and not recreated, but higher priority pods that have not yet been terminated should be recreated if they are missing. This has been discussed in the issue kubernetes/kubernetes#122912.
- The Graceful Node Shutdown feature is not always reliable. If Dbus or kubelet is restarted during the shutdown, pods may be ungracefully terminated, leading to application disruption and data loss. New applications can get scheduled on such a node which can also be harmful. This has been discussed in issues kubernetes/kubernetes#122674, kubernetes/kubernetes#120613 and kubernetes/kubernetes#122674.
- There is no way to gracefully terminate static pods during a node shutdown kubernetes/kubernetes#122674, and the lifecycle/termination is not clearly defined for static pods kubernetes/kubernetes#16627.
- Different pod termination mechanisms are not synchronized with each other. So for example, the taint manager may prematurely terminate pods that are currently under Node Graceful Shutdown. This can also happen with other mechanism (e.g., different types of evictions). This has been discussed in the issue kubernetes/kubernetes#124448 and in the issue kubernetes/kubernetes#72129.
- There is not enough metadata about why the node drain was requested or why the pods are terminating. This has been discussed in the issue kubernetes/kubernetes#30586 and in the issue kubernetes/kubernetes#116965.
Approaches and workarounds used by other projects to deal with these shortcomings:
- https://github.com/medik8s/node-maintenance-operator uses a declarative approach that tries to
mimic
kubectl drain
(and uses kubectl implementation under the hood). - https://github.com/kubereboot/kured performs automatic node reboots and relies on
kubectl drain
implementation to achieve that. - https://github.com/strimzi/drain-cleaner prevents Kafka or ZooKeeper pods from being drained
until they are fully synchronized. Implemented by intercepting eviction requests with a
validating admission webhook. The synchronization is also protected by a PDB with the
.spec.maxUnavailable
field set to 0. See the experience reports for more information. - https://github.com/kubevirt/kubevirt intercepts eviction requests with a validating admission
webhook to block eviction and to start a virtual machine live migration from one node to another.
Normally, the workload is also guarded by a PDB with the
.spec.minAvailable
field set to 1. During the migration the value is increased to 2. - https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler has an eviction process that takes inspiration from kubectl and build additional logic on top of it. See Cluster Autoscaler for more details.
- https://github.com/kubernetes-sigs/karpenter taints the node during the node drain. It then attempts to evict all the pods on the node by calling the Eviction API. It prioritizes non-critical pods and non-DaemonSet pods
- https://github.com/aws/aws-node-termination-handler watches for a predefined set of events
(spot instance termination, EC2 termination, etc.), then cordons and drains the node. It relies
on the
kubectl
implementation. - https://github.com/openshift/machine-config-operator updates/drains nodes by using a cordon and
relies on the
kubectl drain
implementation.
Experience Reports:
- Federico Valeri, Drain Cleaner: What's this?, Sep 24, 2021, description of the use case and implementation of drain cleaner
- Tommer Amber, Solution!! Avoid Kubernetes/Openshift Node Drain Failure due to active PodDisruptionBudget, Apr 30, 2022 - user
is unhappy about the manual intervention required to perform node maintenance and gives the
unfortunate advice to cluster admins to simply override the PDBs. This can have negative
consequences for user applications, including data loss. This also discourages the use of PDBs.
We have also seen an interest in the issue kubernetes/kubernetes#83307
for overriding evictions, which led to the addition of the
--disable-eviction
flag tokubectl drain
. There are other examples of this approach on the web . - Kevin Reeuwijk, How to handle blocking PodDisruptionBudgets on K8s with distributed storage, June 6, 2022 - a simple shell script example on how to drain the node in a safer way. It does a normal eviction, then looks for a pet application (Rook-Ceph in this case) and does hard delete if it does not see it. This approach is not plagued by the loss of data resiliency, but it does require maintenaning a list of pet applications, which can be prone to mistakes. In the end, the cluster admin has to do a job of the application maintainer.
- Artur Rodrigues, Impossible Kubernetes node drains, 30 Mar, 2023 - discusses the problem with node drains and offers a workaround to restart the application without the application owner's consents, but acknowledges that this may be problematic without the knowledge of the application
- Jack Roper, How to Delete Pods from a Kubernetes Node with Examples, 05 Jul, 2023 - also discusses the problem of blocking PDBs and offers several workarounds. Similar to others also offers a force deletion, but also a less destructive method of scaling up the application. However, this also interferes with application deployment and has to be supported by the application.
Accepts a drain-priority-config
option, which is similar to Graceful Node Shutdown in that it
gives each priority a shutdown grace period. Also has a max-graceful-termination-sec
option for
pod termination and a max-pod-eviction-time
option after which the eviction is forfeited.
Each pod is first analyzed to see if it is drainable. Part of the logic is similar to kubectl and its drain filters (see Cluster Autoscaler rules):
- Mirror pods are skipped.
- Terminating pods are skipped.
- Pods and ReplicaSets/ReplicationControllers without owning controllers are blocking by default
(the check can be modified with the
skip-nodes-with-custom-controller-pods
option). - System pods (in the
kube-system
namespace) without a matching PDB are blocking by default (the check can be modified with theskip-nodes-with-system-pods
option). - Pods with
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
annotation are blocking. - Pods with local storage are blocking unless they have a
cluster-autoscaler.kubernetes.io/safe-to-evict-local-volumes
annotation (the check can be modified with theskip-nodes-with-local-storage
option, e.g., this check is skipped on AKS). - Pods with PDBs that do not have
disruptionsAllowed
are blocking.
This can be enhanced with other rules and overrides.
It uses this logic to first check if all pods can be removed from a node. If not, it will report those nodes. Then it will group all pods by a priority and evict them gradually from the lowest to highest priority. This may include DaemonSet pods.
Graceful Node Shutdown is a part of the current solution for node maintenance. Unfortunately, it is not possible to rely solely on this feature as a go-to solution for graceful node and workload termination.
- The Graceful Node Shutdown feature is not application aware and may prematurely disrupt workloads and lead to data loss.
- The kubelet controls the shutdown process using Dbus and systemd, and can delay (but not entirely block) it using the systemd inhibitor. However, if Dbus or the kubelet is restarted during the node shutdown, the shutdown might not be registered again, and pods might be terminated ungracefully. Also, new workloads can get scheduled on the node while the node is shutting down. Cluster admins should, therefore, plan the maintenance in advance and ensure that pods are gracefully removed before attempting to shut down or restart the machine.
- The kubelet has no way of reliably detecting ongoing maintenance if the node is restarted in the meantime.
- Graceful termination of static pods during a shutdown is not possible today. It is also not currently possible to prevent them from starting back up immediately after the machine has been restarted and the kubelet has started again, if the node is still under maintenance.
To sum up. Some applications solve the disruption problem by introducing validating admission webhooks. This has some drawbacks. The webhooks are not easily discoverable by cluster admins. And they can block evictions for other applications if they are misconfigured or misbehave. The eviction API is not intended to be extensible in this way. The webhook approach is therefore not recommended.
Some drainers solve the node drain by depending on the kubectl logic, or by extending/rewriting it with additional rules and logic.
As seen in the experience reports and GitHub issues, some admins solve their problems by simply ignoring PDBs which can cause unnecessary disruptions or data loss. Some solve this by playing with the application deployment, but have to understand that the application supports this.
kubelet's Graceful Node Shutdown feature is a best-effort solution for unplanned shutdowns, but it is not sufficient to ensure application and data safety.
- Introduce NodeMaintenance API.
- Introduce a node maintenance controller that creates evacuations.
- Deprecate kubectl drain in favor of NodeMaintenance.
- Implement NodeMaintenanceAwareKubelet feature to implement a lifecycle for static pods during a maintenance.
- Implement NodeMaintenanceAwareDaemonSet feature to prevent the scheduling of DaemonSet pods on nodes during a maintenance.
- Introduce a node maintenance period, nodeDrainTimeout (similar to cluster-api nodeDrainTimeout) or a TTL optional field as an upper bound on the duration of node maintenance. Then the node maintenance would be garbage collected and the node made schedulable again.
- Solve the node lifecycle management or automatic shutdown after the node drain is completed. Implementation of this is better suited for other cluster components and actors who can use the node maintenance as a building block to achieve their desired goals.
- Synchronize all pod termination mechanisms (see #8 in the Motivation section), so that they do not terminate pods under NodeMaintenance/Evacuation.
Most of these issues stem from the lack of a standardized way of detecting a start of the node
drain. This KEP proposes the introduction of a NodeMaintenance object that would signal an intent
to gracefully remove pods from given nodes. The intent will be implemented by the newly proposed
Evacuation API KEP, which ensures
graceful pod removal or migration, an ability to measure the progress and a fallback to eviction if
progress is lost. The NodeMaintenance implementation should also utilize existing node's
.spec.unschedulable
field, which prevents new pods from being scheduled on such a node.
We will deprecate the kubectl drain
as the main mechanism for draining nodes and drive the whole
process via a declarative API. This API can be used either manually or programmatically by other
drain implementations (e.g. cluster autoscalers).
To support workload migration, a new controller should be introduced to observe the NodeMaintenance
objects and then select pods for evacuation. The pods should be selected by node (nodeSelector
)
and the pods should be gradually evacuated according to the workload they are running.
Controllers can then implement the migration/termination either by reacting to the Evacuation API
or by reacting to the NodeMaintenance API if they need more details.
As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without any required manual interventions. I want to have an ability to manually switch between the maintenance phases (Planning, Cordon, Drain, Drain Complete, Maintenance Complete). I also want to observe the node drain via the API and check on its progress. I also want to be able to discover workloads that are blocking the node drain.
As an application owner, I want to run single replica applications without disruptions and have the ability to easily migrate the workload pods from one node to another. This also applies to applications with larger number of replicas that prefer to surge (upscale) pods first rather than downscale.
Cluster or node autoscalers that take on the role of kubectl drain
want to signal the intent to
drain a node using the same API and provide a similar experience to the CLI counterpart.
- This KEP depends on Evacuation API KEP.
A misconfigured .spec.nodeSelector could select all the nodes (or just all master nodes) in the cluster. This can cause the cluster to get into a degraded and unrecoverable state.
An admission plugin (NodeMaintenance Admission) is introduced to issue a warning in this scenario.
kubectl drain
: as we can see in the Motivation section, kubectl is heavily used
either manually or as a library by other projects. It is safer to keep the old behavior of this
command. However, we will deprecate it along with all the library functions. We can print a
deprecation warning when this command is used, and promote the NodeMaintenance. Additionally, pods
that support evacuation and have evacuation.coordination.k8s.io/priority_${EVACUATOR_CLASS}
annotations will block eviction requests.
kubectl cordon
and kubectl uncordon
commands will be enhanced with a warning that will warn
the user if a node is made un/schedulable, and it collides with an existing NodeMaintenance object.
As a consequence the node maintenance controller will reconcile the node back to the old value.
Because of this we can make these commands noop when the node is under the NodeMaintenance.
NodeMaintenance objects serve as an intent to remove or migrate pods from a set of nodes. We will include Cordon and Drain toggles to support the following states/stages of the maintenance:
- Planning: this is to let the users know that maintenance will be performed on a particular set
of nodes in the future. Configured with
.spec.stage=Idle
. - Cordon: stop accepting (scheduling) new pods. Configured with
.spec.stage=Cordon
. - Drain: gives an intent to drain all selected nodes by creating
Evacuation
objects for the node's pods. Configured with.spec.stage=Drain
. - Drain Complete: all targeted pods have been drained from all the selected nodes. The nodes can
be upgraded, restarted, or shut down. The configuration is still kept at
.spec.stage=Drain
andDrained
condition is set to"True"
on the node maintenance object. - Maintenance Complete: make the nodes schedulable again once the node maintenance is done.
Configured with
.spec.stage=Complete
.
// +enum
type NodeMaintenanceStage string
const (
// Idle does not interact with the cluster.
Idle NodeMaintenanceStage = "Idle"
// Cordon cordons all selected nodes by making them unschedulable.
Cordon NodeMaintenanceStage = "Cordon"
// Drain:
// 1. Cordons all selected nodes by making them unschedulable.
// 2. Gives an intent to drain all selected nodes by creating Evacuation objects for the
// node's pods.
Drain NodeMaintenanceStage = "Drain"
// Complete:
// 1. Removes all Evacuation objects requested by this NodeMaintenance.
// 2. Uncordons all selected nodes by making them schedulable again, unless there is not another
// maintenance in progress.
Complete NodeMaintenanceStage = "Complete"
)
type NodeMaintenance struct {
...
Spec NodeMaintenanceSpec
Status NodeMaintenanceStatus
}
type NodeMaintenanceSpec struct {
// NodeSelector selects nodes for this node maintenance.
// +required
NodeSelector *v1.NodeSelector
// The order of the stages is Idle -> Cordon -> Drain -> Complete.
//
// - The Cordon or Drain stage can be skipped by setting the stage to Complete.
// - The NodeMaintenance object is moved to the Complete stage on deletion unless the Idle stage has been set.
//
// The default value is Idle.
Stage NodeMaintenanceStage
// DrainPlan is executed from the first entry to the last entry during the Drain stage.
// DrainPlanEntry podType fields should be in the following order:
// nil -> DaemonSet -> Static
// DrainPlanEntry priority fields should be in ascending order for each podType.
// If the priority and podType are the same, concrete selectors are executed first.
//
// The following entries are injected into the drainPlan on the NodeMaintenance admission:
// - podPriority: 1000000000 // highest priority for user defined priority classes
// podType: "Default"
// - podPriority: 2000000000 // system-cluster-critical priority class
// podType: "Default"
// - podPriority: 2000001000 // system-node-critical priority class
// podType: "Default"
// - podPriority: 2147483647 // maximum value
// podType: "Default"
// - podPriority: 1000000000 // highest priority for user defined priority classes
// podType: "DaemonSet"
// - podPriority: 2000000000 // system-cluster-critical priority class
// podType: "DaemonSet"
// - podPriority: 2000001000 // system-node-critical priority class
// podType: "DaemonSet"
// - podPriority: 2147483647 // maximum value
// podType: "DaemonSet"
// - podPriority: 1000000000 // highest priority for user defined priority classes
// podType: "Static"
// - podPriority: 2000000000 // system-cluster-critical priority class
// podType: "Static"
// - podPriority: 2000001000 // system-node-critical priority class
// podType: "Static"
// - podPriority: 2147483647 // maximum value
// podType: "Static"
//
// Duplicate entries are not allowed.
// This field is immutable.
DrainPlan []DrainPlanEntry
// Reason for the maintenance.
Reason string
}
const (
// Default selects all pods except DaemonSet and Static pods.
Default PodType = "Default"
// DaemonSet selects DaemonSet pods.
DaemonSet PodType = "DaemonSet"
// Static selects static pods.
Static PodType = "Static"
)
type DrainPlanEntry struct {
// PodSelector selects pods according to their labels.
// This can help to select which pods of the same priority should be evacuated first.
// +optional
PodSelector *metav1.LabelSelector
// PodPriority specifies a pod priority.
// Pods with a priority less or equal to this value are selected.
PodPriority int32
// PodType selects pods according to the pod type:
// - Default selects all pods except DaemonSet and Static pods.
// - DaemonSet selects DaemonSet pods.
// - Static selects static pods.
PodType PodType
}
type NodeMaintenanceStatus struct {
// StageStatuses tracks the statuses of started stages.
StageStatuses []StageStatus
// List of a maintenance statuses for all nodes targeted by this maintenance.
NodeStatuses []NodeMaintenanceNodeStatus
Conditions []metav1.Condition
}
type StageStatus struct {
// Name of the Stage.
Name NodeMaintenanceStage
// StartTimestamp is the time that indicates the start of this stage.
StartTimestamp *metav1.Time
}
type NodeReference struct {
// Name of the node.
Name string
}
type NodeMaintenanceNodeStatus struct {
// NodeRef identifies a Node.
NodeRef NodeReference
// DrainTargets specifies which pods on this node are currently being targeted for evacuation.
// Once evacuation of the Default PodType finishes, DaemonSet PodType entries appear.
// Once the evacuation of DaemonSet PodType finishes, Static PodType entries appear.
// The PodPriority for these entries is increased over time according to the .spec.DrainPlan
// as the lower-priority pods finish evacuation.
// The next entry in the .spec.DrainPlan is selected once all the nodes have reached their
// DrainTargets.
// If there are multiple NodeMaintenances for a node, the least powerful DrainTargets among
// them are selected and set for all the NodeMaintenances for that node. Thus, the DrainTargets
// do not have to correspond to the entries in .spec.drainPlan for a single NodeMaintenance
// instance.
// DrainTargets cannot backtrack and will target more pods with each update until all pods on
// the node are targeted.
DrainTargets []DrainPlanEntry
// DrainMessage may specify a state of the drain on this node and a reason why the drain
// targets are set to a particular values.
DrainMessage string
// Number of pods that have not yet started evacuation.
PodsPendingEvacuation int32
// Number of pods that have started evacuation and have a matching Evacuation object.
PodsEvacuating int32
}
const (
// DrainedCondition is a condition set by the node-maintenance controller that signals
// whether all pods pending termination have terminated on all target nodes when drain is
// requested by the maintenance object.
DrainedCondition = "Drained"
}
nodemaintenance
admission plugin will be introduced.
It will validate all incoming requests for CREATE, UPDATE, and DELETE operations on the
NodeMaintenance objects. All nodes matching the .spec.nodeSelector
must pass an authorization
check for the DELETE operation.
Also, if the .spec.nodeSelector
matches all cluster nodes, a warning will be produced indicating
that the cluster may get into a degraded and unrecoverable state. The warning is non-blocking and
such NodeMaintenance is still valid and can proceed.
Node maintenance controller will be introduced and added to kube-controller-manager
. It will
observe NodeMaintenance objects and have the following main features:
The controller should not touch the pods or nodes that match the selector of the NodeMaintenance
object in any way in the Idle
stage.
When a stage is not Idle
, nodemaintenance.k8s.io/maintenance-completion
finalizer is placed on
the NodeMaintenance object to ensure uncordon and removal of Evacuations upon deletion.
When a deletion of the NodeMaintenance object is detected, its .spec.stage
is set to Complete
.
The finalizer is not removed until the Complete
stage has been completed.
When a Cordon
or Drain
stage is detected on the NodeMaintenance object, the controller
will set (and reconcile) .spec.Unschedulable
to true
on all nodes that satisfy
.spec.nodeSelector
. It should alert via events if too many occur appear and a race to change
this field is detected.
When a Complete
stage is detected on the NodeMaintenance object, the controller sets
.spec.Unschedulable
back to false
on all nodes that satisfy .spec.nodeSelector
, unless there
is no other maintenance in progress.
When the node maintenance is canceled (reaches the Complete
stage without all of its pods
terminating), the controller will attempt to remove all Evacuations that match the node maintenance,
unless there is no other maintenance in progress.
- If there are foreign finalizers on the Evacuation, it should only remove its own instigator finalizer (see Drain).
- If the evacuator does not support a cancellation and it has set
.status.evacuationCancellationPolicy
toForbid
, deletion of the Evacuation object will not be attempted.
Consequences for pods:
- Pods whose evacuators have not yet initiated evacuation will continue to run unchanged.
- Pods whose evacuators have initiated evacuation and support cancellation
(
.status.evacuationCancellationPolicy=Allow
) should cancel the evacuation and keep the pods available. - Pods whose evacuators have initiated evacuation and do not support cancellation
(
.status.evacuationCancellationPolicy=Forbid
) should continue the evacuation and eventually terminate the pods.
When a Drain
stage is detected on the NodeMaintenance object, Evacuation objects are created for
selected pods (Pod Selection).
apiVersion: v1alpha1
kind: Evacuation
metadata:
finalizers:
evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
name: f5823a89-e03f-4752-b013-445643b8c7a0-muffin-orders-6b59d9cb88-ks7wb
namespace: blue-deployment
spec:
podRef:
name: muffin-orders-6b59d9cb88-ks7wb
uid: f5823a89-e03f-4752-b013-445643b8c7a0
progressDeadlineSeconds: 1800
This is resolved to the following Evacuation object according to the pod on admission:
apiVersion: v1alpha1
kind: Evacuation
metadata:
finalizers:
evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
labels:
app: muffin-orders
name: f5823a89-e03f-4752-b013-445643b8c7a0-muffin-orders-6b59d9cb88-ks7wb
namespace: blue-deployment
spec:
podRef:
name: muffin-orders-6b59d9cb88-ks7wb
uid: f5823a89-e03f-4752-b013-445643b8c7a0
progressDeadlineSeconds: 1800
evacuators:
- evacuatorClass: deployment.apps.k8s.io
priority: 10000
role: controller
The node maintenance controller requests the removal of a pod from a node by the presence of the
Evacuation. Setting progressDeadlineSeconds
to 1800 (30m) should give potential evacuators
enough time to recover from a disruption and continue with the graceful evacuation. If the
evacuators are unable to evacuate the pod, or if there are no evacuators, the evacuation controller
will attempt to evict these pods, until they are deleted.
The only job of the node maintenance controller is to make sure that the Evacuation object exist
and has the evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
finalizer.
The pods for evacuation would first be selected by node (.spec.nodeSelector
). NodeMaintenance
should eventually remove all the pods from each node. To do this in a graceful manner, the
controller will first ensure that lower priority pods are evacuated first for the same pod type.
The user can also target some pods earlier than others with a label selector.
DaemonSet and static pods typically run critical workloads that should be scaled down last.
<<[UNRESOLVED Pod Selection Priority]>> Should user daemon sets (priority up to 1000000000) be scaled down first? <<[/UNRESOLVED]>>
To achieve this, we will ensure that the NodeMaintenance .spec.drainPlan
always contains the
following entries:
spec:
drainPlan:
- podPriority: 1000000000 # highest priority for user defined priority classes
podType: "Default"
- podPriority: 2000000000 # system-cluster-critical priority class
podType: "Default"
- podPriority: 2000001000 # system-node-critical priority class
podType: "Default"
- podPriority: 2147483647 # maximum value
podType: "Default"
- podPriority: 1000000000 # highest priority for user defined priority classes
podType: "DaemonSet"
- podPriority: 2000000000 # system-cluster-critical priority class
podType: "DaemonSet"
- podPriority: 2000001000 # system-node-critical priority class
podType: "DaemonSet"
- podPriority: 2147483647 # maximum value
podType: "DaemonSet"
- podPriority: 1000000000 # highest priority for user defined priority classes
podType: "Static"
- podPriority: 2000000000 # system-cluster-critical priority class
podType: "Static"
- podPriority: 2000001000 # system-node-critical priority class
podType: "Static"
- podPriority: 2147483647 # maximum value
podType: "Static"
...
If not they will be added during the NodeMaintenance admission.
The node maintenance controller resolves this plan across intersecting NodeMaintenances. To
indicate which pods are being evacuated on which node, the controller populates
.status[0].nodeStatuses.drainTargets
. This status field is updated during the Drain
stage to
incrementally select pods with higher priority and pod type (Default
->DaemonSet
-> Static
).
It is also possible to partition the updates for the same priorities according to the pod labels.
If there is only a single NodeMaintenance present, it selects the first entry from the
.spec.drainPlan
and makes sure that all the targeted pods are evacuated/removed. It then selects
the next entry and repeats the process. If a new pod appears that matches the previous entries, it
will also be evacuated.
If there are multiple NodeMaintenances, we have to first resolve the lowest priority entry from the
.spec.drainPlan
among them for the intersecting nodes. Non-intersecting nodes may have a higher
priority or pod type. The next entry in the plan can be selected once all the nodes of a
NodeMaintenance have finished evacuation and all the NodeMaintenances of intersecting nodes have
finished evacuation for the current drain targets. See the Pod Selection and DrainTargets Example
for additional details.
A similar kind of drain plan, albeit with fewer features is offered today by the
Graceful Node Shutdown
feature and by the Cluster Autoscaler's drain-priority-config.
The downside of these configurations is that they have shutdownGracePeriodSeconds
which sets a
limit on how long the termination of pods should take. This is not application-aware and some
applications may require more time to gracefully shut down. Allowing such hard-coded timeouts may
result in unnecessary application disruptions or data corruption.
To support the evacuation of DaemonSet
and Static
pods, the daemon set controller and kubelet
should observe NodeMaintenance objects and Evacuations to coordinate the scale down of the pods on
the targeted nodes.
To ensure more streamlined experience we will not support the default kubectl drain filters.
Instead, it should be possible to create the NodeMaintenance object with just a spec.nodeSelector
.
The only thing that can be configured is which pods should be scaled down first.
NodeMaintenance alternatives to kubectl drain filters:
daemonSetFilter
: Removal of these pods should be supported by the DaemonSet controller.mirrorPodFilter
: Removal of these pods should be supported by the kubelet.skipDeletedFilter
: Creating evacuation of already terminating pods should have no downside and be informative for the user.unreplicatedFilter
: Actors who own pods without a controller owner reference should have the opportunity to register an evacuator to evacuate their pods. Many drain solutions today evict these types of pods indiscriminately.localStorageFilter
: Actors who own pods with local storage (havingEmptyDir
volumes) should have the opportunity to register an evacuator to evacuate their pods. Many drain solutions today evict these types of pods indiscriminately.
If two Node Maintenances are created at the same time for the same node. Then, for the intersecting nodes, the entry with the lowest priority in the drainPlan is resolved first.
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
name: "maintenance-a"
...
spec:
stage: Drain
drainPlan:
- podPriority: 5000
podType: Default
- podPriority: 15000
podType: Default
- podPriority: 3000
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: one
podsPendingEvacuation: 100
podsEvacuating: 10
drainMessage: "Evacuating"
drainTargets:
- podPriority: 5000
podType: Default
- nodeRef:
name: two
podsPendingEvacuation: 30
podsEvacuating: 7
drainMessage: "Evacuating"
drainTargets:
- podPriority: 5000
podType: Default
...
---
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
name: "maintenance-b"
...
spec:
stage: Drain
drainPlan:
- podPriority: 10000
podType: Default
- podPriority: 15000
podType: Default
- podPriority: 4000
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: one
podsPendingEvacuation: 100
podsEvacuating: 10
drainMessage: "Evacuating (limited by maintenance-a)"
drainTargets:
- podPriority: 5000
podType: Default
- nodeRef:
name: three
podsPendingEvacuation: 45
podsEvacuating: 25
drainMessage: "Evacuating"
drainTargets:
- podPriority: 10000
podType: Default
...
If the node three is drained, then it has to wait for the node one, because the drain plan specifies that all the pods with priority 10000 or lower should be evacuated first before moving on to the next entry.
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
name: "maintenance-b"
...
spec:
stage: Drain
drainPlan:
- podPriority: 10000
podType: Default
- podPriority: 15000
podType: Default
- podPriority: 4000
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: one
podsPendingEvacuation: 100
podsEvacuating: 5
drainMessage: "Evacuating (limited by maintenance-a)"
drainTargets:
- podPriority: 5000
podType: Default
- nodeRef:
name: three
podsPendingEvacuation: 45
podsEvacuating: 0
drainMessage: "Waiting for node one."
drainTargets:
- podPriority: 10000
podType: Default
...
If the node one is drained, we still have to wait for the maintenance-a
to drain node two. If we
were to start evacuating higher priority pods from node one earlier, we would not conform to the
drainPlan of maintenance-a
. The plan specifies that all the pods with priority 5000 or lower
should be evacuated first before moving on to the next entry.
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
name: "maintenance-a"
...
spec:
stage: Drain
drainPlan:
- podPriority: 5000
podType: Default
- podPriority: 15000
podType: Default
- podPriority: 3000
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: one
podsPendingEvacuation: 100
podsEvacuating: 0
drainMessage: "Waiting for node two."
drainTargets:
- podPriority: 5000
podType: Default
- nodeRef:
name: two
podsPendingEvacuation: 30
podsEvacuating: 2
drainMessage: "Evacuating"
drainTargets:
- podPriority: 5000
podType: Default
...
---
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
name: "maintenance-b"
...
spec:
stage: Drain
drainPlan:
- podPriority: 10000
podType: Default
- podPriority: 15000
podType: Default
- podPriority: 4000
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: one
podsPendingEvacuation: 100
podsEvacuating: 0
drainMessage: "Waiting for node two (maintenance-a)."
drainTargets:
- podPriority: 5000
podType: Default
- nodeRef:
name: three
podsPendingEvacuation: 45
podsEvacuating: 0
drainMessage: "Waiting for node two (maintenance-a)."
drainTargets:
- podPriority: 10000
podType: Default
...
Once the node two drains, we can increment the drainTargets.
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
name: "maintenance-a"
...
spec:
stage: Drain
drainPlan:
- podPriority: 5000
podType: Default
- podPriority: 15000
podType: Default
- podPriority: 3000
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: one
podsPendingEvacuation: 70
podsEvacuating: 30
drainMessage: "Evacuating (limited by maintenance-b)"
drainTargets:
- podPriority: 10000
podType: Default
- nodeRef:
name: two
podsPendingEvacuation: 21
podsEvacuating: 9
drainMessage: "Evacuating"
drainTargets:
- podPriority: 15000
podType: Default
...
---
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
name: "maintenance-b"
...
spec:
stage: Drain
drainPlan:
- podPriority: 10000
podType: Default
- podPriority: 15000
podType: Default
- podPriority: 4000
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: one
podsPendingEvacuation: 70
podsEvacuating: 30
drainMessage: "Evacuating"
drainTargets:
- podPriority: 10000
podType: Default
- nodeRef:
name: three
podsPendingEvacuation: 45
podsEvacuating: 0
drainMessage: "Waiting for node one."
drainTargets:
- podPriority: 10000
podType: Default
...
The progress of the drain should not be backtracked. If an intersecting maintenance-c
is created,
it will be fast-forwarded for node one regardless of its drainPlan.
apiVersion: v1alpha1
kind: NodeMaintenance
metadata:
name: "maintenance-c"
...
spec:
stage: Drain
drainPlan:
- podPriority: 2000
podType: Default
- podPriority: 15000
podType: Default
...
status:
nodeStatuses:
- nodeRef:
name: one
podsPendingEvacuation: 70
podsEvacuating: 30
drainMessage: "Evacuating (fast-forwarded by older maintenance-b)"
drainTargets:
- podPriority: 10000
podType: Default
- nodeRef:
name: four
podsPendingEvacuation: 20
podsEvacuating: 5
drainMessage: "Evacuating"
drainTargets:
- podPriority: 2000
podType: Default
...
This is done to ensure that the pre-conditions of the older maintenances (maintenance-a
and
maintenance-b
) are not broken. When we remove workloads with priority 15000, our pre-condition is
that workloads with priority 5000 that might depend on these 15000 priority workloads are gone. If
we allow rescheduling of the lower priority pods, this assumption is broken.
Unfortunately, a similar precondition is broken for the maintenance-c
, so we can at least emit an
event saying that we are fast-forwarding maintenance-c
due to existing older maintenance(s). In
the extreme scenario, node one may already be turned off and creating a new maintenance that
assumes priority X pods are still running will not help to bring it back. Emitting an event would
help with observability and might help cluster admins better schedule node maintenances.
An example progression for the following drain plan might look as follows:
spec:
stage: Drain
drainPlan:
- podPriority: 1000
podType: Default
- podPriority: 2000
podType: Default
podSelector:
matchLabels:
app: postgres
- podPriority: 2147483647
podType: Default
- podPriority: 1000
podType: DaemonSet
- podPriority: 2147483647
podType: DaemonSet
- podPriority: 2147483647
podType: Static
status:
nodeStatuses:
- nodeRef:
name: five
drainTargets:
- podPriority: 1000
podType: Default
- podPriority: 1000
podType: Default
podSelector:
matchLabels:
app: postgres
...
status:
nodeStatuses:
- nodeRef:
name: five
drainTargets:
- podPriority: 1000
podType: Default
- podPriority: 2000
podType: Default
podSelector:
matchLabels:
app: postgres
...
status:
nodeStatuses:
- nodeRef:
name: five
drainTargets:
- podPriority: 2147483647
podType: Default
- podPriority: 2147483647
podType: Default
podSelector:
matchLabels:
app: postgres
...
status:
nodeStatuses:
- nodeRef:
name: five
drainTargets:
- podPriority: 2147483647
podType: Default
- podPriority: 2147483647
podType: Default
podSelector:
matchLabels:
app: postgres
- podPriority: 1000
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: five
drainTargets:
- podPriority: 2147483647
podType: Default
- podPriority: 2147483647
podType: Default
podSelector:
matchLabels:
app: postgres
- podPriority: 2147483647
podType: DaemonSet
...
status:
nodeStatuses:
- nodeRef:
name: five
drainTargets:
- podPriority: 2147483647
podType: Default
- podPriority: 2147483647
podType: Default
podSelector:
matchLabels:
app: postgres
- podPriority: 2147483647
podType: DaemonSet
- podPriority: 2147483647
podType: Static
...
The controller can show progress by reconciling:
.status.stageStatuses
should be amended when a new stage is selected. This is used to track which stages have been started. Additional metadata can be added to this struct in the future..status.nodeStatuses[0].drainTargets
should be updated during aDrain
stage. The drain targets should be resolved according to the Pod Selection and Pod Selection and DrainTargets Example..status.nodeStatuses[0].drainMessage
should be updated during aDrain
stage. The message should be resolved according to Pod Selection and DrainTargets Example..status.nodeStatuses[0].PodsPendingEvacuation
, to indicate how many pods are left to start evacuation from the first node..status.nodeStatuses[0].PodsEvacuating
, to indicate how many pods are being evacuated from the first node. These are the pods that have matching Evacuation objects.- To keep track of the entire maintenance the controller will reconcile a
Drained
condition and set it to true if all pods pending evacuation/termination have terminated on all target nodes when drain is requested by the maintenance object. - NodeMaintenance condition or annotation can be set on the node object to advertise the current phase of the maintenance.
The following transitions should be validated by the API server.
- Idle -> Deletion
- Planning a maintenance in the future and canceling/deleting it without any consequence.
- (Idle) -> Cordon -> (Complete) -> Deletion.
- Make a set of nodes unschedulable and then schedulable again.
- The complete stage will always be run even without specifying it.
- (Idle) -> (Cordon) -> Drain -> (Complete) -> Deletion.
- Make a set of nodes unschedulable, drain them, and then make them schedulable again.
- Cordon and Complete stages will always be run, even without specifying them.
- (Idle) -> Complete -> Deletion.
- Make a set of nodes schedulable.
The stage transitions are invoked either manually by the cluster admin or by a higher-level
controller. For a simple drain, cluster admin can simply create the NodeMaintenance with
stage: Drain
directly.
The DaemonSet workloads should be tied to the node lifecycle because they typically run critical
workloads where availability is paramount. Therefore, the DaemonSet controller should respond to
the Evacuation only if there is a NodeMaintenance happening on that node and the DaemonSet is in
the drainTargets
. For example, if we observe the following NodeMaintenance:
apiVersion: v1alpha1
kind: NodeMaintenance
...
status:
nodeStatuses:
- nodeRef:
name: six
drainTargets:
- podPriority: 2147483647
podType: Default
- podPriority: 5000
podType: DaemonSet
...
To fulfil the Evacuation API, the DaemonSet controller should register itself as a controller evacuator. To do this, it should ensure that the following annotation is present on its own pods.
evacuation.coordination.k8s.io/priority_daemonset.apps.k8s.io: "10000/controller"
The controller should respond to the Evacuation object when it observes its own class
(daemonset.apps.k8s.io
) in .status.activeEvacuatorClass
.
For the above node maintenance, the controller should not react to Evacuations of DaemonSet pods
with a priority greater than 5000. This state should not normally occur, as Evacuation requests
should be coordinated with NodeMaintenance. If it does occur, we should not encourage this flow by
updating the .status.ActiveEvacuatorCompleted
field, although it is required to update this field
for normal workloads.
If the DaemonSet pods have a priority equal to or less than 5000, the Evacuation status should be updated appropriately as follows, and the targeted pod should be deleted by the DaemonSet controller:
apiVersion: v1alpha1
kind: Evacuation
metadata:
finalizers:
evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
labels:
app: critical-ds
name: ae9b4bc6-e4ca-4f8e-962b-2d4459b1f684-critical-ds-5nxjs
namespace: critical-workloads
spec:
podRef:
name: critical-ds-5nxjs
uid: ae9b4bc6-e4ca-4f8e-962b-2d4459b1f684
progressDeadlineSeconds: 1800
evacuators:
- evacuatorClass: daemonset.apps.k8s.io
priority: 10000
role: controller
status:
activeEvacuatorClass: daemonset.apps.k8s.io
activeEvacuatorCompleted: false
evacuationProgressTimestamp: "2024-04-22T21:40:32Z"
expectedEvacuationFinishTime: "2024-04-22T21:41:32Z" # now + terminationGracePeriodSeconds:
failedEvictionCounter: 0
message: "critical-ds is terminating the pod due to node maintenance (OS upgrade)."
conditions: []
Once the pod is terminated and removed from the node, it should not be re-scheduled on the node by the DaemonSet controller until the node maintenance is complete.
The current Graceful Node Shutdown feature has a couple of drawbacks when compared to NodeMaintenance:
- It is application agnostic as it only provides a static grace period before the shutdown based on priority. This does not always give the application enough time to react and can lead to data loss or application availability loss.
- The DaemonSet pods may be running important services (critical priority) that should be available even during part of the shutdown. The daemon set controller does not have the observability of the kubelet shutdown procedure and cannot infer which DaemonSets should stop running. The controller needs to know which DaemonSets should run on each node with which priorities and reconcile accordingly.
To support these use cases we could introduce a new configuration option to the kubelet called
preferNodeMaintenanceDuringGracefulShutdown
.
This would result in the following behavior:
When a shutdown is detected, the kubelet would create a NodeMaintenance object for that node.
Then it would block the shutdown indefinitely, until all the pods are terminated. The kubelet
could pass the priorities from the shutdownGracePeriodByPodPriority
to the NodeMaintenance,
just without the shutdownGracePeriodSeconds
. This would give applications a chance to react and
gracefully leave the node without a timeout. Pod Selection would ensure that user
workloads are terminated first and critical pods are terminated last.
By default, all user workloads will be asked to terminate at once. The Evacuation API ensures that an evacuator is selected or an eviction API is called. This should result in a fast start of a pod termination. NodeMaintenance could then be used even by spot instances.
The NodeMaintenance object should survive kubelet restarts, and the kubelet would always know if the node is under shutdown (maintenance). The cluster admin would have to remove the NodeMaintenance object after the node restart to indicate that the node is healthy and can run pods again. Admins are expected to deal with the lifecycle of planned NodeMaintenances, so reacting to the unplanned one should not be a big issue.
If there is no connection to the apiserver (apiserver down, network issues, etc.) and the NodeMaintenance object cannot be created, we would fall back to the original behavior of Graceful Node Shutdown feature. If the connection is restored, we would stop the Graceful Node Shutdown and proceed with the NodeMaintenance.
The NodeMaintenance would ensure that all pods are removed. This also includes the DaemonSet and static pods.
Currently, there is no standard solution for terminating static pods. We can advertise what state each node should be in, declaratively with NodeMaintenance. This can include static pods as well.
Since static pods usually run the most critical workloads, they should be terminated last according to Pod Selection.
Similar to DaemonSets, static pods should be tied to the node lifecycle
because they typically run critical workloads where availability is paramount. Therefore, the
kubelet should respond to the Evacuation only if there is a NodeMaintenance happening on that node
and the Static
pod is in the drainTargets
. For example, if we observe the following
NodeMaintenance:
apiVersion: v1alpha1
kind: NodeMaintenance
...
status:
nodeStatuses:
- nodeRef:
name: six
drainTargets:
- podPriority: 2147483647
podType: Default
- podPriority: 2147483647
podType: DaemonSet
- podPriority: 7000
podType: Static
...
To fulfil the Evacuation API, the DaemonSet controller should register itself as a controller evacuator. To do this, it should ensure that the following annotation is present on its own pods.
evacuation.coordination.k8s.io/priority_kubelet.k8s.io: "10000/controller"
The kubelet should respond to the Evacuation object when it observes its own class
(kubelet.k8s.io
) in .status.activeEvacuatorClass
.
For the above node maintenance, the kubelet should not react to Evacuations of static pods
with a priority greater than 7000. This state should not normally occur, as Evacuation requests
should be coordinated with NodeMaintenance. If it does occur, we should not encourage this flow by
updating the .status.ActiveEvacuatorCompleted
field, although it is required to update this field
for normal workloads.
If the static pods have a priority equal to or less than 7000, the Evacuation status should be updated appropriately as follows, and the targeted pod should be terminated by the kubelet:
apiVersion: v1alpha1
kind: Evacuation
metadata:
finalizers:
evacuation.coordination.k8s.io/instigator_nodemaintenance.k8s.io
labels:
app: critical-static-workload
name: 08deef1c-1838-42a5-a3a8-3a6d0558c7f9-critical-static-workload
namespace: critical-workloads
spec:
podRef:
name: critical-static-workload
uid: 08deef1c-1838-42a5-a3a8-3a6d0558c7f9
progressDeadlineSeconds: 1800
evacuators:
- evacuatorClass: kubelet.k8s.io
priority: 10000
role: controller
status:
activeEvacuatorClass: kubelet.k8s.io
activeEvacuatorCompleted: false
evacuationProgressTimestamp: "2024-04-22T22:10:05Z"
expectedEvacuationFinishTime: "2024-04-22T22:11:05Z" # now + terminationGracePeriodSeconds:
failedEvictionCounter: 0
message: "critical-static-workload is terminating the pod due to node maintenance (OS upgrade)."
conditions: []
Once the pod is terminated and removed from the node, it should not be started on the node by the kubelet again until the node maintenance is complete.
[ ] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
<package>
:<date>
-<test coverage>
- :
- :
- Feature gate
- Feature gate name: DeclarativeNodeMaintenance - this feature gate enables the NodeMaintenance API and node
maintenance controller which creates
Evacuation
- Components depending on the feature gate: kube-apiserver, kube-controller-manager
- Feature gate name: DeclarativeNodeMaintenance - this feature gate enables the NodeMaintenance API and node
maintenance controller which creates
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
We could implement the NodeMaintenance or Evacuation API out-of-tree first as a CRD.
The KEP aims to solve graceful termination of any pod in the cluster. This is not possible with a 3rd party CRD as we need an integration with core components.
- We would like to solve the lifecycle of static pods during a node maintenance. This means that
static pods should be terminated during the drain according to
drainPlan
, and they should stay terminated after the kubelet restart if the node is still under maintenance. This requires integration with kubelet. See kubelet: Static Pods for more details. - We would like to improve the Graceful Node Shutdown feature. Terminating pods via NodeMaintenance will improve application safety and availability. It will also improve the reliability of the Graceful Node Shutdown feature. However, this also requires the kubelet to interact with a NodeMaintenance. See kubelet and kubelet: Graceful Node Shutdown for more details.
- We would like to also solve the lifecycle of DaemonSet pods during the NodeMaintenance. Usually these pods run important or critical services. These should be terminated at the right time during the node drain. To solve this, integration with NodeMaintenance is required. See DaemonSet Controller for more details.
Also, one of the disadvantages of using a CRD is that it would be more difficult to get real-word adoption and thus important feedback on this feature. This is mainly because the NodeMaintenance feature coordinates the node drain and provides good observability of the whole process. Third-party components that are both cluster admin and application developer facing can depend on this feature, use it, and build on top of it.
As an alternative, it would be possible to signal the node maintenance by marking the node object instead of introducing a new API. But, it is probably better to decouple this from the node for reasons of extensibility and complexity.
Advantages of the NodeMaintenance API approach:
- It allows us to implement incremental scale down of pods by various attributes according to a drainPlan across multiple nodes.
- There may be additional features that can be added to the NodeMaintenance in the future.
- It helps to decouple RBAC permissions and general update responsibility from the node object.
- It is easier to manage a NodeMaintenance lifecycle compared to the node object.
- Two or more different actors may want to maintain the same node in two different overlapping time slots. Creating two different NodeMaintenance objects would help with tracking each maintenance along with the reason behind it.
- Observability is better achieved with an additional object.
To signal the start of the eviction we could simply taint a node with the NoExecute
taint. This
taint should be easily recognizable and have a standard name, such as
node.kubernetes.io/maintenance
. Other actors could observe the creations of such a taint and
migrate or delete the pod. To ensure pods are not removed prematurely, application owners would
have to set a toleration on their pods for this maintenance taint. Such applications could also set
.spec.tolerations[].tolerationSeconds
, which would give a deadline for the pods to be removed by
the NoExecuteTaintManager.
This approach has the following disadvantages:
- Taints and tolerations do not support PDBs, which is the main mechanism for preventing voluntary disruptions. People who want to avoid the disruptions caused by the maintenance taint would have to specify the toleration in the pod definition and ensure it is present at all times. This would also have an impact on the controllers, who would have to pollute the pod definitions with these tolerations, even though the users did not specify them in their pod template. The controllers could override users' tolerations, which the users might not be happy about. It is also hard to make such behaviors consistent across all the controllers.
- Taints are used as a mechanism for involuntary disruption; to get pods out of the node for some reason (e.g. node is not ready). Modifying the taint mechanism to be less harmful (e.g. by adding a PDB support) is not possible due to the original requirements.
- It is not possible to incrementally scale down according to pod priorities, labels, etc.
These names are considered as an alternative to NodeMaintenance:
- NodeIsolation
- NodeDetachment
- NodeClearance
- NodeQuarantine
- NodeDisengagement
- NodeVacation