Skip to content

Commit

Permalink
Add clarifications to RFC
Browse files Browse the repository at this point in the history
  • Loading branch information
AndrewSirenko committed Jun 12, 2024
1 parent 955acf6 commit 9b64f29
Showing 1 changed file with 15 additions and 14 deletions.
29 changes: 15 additions & 14 deletions designs/statefulset-disruption.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ Today, with customers with default Karpenter `v0.37.0` and EBS CSI Driver `v1.31

**Problem A. If step 2 doesn't happen, there will be a 6+ minute delay.**

If volumes are not unmounted *by CSI Node Service*, Kubernetes cannot confirm volumes are not in use and will wait 6 minutes before treating the volume as unmounted. See [EBS CSI 6-minute delay FAQ](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes)
If volumes are not *confirmed* as unmounted *by CSI Node Service*, Kubernetes cannot confirm volumes are not in use and will wait 6 minutes before treating the volume as unmounted. See [EBS CSI 6-minute delay FAQ](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes)

Relevant PVC Event (Note the 6+ minute delay):

Expand All @@ -95,7 +95,7 @@ Warning FailedAttachVolume 6m51s attachdetach-controller Mul

**Problem B. If step 3 doesn't happen before step 4, there will be a 1+ minute delay**

If karpenter calls EC2 TerminateInstance **before** EC2 DetachVolume calls finish, then the volumes won't be detached **until the old instance terminates**. This delay depends on the instance type. 1 minute for `m5a.large`, up to 15 minutes for certain GPU/Windows instances.
If karpenter calls EC2 TerminateInstance **before** EC2 DetachVolume calls from EBS CSI Driver Controller pod finish, then the volumes won't be detached **until the old instance terminates**. This delay depends on the instance type. 1 minute for `m5a.large`, up to 15 minutes for certain GPU/Windows instances.

Relevant PVC Events (Note the 1-min delay between `Multi-Attach error` AND `AttachVolume.Attach failed`):

Expand All @@ -111,11 +111,11 @@ Customers can determine which delay they are suffering from based off of whether

**A1: To solve A long-term, Kubernetes should ensure volumes are unmounted before CSI Driver Node pods are terminated.**

**A2: To solve A today, Karpenter should confirm and communicate that volumes are not in use before deleting the node's finalizer.**
**A2: To solve A today, Karpenter should confirm that volumes are not in use and confirm AttachDetach Controller knows this before deleting the node's finalizer.**

**B1: To solve B today, Karpenter should wait for volumes to detach before terminating the node.**

See [WIP Kubernetes 1.31 solution in PR #125070](https://github.com/kubernetes/kubernetes/pull/125070)
See [WIP Kubernetes 1.31 A1 solution in PR #125070](https://github.com/kubernetes/kubernetes/pull/125070)

See [a proof-of-concept implementation of A2 & B1 in PR #1294](https://github.com/kubernetes-sigs/karpenter/pull/1294)

Expand All @@ -127,7 +127,7 @@ Finally we should add the following EBS x Karpenter end-to-end test in karpenter

## Problem A. Preventing 6+ minute delays

If volumes are not unmounted *by CSI Node Service*, Kubernetes cannot confirm volumes are not in use and will wait 6 minutes[^4] before treating the volume as unmounted and moving forward with a volume detach. See [EBS CSI 6-minute delay FAQ](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes)
If `ReadWriteOnce` volumes are not unmounted *by CSI Node Service*, Kubernetes cannot confirm volumes are not in use and safe to attach to new node. Kubernetes will wait 6 minutes[^4] and ensure Node is `unhealthy` before treating the volume as unmounted and moving forward with a volume detach. See [EBS CSI 6-minute delay FAQ](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes)

Cluster operator will see a `FailedAttachVolume` event with `Multi-Attach error`

Expand All @@ -143,9 +143,9 @@ However, the the shutdown manager does not wait for all volumes to be unmounted,

![kubelet_shutdown_race](https://github.com/AndrewSirenko/karpenter-provider-aws/assets/68304519/9ccb6c94-ef02-40af-84bf-9471827f3303)

Today, the EBS CSI Driver attempts to workaround this race by utilizing a [PreStop hook](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-steps-can-be-taken-to-mitigate-this-issue) that tries to keep the Node Service alive until all volumes are unmounted. We will explore the shortcomings of this solution later.
Today, the EBS CSI Driver attempts to work around this race by utilizing a [PreStop hook](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-steps-can-be-taken-to-mitigate-this-issue) that tries to keep the Node Service alive until all volumes are unmounted. We will explore the shortcomings of this solution later.

*Note: In addition to the Kubelet race, this delay can happen if stateful pods out-live the CSI Node Pod. E.g. Operator has a statefulset that tolerates all taints and a longer terminationGracePeriod than EBS CSI Driver.*
*Note: In addition to the Kubelet race, this delay can happen if stateful pods out-live the CSI Node Pod. E.g. Operator has a statefulset that tolerates all taints and has a longer terminationGracePeriod than EBS CSI Driver.*

### Solutions:

Expand All @@ -155,9 +155,9 @@ We should:

#### A1: Fix race at Kubelet level

The Kubelet Shutdown Manager should not kill CSI Driver Pods before volumes are unmounted.
The Kubelet Shutdown Manager should not kill CSI Driver Pods before volumes are unmounted. This change must be made at `kubernetes/kubernetes` level.

See: [PR #125070](https://github.com/kubernetes/kubernetes/pull/125070) for more information.
See: [Active PR #125070](https://github.com/kubernetes/kubernetes/pull/125070) for more information.

**Pros:**

Expand All @@ -168,19 +168,19 @@ See: [PR #125070](https://github.com/kubernetes/kubernetes/pull/125070) for more

**Cons:**

- Unavailable until Kubernetes `v1.31`
- Unavailable until merged in a version of Kubernetes (Likely Kubernetes `v1.31`) (Possibly able to be backported)

#### A2: Taint node as `out-of-service` after termination

While this race should be fixed at the Kubelet level long-term, we still need a solution for earlier versions of Kubernetes.

One solution is to mark terminated nodes as `out-of-service`.
One solution is to mark terminated nodes via `out-of-service` taint.

In `v1.26`, Kubernetes enabled the [Non-graceful node shutdown handling feature](https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown) by default. This introduces the `node.kubernetes.io/out-of-service` taint, which can be used to mark a node as terminated.

Once Karpenter confirms that an instance is terminated, adding this taint to the node object will allow the Attach/Detach Controller to treat the volume as not in use, preventing the 6+ minute delay.

With this taint and wait, the following sequence will occur:
By modifying Karpenter to apply this taint and wait ~5 seconds until volumes are marked not in use, the following sequence will occur:

![taint_solution](https://github.com/AndrewSirenko/karpenter-provider-aws/assets/68304519/d1bc5704-7592-4aa1-a72e-83d33f824b8b)

Expand All @@ -189,14 +189,15 @@ See [this commit](https://github.com/kubernetes-sigs/karpenter/pull/1294/commits
**Pros:**

- Solves 6+ minute delays by default
- No additional latency before starting instance termination. (In my tests, a 6 second wait was sufficient for AttachDetach Controller to recognize the out-of-service taint and allow for volume detach)
- No additional latency before starting instance termination.
- Minor latency in deleting Node's finalizer IF karpenter does not treat `shutting down` instance as terminated. (In my tests, a 5 second wait was sufficient for AttachDetach Controller to recognize the out-of-service taint and allow for volume detach)
- If Kubernetes makes this 6 minute ForceDetach timer infinite by default (As currently planned in Kubernetes v1.32), the out-of-service taint will be the only ensure workload starts on other node

**Cons:**

- Unavailable until Kubernetes `v1.26`. Customers running Karpenter on Kubernetes ≤ `v1.25` will require solution B1 to be implemented.
- Requires Karpenter-Provider-AWS to not treat `Shutting Down` as terminated (Though Karpenter had already planned on this via [PR #5979](https://github.com/aws/karpenter-provider-aws/pull/5979))
- Problem B's delay still occurs because we must wait until instance terminates.
- Problem B's delay still occurs because volumes will not be detached until consolidated instance terminates.

#### Alternatives Considered

Expand Down

0 comments on commit 9b64f29

Please sign in to comment.