Add clarifications to RFC

aws · Jun 12, 2024 · 9b64f29 · 9b64f29
1 parent 955acf6
commit 9b64f29
Showing 1 changed file with 15 additions and 14 deletions.
diff --git a/designs/statefulset-disruption.md b/designs/statefulset-disruption.md
@@ -85,7 +85,7 @@ Today, with customers with default Karpenter `v0.37.0` and EBS CSI Driver `v1.31
 
 **Problem A. If step 2 doesn't happen, there will be a 6+ minute delay.**
 
-If volumes are not unmounted *by CSI Node Service*, Kubernetes cannot confirm volumes are not in use and will wait 6 minutes before treating the volume as unmounted. See [EBS CSI 6-minute delay FAQ](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes) 
+If volumes are not *confirmed* as unmounted *by CSI Node Service*, Kubernetes cannot confirm volumes are not in use and will wait 6 minutes before treating the volume as unmounted. See [EBS CSI 6-minute delay FAQ](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes) 
 
 Relevant PVC Event (Note the 6+ minute delay): 
 
@@ -95,7 +95,7 @@ Warning  FailedAttachVolume      6m51s              attachdetach-controller  Mul
 
 **Problem B. If step 3 doesn't happen before step 4, there will be a 1+ minute delay**
 
-If karpenter calls EC2 TerminateInstance **before** EC2 DetachVolume calls finish, then the volumes won't be detached **until the old instance terminates**. This delay depends on the instance type. 1 minute for `m5a.large`, up to 15 minutes for certain GPU/Windows instances.
+If karpenter calls EC2 TerminateInstance **before** EC2 DetachVolume calls from EBS CSI Driver Controller pod finish, then the volumes won't be detached **until the old instance terminates**. This delay depends on the instance type. 1 minute for `m5a.large`, up to 15 minutes for certain GPU/Windows instances.
 
 Relevant PVC Events (Note the 1-min delay between `Multi-Attach error` AND `AttachVolume.Attach failed`):
 
@@ -111,11 +111,11 @@ Customers can determine which delay they are suffering from based off of whether
 
 **A1: To solve A long-term, Kubernetes should ensure volumes are unmounted before CSI Driver Node pods are terminated.**
 
-**A2: To solve A today, Karpenter should confirm and communicate that volumes are not in use before deleting the node's finalizer.**
+**A2: To solve A today, Karpenter should confirm that volumes are not in use and confirm AttachDetach Controller knows this before deleting the node's finalizer.**
 
 **B1: To solve B today, Karpenter should wait for volumes to detach before terminating the node.**
 
-See [WIP Kubernetes 1.31 solution in PR #125070](https://github.com/kubernetes/kubernetes/pull/125070)
+See [WIP Kubernetes 1.31 A1 solution in PR #125070](https://github.com/kubernetes/kubernetes/pull/125070)
 
 See [a proof-of-concept implementation of A2 & B1 in PR #1294](https://github.com/kubernetes-sigs/karpenter/pull/1294)
 
@@ -127,7 +127,7 @@ Finally we should add the following EBS x Karpenter end-to-end test in karpenter
 
 ## Problem A. Preventing 6+ minute delays 
 
-If volumes are not unmounted *by CSI Node Service*, Kubernetes cannot confirm volumes are not in use and will wait 6 minutes[^4] before treating the volume as unmounted and moving forward with a volume detach. See [EBS CSI 6-minute delay FAQ](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes) 
+If `ReadWriteOnce` volumes are not unmounted *by CSI Node Service*, Kubernetes cannot confirm volumes are not in use and safe to attach to new node. Kubernetes will wait 6 minutes[^4] and ensure Node is `unhealthy` before treating the volume as unmounted and moving forward with a volume detach. See [EBS CSI 6-minute delay FAQ](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#6-minute-delays-in-attaching-volumes) 
 
 Cluster operator will see a `FailedAttachVolume` event with `Multi-Attach error`
 
@@ -143,9 +143,9 @@ However, the the shutdown manager does not wait for all volumes to be unmounted,
 
 ![kubelet_shutdown_race](https://github.com/AndrewSirenko/karpenter-provider-aws/assets/68304519/9ccb6c94-ef02-40af-84bf-9471827f3303)
 
-Today, the EBS CSI Driver attempts to workaround this race by utilizing a [PreStop hook](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-steps-can-be-taken-to-mitigate-this-issue) that tries to keep the Node Service alive until all volumes are unmounted. We will explore the shortcomings of this solution later.  
+Today, the EBS CSI Driver attempts to work around this race by utilizing a [PreStop hook](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-steps-can-be-taken-to-mitigate-this-issue) that tries to keep the Node Service alive until all volumes are unmounted. We will explore the shortcomings of this solution later.  
 
-*Note: In addition to the Kubelet race, this delay can happen if stateful pods out-live the CSI Node Pod. E.g. Operator has a statefulset that tolerates all taints and a longer terminationGracePeriod than EBS CSI Driver.*
+*Note: In addition to the Kubelet race, this delay can happen if stateful pods out-live the CSI Node Pod. E.g. Operator has a statefulset that tolerates all taints and has a longer terminationGracePeriod than EBS CSI Driver.*
 
 ### Solutions:
 
@@ -155,9 +155,9 @@ We should:
 
 #### A1: Fix race at Kubelet level
 
-The Kubelet Shutdown Manager should not kill CSI Driver Pods before volumes are unmounted. 
+The Kubelet Shutdown Manager should not kill CSI Driver Pods before volumes are unmounted. This change must be made at `kubernetes/kubernetes` level.
 
-See: [PR #125070](https://github.com/kubernetes/kubernetes/pull/125070) for more information. 
+See: [Active PR #125070](https://github.com/kubernetes/kubernetes/pull/125070) for more information. 
 
 **Pros:** 
 
@@ -168,19 +168,19 @@ See: [PR #125070](https://github.com/kubernetes/kubernetes/pull/125070) for more
 
 **Cons:**
 
-- Unavailable until Kubernetes `v1.31`
+- Unavailable until merged in a version of Kubernetes (Likely Kubernetes `v1.31`) (Possibly able to be backported)
 
 #### A2: Taint node as `out-of-service` after termination
 
 While this race should be fixed at the Kubelet level long-term, we still need a solution for earlier versions of Kubernetes.
 
-One solution is to mark terminated nodes as `out-of-service`. 
+One solution is to mark terminated nodes via `out-of-service` taint. 
 
 In `v1.26`, Kubernetes enabled the [Non-graceful node shutdown handling feature](https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#non-graceful-node-shutdown) by default. This introduces the `node.kubernetes.io/out-of-service` taint, which can be used to mark a node as terminated. 
 
 Once Karpenter confirms that an instance is terminated, adding this taint to the node object will allow the Attach/Detach Controller to treat the volume as not in use, preventing the 6+ minute delay. 
 
-With this taint and wait, the following sequence will occur:
+By modifying Karpenter to apply this taint and wait ~5 seconds until volumes are marked not in use, the following sequence will occur:
 
 ![taint_solution](https://github.com/AndrewSirenko/karpenter-provider-aws/assets/68304519/d1bc5704-7592-4aa1-a72e-83d33f824b8b)
 
@@ -189,14 +189,15 @@ See [this commit](https://github.com/kubernetes-sigs/karpenter/pull/1294/commits
 **Pros:**
 
 - Solves 6+ minute delays by default
-- No additional latency before starting instance termination. (In my tests, a 6 second wait was sufficient for AttachDetach Controller to recognize the out-of-service taint and allow for volume detach)
+- No additional latency before starting instance termination. 
+- Minor latency in deleting Node's finalizer IF karpenter does not treat `shutting down` instance as terminated. (In my tests, a 5 second wait was sufficient for AttachDetach Controller to recognize the out-of-service taint and allow for volume detach)
 - If Kubernetes makes this 6 minute ForceDetach timer infinite by default (As currently planned in Kubernetes v1.32), the out-of-service taint will be the only ensure workload starts on other node 
 
 **Cons:**
 
 - Unavailable until Kubernetes `v1.26`. Customers running Karpenter on Kubernetes ≤ `v1.25` will require solution B1 to be implemented. 
 - Requires Karpenter-Provider-AWS to not treat `Shutting Down` as terminated (Though Karpenter had already planned on this via [PR #5979](https://github.com/aws/karpenter-provider-aws/pull/5979))
-- Problem B's delay still occurs because we must wait until instance terminates. 
+- Problem B's delay still occurs because volumes will not be detached until consolidated instance terminates. 
 
 #### Alternatives Considered