Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Volume still hang on Karpenter Node Consolidation/Termination #1955

Open
levanlongktmt opened this issue Mar 4, 2024 · 83 comments
Open

Volume still hang on Karpenter Node Consolidation/Termination #1955

levanlongktmt opened this issue Mar 4, 2024 · 83 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@levanlongktmt
Copy link

levanlongktmt commented Mar 4, 2024

/kind bug

What happened?
As discussed at #1665, @torredil said it's fixed in v1.27 (#1665 (comment)) but we still got problem with v1.28

  • The pod using volume pv-A running in node N1
  • Karpenter terminate pod and terminate node N1
  • K8s start new pod and trying attach volume pv-A but still need to wait 6 minutes to be release and attach to new Pod

What you expected to happen?

  • After old pod has been terminated, the pv-A should be released and able to attach to new pod

How to reproduce it (as minimally and precisely as possible)?

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: dev
spec:
  version: 8.12.2
  volumeClaimDeletePolicy: DeleteOnScaledownAndClusterDeletion
  updateStrategy:
    changeBudget:
      maxSurge: 2
      maxUnavailable: 1
  nodeSets:
  - name: default
    count: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data # Do not change this name unless you set up a volume mount for the data path.
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 25Gi
    podTemplate:
      spec:
        nodeSelector:
          kubernetes.io/arch: arm64
          topology.kubernetes.io/zone: eu-central-1a
        containers:
        - name: elasticsearch
          env:
            - name: ES_JAVA_OPTS
              value: -Xms4g -Xmx4g
          resources:
            requests:
              memory: 5Gi
              cpu: 1
            limits:
              memory: 5Gi
              cpu: 2
    config:
      node.store.allow_mmap: false
  • Trigger spot instance termination or just delete 1 ec2 instance
  • The node has been removed in k8s very quick, old pod has been Terminated and k8s start new pod
  • Pod stuck in 6 minutes with error Multi-Attach error for volume "pvc-xxxxx-xxxxx-xxx" Volume is already exclusively attached to one node and can't be attached to another
  • After 6 minutes new pod can attach volume
  • Here is logs of ebs-csi-controller
I0302 06:12:10.305080       1 controller.go:430] "ControllerPublishVolume: attached" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1" devicePath="/dev/xvdaa"
<< at 06:14 the node has been terminated but no logs here >>
I0302 06:20:18.486042       1 controller.go:471] "ControllerUnpublishVolume: detaching" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1"
I0302 06:20:18.584737       1 cloud.go:792] "DetachDisk: called on non-attached volume" volumeID="vol-02b33186429105461"
I0302 06:20:18.807752       1 controller.go:474] "ControllerUnpublishVolume: attachment not found" volumeID="vol-02b33186429105461" nodeID="i-0715ec90e486bb8a1"
I0302 06:20:19.124534       1 controller.go:421] "ControllerPublishVolume: attaching" volumeID="vol-02b33186429105461" nodeID="i-0ee2a470112401ffb"
I0302 06:20:20.635493       1 controller.go:430] "ControllerPublishVolume: attached" volumeID="vol-02b33186429105461" nodeID="i-0ee2a470112401ffb" devicePath="/dev/xvdaa"

Anything else we need to know?:
I setup csi driver using eks add-on
Environment

  • Kubernetes version (use kubectl version):
Client Version: v1.29.1
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.0-eks-c417bb3
  • Driver version: v1.28.0-eksbuild.1
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Mar 4, 2024
@levanlongktmt
Copy link
Author

I updated to karpenter 0.35 and using AL2023 image, the problem still happen

@levanlongktmt
Copy link
Author

some update: it seemly caused by elasticsearch statefulset, time to terminate es pod is so slow, it's possible that the driver pod be killed before es pod killed then the volume not released

@levanlongktmt
Copy link
Author

update: when I set PRE_STOP_ADDITIONAL_WAIT_SECONDS to 5 seconds, new pod can attach pvc normally, so I think the long delay of pod stop is reason for volume stuck

@levanlongktmt
Copy link
Author

update: when I set PRE_STOP_ADDITIONAL_WAIT_SECONDS to 5 seconds, new pod can attach pvc normally, so I think the long delay of pod stop is reason for volume stuck

Nope, in another try pv still stuck 🤦

@torredil
Copy link
Member

torredil commented Mar 4, 2024

Hey there @levanlongktmt thank you for reporting this.

It appears your spot instances are being ungracefully terminated. Take a look at the documentation for guidance on navigating this issue: 6-Minute Delays in Attaching Volumes - What steps can be taken to mitigate this issue?

If you are still running into delays but have already configured the Kubelet and enabled interruption handling in Karpenter, please let us know. Thanks!

@levanlongktmt
Copy link
Author

@torredil ops... my user data missed last line

systemctl restart kubelet

let me try with it

@levanlongktmt
Copy link
Author

hey @torredil I tried but still no luck :(
I already added SQS to handle interrupt, so when I trigger interupt in aws console, karpenter handle and launch new instance very quickly. I also follow logs of ebs-csi-node but it not show anything related to prestop hook, here is all logs of it

I0304 15:40:38.249081       1 driver.go:83] "Driver Information" Driver="ebs.csi.aws.com" Version="v1.28.0"
I0304 15:40:38.250134       1 node.go:93] "regionFromSession Node service" region=""
I0304 15:40:38.252657       1 metadata.go:85] "retrieving instance data from ec2 metadata"
I0304 15:40:38.257835       1 metadata.go:92] "ec2 metadata is available"
I0304 15:40:38.258488       1 metadata_ec2.go:25] "Retrieving EC2 instance identity metadata" regionFromSession=""
I0304 15:47:29.651847       1 mount_linux.go:243] Detected OS without systemd

after graceful shutdown of elasticsearch pod, the pod ebs-csi-node has been killed immediately. So the prestop has been ignored.
I was thought it might because bug of AL2023, so I switch back to use AL2 but still no luck.

@levanlongktmt
Copy link
Author

I will try set InhibitDelayMaxSec to 60 (45 + 15)

@levanlongktmt
Copy link
Author

Still no luck, volume still be stuck in 6 minutes and prestop hook not working 😭

@torredil
Copy link
Member

torredil commented Mar 4, 2024

@levanlongktmt Thanks for the followup info. Something that came to mind here is that it might be possible for newer versions of Karpenter to be incompatible with the pre-stop lifecycle hook due to this change: kubernetes-sigs/karpenter#508.

You'll notice that the LCH won't run if the following taint is not present on the node during a termination event:

node.kubernetes.io/unschedulable

We'll take a closer look and report back with more details or a fix. Thank you!

@levanlongktmt
Copy link
Author

@torredil here is logs of karpenter and elasticsearch pod, so as I see

  • At 16:22:20 karpenter get interupt message
  • Very quickly karpenter call taint node and pod elasticsearch get shutdown signal
  • After 46 seconds the pod elasticsearch terminated
  • At 16:23:09 karpenter start delete node and the node has been deleted very quick

Logs of Karpenter

{"level":"INFO","time":"2024-03-04T16:22:20.270Z","logger":"controller.interruption","message":"initiating delete from interruption message","commit":"2c8f2a5","queue":"Karpenter-kollekt-eks","messageKind":"SpotInterruptionKind","nodeclaim":"default-2bd27","action":"CordonAndDrain","node":"i-0502b920c79f40b5a.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-03-04T16:22:20.336Z","logger":"controller.node.termination","message":"tainted node","commit":"2c8f2a5","node":"i-0502b920c79f40b5a.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-03-04T16:22:21.364Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"24.462475ms"}
{"level":"INFO","time":"2024-03-04T16:22:21.364Z","logger":"controller.provisioner","message":"computed new nodeclaim(s) to fit pod(s)","commit":"2c8f2a5","nodeclaims":1,"pods":1}
{"level":"INFO","time":"2024-03-04T16:22:21.379Z","logger":"controller.provisioner","message":"created nodeclaim","commit":"2c8f2a5","nodepool":"default","nodeclaim":"default-pk44s","requests":{"cpu":"680m","memory":"5240Mi","pods":"5"},"instance-types":"a1.2xlarge, a1.4xlarge, a1.metal, a1.xlarge, c6g.12xlarge and 91 other(s)"}
{"level":"INFO","time":"2024-03-04T16:22:23.782Z","logger":"controller.nodeclaim.lifecycle","message":"launched nodeclaim","commit":"2c8f2a5","nodeclaim":"default-pk44s","provider-id":"aws:///eu-central-1a/i-09a52eea14ff0b6c9","instance-type":"r6gd.medium","zone":"eu-central-1a","capacity-type":"spot","allocatable":{"cpu":"940m","ephemeral-storage":"17Gi","memory":"7075Mi","pods":"8","vpc.amazonaws.com/pod-eni":"4"}}
{"level":"INFO","time":"2024-03-04T16:22:31.356Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"16.586479ms"}
{"level":"INFO","time":"2024-03-04T16:22:41.353Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"13.509969ms"}
{"level":"INFO","time":"2024-03-04T16:22:51.355Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"14.534034ms"}
{"level":"INFO","time":"2024-03-04T16:22:58.280Z","logger":"controller.nodeclaim.lifecycle","message":"registered nodeclaim","commit":"2c8f2a5","nodeclaim":"default-pk44s","provider-id":"aws:///eu-central-1a/i-09a52eea14ff0b6c9","node":"i-09a52eea14ff0b6c9.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-03-04T16:23:01.356Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"13.781092ms"}
{"level":"INFO","time":"2024-03-04T16:23:09.571Z","logger":"controller.node.termination","message":"deleted node","commit":"2c8f2a5","node":"i-0502b920c79f40b5a.eu-central-1.compute.internal"}
{"level":"INFO","time":"2024-03-04T16:23:09.944Z","logger":"controller.nodeclaim.termination","message":"deleted nodeclaim","commit":"2c8f2a5","nodeclaim":"default-2bd27","node":"i-0502b920c79f40b5a.eu-central-1.compute.internal","provider-id":"aws:///eu-central-1a/i-0502b920c79f40b5a"}
{"level":"INFO","time":"2024-03-04T16:23:10.455Z","logger":"controller.disruption","message":"pod \"datalayer/dev-es-default-0\" has a preferred Anti-Affinity which can prevent consolidation","commit":"2c8f2a5"}
{"level":"INFO","time":"2024-03-04T16:23:10.698Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"83.215948ms"}
{"level":"INFO","time":"2024-03-04T16:23:10.731Z","logger":"controller.nodeclaim.lifecycle","message":"initialized nodeclaim","commit":"2c8f2a5","nodeclaim":"default-pk44s","provider-id":"aws:///eu-central-1a/i-09a52eea14ff0b6c9","node":"i-09a52eea14ff0b6c9.eu-central-1.compute.internal","allocatable":{"cpu":"940m","ephemeral-storage":"18233774458","hugepages-1Gi":"0","hugepages-2Mi":"0","hugepages-32Mi":"0","hugepages-64Ki":"0","memory":"7552796Ki","pods":"8"}}
{"level":"INFO","time":"2024-03-04T16:23:12.487Z","logger":"controller.provisioner","message":"found provisionable pod(s)","commit":"2c8f2a5","pods":"datalayer/dev-es-default-0","duration":"12.009019ms"}

Logs of elasticsearch pod

{"@timestamp": "2024-03-04T16:22:20+00:00", "message": "retrieving node ID", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-03-04T16:22:21+00:00", "message": "initiating node shutdown", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp":"2024-03-04T16:22:21.338Z", "log.level": "INFO", "message":"creating shutdown record {nodeId=[qElhr0BPQnufjz5_CdDGMg], type=[RESTART], reason=[pre-stop hook]}", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[dev-es-default-0][masterService#updateTask][T#2]","log.logger":"org.elasticsearch.xpack.shutdown.TransportPutShutdownNodeAction","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:22:21.530Z", "log.level": "INFO", "message":"Aborting health node task due to node [{dev-es-default-0}{qElhr0BPQnufjz5_CdDGMg}] shutting down.", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[dev-es-default-0][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.health.node.selection.HealthNodeTaskExecutor","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:22:21.535Z", "log.level": "INFO", "message":"Starting node shutdown sequence for ML", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[dev-es-default-0][clusterApplierService#updateTask][T#1]","log.logger":"org.elasticsearch.xpack.ml.MlLifeCycleService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp": "2024-03-04T16:22:21+00:00", "message": "waiting for node shutdown to complete", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp": "2024-03-04T16:22:21+00:00", "message": "delaying termination for 44 seconds", "ecs.version": "1.2.0", "event.dataset": "elasticsearch.pre-stop-hook"}
{"@timestamp":"2024-03-04T16:23:05.782Z", "log.level": "INFO", "message":"stopping ...", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.783Z", "log.level": "INFO", "message":"shutting down watcher thread", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[file-watcher[/usr/share/elasticsearch/config/operator/settings.json]]","log.logger":"org.elasticsearch.common.file.AbstractFileWatchingService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.785Z", "log.level": "INFO", "message":"watcher service stopped", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.common.file.AbstractFileWatchingService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.790Z", "log.level": "INFO", "message":"[controller/83] [Main.cc@176] ML controller exiting", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"ml-cpp-log-tail-thread","log.logger":"org.elasticsearch.xpack.ml.process.logging.CppLogMessageHandler","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.792Z", "log.level": "INFO", "message":"Native controller process has stopped - no new native processes can be started", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"ml-cpp-log-tail-thread","log.logger":"org.elasticsearch.xpack.ml.process.NativeController","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.793Z", "log.level": "INFO", "message":"stopping watch service, reason [shutdown initiated]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.xpack.watcher.WatcherService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:05.794Z", "log.level": "INFO", "message":"watcher has stopped and shutdown", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[dev-es-default-0][generic][T#3]","log.logger":"org.elasticsearch.xpack.watcher.WatcherLifeCycleService","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"timestamp": "2024-03-04T16:23:06+00:00", "message": "readiness probe failed", "curl_rc": "7"}
{"@timestamp":"2024-03-04T16:23:06.552Z", "log.level": "INFO", "message":"stopped", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.554Z", "log.level": "INFO", "message":"closing ...", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.576Z", "log.level": "INFO", "message":"evicted [0] entries from cache after reloading database [/tmp/elasticsearch-3976228296045036888/geoip-databases/qElhr0BPQnufjz5_CdDGMg/GeoLite2-Country.mmdb]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.577Z", "log.level": "INFO", "message":"evicted [0] entries from cache after reloading database [/tmp/elasticsearch-3976228296045036888/geoip-databases/qElhr0BPQnufjz5_CdDGMg/GeoLite2-ASN.mmdb]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.578Z", "log.level": "INFO", "message":"evicted [25] entries from cache after reloading database [/tmp/elasticsearch-3976228296045036888/geoip-databases/qElhr0BPQnufjz5_CdDGMg/GeoLite2-City.mmdb]", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.ingest.geoip.DatabaseReaderLazyLoader","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}
{"@timestamp":"2024-03-04T16:23:06.586Z", "log.level": "INFO", "message":"closed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch-shutdown","log.logger":"org.elasticsearch.node.Node","elasticsearch.cluster.uuid":"Z_NFF7KcT3OJ6ekUUoJ3Qg","elasticsearch.node.id":"qElhr0BPQnufjz5_CdDGMg","elasticsearch.node.name":"dev-es-default-0","elasticsearch.cluster.name":"dev"}

@levanlongktmt
Copy link
Author

@torredil do you think this quick and dirty fix will works 😆?
Screenshot 2024-03-05 at 12 18 13 AM

@levanlongktmt
Copy link
Author

@torredil seemly k8s not call preStop, because if it call then atleast I will see log PreStop: executing PreStop lifecycle hook, but I didn't see it
image

@levanlongktmt
Copy link
Author

@torredil any good news for this 😀?

@ConnorJC3
Copy link
Contributor

Hi, sorry about the wait - your issue is probably caused by what #1969 solves - Karpenter changed the taints they used when draining nodes and our LCH needs to be changed to account for it. That fix should be available in the next release of the EBS CSI Driver, expected to happen later this month.

@levanlongktmt
Copy link
Author

levanlongktmt commented Mar 18, 2024

Amazing, thank so much @ConnorJC3 😍
I will test again when new version released

@primeroz
Copy link

@ConnorJC3 I was trying to understand from which version of karpenter this was changed , i was about to upgrade my csi driver to address the fix but i guess i will wait for the next release.

I see kubernetes-sigs/karpenter#508 was merged in October 🤔

@levanlongktmt
Copy link
Author

@primeroz here is list of karpenter versions affected
image

@alexandermarston
Copy link
Contributor

v1.29.0 has been released which contains the fix.

@primeroz
Copy link

primeroz commented Mar 22, 2024

I just upgraded my dev cluster to 1.29.0 while using karpenter-aws 0.35.2 but still have the same multi attach problem as before

  • Have stateful workload running with PVC Mounted
  • Delete a node with kubectl delete node
    • karpenter intercepts through finalizer and sets taint
- effect: NoSchedule
  key: karpenter.sh/disruption
  value: disrupting
  • Pod is evicted - 26s Normal Evicted pod/c-pod-0 Evicted pod
  • new Node is created by karpenter and pod is moved to it
  • Multi-Attach error for volume "pvc-9b50cf31-5a24-4783-8039-a362ad2a7c0d" Volume is already exclusively attached to one node and can't be attached to another for 6+ minutes until timeout
  • pod starts on new node

@alexandermarston
Copy link
Contributor

alexandermarston commented Mar 22, 2024

@primeroz do you have the logs from the driver, when your node was disrupted? (also, would be useful for you to increase your log level while collecting these)

@primeroz
Copy link

primeroz commented Mar 22, 2024

@alexandermarston there are no events whatsoever when i delete the node in the ebs csi pod running on that node

I can see the prestop is in the pod spec ... so i guess when we delete a node rather than replace it through karpenter disruption the prestop on the ebs node is never run ?

Note in my case i am deleting the node to get it replaced by karpenter.

+ ebs-csi-node-rj6qg › node-driver-registrar
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.381506       1 main.go:135] Version: v2.10.0
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.381558       1 main.go:136] Running node-driver-registrar in mode=
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.381565       1 main.go:157] Attempting to open a gRPC connection with: "/csi/csi.sock"
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.382280       1 main.go:164] Calling CSI driver to discover driver name
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.385954       1 main.go:173] CSI driver name: "ebs.csi.aws.com"
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.385996       1 node_register.go:55] Starting Registration Server at: /registration/ebs.csi.aws.com-reg.sock
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.386143       1 node_register.go:64] Registration Server started at: /registration/ebs.csi.aws.com-reg.sock
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.386255       1 node_register.go:88] Skipping HTTP server because endpoint is set to: ""
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.832022       1 main.go:90] Received GetInfo call: &InfoRequest{}
ebs-csi-node-rj6qg node-driver-registrar I0322 14:33:43.857979       1 main.go:101] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
ebs-csi-node-rj6qg liveness-probe I0322 14:33:44.130483       1 main.go:133] "Calling CSI driver to discover driver name"
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022623.64,"caller":"driver/driver.go:84","msg":"Driver Information","v":0,"Driver":"ebs.csi.aws.com","Version":"v1.29.0"}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022623.6697,"caller":"driver/node.go:97","msg":"regionFromSession Node service","v":0,"region":""}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022623.6895,"caller":"cloud/metadata.go:85","msg":"retrieving instance data from ec2 metadata","v":0}
ebs-csi-node-rj6qg liveness-probe I0322 14:33:44.133455       1 main.go:141] "CSI driver name" driver="ebs.csi.aws.com"
ebs-csi-node-rj6qg liveness-probe I0322 14:33:44.133475       1 main.go:170] "ServeMux listening" address="0.0.0.0:9808"
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022625.4724,"caller":"cloud/metadata.go:92","msg":"ec2 metadata is available","v":0}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118022626.2078,"caller":"cloud/metadata_ec2.go:25","msg":"Retrieving EC2 instance identity metadata","v":0,"regionFromSession":""}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118023644.7207,"caller":"driver/node.go:875","msg":"Unexpected failure when attempting to remove node taint(s)","err":"isAllocatableSet: driver not found on node ip-10-1-5-195.us-west-2.compute.internal"}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118024155.2048,"caller":"driver/node.go:961","msg":"CSINode Allocatable value is set","v":0,"nodeName":"ip-10-1-5-195.us-west-2.compute.internal","count":25}
ebs-csi-node-rj6qg ebs-plugin {"ts":1711118326119.883,"caller":"mount-utils@v0.29.2/mount_linux.go:243","msg":"Detected OS without systemd","v":2}



### Delete node here 






- ebs-csi-node-rj6qg › ebs-plugin
- ebs-csi-node-rj6qg › liveness-probe
+ ebs-csi-node-rj6qg › liveness-probe
- ebs-csi-node-rj6qg › node-driver-registrar
+ ebs-csi-node-rj6qg › ebs-plugin
+ ebs-csi-node-rj6qg › node-driver-registrar
- ebs-csi-node-rj6qg › ebs-plugin
- ebs-csi-node-rj6qg › node-driver-registrar
- ebs-csi-node-rj6qg › liveness-probe

@ConnorJC3
Copy link
Contributor

@primeroz what exactly do you mean by "delete a node"? Assuming you mean something like deleting the node via the AWS Console - that sounds like ungraceful termination, which will not run pre-stop hooks as it results in an immediate and uninterruptible shutdown.

@primeroz
Copy link

primeroz commented Mar 22, 2024

@ConnorJC3 I mean kubectl delete node XXX which is something karpenter supports and manages through finalizer, it goes through the same steps as a karpenter managed disruption - see https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/#8-delete-karpenter-nodes-manually

Anyway i am testing now with a karpenter disruption by adding an annotation to nodepool, which will trigger a full replacement of all nodes created by that nodepool since the specs have changed. same multi-attach error happens

collecting logs on next node termination

@alexandermarston
Copy link
Contributor

OK, I've been able to test this by manually tainting a node and then running the preStop hook with:

k exec ebs-csi-node-l7zzf -- "/bin/aws-ebs-csi-driver" "pre-stop-hook"

The preStop hook is doing everything it should, from what I understand.

Again, from my limited understanding, I imagine the issue is that your service (which is using the volume) is taking longer than the terminationGracePeriodSeconds of the EBS CSI driver to shutdown and release the volume. If your service takes longer than 30 seconds to terminate, then the EBS driver will never have a chance to do its work.

You could either try increasing the terminationGracePeriod of the EBS CSI Driver or lower the grace period of your service.

@primeroz
Copy link

If your service takes longer than 30 seconds to terminate, then the EBS driver will never have a chance to do its work.

You could either try increasing the terminationGracePeriod of the EBS CSI Driver or lower the grace period of your service.

This is entirely possible, i am testing indeed with a service with an higher than normal grace period

I will test this 🙏

@primeroz
Copy link

@alexandermarston

Increasing the terminationGracePeriod: 3700 on the ebs node daemonset did not help . triggered an update of nodes in ndoepool by updating an annotation but still failing with multi-attach errors

Also what's interesting, to me at least, is that in the logs of the ebs csi node i never see klog.InfoS("PreStop: executing PreStop lifecycle hook") so i guess that when karpenter replaces a node it does never trigger a termination of the daemonsets running on it ?

I know for sure that karpenter does not drain the node in the same way that kubectl drain does. I can see that the non daemonset pods are being evicted one by one but I see nothing on the ebs nodes

I have been watching the ebs-csi-node pod , i have been tailing events for both the ebs-csi-node pod ... nothing. it seems like once the non daemonsets pods are all evicted the node just goes away and so all the daemonsets pods on it 🤔

@alexandermarston
Copy link
Contributor

You won't see the logs from the preStop hook handler, as they are not logged in the same way normal pod stdout is. You will be able to see if the preStop hook failed in the Kubernetes Events though.

Can you try deleting the node again, but manually execute the preStop hook and share the output?

k exec <EBS-CSI-POD-NAME> -- "/bin/aws-ebs-csi-driver" "pre-stop-hook"

@torredil
Copy link
Member

Hi @levanlongktmt, we have been tracking this closely and have plans to address it:

  • Short to medium term: the team is actively working with Karpenter maintainers to implement a fix on Karpenter's side.
  • Long term: we are also collaborating with the Kubernetes community to implement a fix that addresses the race condition in Kubelet.

The team will provide updates as the plans above materialize.

@torredil
Copy link
Member

@levanlongktmt If you are able to, we strongly encourage users to run the driver with tolerateAllTaints disabled - that should allow for the execution of the lifecycle hook as tolerating the Karpenter taint introduces a conflict.

@levanlongktmt
Copy link
Author

I got very weird problem, after I set node.tolerateAllTaints to false, the pod of Statefulset stuck in Pending as this issue aws/karpenter-provider-aws#4392

I tried delete node with pending several times but no luck, then after I rollback node.tolerateAllTaints all pending pods running normally.

@levanlongktmt
Copy link
Author

wops I think I know why, the nodepool has special taints, so I need add tolerations

@levanlongktmt
Copy link
Author

Update: I still got multi-attach error with tolerateAllTaints is false, as @ConnorJC3 said the hook is still unreliable and often breaks.

Hopefully it will be fix at Karpenter side soon.

@johnjeffers
Copy link

  • Short to medium term: the team is actively working with Karpenter maintainers to implement a fix on Karpenter's side.

We've been waiting months for 0.37 to be released, which we were told would fix this problem, and now I see that the proposed fix for 0.37 was reverted. What is the plan for this? We can't rely on workloads that use EBS volumes because this bug makes them susceptible to 6+ minute outages any time a pod or node termination occurs. In my mind, this means EKS and Karpenter is not currently production-ready.

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Jun 4, 2024

Hi there, I've from the EBS CSI Driver team and have been working on a fix in Karpenter.

A TL;DR of the karpenter issue

In order for a stateful pod to smoothly migrate from terminating node to new node...

  1. Consolidation event starts
  2. Stateful pods must terminate
  3. EBS CSI Node pod must unmount all filesystems (NodeUnpublish & NodeUnstage RPCs)
  4. EBS CSI Controller pod must detach all volumes from instance
  5. Karpenter terminates EC2 Instance
  6. Karpenter ensures Node object deleted from Kubernetes

Problems:
A. If 2 doesn't happen, today there's a 6+ minute delay in stateful pod migration because Kubernetes is afraid volume still attached and mounted to instance (6+ min delay)
B. If 3 doesn't happen, the new stateful pod can't start until consolidated instance is terminated which auto-detaches volumes (1+ min delay) (Sidenote: @levanlongktmt might this be the problem you ran into even with tolerateAllTaints set to false? The key difference is that you'd see a ~1 min delay instead of 6+)

Solutions:

  • If you follow our FAQ instructions and set tolerateAllTaints to false, then A is solved via our pre-stop lifecycle hook.
  • [Scope Medium] We can increase the likelihood of solving both A and B by having karpenter wait (on VolumeAttachment objects dissapearing) between 3 & 4.
  • [Scope Small] We can 100% solve A (that 6 min) by applying the node.kubernetes.io/out-of-service:nodeshutdown:NoExecute taint on the node between 4 and 5. (But we must make sure underlying instance is actually terminated)

These second two solutions work locally on my fork of the Karpenter project against v0.36.2. A member of the Karpenter team has answering a few design decision questions of mine before I submit the PR. I hope that we can have this fix for v0.37.1 and v0.36.3, but I'll know more after today's sync with Karpenter team.

I'll provide two more updates by the end today. One once I sync with the Karpenter team during their PST business hours, and a second update once the PR is up on the Karpenter project.

Update: WIP PR is up against v0.37.0 on kubernetes-sigs/karpenter. Feel free to comment any concerns there. Expect a refactor today to fit more in-line with Karpenter team's style.

Update 2: There will be a public Request For Comment doc outlining the situation and interactions between EBS/Karpenter/Kubernetes in more detail by the end of the week (Friday, 7th of June).

@johnjeffers
Copy link

@AndrewSirenko Thank you! It's good to see that this is finally being addressed with some urgency. Much appreciated.

@primeroz
Copy link

primeroz commented Jun 4, 2024

@AndrewSirenko thanks a lot for the detailed exlanation , one question

[Scope Small] We can 100% solve A (that 6 min / soon-to-be-infinite delay)

Do you have any link to the soon to be infinite delay change ? that is so scary

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Jun 4, 2024

@primeroz Apologies, I have edited my statement. I was referencing the ability to disable the 6 minute force detach timeout which you can see the details of here that was added in Kubernetes 1.30 (off by default).

According to SIG Storage planning sheet, this would not be enabled by default until atleast Kubernetes 1.32 (and most likely later than that if this general issue is not fixed at the upstream Kubernetes level).

@levanlongktmt
Copy link
Author

levanlongktmt commented Jun 5, 2024

A. If 2 doesn't happen, today there's a 6+ minute delay in stateful pod migration because Kubernetes is afraid volume still attached and mounted to instance (6+ min delay)

Seemly this is what happened with me, I did 2 quick test
First test: Interupt the spot node with 2 MySql pods, both have terminationGracePeriodSeconds is 20

  • 1 pod terminated very quick and then been scheduled on another node, it attach pvc normally
  • 1 pod stuck in Terminating for a while, then it been scheduled and got multi-attach error

Second test: Interupt the spot node with 1 ES pod, the terminationGracePeriodSeconds is 50

  • pod stuck in Terminating for a while, even ec2 instance has been deleted, then it been scheduled and got multi-attach error

It seems somehow the node has been deleted before all pods terminate (from aws or karpenter) and make 6+ min delay

@ConnorJC3
Copy link
Contributor

If your pod is stuck in terminating there is nothing we can do, because volumes are not unmounted until all pods using them are terminated.

You would need to work out whatever is preventing the pod from terminating correctly and fix that issue.

@jmdeal
Copy link
Member

jmdeal commented Jun 5, 2024

I've also been doing some testing tonight to validate the workaround from the Karpenter side. What I've done is:

  • Install the latest version of the EBS CSI driver addon (v1.31.0-eksbuild.1) with tolerateAllTaints set to false.
  • Ensured that the terminationGracePeriod for the ebs-csi-node daemonset is greater than any workload pods.

I whipped up a simple test where I provisioned a stateful set with a prestop hook, waited for the Karpenter provisioned node to come ready and the pod to begin running, and then deleted the Karpenter node. This kicks off Karpenter's termination flow and the drain begins. I expect this to work; the workload pods being draining before the CSI driver and the CSI driver has a greater terminationGracePeriodSeconds, ensuring that the workload pods should be terminated before the CSI driver.

Where it gets weird is the StatefulSet pods are left in a terminating state well after their pre-stop hook and terminationGracePeriodSeconds have expired. I wasn't sure why this would happen since the container was just a pause container, but I happened to notice that the liveness-probe and node-driver-registrar containers on the ebs-csi-node pod were stopped as soon as the drain began. I tried adding a sleep prestop hook to each of these containers to keep them alive until the StatefulSet termination should have completed, and now saw the volume being detached at the end of the StatefulSet's terminationGracePeriod and the StatefulSet pod successfully terminated. It was then migrated to the new node with no MultiAttach errors.

I'm honestly not familiar enough with the Kubernetes internals around StatefulSet termination and CSI drivers to understand why this fixes the problem, but with this change the suggested mitigation strategy appears to work consistently in my testing. Would love to know if either of you have any insights here @ConnorJC3 @AndrewSirenko.

@StepanS-Enverus
Copy link

Maybe I'm missing something, but by default aws-ebs-csi-driver is running on all nodes as DaemonSet (tolerations: Exists) so nobody need to care about setup nodes if deployments/sts will start using EBS. Pretty cool feature.

We are using STS/Deployments with EBS and after setting tolerateAllTaints to false we finally get rid of Multi attach error, but it brings us another issues with node with existing taints. We need to handle extra work around nodes, deployments/sts and configure ebs-driver to be on all nodes with taints.

Or will be the functionaly of all nodes in cluster will remain same? I mean any sts/deployment on any node can claim EBS and free ebs without any issue, even on nodes with taints?

And Finally, can someone explain me how this simple change solve this issue? Because during node draining daemon set pods are ignored, so why volume deattaching not works when ebs-node pod is deployed to to all nodes, no matter which taint have?

@youwalther65
Copy link

Maybe I'm missing something, but by default aws-ebs-csi-driver is running on all nodes as DaemonSet (tolerations: Exists) so nobody need to care about setup nodes if deployments/sts will start using EBS. Pretty cool feature.

We are using STS/Deployments with EBS and after setting tolerateAllTaints to false we finally get rid of Multi attach error, but it brings us another issues with node with existing taints. We need to handle extra work around nodes, deployments/sts and configure ebs-driver to be on all nodes with taints.

Or will be the functionaly of all nodes in cluster will remain same? I mean any sts/deployment on any node can claim EBS and free ebs without any issue, even on nodes with taints?

And Finally, can someone explain me how this simple change solve this issue? Because during node draining daemon set pods are ignored, so why volume deattaching not works when ebs-node pod is deployed to to all nodes, no matter which taint have?

Regarding tolerateAllTaints just look into how it is implemented in the Helm values.yaml and node DaemonSet template aka. if you set tolerateAllTaints: false you 'll get the tolerations you set in Helm values.yaml.

Regarding preStop hook. For graceful node shutdown or calling eviction API to stop EBS CSI node DaemonSet pod invokes the preStop hook, which cleans up volumes, see FAQ

@StepanS-Enverus
Copy link

StepanS-Enverus commented Jun 5, 2024

Regarding tolerateAllTaints just look into how it is implemented in the Helm values.yaml and node DaemonSet template aka. if you set tolerateAllTaints: false you 'll get the tolerations you set in Helm values.yaml.

If we use default value (tolerateAllTaints: true), we can easily mount EBS volume on any node we have.
If we want to destroy node under pod and reuse use volume on any other node, we need to wait 6 minutes or disable default behaviour which implements additional configurations in ebs-driver.

Why volume reuse cannot work with default values?

Regarding preStop hook. For graceful node shutdown or calling eviction API to stop EBS CSI node DaemonSet pod invokes the preStop hook, which cleans up volumes, see FAQ

This is not explaining us why volume reuse starts working, if we run ebs-driver daemon set only on specific nodes (nodes without taints or nodes with specific taints, defined in tolerations for ebs-driver)

Why volume reuse cannot work with default values - ebs-driver daemon set is deployed on any node with any taint.

@primeroz
Copy link

primeroz commented Jun 5, 2024

This is not explaining us why volume reuse starts working, if we run ebs-driver daemon set only on specific nodes (nodes without taints or nodes with specific taints, defined in tolerations for ebs-driver)
Why volume reuse cannot work with default values - ebs-driver daemon set is deployed on any node with any taint.

to my understanding this is because of the combination of ebs csi driver and karpenter

During node draining karpenter will evict and so , if present, trigger the prestop hook of all pods on the node except those that tolerate karpenter.sh/disruption

So with the default values.yaml the ebs-csi-driver will not be evicted when node is drained ( because it tolerates everything including karpenter taint) and the prestop hook will not run.

If the pre-stop hook does not run the needed cleanup does not happen and you have the 6 minutes problem

@levanlongktmt
Copy link
Author

levanlongktmt commented Jun 5, 2024

@ConnorJC3 @jmdeal how can I set terminationGracePeriodSeconds for ebs-csi-node? I didn't see any value in helm chart to configure it.

Are you mean the shutdownGracePeriod and shutdownGracePeriodCriticalPods in userdata?

    #!/bin/bash
    echo -e "InhibitDelayMaxSec=55\n" >> /etc/systemd/logind.conf
    systemctl restart systemd-logind
    echo "$(jq ".shutdownGracePeriod=\"55s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json
    echo "$(jq ".shutdownGracePeriodCriticalPods=\"15s\"" /etc/kubernetes/kubelet/kubelet-config.json)" > /etc/kubernetes/kubelet/kubelet-config.json

@jmdeal
Copy link
Member

jmdeal commented Jun 5, 2024

how can I set terminationGracePeriodSeconds for ebs-csi-node? I didn't see any value in helm chart to configure it.

It's not currently surfaced as part of the helm chart or addon configuration. I set it by patching the daemonset after installing the addon.

kubectl patch ds -n kube-system ebs-csi-node --patch "$(cat <<EOF
spec:
  template:
    spec:
      terminationGracePeriodSeconds: ...
EOF
)"

By setting the resolveConflicts policy to preserve when updating the addon, this change is maintained across upgrades so it only needs to be made once. The patch would also work with the helm chart, but I'm not sure if there's also a method to preserve the change across uprades.

Are you mean the shutdownGracePeriod and shutdownGracePeriodCriticalPods in userdata?

@primeroz's summary was spot on. By setting tolerateAllTaints to false, this enables Karpenter to drain the daemonset when terminating the instance. This triggers the pre-stop hook, and Karpenter will not be able to procede until the pre-stop hook has completed, i.e. until all volumes have been detached or the terminationGracePeriodSeconds on the ebs-csi-node pod has been exceeded. You don't need to configure Graceful Node Shutdown in this case because it's finished detaching volumes before the instance starts terminating.

@primeroz
Copy link

primeroz commented Jun 6, 2024

terminationGracePeriodSeconds

I have been wondering why we need this but could not understand it

if i look at karpenter termination code https://github.com/kubernetes-sigs/karpenter/blob/5bc07be72a1acd553a2a692edd27e79c20e0e1c1/pkg/controllers/node/termination/terminator/terminator.go#L118-L130

I will evict in order ( as long as they don't tolerate the karpenter.sh/disruption taint )

	// a. non-critical non-daemonsets
	// b. non-critical daemonsets
	// c. critical non-daemonsets
	// d. critical daemonsets

So i would think it would

  • drain all common workloads , including those with EBS volumes
  • drain random daemonsets
  • drain all the special / critical workloads, including those EBS volumes
  • drain ebs-csi-driver node pod , trigger the prestop Hook

So , unless you are running StatefulWorkloads that are daemonsets and hae the critical priorityClass you should not need to increase the temrination grace time ... right ?

@jmdeal
Copy link
Member

jmdeal commented Jun 6, 2024

Currently Karpenter does not wait for one set of pods to complete termination before moving onto the next set. So long as all of the pods in the previous set have began terminating eviction of the next set can begin. If we changed this behavior and waited for the previous set to complete termination then you would right, there shouldn't be a need to configure terminationGracePeriodSeconds on the ebs-csi-node pod. However, this is required with Karpenter's current drain logic.

@levanlongktmt
Copy link
Author

Some additional information, in my cluster (EKS v1.30, Karpenter v0.37), when k8s envict pod of statefulset, I randomly see the pod stay in Terminating in long time, even the terminationGracePeriodSeconds is just 20s and pod doesn't have any PDP, then node be deleted when pod still in Terminating. This isn't problem from ebs driver but I don't know it's problem of Karpenter or Kubelet.

@AndrewSirenko
Copy link
Contributor

Hi folks, I have just posted a draft Request For Comment on aws/karpenter-provider-aws: docs: RFC for disrupted EBS-Backed StatefulSet delays

Karpenter + EBS CSI Driver teams will hopefully decide on which solutions we are moving forward via this RFC.

@AndrewSirenko
Copy link
Contributor

It's not currently surfaced as part of the helm chart or addon configuration.

You will be able to increase EBS CSI Driver Node Pod terminationGracePeriod in v1.32.0 of aws-ebs-csi-driver via helm and via add-on configuration.

Thanks to @ElijahQuinones for #2060

I wasn't sure why this would happen since the container was just a pause container, but I happened to notice that the liveness-probe and node-driver-registrar containers on the ebs-csi-node pod were stopped as soon as the drain began. I tried adding a sleep prestop hook to each of these containers to keep them alive until the StatefulSet termination should have completed, and now saw the volume being detached at the end of the StatefulSet's terminationGracePeriod and the StatefulSet pod successfully terminated.

At the moment, we cannot add a sleep pre-stop hook to liveness-probe and node-driver-registrar containers because we cannot inject the wait for volume detachments logic that our ebs-plugin pre-stop hook currently has. Without this smarter detachment logic, there would always be an extra node.terminationGracePeriod seconds before EBS CSI Node pod can terminate, and this might not be a good trade off by default.

@AndrewSirenko
Copy link
Contributor

AndrewSirenko commented Jun 18, 2024

@jmdeal FYI we may be able to avoid needing to add a sleep pre-stop hook to liveness-probe altogether for users on newer versions of Kubernetes by using grpc liveness probe instead of separate container.

This would increase robustness of our pre-stop lifecycle hook. Our team will look into this.

Thanks to Connor for mentioning this feature.

@cnmcavoy
Copy link
Contributor

cnmcavoy commented Jun 25, 2024

I was made aware of this issue and that we have some workloads (statefulsets with pvs) hitting this with Karpenter v0.37, EKS 1.28, ebs-csi v1.32.0

The behavior that would be ideal is that if the ebs-csi controller detected the node was de-provisioned, and force-detached any remaining volumes that escaped cleanup of the deamonset driver pod. Most of the suggestions on this issue are "hacks" or workarounds to try and ensure the daemonset driver pod can always succeed at cleanup. This seems like a mistake to me, as the controller should have that responsibility to clean up volumes if the node is permanently destroyed. However, when I started looking into how this might be implemented, I discovered this is a gap in the csi spec: container-storage-interface/spec#512

Has AWS considered reaching out to the SIG-node folks or the authors of that original issue? There seems to have been a push in late 2022 to fix the issue that did not cross the finish line: container-storage-interface/spec#512 (comment) & container-storage-interface/spec#477

@johnjeffers
Copy link

This has been brought up before, but if you follow this advice in this thread as set tolerateAllTaints = false, it seems to break statefulsets. I'm not sure what's going on, but when Karpenter deletes a node, any statefulsets on that node get stuck in Terminating because they won't reschedule on another node. If I delete the Karpenter finalizer on the node, then it will delete and the statefulset pods will start on another node. If I don't... I'm not sure how long it will sit there and do nothing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests