EFS volume cannot unmount because device is busy #112

houdinisparks · 2019-12-12T13:04:33Z

/kind bug

What happened?
During rolling updated, 1 of the pods is stuck in terminating status due to error below"

Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720521
    2577 reconciler.go:181] operationExecutor.UnmountVolume started for volume "efs-claim" (UniqueNa
me: "kubernetes.io/csi/efs.csi.aws.com^fs-51aad710") pod "7a84d253-11c9-11ea-b310-066b5c788196" (UID
: "7a84d253-11c9-11ea-b310-066b5c788196")
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720669
    2577 clientconn.go:440] parsed scheme: ""
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720689
    2577 clientconn.go:440] scheme "" not registered, fallback to default scheme
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720746
    2577 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock 0  <nil>}]
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720764    2577 clientconn.go:796] ClientConn switching balancer to "pick_first"
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720813    2577 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc0003f24b0, CONNECTING
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720846    2577 clientconn.go:1016] blockingPicker: the picked transport is not ready, loop back to repick
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.721004    2577 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc0003f24b0, READY
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: E1212 18:57:22.728275    2577 csi_mounter.go:365] kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount": Unmount failed: exit status 16
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: Unmounting arguments: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: Output: umount.nfs4: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount: device is busy
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: E1212 18:57:22.728339    2577 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/csi/efs.csi.aws.com^fs-51aad710\" (\"7a84d253-11c9-11ea-b310-066b5c788196\")" failed. No retries permitted until 2019-12-12 18:59:24.728314138 +0800 +08 m=+2348866.786363694 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"efs-claim\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-51aad710\") pod \"7a84d253-11c9-11ea-b310-066b5c788196\" (UID: \"7a84d253-11c9-11ea-b310-066b5c788196\") : rpc error: code = Internal desc = Could not unmount \"/var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount\": Unmount failed: exit status 16\nUnmounting arguments: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount\nOutput: umount.nfs4: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount: device is busy\n\n"

What you expected to happen?
Device is unmounted successfully.

How to reproduce it (as minimally and precisely as possible)?
Not sure, this is the first time it happens.

Anything else we need to know?:

Environment

Kubernetes version (use kubectl version):
Driver version:

Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:49Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-eks-b7174d", GitCommit:"b7174db5ee0e30c94a0b9899c20ac980c0850fc8", GitTreeState:"clean", BuildDate:"2019-10-18T17:56:01Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

leakingtapan · 2019-12-12T23:15:08Z

@houdinisparks how did you do the rolling update? And do you have the log for efs-node pod?

houdinisparks · 2019-12-13T01:38:43Z

@houdinisparks how did you do the rolling update? And do you have the log for efs-node pod?

below is my yaml file for the pod in question:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: ror-app-main
spec:
  replicas: 6
  strategy:
    rollingUpdate:
      maxSurge: "50%"
      maxUnavailable: "50%"
    type: RollingUpdate
  template:
    spec:
    ...
    volumes:
        - name: efs-claim
          persistentVolumeClaim:
              claimName: efs-claim
---
# sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com

---
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 5Gi

---
# pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-*******

logs below:

efs-csi-node-qb54t efs-plugin I1212 12:04:30.661934       1 node.go:118] NodeUnpublishVolume: called with args volume_id:"fs-51aad710" target_path:"/var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount"
efs-csi-node-qb54t efs-plugin I1212 12:04:30.662004       1 node.go:125] NodeUnpublishVolume: unmounting /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
efs-csi-node-qb54t efs-plugin I1212 12:04:30.662013       1 mount_linux.go:211] Unmounting /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
efs-csi-node-qb54t efs-plugin E1212 12:04:30.668714       1 driver.go:71] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount": Unmount failed: exit status 16
efs-csi-node-qb54t efs-plugin Unmounting arguments: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
efs-csi-node-qb54t efs-plugin Output: umount.nfs4: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount: device is busy

UPDATE
I have done the various methods below, but it still doesnt rectify:

deleting the efs-csi-driver daemonset pod and spinning up a new ont

does not work, no diff

manually unmount and remount at node

does not work. when unmount, efs-plugin complains of no dir to unmount error. when remounted, it says device is busy again.

force unmount of pod

kubectl delete pod foo --grace-period=0 --force

does not work, efs-csi-driver still trying to unmount the volume

The only remediation I havent tried, is to spin up another node, drain the problem node of its pods, and then terminate the problem node. However, this issue will likely occur again in future deployments. Hope to have a resolution.

houdinisparks · 2019-12-16T04:46:19Z

Update 16/12/2019

unmount and remount from within pod

kubectl exec -it efs-csi-node-4zn8z -c efs-plugin sh -n kube-system
umount -l /var/lib/kubelet/pods/ac6124d1-1d97-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
mount -t efs fs-51aad710:/ /var/lib/kubelet/pods/ac6124d1-1d97-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount

After remounting from within the container, the efs-plugin is able to successfully unmount the volume. Not sure why theres a discrepancy between remounting from node/container.

cmcconnell1 · 2020-03-27T23:06:03Z

Thanks for filing/posting this. it was very helpful.

Our process was as follows for our cluster which had hanging (missing EFS persistent volume) mounts on their nodes. Hopefully this help others out.

A simple df command would hang on the EKS kube nodes.

running a ps -ef | grep umount showed over a thousand umount processes on the nodes

kube-node # ps -ef | grep umount | wc -l
1128

Which looked like this

[root@ip-10-2-2-8 ts-aws-efs]# ps -ef | grep umount | head -5
root       340 19368  0 05:09 ?        00:00:00 umount /var/lib/kubelet/pods/2e378c53-63b5-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/efs-pv/mount
root       341 19368  0 05:09 ?        00:00:00 umount /var/lib/kubelet/pods/57db20a4-67e2-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/wordpress-dev-efs/mount
root       346 19368  0 05:09 ?        00:00:00 umount /var/lib/kubelet/pods/57dc6ba7-67e2-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/wordpress-dev-efs/mount
root       458 19368  0 02:12 ?        00:00:00 umount /var/lib/kubelet/pods/2e378c53-63b5-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/efs-pv/mount
root       459 19368  0 02:12 ?        00:00:00 umount /var/lib/kubelet/pods/57db20a4-67e2-11ea-aaa9-

then we needed to strace df on each of our EKS kube nodes, which provided the mount id of the problematic mount in the last line output.

Next we had to determine which efs-csi-node pods were running on the/each kube node
k get po -n kube-system -o wide | grep efs-csi

Now inside the requisite efs-csi-pod, run the above noted umount commands to unmount (all of ) the missing mounts that you got from the strace df command on the EKS nodes.
i.e.:

k exec -it po/efs-csi-node-zr6r5 -n kube-system bash
Defaulting container name to efs-plugin.
Use 'kubectl describe pod/efs-csi-node-zr6r5 -n kube-system' to see all of the containers in this pod.
bash-4.2# umount -l /var/lib/kubelet/pods/57dc5ca1-67e2-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/wordpress-dev-efs/mount

Note that in our case, for EKS kube we had to do some more steps:

k version | grep Ser
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-502bfb", GitCommit:"502bfb383169b124d87848f89e17a04b9fc1f6f0", GitTreeState:"clean", BuildDate:"2020-02-07T01:31:02Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Stop and start (and validate the restart of course) the kubelet on each node (simple restarts apparently are not always successful)

systemctl stop kubelet.service
systemctl start kubelet.service
systemctl status kubelet.service

kill the current efs-csi pods in kube-system.

k delete po/efs-csi-node-vbvzb -n kube-system

Note that before doing these final two steps, the previously unmounted volumes (that were manually unmounted on the efs-csi pods) continued to show up in the logs as still not being able to be unmonuted.

wongma7 · 2020-04-16T23:14:05Z

What version of the driver is it, 0.2.0 or 0.3.0?

wongma7 · 2020-05-08T21:22:24Z

@cmcconnell1 do you know how the "hanging" mounts came about? I've not been able to reproduce this issue repeatedly triggering rolling updates on a deployment with 60 replicas.

cmcconnell1 · 2020-05-08T23:30:54Z

Hey @wongma7

I was not 100% sure about the specific cause of the hanging mounts, but believe it was with the coding of our initial first pass at our full lifecycle deployment automation for EFS resources.

I.e.: I think something got stuck/hung (upon destroy) and not enough time was provided after tear-down before it was redeployed. I confirmed this possibility with AWS EKS support, but we were unable to find the exact cause.

After I refactored the automation deploy/destroy scripts to provide requisite time and ensure explicit order of operations, we did not experience this again, so we are no longer affected by this and haven't seen any similar issues since.

Thanks for following-up.

fejta-bot · 2020-08-06T23:41:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-09-06T00:23:33Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-10-06T01:06:28Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-10-06T01:06:36Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

bradmann · 2024-03-07T01:19:49Z

/reopen

seeing this same issue on a kubernetes v1.23.8 cluster running efs-csi-driver version 1.5.8.

k8s-ci-robot · 2024-03-07T01:19:53Z

@bradmann: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

seeing this same issue on a kubernetes v1.23.8 cluster running efs-csi-driver version 1.5.8.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 12, 2019

leakingtapan added this to the 0.4 milestone Feb 22, 2020

leakingtapan mentioned this issue May 15, 2020

Always enabled encryption in transit #166

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 6, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2020

k8s-ci-robot closed this as completed Oct 6, 2020

therapy-lf mentioned this issue Dec 13, 2022

Mount hanging indefinitely #616

Closed

sba30 mentioned this issue Feb 9, 2024

NodeUnpublishVolume error after a pod is force deleted #1262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EFS volume cannot unmount because device is busy #112

EFS volume cannot unmount because device is busy #112

houdinisparks commented Dec 12, 2019

leakingtapan commented Dec 12, 2019

houdinisparks commented Dec 13, 2019 •

edited

Loading

houdinisparks commented Dec 16, 2019 •

edited

Loading

cmcconnell1 commented Mar 27, 2020

wongma7 commented Apr 16, 2020

wongma7 commented May 8, 2020

cmcconnell1 commented May 8, 2020 •

edited

Loading

fejta-bot commented Aug 6, 2020

fejta-bot commented Sep 6, 2020

fejta-bot commented Oct 6, 2020

k8s-ci-robot commented Oct 6, 2020

bradmann commented Mar 7, 2024

k8s-ci-robot commented Mar 7, 2024

EFS volume cannot unmount because device is busy #112

EFS volume cannot unmount because device is busy #112

Comments

houdinisparks commented Dec 12, 2019

leakingtapan commented Dec 12, 2019

houdinisparks commented Dec 13, 2019 • edited Loading

houdinisparks commented Dec 16, 2019 • edited Loading

cmcconnell1 commented Mar 27, 2020

wongma7 commented Apr 16, 2020

wongma7 commented May 8, 2020

cmcconnell1 commented May 8, 2020 • edited Loading

fejta-bot commented Aug 6, 2020

fejta-bot commented Sep 6, 2020

fejta-bot commented Oct 6, 2020

k8s-ci-robot commented Oct 6, 2020

bradmann commented Mar 7, 2024

k8s-ci-robot commented Mar 7, 2024

houdinisparks commented Dec 13, 2019 •

edited

Loading

houdinisparks commented Dec 16, 2019 •

edited

Loading

cmcconnell1 commented May 8, 2020 •

edited

Loading