Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EFS volume cannot unmount because device is busy #112

Closed
houdinisparks opened this issue Dec 12, 2019 · 13 comments
Closed

EFS volume cannot unmount because device is busy #112

houdinisparks opened this issue Dec 12, 2019 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Milestone

Comments

@houdinisparks
Copy link

/kind bug

What happened?
During rolling updated, 1 of the pods is stuck in terminating status due to error below"

Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720521
    2577 reconciler.go:181] operationExecutor.UnmountVolume started for volume "efs-claim" (UniqueNa
me: "kubernetes.io/csi/efs.csi.aws.com^fs-51aad710") pod "7a84d253-11c9-11ea-b310-066b5c788196" (UID
: "7a84d253-11c9-11ea-b310-066b5c788196")
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720669
    2577 clientconn.go:440] parsed scheme: ""
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720689
    2577 clientconn.go:440] scheme "" not registered, fallback to default scheme
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720746
    2577 asm_amd64.s:1337] ccResolverWrapper: sending new addresses to cc: [{/var/lib/kubelet/plugins/efs.csi.aws.com/csi.sock 0  <nil>}]
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720764    2577 clientconn.go:796] ClientConn switching balancer to "pick_first"
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720813    2577 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc0003f24b0, CONNECTING
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.720846    2577 clientconn.go:1016] blockingPicker: the picked transport is not ready, loop back to repick
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: I1212 18:57:22.721004    2577 balancer_conn_wrappers.go:131] pickfirstBalancer: HandleSubConnStateChange: 0xc0003f24b0, READY
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: E1212 18:57:22.728275    2577 csi_mounter.go:365] kubernetes.io/csi: mounter.TearDownAt failed: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount": Unmount failed: exit status 16
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: Unmounting arguments: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: Output: umount.nfs4: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount: device is busy
Dec 12 18:57:22 ip-100-64-36-72.ap-southeast-1.compute.internal kubelet[2577]: E1212 18:57:22.728339    2577 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/csi/efs.csi.aws.com^fs-51aad710\" (\"7a84d253-11c9-11ea-b310-066b5c788196\")" failed. No retries permitted until 2019-12-12 18:59:24.728314138 +0800 +08 m=+2348866.786363694 (durationBeforeRetry 2m2s). Error: "UnmountVolume.TearDown failed for volume \"efs-claim\" (UniqueName: \"kubernetes.io/csi/efs.csi.aws.com^fs-51aad710\") pod \"7a84d253-11c9-11ea-b310-066b5c788196\" (UID: \"7a84d253-11c9-11ea-b310-066b5c788196\") : rpc error: code = Internal desc = Could not unmount \"/var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount\": Unmount failed: exit status 16\nUnmounting arguments: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount\nOutput: umount.nfs4: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount: device is busy\n\n"

What you expected to happen?
Device is unmounted successfully.

How to reproduce it (as minimally and precisely as possible)?
Not sure, this is the first time it happens.

Anything else we need to know?:

Environment

  • Kubernetes version (use kubectl version):
  • Driver version:
Client Version: version.Info{Major:"1", Minor:"14", GitVersion:"v1.14.6", GitCommit:"96fac5cd13a5dc064f7d9f4f23030a6aeface6cc", GitTreeState:"clean", BuildDate:"2019-08-19T11:13:49Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.8-eks-b7174d", GitCommit:"b7174db5ee0e30c94a0b9899c20ac980c0850fc8", GitTreeState:"clean", BuildDate:"2019-10-18T17:56:01Z", GoVersion:"go1.12.10", Compiler:"gc", Platform:"linux/amd64"}
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 12, 2019
@leakingtapan
Copy link
Contributor

@houdinisparks how did you do the rolling update? And do you have the log for efs-node pod?

@houdinisparks
Copy link
Author

houdinisparks commented Dec 13, 2019

@houdinisparks how did you do the rolling update? And do you have the log for efs-node pod?

below is my yaml file for the pod in question:

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  name: ror-app-main
spec:
  replicas: 6
  strategy:
    rollingUpdate:
      maxSurge: "50%"
      maxUnavailable: "50%"
    type: RollingUpdate
  template:
    spec:
    ...
    volumes:
        - name: efs-claim
          persistentVolumeClaim:
              claimName: efs-claim
---
# sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com

---
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  resources:
    requests:
      storage: 5Gi

---
# pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-*******

logs below:

efs-csi-node-qb54t efs-plugin I1212 12:04:30.661934       1 node.go:118] NodeUnpublishVolume: called with args volume_id:"fs-51aad710" target_path:"/var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount"
efs-csi-node-qb54t efs-plugin I1212 12:04:30.662004       1 node.go:125] NodeUnpublishVolume: unmounting /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
efs-csi-node-qb54t efs-plugin I1212 12:04:30.662013       1 mount_linux.go:211] Unmounting /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
efs-csi-node-qb54t efs-plugin E1212 12:04:30.668714       1 driver.go:71] GRPC error: rpc error: code = Internal desc = Could not unmount "/var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount": Unmount failed: exit status 16
efs-csi-node-qb54t efs-plugin Unmounting arguments: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
efs-csi-node-qb54t efs-plugin Output: umount.nfs4: /var/lib/kubelet/pods/7a84d253-11c9-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount: device is busy

UPDATE
I have done the various methods below, but it still doesnt rectify:

  1. deleting the efs-csi-driver daemonset pod and spinning up a new ont

does not work, no diff


  1. manually unmount and remount at node

does not work. when unmount, efs-plugin complains of no dir to unmount error. when remounted, it says device is busy again.


  1. force unmount of pod
kubectl delete pod foo --grace-period=0 --force

does not work, efs-csi-driver still trying to unmount the volume

The only remediation I havent tried, is to spin up another node, drain the problem node of its pods, and then terminate the problem node. However, this issue will likely occur again in future deployments. Hope to have a resolution.

@houdinisparks
Copy link
Author

houdinisparks commented Dec 16, 2019

Update 16/12/2019

  1. unmount and remount from within pod
kubectl exec -it efs-csi-node-4zn8z -c efs-plugin sh -n kube-system
umount -l /var/lib/kubelet/pods/ac6124d1-1d97-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
mount -t efs fs-51aad710:/ /var/lib/kubelet/pods/ac6124d1-1d97-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount

After remounting from within the container, the efs-plugin is able to successfully unmount the volume. Not sure why theres a discrepancy between remounting from node/container.

@leakingtapan leakingtapan added this to the 0.4 milestone Feb 22, 2020
@cmcconnell1
Copy link

Thanks for filing/posting this. it was very helpful.

Our process was as follows for our cluster which had hanging (missing EFS persistent volume) mounts on their nodes. Hopefully this help others out.

A simple df command would hang on the EKS kube nodes.

running a ps -ef | grep umount showed over a thousand umount processes on the nodes

kube-node # ps -ef | grep umount | wc -l
1128

Which looked like this

[root@ip-10-2-2-8 ts-aws-efs]# ps -ef | grep umount | head -5
root       340 19368  0 05:09 ?        00:00:00 umount /var/lib/kubelet/pods/2e378c53-63b5-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/efs-pv/mount
root       341 19368  0 05:09 ?        00:00:00 umount /var/lib/kubelet/pods/57db20a4-67e2-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/wordpress-dev-efs/mount
root       346 19368  0 05:09 ?        00:00:00 umount /var/lib/kubelet/pods/57dc6ba7-67e2-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/wordpress-dev-efs/mount
root       458 19368  0 02:12 ?        00:00:00 umount /var/lib/kubelet/pods/2e378c53-63b5-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/efs-pv/mount
root       459 19368  0 02:12 ?        00:00:00 umount /var/lib/kubelet/pods/57db20a4-67e2-11ea-aaa9-

then we needed to strace df on each of our EKS kube nodes, which provided the mount id of the problematic mount in the last line output.

Next we had to determine which efs-csi-node pods were running on the/each kube node
k get po -n kube-system -o wide | grep efs-csi

Now inside the requisite efs-csi-pod, run the above noted umount commands to unmount (all of ) the missing mounts that you got from the strace df command on the EKS nodes.
i.e.:

k exec -it po/efs-csi-node-zr6r5 -n kube-system bash
Defaulting container name to efs-plugin.
Use 'kubectl describe pod/efs-csi-node-zr6r5 -n kube-system' to see all of the containers in this pod.
bash-4.2# umount -l /var/lib/kubelet/pods/57dc5ca1-67e2-11ea-aaa9-023df131cbfe/volumes/kubernetes.io~csi/wordpress-dev-efs/mount

Note that in our case, for EKS kube we had to do some more steps:

k version | grep Ser
Server Version: version.Info{Major:"1", Minor:"14+", GitVersion:"v1.14.9-eks-502bfb", GitCommit:"502bfb383169b124d87848f89e17a04b9fc1f6f0", GitTreeState:"clean", BuildDate:"2020-02-07T01:31:02Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"}

Stop and start (and validate the restart of course) the kubelet on each node (simple restarts apparently are not always successful)

systemctl stop kubelet.service
systemctl start kubelet.service
systemctl status kubelet.service

kill the current efs-csi pods in kube-system.

k delete po/efs-csi-node-vbvzb -n kube-system

Note that before doing these final two steps, the previously unmounted volumes (that were manually unmounted on the efs-csi pods) continued to show up in the logs as still not being able to be unmonuted.

@wongma7
Copy link
Contributor

wongma7 commented Apr 16, 2020

What version of the driver is it, 0.2.0 or 0.3.0?

@wongma7
Copy link
Contributor

wongma7 commented May 8, 2020

@cmcconnell1 do you know how the "hanging" mounts came about? I've not been able to reproduce this issue repeatedly triggering rolling updates on a deployment with 60 replicas.

@cmcconnell1
Copy link

cmcconnell1 commented May 8, 2020

Hey @wongma7

I was not 100% sure about the specific cause of the hanging mounts, but believe it was with the coding of our initial first pass at our full lifecycle deployment automation for EFS resources.

I.e.: I think something got stuck/hung (upon destroy) and not enough time was provided after tear-down before it was redeployed. I confirmed this possibility with AWS EKS support, but we were unable to find the exact cause.

After I refactored the automation deploy/destroy scripts to provide requisite time and ensure explicit order of operations, we did not experience this again, so we are no longer affected by this and haven't seen any similar issues since.

Thanks for following-up.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 6, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 6, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@bradmann
Copy link

bradmann commented Mar 7, 2024

/reopen

seeing this same issue on a kubernetes v1.23.8 cluster running efs-csi-driver version 1.5.8.

@k8s-ci-robot
Copy link
Contributor

@bradmann: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

seeing this same issue on a kubernetes v1.23.8 cluster running efs-csi-driver version 1.5.8.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants