-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EFS volume cannot unmount because device is busy #112
Comments
@houdinisparks how did you do the rolling update? And do you have the log for efs-node pod? |
below is my yaml file for the pod in question: apiVersion: apps/v1beta2
kind: Deployment
metadata:
name: ror-app-main
spec:
replicas: 6
strategy:
rollingUpdate:
maxSurge: "50%"
maxUnavailable: "50%"
type: RollingUpdate
template:
spec:
...
volumes:
- name: efs-claim
persistentVolumeClaim:
claimName: efs-claim
---
# sc.yaml
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: efs-sc
provisioner: efs.csi.aws.com
---
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: efs-claim
spec:
accessModes:
- ReadWriteMany
storageClassName: efs-sc
resources:
requests:
storage: 5Gi
---
# pv.yaml
apiVersion: v1
kind: PersistentVolume
metadata:
name: efs-pv
spec:
capacity:
storage: 5Gi
volumeMode: Filesystem
accessModes:
- ReadWriteMany
persistentVolumeReclaimPolicy: Retain
storageClassName: efs-sc
csi:
driver: efs.csi.aws.com
volumeHandle: fs-******* logs below:
UPDATE
kubectl delete pod foo --grace-period=0 --force
The only remediation I havent tried, is to spin up another node, drain the problem node of its pods, and then terminate the problem node. However, this issue will likely occur again in future deployments. Hope to have a resolution. |
Update 16/12/2019
kubectl exec -it efs-csi-node-4zn8z -c efs-plugin sh -n kube-system
umount -l /var/lib/kubelet/pods/ac6124d1-1d97-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount
mount -t efs fs-51aad710:/ /var/lib/kubelet/pods/ac6124d1-1d97-11ea-b310-066b5c788196/volumes/kubernetes.io~csi/efs-pv/mount After remounting from within the container, the efs-plugin is able to successfully unmount the volume. Not sure why theres a discrepancy between remounting from node/container. |
Thanks for filing/posting this. it was very helpful. Our process was as follows for our cluster which had hanging (missing EFS persistent volume) mounts on their nodes. Hopefully this help others out. A simple df command would hang on the EKS kube nodes. running a ps -ef | grep umount showed over a thousand umount processes on the nodes
Which looked like this
then we needed to Next we had to determine which efs-csi-node pods were running on the/each kube node Now inside the requisite efs-csi-pod, run the above noted umount commands to unmount (all of ) the missing mounts that you got from the
Note that in our case, for EKS kube we had to do some more steps:
Stop and start (and validate the restart of course) the kubelet on each node (simple restarts apparently are not always successful)
kill the current efs-csi pods in kube-system.
Note that before doing these final two steps, the previously unmounted volumes (that were manually unmounted on the efs-csi pods) continued to show up in the logs as still not being able to be unmonuted. |
What version of the driver is it, 0.2.0 or 0.3.0? |
@cmcconnell1 do you know how the "hanging" mounts came about? I've not been able to reproduce this issue repeatedly triggering rolling updates on a deployment with 60 replicas. |
Hey @wongma7 I was not 100% sure about the specific cause of the hanging mounts, but believe it was with the coding of our initial first pass at our full lifecycle deployment automation for EFS resources. I.e.: I think something got stuck/hung (upon destroy) and not enough time was provided after tear-down before it was redeployed. I confirmed this possibility with AWS EKS support, but we were unable to find the exact cause. After I refactored the automation deploy/destroy scripts to provide requisite time and ensure explicit order of operations, we did not experience this again, so we are no longer affected by this and haven't seen any similar issues since. Thanks for following-up. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen seeing this same issue on a kubernetes v1.23.8 cluster running efs-csi-driver version 1.5.8. |
@bradmann: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
What happened?
During rolling updated, 1 of the pods is stuck in terminating status due to error below"
What you expected to happen?
Device is unmounted successfully.
How to reproduce it (as minimally and precisely as possible)?
Not sure, this is the first time it happens.
Anything else we need to know?:
Environment
kubectl version
):The text was updated successfully, but these errors were encountered: