Pods are getting stuck in ContainerCreating because of EFS CSI driver error #325

mikalai-t · 2021-02-10T10:02:47Z

/kind bug

Hello.

What happened?
Sometimes, on a new EC2 instance EFS CSI driver can't handle PVC from Pods running on that node and this cause an issue described in subject.

What you expected to happen?
PVC must be executed correctly so Volume can be mounted and attached to the Pod.

How to reproduce it (as minimally and precisely as possible)?
It happens intermittently. Generally the process is following:

Create EKS cluster and Node group (doesn't matter what EC2 instance type to choose - it happened on t3.medium and m5.large).
Create EFS and SecurityGroup for EFS to allow connection to 2049/tcp from VPC CIDR. Say, the EFS ID is fs-33559c06.
Create EFS AccessPoint for particular application (e.g. ID fsap-085397e3f5ef75ead)
Create StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com

Create PersistentVolume:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv-thanos-compactor
  labels:
    'app.kubernetes.io/name': 'thanos-compactor'
spec:
  capacity:
    storage: 100Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: 'fs-33559c06::fsap-085397e3f5ef75ead'
    volumeAttributes:
      encryptInTransit: 'true'

Create PersistentVolumeClaim:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/name: thanos-compactor
  name: thanos-compactor
  namespace: monitoring
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  storageClassName: efs-sc
  volumeMode: Filesystem
  selector:
    matchLabels:
      "app.kubernetes.io/name": thanos-compactor

Create Pod:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  labels:
    app.kubernetes.io/component: compactor
    app.kubernetes.io/instance: monitoring-thanos
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: thanos
    helm.sh/chart: thanos-3.8.1
    pod-template-hash: 7c8b88b68
  name: thanos-compactor-7c8b88b68-6hflm
  namespace: monitoring

# ...
spec:
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: thanos-compactor

# ...

Anything else we need to know?:

There was an error message as the last line in cs-driver-registrar container (by the moment of writing I lost the exact timestamp):

I0210 08:47:20.032981       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
Exxxx HH:MM:SS.mmmmmm       1 efs connection.go:129] Lost connection to unix:///csi/csi.sock

After restarting EFS CSI Pod the error above disappeared, but it didn't help to resolve the issue - the only workaround is to taint/remove the node fro the cluster and create a new EC2 instance.
Meanwhile logs from the efs-plugin before and after restart are not very helpful. It keeps repeating the same sequence every 4 minutes 2 seconds:

I0210 09:43:51.664790       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:43:51.673852       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:43:51.674749       1 node.go:51] NodePublishVolume: called with args volume_id:"fs-33559c06::fsap-085397e3f5ef75ead" target_path:"/var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount" volume_capability:<mount:<> access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"encryptInTransit" value:"true" > 
I0210 09:43:51.674810       1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount
I0210 09:43:51.674986       1 node.go:172] NodePublishVolume: mounting fs-33559c06:/ at /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount with options [accesspoint=fsap-085397e3f5ef75ead tls]
I0210 09:43:51.675089       1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-085397e3f5ef75ead,tls fs-33559c06:/ /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount)

I0210 09:47:53.762514       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:47:53.769379       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:47:53.770295       1 node.go:51] NodePublishVolume: called with args volume_id:"fs-33559c06::fsap-085397e3f5ef75ead" target_path:"/var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount" volume_capability:<mount:<> access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"encryptInTransit" value:"true" > 
I0210 09:47:53.770365       1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount
I0210 09:47:53.770390       1 node.go:172] NodePublishVolume: mounting fs-33559c06:/ at /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount with options [accesspoint=fsap-085397e3f5ef75ead tls]
I0210 09:47:53.770410       1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-085397e3f5ef75ead,tls fs-33559c06:/ /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount)

Events (kubectl get events -n monitoring --sort-by=metadata.creationTimestamp):

4m46s       Warning   FailedMount               pod/thanos-compactor-7c8b88b68-6hflm              MountVolume.SetUp failed for volume "efs-pv-thanos-compactor" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
33m         Warning   FailedMount               pod/thanos-compactor-7c8b88b68-6hflm              Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[objstore-config data thanos-token-f7njj aws-iam-token]: timed out waiting for the condition

Environment

Kubernetes version (use kubectl version):

Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

Driver version:
Installed using Helm chart aws-efs-csi-driver version 1.1.0 from the repo https://kubernetes-sigs.github.io/aws-efs-csi-driver with default values.

The text was updated successfully, but these errors were encountered:

2tim · 2021-02-12T20:25:03Z

Very similar behavior here one EKS 1.18
We were running fine for about 2 weeks then bam, user pings and says his pods won't start and we get this error:

  Type     Reason              Age                  From                                    Message
  ----     ------              ----                 ----                                    -------
  Normal   Scheduled           29m                  default-scheduler                       Successfully assigned default/coco-webserver-7658b7b74c-75mgq to ip-10-xx-xx-107.ec2.internal
  Warning  FailedMount         4m36s (x6 over 20m)  kubelet, ip-10-xx-xx-107.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[persistent-storage-efs], unattached volumes=[default-token persistent-storage-efs]: timed out waiting for the condition
  Warning  FailedAttachVolume  80s (x10 over 25m)   attachdetach-controller                 AttachVolume.Attach failed for volume "efs-ai-annotations" : attachdetachment timeout for volume fs-d140xxxx
  Warning  FailedMount         1s (x7 over 27m)     kubelet, ip-10-xx-xx-107.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[persistent-storage-efs], unattached volumes=[persistent-storage-efs default-token-xxxxx]: timed out waiting for the condition

wongma7 · 2021-02-12T21:39:46Z

Is there any relation between this issue happening and recency of fs/access point creation?

Or did the same fs already exist 2 weeks before hand?

Did this only start happening with 1.1.0?
There is a similar unresolved issue happening in our CI testing master where a node will "go bad" and no amount of restarting the driver, etc. will fix mounts being stuck.

2tim · 2021-02-15T15:27:37Z

It has been the same FS the entire time. We are using 1.0.0. We manage our nodes and routinely update the AMIs with a pipeline that handles the drain and reschedule of the pods. We have not observed this process causing issues.

2tim · 2021-02-15T15:29:31Z

I also noticed the default logging level seems to have no helpful logs for this. Are we possibly missing a system log that might show more of what is going on.

2tim · 2021-02-15T16:22:37Z

I just upgraded the CSI to the latest version (1.1.0) using the default settings in the values.yaml for the helm install. The daemonsets each restarted cleanly. The PV and PVC still show Bound as the status.

I attempted to restart the pods and they still came up with a FailedMount status.

It did not reattach until I issued the following:

kubectl patch pv pv-name -p '{"spec":{"claimRef": null}}'

and then delete/apply the pvc again.

kubectl delete -f pvc.yaml

The PVC hangs in a Terminating status with no end in site. 💀

2tim · 2021-02-15T17:16:20Z

I was able to patch the PVC with the following, then delete it:

kubectl -n team1 patch pvc efs-team1 -p '{"metadata":{"finalizers":null}}'

but after reapplying the pvc the volumes are not attaching.

kubectl get volumeattachment

shows ATTACHED = false.

@wongma7 do you have any suggestions? We are unable to use the volume at this point from the cluster. This is the recommended tool from AWS for attaching EFS volumes in EKS but so far it has been very unreliable and the logs show nothing new after setting the logging level higher.

wongma7 · 2021-02-15T18:17:21Z

EFS shouldn't involve an attachment stage, there should be no volumeattachment object, and looking back at the error, I am not sure why there is a FailedAttach object. Can you share the efs CSIDriver object?

Could you try this log collector tool? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/log_collector.py There should be logs from efs-utils (the app that provides mount -t efs) in /var/log/amazon/efs in the container.

2tim · 2021-02-15T18:18:26Z

@wongma7 It appears we hit this issue as well: #338 during the upgrade with helm.

wongma7 · 2021-02-15T18:20:04Z

The default attachRequired value for a driver if CSIDRiver does not exist is true, so that could explain the attach errors. Please try creating the object and let me know if pods still get stuck.

On my part, let me prioritize getting the fix out into the helm chart

2tim · 2021-02-15T18:20:36Z

After the helm upgrade --install ... I see this:

kubectl get CSIDriver efs.csi.aws.com -o yaml
Error from server (NotFound): csidrivers.storage.k8s.io "efs.csi.aws.com" not found

2tim · 2021-02-15T19:08:53Z

EFS shouldn't involve an attachment stage, there should be no volumeattachment object, and looking back at the error, I am not sure why there is a FailedAttach object. Can you share the efs CSIDriver object?

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/resource-policy: keep
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"CSIDriver","metadata":{"annotations":{"helm.sh/hook":"pre-install","helm.sh/hook-delete-policy":"before-hook-creation","helm.sh/resource-policy":"keep"},"name":"efs.csi.aws.com"},"spec":{"attachRequired":false}}
  creationTimestamp: "2021-02-15T18:22:06Z"
  managedFields:
  - apiVersion: storage.k8s.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:helm.sh/hook: {}
          f:helm.sh/hook-delete-policy: {}
          f:helm.sh/resource-policy: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        f:attachRequired: {}
        f:podInfoOnMount: {}
        f:volumeLifecycleModes:
          .: {}
          v:"Persistent": {}
    manager: kubectl
    operation: Update
    time: "2021-02-15T18:22:06Z"
  name: efs.csi.aws.com
  resourceVersion: "210947747"
  selfLink: /apis/storage.k8s.io/v1/csidrivers/efs.csi.aws.com
  uid: e6230f1f-1e17-44a2-9f86-e67ad665ea3a
spec:
  attachRequired: false
  podInfoOnMount: false
  volumeLifecycleModes:
  - Persistent

Could you try this log collector tool? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/log_collector.py There should be logs from efs-utils (the app that provides mount -t efs) in /var/log/amazon/efs in the container.

cat efs_utils_state_dir
kubectl exec efs-csi-node-wvj6n -n kube-system -c efs-plugin -- find /var/run/efs -type f -exec echo {} \; -exec cat {} \; -exec echo \;

:\

2tim · 2021-02-15T19:24:38Z

@wongma7 I am afraid when the helm upgrade removed the CSIDriver it left my cluster in an unknown state. I still see the volumeattachments that seemed to be created in the period that we did not have the CSIDriver. Should I delete these?

kubectl get volumeattachment
NAME                                                                   ATTACHER          PV                   NODE                            ATTACHED   AGE
csi-b8be6de6a26fa325043e2f2d295d83d80d429e37caf54bc86740ef70f81a7f8a   efs.csi.aws.com   efs-ai-annotations   ip-10-xxx-xxx-xxx.ec2.internal   false      2d23h
csi-eb827277d3edb2746f2dd98cf2de8e541e2f30b5a340aae554d946fee5bd4f6a   efs.csi.aws.com   efs-ai-annotations   ip-10-xxx-xxx-xxx.ec2.internal    false      2d23h

The nodes mentioned are the one the scheduler is trying to run the pods on that mount the disk. I am guessing this is creating a conflict?

wongma7 · 2021-02-15T19:36:34Z

Yes, it is safe to delete them. Now that CSIDriver has been fixed to say attachRequired false, kubelet should not even be attempting to get these volumeattachments

2tim · 2021-02-15T20:24:48Z

@wongma7 after removing the volumeattachments mentioned above I am still unable to mount the volumes. I removed the PVC and reset the PV claimRef to null then recreated the PVC. Everything looks ok with the PV and PVC (STATUS=Bound) but the pods are still unable to mount the EFS volume. Am I missing something?

wongma7 · 2021-02-15T20:26:33Z

Can you share the latest events on the pod?

2tim · 2021-02-15T20:29:34Z

@wongma7

Warning  FailedMount  61s (x7 over 35m)     kubelet, ip-10-133-31-48.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[persistent-storage-efs], unattached volumes=[default-token-k7dxb persistent-storage-efs]: timed out waiting for the condition

Also, I don't see anything useful in the log_collector.py output. Its as though the driver does not see the pv/pvc?

wongma7 · 2021-02-15T20:32:01Z

The driver should at least be logging "NodePublish" requests if it's receiving them. These would be visible from efs-plugin logs like kubectl log efs-node-pod-asdf efs-plugin.

If the driver is really not logging anything, then we'll have to look at kubelet logs to ssee why kubelet is not sending requests to the driver.

2tim · 2021-02-15T20:35:29Z

@wongma7

kubectl -n kube-system logs efs-csi-node-wvj6n -c efs-plugin
I0215 15:52:35.184083       1 config_dir.go:87] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0215 15:52:35.186681       1 driver.go:83] Node Service capability for Get Volume Stats Not enabled
I0215 15:52:35.189204       1 mount_linux.go:173] Cannot run systemd-run, assuming non-systemd OS
I0215 15:52:35.189214       1 mount_linux.go:174] systemd-run failed with: exit status 1
I0215 15:52:35.189223       1 mount_linux.go:175] systemd-run output: Failed to create bus connection: No such file or directory
I0215 15:52:35.189386       1 driver.go:114] Starting watchdog
I0215 15:52:35.189508       1 efs_watch_dog.go:174] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0215 15:52:35.189592       1 efs_watch_dog.go:174] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0215 15:52:35.195020       1 driver.go:120] Staring subreaper
I0215 15:52:35.195036       1 driver.go:123] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0215 15:52:36.956813       1 node.go:295] NodeGetInfo: called with args

2tim · 2021-02-15T20:36:03Z

This log output is the same on all the nodes.

wongma7 · 2021-02-15T20:40:56Z

Could you send me kubelet logs? We can continue on slack to share files if needed.

mikalai-t · 2021-02-16T07:43:19Z

Is there any relation between this issue happening and recency of fs/access point creation?

Or did the same fs already exist 2 weeks before hand?

No it didn't. Why?
Actually, it was created a couple of minutes before EFS CSI driver deployed

Did this only start happening with 1.1.0?

I don't know. I started using EFS CSI very recently with version 1.1.0

Currently we're not using EFS, but I'd like to help so if more info required I'll re-deploy it again, unfortunately the issue happens intermittently, as @2tim said:

Very similar behavior here one EKS 1.18
We were running fine for about 2 weeks then bam, user pings and says his pods won't start and we get this error:

wongma7 · 2021-02-16T17:13:49Z

No it didn't. Why?
Actually, it was created a couple of minutes before EFS CSI driver deployed

Apparently there is some undocumented EFS-side "bootstrap" or "cooldown" soon after it is created during which mounts may not succeed.

@2tim 's issue we determined was because the csidriver got deleted by helm chart 1.1.0/1.1.1.#338 This meant attachRequired became true, which is wrong, so all Pods got stuck waiting for attach to complete. The solution is to recreate the csidriver with attachRequired false AND (restart kubelet OR reschedule the stuck pods to other kubelets). If your pods' events also say something like "FaileAttach" then it is the same issue.

mikalai-t · 2021-02-18T13:34:38Z

Apparently there is some undocumented EFS-side "bootstrap" or "cooldown" soon after it is created during which mounts may not succeed.

E.g. Terraform deployment log:

module.eks_cluster.aws_eks_cluster.this[0]: Creating...
module.efs.aws_efs_file_system.default[0]: Creation complete after 7s [id=fs-14df6d21]
module.efs.aws_efs_mount_target.default[2]: Creating...
module.efs.aws_efs_mount_target.default[0]: Creating...
module.efs.aws_efs_mount_target.default[1]: Creating...

# ...

aws_efs_access_point.ap_for_pv["thanos-storegateway"]: Creation complete after 3s [id=fsap-061861f64c202128f]

# ...

module.eks_cluster.aws_eks_cluster.this[0]: Still creating... [10s elapsed]
module.efs.aws_efs_mount_target.default[0]: Still creating... [10s elapsed]
module.efs.aws_efs_mount_target.default[1]: Still creating... [10s elapsed]
module.efs.aws_efs_mount_target.default[2]: Still creating... [10s elapsed]

# ...

module.efs.aws_efs_mount_target.default[0]: Creation complete after 1m27s [id=fsmt-3cb35909]

# ...

module.efs.aws_efs_mount_target.default[1]: Creation complete after 2m39s [id=fsmt-3eb3590b]
module.efs.aws_efs_mount_target.default[2]: Creation complete after 2m39s [id=fsmt-3db35908]
module.eks_cluster.aws_eks_cluster.this[0]: Still creating... [2m50s elapsed]

# ...

module.eks_cluster.aws_eks_cluster.this[0]: Still creating... [13m20s elapsed]
module.eks_cluster.aws_eks_cluster.this[0]: Creation complete after 13m23s [id=eks007]

Roughly, the cluster started 10 minutes later the last mount target was created so the EFS CSI resources were created after that point in time...

Do you happen to know for how long we should wait before attempting to use EFS for the first time?

khurram-s · 2021-02-26T05:34:14Z

I am having the same issue with EKS 1.18
MountVolume.SetUp failed for volume "pvc-baaf6f17-6eb7-4189-89e8-450286d38856" : rpc error: code = InvalidArgument desc │
│ = Volume context property storage.kubernetes.io/csiProvisionerIdentity not supported │
│ Warning FailedMount 7m20s (x3 over 11m) kubelet, ip-172-31-26-97.ap-southeast-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[trustedexec-netcert], unattached volumes=[trustgrid-secret-path tr │
│ stgrid-db-volume trustedexec-netcert default-token-v2gzz[]: timed out waiting for the condition

mikalai-t · 2021-02-26T16:03:24Z

@khurram-s It doesn't seem so. At least you have the exact error message:

code = InvalidArgument  ... storage.kubernetes.io/csiProvisionerIdentity not supported

Try to remove it from your manifest

j-trotter-rl · 2021-03-01T20:50:37Z

I'm getting the same timeout and "unrecognized init system" errors (attached)
chart 1.1.2 on EKS 1.19
This is my first time using EFS on EKS with this AWS account

wongma7 · 2021-03-01T23:15:29Z

@kbasv

issue related to access point creation making an instance unable to mount efs

Roughly, the cluster started 10 minutes later the last mount target was created so the EFS CSI resources were created after that point in time...
Do you happen to know for how long we should wait before attempting to use EFS for the first time?

kbasv · 2021-03-02T04:18:20Z

@mikalai-t
Is your application waiting for the life cycle of all EFS resources to become available before proceeding to mount by creating your PV & PVC?

mikalai-t · 2021-03-06T10:42:45Z

@kbasv not quite:

I use Terraform to deploy VPC, EFS, EKS and so on
Terraform builds its own dependency graph and creates resource in that order, e.g.: first VPC, then EFS and EKS in parallel
It takes a minute to deploy EFS and AccessPoints, then 2-3 minutes to deploy EFS MountTargets (which are not explicitly used in our case).
~ 10 minutes later EKS deployment is completed and...
4.1. PVs and PVCs are being created with kubectl apply -f ...
4.2 HelmOperator is being deployed
It takes a minute...
As soon as HelmOperator is running it starts deploying e.g. Thanos chart that tries to use that PVC

As you can see there is min. ~10 minutes delay between that point when EFS is reported to be ready and first PV operation. Must be enough I think...

I can't control Thanos application behavior to wait for something... but can include artificial delay somewhere between p.[2] and p.[5]

Correct me if I'm wrong but even if application tries to use PVC which is not ready at the moment, it should succeed after several attempts later, when PV/PVC objects are ready, probably because of declarative Kubernetes architecture. The only question is to get PV and PVC ready what is impossible due to EFS driver error

wmgroot · 2021-05-04T00:34:03Z

Alright, I have some new repro steps to produce the issue, and some log_collector.py dumps from the plugin on the node where it happened.

My test setup:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: efs-stress-test
  labels:
    app: efs-stress
spec:
  replicas: 100
  selector:
    matchLabels:
      app: efs-stress
  template:
    metadata:
      labels:
        app: efs-stress
    spec:
      containers:
      - name: efs-stress
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: efs-stress-test-2
  labels:
    app: efs-stress
spec:
  replicas: 0
  selector:
    matchLabels:
      app: efs-stress
  template:
    metadata:
      labels:
        app: efs-stress
    spec:
      containers:
      - name: efs-stress
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    # - ReadWriteOnce
    - ReadWriteMany
  storageClassName: efs-test
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-stress-test
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-test
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-39402c41

Repro steps:

Ensure you already have an EFS store created with a volumeHandle available.
Apply the resources above, and allow all 100 pods to become Running.
Change the replicas for efs-stress-test-2 to 100, and efs-stress-test to 0. Apply.
Part way through the Termination of efs-stress-test, change efs-stress-test's replicas to 0, efs-stress-test-2's replicas to 100, and Apply.
Continue swapping the replica counts and applying until you notice Pods becoming stuck in ContainerCreating status.

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    11m                  default-scheduler  Successfully assigned default/efs-stress-test-7b98f4d47d-ppp45 to ip-10-118-4-139.us-east-2.compute.internal
  Warning  FailedMount  2m38s                kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-qwtk4 persistent-storage]: timed out waiting for the condition
  Warning  FailedMount  22s (x4 over 9m24s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[persistent-storage default-token-qwtk4]: timed out waiting for the condition

I went through several successful iterations of these steps until our cluster started hitting ImgPullFailures from our internal Harbor registry from the number of pull requests we were causing. It's possible that pods sitting in an image pull backoff or another failed state is related to causing the issue to appear.

  Warning  Failed     10s               kubelet            Failed to pull image "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": rpc error: code = Unknown desc = failed to pull and unpack image "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": failed to resolve reference "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": unexpected status code [manifests latest]: 429 Too Many Requests

Finally, the log_collector.py results are attached below, run against the efs-node pod on the same node as the pod message above.
results.zip

k8s-triage-robot · 2021-08-02T00:39:57Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

wmgroot · 2021-08-11T18:13:50Z

/remove-lifecycle stale

spence95 · 2021-09-13T16:04:05Z

+1, getting same issue as above on 1.21

salemgolemugoo · 2021-09-29T15:29:19Z

Have the same issue on 1.21 and latest helm chart of this driver

mohannmurthy · 2021-10-20T11:39:26Z

@wmgroot Have the same issue with EKS 1.20 and aws efs csi driver 1.3.3.

hrpatel · 2021-10-20T13:34:37Z

Updates from the last month

We revisited our resource requests and limits to spread out long lived applications pods and cron jobs (15K+/day) more evenly across the cluster. This eliminated load average hotspots in our cluster.
We also removed unnecessary EFS volumes from cron jobs (approx 11K less mounts and unmounts per day!)

Both of these changes combined significantly reduced the number of EFS errors in our deployment.

I suspect the issues in our case were a combination of high load average on nodes combined with high load on EFS (I can't be sure because I have no visibility into EFS infrastructure)

mohannmurthy · 2021-10-20T14:25:52Z

@wongma7 Just to confirm this issue is still seen in EKS 1.20 and aws efs csi driver 1.3.3.

mohannmurthy · 2021-10-20T14:42:22Z

Alright, I have some new repro steps to produce the issue, and some log_collector.py dumps from the plugin on the node where it happened.

My test setup:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: efs-stress-test
  labels:
    app: efs-stress
spec:
  replicas: 100
  selector:
    matchLabels:
      app: efs-stress
  template:
    metadata:
      labels:
        app: efs-stress
    spec:
      containers:
      - name: efs-stress
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: efs-stress-test-2
  labels:
    app: efs-stress
spec:
  replicas: 0
  selector:
    matchLabels:
      app: efs-stress
  template:
    metadata:
      labels:
        app: efs-stress
    spec:
      containers:
      - name: efs-stress
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    # - ReadWriteOnce
    - ReadWriteMany
  storageClassName: efs-test
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-stress-test
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-test
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-39402c41

Repro steps:

Ensure you already have an EFS store created with a volumeHandle available.
Apply the resources above, and allow all 100 pods to become Running.
Change the replicas for efs-stress-test-2 to 100, and efs-stress-test to 0. Apply.
Part way through the Termination of efs-stress-test, change efs-stress-test's replicas to 0, efs-stress-test-2's replicas to 100, and Apply.
Continue swapping the replica counts and applying until you notice Pods becoming stuck in ContainerCreating status.

Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    11m                  default-scheduler  Successfully assigned default/efs-stress-test-7b98f4d47d-ppp45 to ip-10-118-4-139.us-east-2.compute.internal
  Warning  FailedMount  2m38s                kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-qwtk4 persistent-storage]: timed out waiting for the condition
  Warning  FailedMount  22s (x4 over 9m24s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[persistent-storage default-token-qwtk4]: timed out waiting for the condition

I went through several successful iterations of these steps until our cluster started hitting ImgPullFailures from our internal Harbor registry from the number of pull requests we were causing. It's possible that pods sitting in an image pull backoff or another failed state is related to causing the issue to appear.

  Warning  Failed     10s               kubelet            Failed to pull image "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": rpc error: code = Unknown desc = failed to pull and unpack image "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": failed to resolve reference "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": unexpected status code [manifests latest]: 429 Too Many Requests

Finally, the log_collector.py results are attached below, run against the efs-node pod on the same node as the pod message above. results.zip
@wmgroot Were you able to do any workaround on this?

enniomorricone123 · 2021-10-21T04:43:31Z

we are seeing the same issue, this issue is intermittent and once a node "goes bad", everything starts to collapse and no more pod can be scheduled to the node.

lisenet · 2021-10-21T13:52:17Z

The Helm chart does not specify default resources. Our findings suggest that a high load average on worker nodes may cause efs csi driver CPU resource starvation. We have a case where we can replicate this behaviour every time.

Setting resource requests for containers seems to solve the problem for us, e.g.:

[...]
      containers:
        - name: efs-plugin
          image: amazon/aws-efs-csi-driver:v1.3.3
          resources:
            limits:
              memory: "1024Mi"
              cpu: "1000m"
            requests:
              memory: "32Mi"
              cpu: "20m"
[...]
        - name: csi-driver-registrar
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "16Mi"
              cpu: "10m"
[...]
        - name: liveness-probe
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "16Mi"
              cpu: "10m"
[...]

I'm not entirely sure if liveness-probe and csi-driver-registrar resource requests are required, as these containers don't show spikes in CPU usage.

k8s-triage-robot · 2022-01-21T00:25:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

miketexas007 · 2022-02-09T20:28:54Z

I am having the same issue with EKS 1.18 MountVolume.SetUp failed for volume "pvc-baaf6f17-6eb7-4189-89e8-450286d38856" : rpc error: code = InvalidArgument desc │ │ = Volume context property storage.kubernetes.io/csiProvisionerIdentity not supported │ │ Warning FailedMount 7m20s (x3 over 11m) kubelet, ip-172-31-26-97.ap-southeast-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[trustedexec-netcert], unattached volumes=[trustgrid-secret-path tr │ │ stgrid-db-volume trustedexec-netcert default-token-v2gzz[]: timed out waiting for the condition

Hi Khurram, I had same issue "storage.kubernetes.io/csiProvisionerIdentity not supported " what is the resolution for this. can you please help me

k8s-triage-robot · 2022-03-11T20:47:35Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-04-10T21:48:01Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-04-10T21:48:11Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

V0lantis · 2022-05-31T10:08:16Z

Seeing the same issue on the latest version of CSI driver

V0lantis · 2022-05-31T10:08:29Z

/reopen

k8s-ci-robot · 2022-05-31T10:08:39Z

@V0lantis: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2021

wongma7 mentioned this issue Apr 28, 2021

Is there any upgrade guilde to 1.2 #391

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2021

rawc0der mentioned this issue Jan 14, 2022

Stale file handlers for EFS mounts #614

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 11, 2022

k8s-ci-robot closed this as completed Apr 10, 2022

ktleung2017 mentioned this issue Oct 27, 2022

Pods mounting EFS-CSI-driver-based volumes stuck in ContainerCreating for a long time because EFS volumes fail to mount (kubelet error "Unable to attach or mount volumes" [...] "timed out waiting for the condition") #765

Closed

mangreen mentioned this issue Mar 15, 2023

K8S mangreen/Some-Note#22

Open

woehrl01 mentioned this issue Jul 26, 2023

feat: allow to don't use helm hooks for CSIDriver creation #1074

Open

Pods are getting stuck in ContainerCreating because of EFS CSI driver error #325

Pods are getting stuck in ContainerCreating because of EFS CSI driver error #325

Comments

mikalai-t commented Feb 10, 2021

2tim commented Feb 12, 2021 • edited Loading

wongma7 commented Feb 12, 2021

2tim commented Feb 15, 2021

2tim commented Feb 15, 2021

2tim commented Feb 15, 2021 • edited Loading

2tim commented Feb 15, 2021

wongma7 commented Feb 15, 2021

2tim commented Feb 15, 2021

wongma7 commented Feb 15, 2021

2tim commented Feb 15, 2021 • edited Loading

2tim commented Feb 15, 2021

2tim commented Feb 15, 2021

wongma7 commented Feb 15, 2021

2tim commented Feb 15, 2021

wongma7 commented Feb 15, 2021

2tim commented Feb 15, 2021

wongma7 commented Feb 15, 2021

2tim commented Feb 15, 2021

2tim commented Feb 15, 2021

wongma7 commented Feb 15, 2021

mikalai-t commented Feb 16, 2021

wongma7 commented Feb 16, 2021

mikalai-t commented Feb 18, 2021

khurram-s commented Feb 26, 2021

mikalai-t commented Feb 26, 2021

j-trotter-rl commented Mar 1, 2021 • edited Loading

wongma7 commented Mar 1, 2021

kbasv commented Mar 2, 2021

mikalai-t commented Mar 6, 2021

wmgroot commented May 4, 2021

k8s-triage-robot commented Aug 2, 2021

wmgroot commented Aug 11, 2021

spence95 commented Sep 13, 2021

salemgolemugoo commented Sep 29, 2021

mohannmurthy commented Oct 20, 2021

hrpatel commented Oct 20, 2021 • edited Loading

mohannmurthy commented Oct 20, 2021

mohannmurthy commented Oct 20, 2021 • edited Loading

enniomorricone123 commented Oct 21, 2021

lisenet commented Oct 21, 2021

k8s-triage-robot commented Jan 21, 2022

miketexas007 commented Feb 9, 2022

k8s-triage-robot commented Mar 11, 2022

k8s-triage-robot commented Apr 10, 2022

k8s-ci-robot commented Apr 10, 2022

V0lantis commented May 31, 2022

V0lantis commented May 31, 2022

k8s-ci-robot commented May 31, 2022

2tim commented Feb 12, 2021 •

edited

Loading

2tim commented Feb 15, 2021 •

edited

Loading

2tim commented Feb 15, 2021 •

edited

Loading

j-trotter-rl commented Mar 1, 2021 •

edited

Loading

hrpatel commented Oct 20, 2021 •

edited

Loading

mohannmurthy commented Oct 20, 2021 •

edited

Loading