Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods are getting stuck in ContainerCreating because of EFS CSI driver error #325

Closed
mikalai-t opened this issue Feb 10, 2021 · 51 comments · May be fixed by #1074
Closed

Pods are getting stuck in ContainerCreating because of EFS CSI driver error #325

mikalai-t opened this issue Feb 10, 2021 · 51 comments · May be fixed by #1074
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@mikalai-t
Copy link

/kind bug

Hello.

What happened?
Sometimes, on a new EC2 instance EFS CSI driver can't handle PVC from Pods running on that node and this cause an issue described in subject.

What you expected to happen?
PVC must be executed correctly so Volume can be mounted and attached to the Pod.

How to reproduce it (as minimally and precisely as possible)?
It happens intermittently. Generally the process is following:

  1. Create EKS cluster and Node group (doesn't matter what EC2 instance type to choose - it happened on t3.medium and m5.large).

  2. Create EFS and SecurityGroup for EFS to allow connection to 2049/tcp from VPC CIDR. Say, the EFS ID is fs-33559c06.

  3. Create EFS AccessPoint for particular application (e.g. ID fsap-085397e3f5ef75ead)

  4. Create StorageClass:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: efs-sc
provisioner: efs.csi.aws.com
  1. Create PersistentVolume:
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv-thanos-compactor
  labels:
    'app.kubernetes.io/name': 'thanos-compactor'
spec:
  capacity:
    storage: 100Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: 'fs-33559c06::fsap-085397e3f5ef75ead'
    volumeAttributes:
      encryptInTransit: 'true'
  1. Create PersistentVolumeClaim:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app.kubernetes.io/name: thanos-compactor
  name: thanos-compactor
  namespace: monitoring
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 100Gi
  storageClassName: efs-sc
  volumeMode: Filesystem
  selector:
    matchLabels:
      "app.kubernetes.io/name": thanos-compactor
  1. Create Pod:
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/psp: eks.privileged
  labels:
    app.kubernetes.io/component: compactor
    app.kubernetes.io/instance: monitoring-thanos
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: thanos
    helm.sh/chart: thanos-3.8.1
    pod-template-hash: 7c8b88b68
  name: thanos-compactor-7c8b88b68-6hflm
  namespace: monitoring

# ...
spec:
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: thanos-compactor

# ...

Anything else we need to know?:

  1. There was an error message as the last line in cs-driver-registrar container (by the moment of writing I lost the exact timestamp):
I0210 08:47:20.032981       1 main.go:87] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}
Exxxx HH:MM:SS.mmmmmm       1 efs connection.go:129] Lost connection to unix:///csi/csi.sock
  1. After restarting EFS CSI Pod the error above disappeared, but it didn't help to resolve the issue - the only workaround is to taint/remove the node fro the cluster and create a new EC2 instance.

  2. Meanwhile logs from the efs-plugin before and after restart are not very helpful. It keeps repeating the same sequence every 4 minutes 2 seconds:

I0210 09:43:51.664790       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:43:51.673852       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:43:51.674749       1 node.go:51] NodePublishVolume: called with args volume_id:"fs-33559c06::fsap-085397e3f5ef75ead" target_path:"/var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount" volume_capability:<mount:<> access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"encryptInTransit" value:"true" > 
I0210 09:43:51.674810       1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount
I0210 09:43:51.674986       1 node.go:172] NodePublishVolume: mounting fs-33559c06:/ at /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount with options [accesspoint=fsap-085397e3f5ef75ead tls]
I0210 09:43:51.675089       1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-085397e3f5ef75ead,tls fs-33559c06:/ /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount)

I0210 09:47:53.762514       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:47:53.769379       1 node.go:279] NodeGetCapabilities: called with args 
I0210 09:47:53.770295       1 node.go:51] NodePublishVolume: called with args volume_id:"fs-33559c06::fsap-085397e3f5ef75ead" target_path:"/var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount" volume_capability:<mount:<> access_mode:<mode:MULTI_NODE_MULTI_WRITER > > volume_context:<key:"encryptInTransit" value:"true" > 
I0210 09:47:53.770365       1 node.go:167] NodePublishVolume: creating dir /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount
I0210 09:47:53.770390       1 node.go:172] NodePublishVolume: mounting fs-33559c06:/ at /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount with options [accesspoint=fsap-085397e3f5ef75ead tls]
I0210 09:47:53.770410       1 mount_linux.go:146] Mounting cmd (mount) with arguments (-t efs -o accesspoint=fsap-085397e3f5ef75ead,tls fs-33559c06:/ /var/lib/kubelet/pods/c67a556f-c9eb-466c-baaf-11d8a1161508/volumes/kubernetes.io~csi/efs-pv-thanos-compactor/mount)
  1. Events (kubectl get events -n monitoring --sort-by=metadata.creationTimestamp):
4m46s       Warning   FailedMount               pod/thanos-compactor-7c8b88b68-6hflm              MountVolume.SetUp failed for volume "efs-pv-thanos-compactor" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
33m         Warning   FailedMount               pod/thanos-compactor-7c8b88b68-6hflm              Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[objstore-config data thanos-token-f7njj aws-iam-token]: timed out waiting for the condition

Environment

  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.9-eks-d1db3c", GitCommit:"d1db3c46e55f95d6a7d3e5578689371318f95ff9", GitTreeState:"clean", BuildDate:"2020-10-20T22:18:07Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
  • Driver version:
    Installed using Helm chart aws-efs-csi-driver version 1.1.0 from the repo https://kubernetes-sigs.github.io/aws-efs-csi-driver with default values.
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2021
@2tim
Copy link

2tim commented Feb 12, 2021

Very similar behavior here one EKS 1.18
We were running fine for about 2 weeks then bam, user pings and says his pods won't start and we get this error:

  Type     Reason              Age                  From                                    Message
  ----     ------              ----                 ----                                    -------
  Normal   Scheduled           29m                  default-scheduler                       Successfully assigned default/coco-webserver-7658b7b74c-75mgq to ip-10-xx-xx-107.ec2.internal
  Warning  FailedMount         4m36s (x6 over 20m)  kubelet, ip-10-xx-xx-107.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[persistent-storage-efs], unattached volumes=[default-token persistent-storage-efs]: timed out waiting for the condition
  Warning  FailedAttachVolume  80s (x10 over 25m)   attachdetach-controller                 AttachVolume.Attach failed for volume "efs-ai-annotations" : attachdetachment timeout for volume fs-d140xxxx
  Warning  FailedMount         1s (x7 over 27m)     kubelet, ip-10-xx-xx-107.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[persistent-storage-efs], unattached volumes=[persistent-storage-efs default-token-xxxxx]: timed out waiting for the condition

@wongma7
Copy link
Contributor

wongma7 commented Feb 12, 2021

Is there any relation between this issue happening and recency of fs/access point creation?

Or did the same fs already exist 2 weeks before hand?

Did this only start happening with 1.1.0?
There is a similar unresolved issue happening in our CI testing master where a node will "go bad" and no amount of restarting the driver, etc. will fix mounts being stuck.

@2tim
Copy link

2tim commented Feb 15, 2021

It has been the same FS the entire time. We are using 1.0.0. We manage our nodes and routinely update the AMIs with a pipeline that handles the drain and reschedule of the pods. We have not observed this process causing issues.

@2tim
Copy link

2tim commented Feb 15, 2021

I also noticed the default logging level seems to have no helpful logs for this. Are we possibly missing a system log that might show more of what is going on.

@2tim
Copy link

2tim commented Feb 15, 2021

I just upgraded the CSI to the latest version (1.1.0) using the default settings in the values.yaml for the helm install. The daemonsets each restarted cleanly. The PV and PVC still show Bound as the status.

I attempted to restart the pods and they still came up with a FailedMount status.

It did not reattach until I issued the following:

kubectl patch pv pv-name -p '{"spec":{"claimRef": null}}'

and then delete/apply the pvc again.

kubectl delete -f pvc.yaml

The PVC hangs in a Terminating status with no end in site. 💀

@2tim
Copy link

2tim commented Feb 15, 2021

I was able to patch the PVC with the following, then delete it:

kubectl -n team1 patch pvc efs-team1 -p '{"metadata":{"finalizers":null}}'

but after reapplying the pvc the volumes are not attaching.

kubectl get volumeattachment

shows ATTACHED = false.

@wongma7 do you have any suggestions? We are unable to use the volume at this point from the cluster. This is the recommended tool from AWS for attaching EFS volumes in EKS but so far it has been very unreliable and the logs show nothing new after setting the logging level higher.

@wongma7
Copy link
Contributor

wongma7 commented Feb 15, 2021

EFS shouldn't involve an attachment stage, there should be no volumeattachment object, and looking back at the error, I am not sure why there is a FailedAttach object. Can you share the efs CSIDriver object?

Could you try this log collector tool? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/log_collector.py There should be logs from efs-utils (the app that provides mount -t efs) in /var/log/amazon/efs in the container.

@2tim
Copy link

2tim commented Feb 15, 2021

@wongma7 It appears we hit this issue as well: #338 during the upgrade with helm.

@wongma7
Copy link
Contributor

wongma7 commented Feb 15, 2021

The default attachRequired value for a driver if CSIDRiver does not exist is true, so that could explain the attach errors. Please try creating the object and let me know if pods still get stuck.

On my part, let me prioritize getting the fix out into the helm chart

@2tim
Copy link

2tim commented Feb 15, 2021

After the helm upgrade --install ... I see this:

kubectl get CSIDriver efs.csi.aws.com -o yaml
Error from server (NotFound): csidrivers.storage.k8s.io "efs.csi.aws.com" not found

@2tim
Copy link

2tim commented Feb 15, 2021

EFS shouldn't involve an attachment stage, there should be no volumeattachment object, and looking back at the error, I am not sure why there is a FailedAttach object. Can you share the efs CSIDriver object?

apiVersion: storage.k8s.io/v1
kind: CSIDriver
metadata:
  annotations:
    helm.sh/hook: pre-install
    helm.sh/hook-delete-policy: before-hook-creation
    helm.sh/resource-policy: keep
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"storage.k8s.io/v1","kind":"CSIDriver","metadata":{"annotations":{"helm.sh/hook":"pre-install","helm.sh/hook-delete-policy":"before-hook-creation","helm.sh/resource-policy":"keep"},"name":"efs.csi.aws.com"},"spec":{"attachRequired":false}}
  creationTimestamp: "2021-02-15T18:22:06Z"
  managedFields:
  - apiVersion: storage.k8s.io/v1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:helm.sh/hook: {}
          f:helm.sh/hook-delete-policy: {}
          f:helm.sh/resource-policy: {}
          f:kubectl.kubernetes.io/last-applied-configuration: {}
      f:spec:
        f:attachRequired: {}
        f:podInfoOnMount: {}
        f:volumeLifecycleModes:
          .: {}
          v:"Persistent": {}
    manager: kubectl
    operation: Update
    time: "2021-02-15T18:22:06Z"
  name: efs.csi.aws.com
  resourceVersion: "210947747"
  selfLink: /apis/storage.k8s.io/v1/csidrivers/efs.csi.aws.com
  uid: e6230f1f-1e17-44a2-9f86-e67ad665ea3a
spec:
  attachRequired: false
  podInfoOnMount: false
  volumeLifecycleModes:
  - Persistent

Could you try this log collector tool? https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/troubleshooting/log_collector.py There should be logs from efs-utils (the app that provides mount -t efs) in /var/log/amazon/efs in the container.

cat efs_utils_state_dir
kubectl exec efs-csi-node-wvj6n -n kube-system -c efs-plugin -- find /var/run/efs -type f -exec echo {} \; -exec cat {} \; -exec echo \;


:\

@2tim
Copy link

2tim commented Feb 15, 2021

@wongma7 I am afraid when the helm upgrade removed the CSIDriver it left my cluster in an unknown state. I still see the volumeattachments that seemed to be created in the period that we did not have the CSIDriver. Should I delete these?

kubectl get volumeattachment
NAME                                                                   ATTACHER          PV                   NODE                            ATTACHED   AGE
csi-b8be6de6a26fa325043e2f2d295d83d80d429e37caf54bc86740ef70f81a7f8a   efs.csi.aws.com   efs-ai-annotations   ip-10-xxx-xxx-xxx.ec2.internal   false      2d23h
csi-eb827277d3edb2746f2dd98cf2de8e541e2f30b5a340aae554d946fee5bd4f6a   efs.csi.aws.com   efs-ai-annotations   ip-10-xxx-xxx-xxx.ec2.internal    false      2d23h

The nodes mentioned are the one the scheduler is trying to run the pods on that mount the disk. I am guessing this is creating a conflict?

@wongma7
Copy link
Contributor

wongma7 commented Feb 15, 2021

Yes, it is safe to delete them. Now that CSIDriver has been fixed to say attachRequired false, kubelet should not even be attempting to get these volumeattachments

@2tim
Copy link

2tim commented Feb 15, 2021

@wongma7 after removing the volumeattachments mentioned above I am still unable to mount the volumes. I removed the PVC and reset the PV claimRef to null then recreated the PVC. Everything looks ok with the PV and PVC (STATUS=Bound) but the pods are still unable to mount the EFS volume. Am I missing something?

@wongma7
Copy link
Contributor

wongma7 commented Feb 15, 2021

Can you share the latest events on the pod?

@2tim
Copy link

2tim commented Feb 15, 2021

@wongma7

Warning  FailedMount  61s (x7 over 35m)     kubelet, ip-10-133-31-48.ec2.internal  Unable to attach or mount volumes: unmounted volumes=[persistent-storage-efs], unattached volumes=[default-token-k7dxb persistent-storage-efs]: timed out waiting for the condition

Also, I don't see anything useful in the log_collector.py output. Its as though the driver does not see the pv/pvc?

@wongma7
Copy link
Contributor

wongma7 commented Feb 15, 2021

The driver should at least be logging "NodePublish" requests if it's receiving them. These would be visible from efs-plugin logs like kubectl log efs-node-pod-asdf efs-plugin.

If the driver is really not logging anything, then we'll have to look at kubelet logs to ssee why kubelet is not sending requests to the driver.

@2tim
Copy link

2tim commented Feb 15, 2021

@wongma7

kubectl -n kube-system logs efs-csi-node-wvj6n -c efs-plugin
I0215 15:52:35.184083       1 config_dir.go:87] Creating symlink from '/etc/amazon/efs' to '/var/amazon/efs'
I0215 15:52:35.186681       1 driver.go:83] Node Service capability for Get Volume Stats Not enabled
I0215 15:52:35.189204       1 mount_linux.go:173] Cannot run systemd-run, assuming non-systemd OS
I0215 15:52:35.189214       1 mount_linux.go:174] systemd-run failed with: exit status 1
I0215 15:52:35.189223       1 mount_linux.go:175] systemd-run output: Failed to create bus connection: No such file or directory
I0215 15:52:35.189386       1 driver.go:114] Starting watchdog
I0215 15:52:35.189508       1 efs_watch_dog.go:174] Copying /etc/amazon/efs/efs-utils.conf since it doesn't exist
I0215 15:52:35.189592       1 efs_watch_dog.go:174] Copying /etc/amazon/efs/efs-utils.crt since it doesn't exist
I0215 15:52:35.195020       1 driver.go:120] Staring subreaper
I0215 15:52:35.195036       1 driver.go:123] Listening for connections on address: &net.UnixAddr{Name:"/csi/csi.sock", Net:"unix"}
I0215 15:52:36.956813       1 node.go:295] NodeGetInfo: called with args

@2tim
Copy link

2tim commented Feb 15, 2021

This log output is the same on all the nodes.

@wongma7
Copy link
Contributor

wongma7 commented Feb 15, 2021

Could you send me kubelet logs? We can continue on slack to share files if needed.

@mikalai-t
Copy link
Author

Is there any relation between this issue happening and recency of fs/access point creation?

Or did the same fs already exist 2 weeks before hand?

No it didn't. Why?
Actually, it was created a couple of minutes before EFS CSI driver deployed

Did this only start happening with 1.1.0?

I don't know. I started using EFS CSI very recently with version 1.1.0

Currently we're not using EFS, but I'd like to help so if more info required I'll re-deploy it again, unfortunately the issue happens intermittently, as @2tim said:

Very similar behavior here one EKS 1.18
We were running fine for about 2 weeks then bam, user pings and says his pods won't start and we get this error:

@wongma7
Copy link
Contributor

wongma7 commented Feb 16, 2021

No it didn't. Why?
Actually, it was created a couple of minutes before EFS CSI driver deployed

Apparently there is some undocumented EFS-side "bootstrap" or "cooldown" soon after it is created during which mounts may not succeed.

@2tim 's issue we determined was because the csidriver got deleted by helm chart 1.1.0/1.1.1.#338 This meant attachRequired became true, which is wrong, so all Pods got stuck waiting for attach to complete. The solution is to recreate the csidriver with attachRequired false AND (restart kubelet OR reschedule the stuck pods to other kubelets). If your pods' events also say something like "FaileAttach" then it is the same issue.

@mikalai-t
Copy link
Author

Apparently there is some undocumented EFS-side "bootstrap" or "cooldown" soon after it is created during which mounts may not succeed.

E.g. Terraform deployment log:

module.eks_cluster.aws_eks_cluster.this[0]: Creating...
module.efs.aws_efs_file_system.default[0]: Creation complete after 7s [id=fs-14df6d21]
module.efs.aws_efs_mount_target.default[2]: Creating...
module.efs.aws_efs_mount_target.default[0]: Creating...
module.efs.aws_efs_mount_target.default[1]: Creating...

# ...

aws_efs_access_point.ap_for_pv["thanos-storegateway"]: Creation complete after 3s [id=fsap-061861f64c202128f]

# ...

module.eks_cluster.aws_eks_cluster.this[0]: Still creating... [10s elapsed]
module.efs.aws_efs_mount_target.default[0]: Still creating... [10s elapsed]
module.efs.aws_efs_mount_target.default[1]: Still creating... [10s elapsed]
module.efs.aws_efs_mount_target.default[2]: Still creating... [10s elapsed]

# ...

module.efs.aws_efs_mount_target.default[0]: Creation complete after 1m27s [id=fsmt-3cb35909]

# ...

module.efs.aws_efs_mount_target.default[1]: Creation complete after 2m39s [id=fsmt-3eb3590b]
module.efs.aws_efs_mount_target.default[2]: Creation complete after 2m39s [id=fsmt-3db35908]
module.eks_cluster.aws_eks_cluster.this[0]: Still creating... [2m50s elapsed]

# ...

module.eks_cluster.aws_eks_cluster.this[0]: Still creating... [13m20s elapsed]
module.eks_cluster.aws_eks_cluster.this[0]: Creation complete after 13m23s [id=eks007]

Roughly, the cluster started 10 minutes later the last mount target was created so the EFS CSI resources were created after that point in time...

Do you happen to know for how long we should wait before attempting to use EFS for the first time?

@khurram-s
Copy link

I am having the same issue with EKS 1.18
MountVolume.SetUp failed for volume "pvc-baaf6f17-6eb7-4189-89e8-450286d38856" : rpc error: code = InvalidArgument desc │
│ = Volume context property storage.kubernetes.io/csiProvisionerIdentity not supported │
│ Warning FailedMount 7m20s (x3 over 11m) kubelet, ip-172-31-26-97.ap-southeast-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[trustedexec-netcert], unattached volumes=[trustgrid-secret-path tr │
│ stgrid-db-volume trustedexec-netcert default-token-v2gzz[]: timed out waiting for the condition

@mikalai-t
Copy link
Author

@khurram-s It doesn't seem so. At least you have the exact error message:

code = InvalidArgument  ... storage.kubernetes.io/csiProvisionerIdentity not supported

Try to remove it from your manifest

@j-trotter-rl
Copy link

j-trotter-rl commented Mar 1, 2021

I'm getting the same timeout and "unrecognized init system" errors (attached)
chart 1.1.2 on EKS 1.19
This is my first time using EFS on EKS with this AWS account

@wongma7
Copy link
Contributor

wongma7 commented Mar 1, 2021

@kbasv

issue related to access point creation making an instance unable to mount efs

Roughly, the cluster started 10 minutes later the last mount target was created so the EFS CSI resources were created after that point in time...
Do you happen to know for how long we should wait before attempting to use EFS for the first time?

@kbasv
Copy link

kbasv commented Mar 2, 2021

@mikalai-t
Is your application waiting for the life cycle of all EFS resources to become available before proceeding to mount by creating your PV & PVC?

@mikalai-t
Copy link
Author

@kbasv not quite:

  1. I use Terraform to deploy VPC, EFS, EKS and so on
  2. Terraform builds its own dependency graph and creates resource in that order, e.g.: first VPC, then EFS and EKS in parallel
  3. It takes a minute to deploy EFS and AccessPoints, then 2-3 minutes to deploy EFS MountTargets (which are not explicitly used in our case).
  4. ~ 10 minutes later EKS deployment is completed and...
    4.1. PVs and PVCs are being created with kubectl apply -f ...
    4.2 HelmOperator is being deployed
  5. It takes a minute...
  6. As soon as HelmOperator is running it starts deploying e.g. Thanos chart that tries to use that PVC

As you can see there is min. ~10 minutes delay between that point when EFS is reported to be ready and first PV operation. Must be enough I think...

I can't control Thanos application behavior to wait for something... but can include artificial delay somewhere between p.[2] and p.[5]

Correct me if I'm wrong but even if application tries to use PVC which is not ready at the moment, it should succeed after several attempts later, when PV/PVC objects are ready, probably because of declarative Kubernetes architecture. The only question is to get PV and PVC ready what is impossible due to EFS driver error

@wmgroot
Copy link

wmgroot commented May 4, 2021

Alright, I have some new repro steps to produce the issue, and some log_collector.py dumps from the plugin on the node where it happened.

My test setup:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: efs-stress-test
  labels:
    app: efs-stress
spec:
  replicas: 100
  selector:
    matchLabels:
      app: efs-stress
  template:
    metadata:
      labels:
        app: efs-stress
    spec:
      containers:
      - name: efs-stress
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: efs-stress-test-2
  labels:
    app: efs-stress
spec:
  replicas: 0
  selector:
    matchLabels:
      app: efs-stress
  template:
    metadata:
      labels:
        app: efs-stress
    spec:
      containers:
      - name: efs-stress
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    # - ReadWriteOnce
    - ReadWriteMany
  storageClassName: efs-test
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-stress-test
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-test
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-39402c41

Repro steps:

  1. Ensure you already have an EFS store created with a volumeHandle available.
  2. Apply the resources above, and allow all 100 pods to become Running.
  3. Change the replicas for efs-stress-test-2 to 100, and efs-stress-test to 0. Apply.
  4. Part way through the Termination of efs-stress-test, change efs-stress-test's replicas to 0, efs-stress-test-2's replicas to 100, and Apply.
  5. Continue swapping the replica counts and applying until you notice Pods becoming stuck in ContainerCreating status.
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    11m                  default-scheduler  Successfully assigned default/efs-stress-test-7b98f4d47d-ppp45 to ip-10-118-4-139.us-east-2.compute.internal
  Warning  FailedMount  2m38s                kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-qwtk4 persistent-storage]: timed out waiting for the condition
  Warning  FailedMount  22s (x4 over 9m24s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[persistent-storage default-token-qwtk4]: timed out waiting for the condition

I went through several successful iterations of these steps until our cluster started hitting ImgPullFailures from our internal Harbor registry from the number of pull requests we were causing. It's possible that pods sitting in an image pull backoff or another failed state is related to causing the issue to appear.

  Warning  Failed     10s               kubelet            Failed to pull image "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": rpc error: code = Unknown desc = failed to pull and unpack image "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": failed to resolve reference "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": unexpected status code [manifests latest]: 429 Too Many Requests

Finally, the log_collector.py results are attached below, run against the efs-node pod on the same node as the pod message above.
results.zip

@k8s-triage-robot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2021
@wmgroot
Copy link

wmgroot commented Aug 11, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2021
@spence95
Copy link

+1, getting same issue as above on 1.21

@salemgolemugoo
Copy link

Have the same issue on 1.21 and latest helm chart of this driver

@mohannmurthy
Copy link

@wmgroot Have the same issue with EKS 1.20 and aws efs csi driver 1.3.3.

@hrpatel
Copy link

hrpatel commented Oct 20, 2021

Updates from the last month

  • We revisited our resource requests and limits to spread out long lived applications pods and cron jobs (15K+/day) more evenly across the cluster. This eliminated load average hotspots in our cluster.
  • We also removed unnecessary EFS volumes from cron jobs (approx 11K less mounts and unmounts per day!)

Both of these changes combined significantly reduced the number of EFS errors in our deployment.

I suspect the issues in our case were a combination of high load average on nodes combined with high load on EFS (I can't be sure because I have no visibility into EFS infrastructure)

@mohannmurthy
Copy link

@wongma7 Just to confirm this issue is still seen in EKS 1.20 and aws efs csi driver 1.3.3.

@mohannmurthy
Copy link

mohannmurthy commented Oct 20, 2021

Alright, I have some new repro steps to produce the issue, and some log_collector.py dumps from the plugin on the node where it happened.

My test setup:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: efs-stress-test
  labels:
    app: efs-stress
spec:
  replicas: 100
  selector:
    matchLabels:
      app: efs-stress
  template:
    metadata:
      labels:
        app: efs-stress
    spec:
      containers:
      - name: efs-stress
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: efs-stress-test-2
  labels:
    app: efs-stress
spec:
  replicas: 0
  selector:
    matchLabels:
      app: efs-stress
  template:
    metadata:
      labels:
        app: efs-stress
    spec:
      containers:
      - name: efs-stress
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"]
        volumeMounts:
        - name: persistent-storage
          mountPath: /data
      volumes:
      - name: persistent-storage
        persistentVolumeClaim:
          claimName: efs-claim
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: efs-claim
spec:
  accessModes:
    # - ReadWriteOnce
    - ReadWriteMany
  storageClassName: efs-test
  resources:
    requests:
      storage: 5Gi
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-stress-test
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  persistentVolumeReclaimPolicy: Retain
  storageClassName: efs-test
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-39402c41

Repro steps:

  1. Ensure you already have an EFS store created with a volumeHandle available.
  2. Apply the resources above, and allow all 100 pods to become Running.
  3. Change the replicas for efs-stress-test-2 to 100, and efs-stress-test to 0. Apply.
  4. Part way through the Termination of efs-stress-test, change efs-stress-test's replicas to 0, efs-stress-test-2's replicas to 100, and Apply.
  5. Continue swapping the replica counts and applying until you notice Pods becoming stuck in ContainerCreating status.
Events:
  Type     Reason       Age                  From               Message
  ----     ------       ----                 ----               -------
  Normal   Scheduled    11m                  default-scheduler  Successfully assigned default/efs-stress-test-7b98f4d47d-ppp45 to ip-10-118-4-139.us-east-2.compute.internal
  Warning  FailedMount  2m38s                kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[default-token-qwtk4 persistent-storage]: timed out waiting for the condition
  Warning  FailedMount  22s (x4 over 9m24s)  kubelet            Unable to attach or mount volumes: unmounted volumes=[persistent-storage], unattached volumes=[persistent-storage default-token-qwtk4]: timed out waiting for the condition

I went through several successful iterations of these steps until our cluster started hitting ImgPullFailures from our internal Harbor registry from the number of pull requests we were causing. It's possible that pods sitting in an image pull backoff or another failed state is related to causing the issue to appear.

  Warning  Failed     10s               kubelet            Failed to pull image "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": rpc error: code = Unknown desc = failed to pull and unpack image "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": failed to resolve reference "harbor.<internal-domain>/dockerhub-proxy/library/centos:latest": unexpected status code [manifests latest]: 429 Too Many Requests

Finally, the log_collector.py results are attached below, run against the efs-node pod on the same node as the pod message above. results.zip
@wmgroot Were you able to do any workaround on this?

@enniomorricone123
Copy link

we are seeing the same issue, this issue is intermittent and once a node "goes bad", everything starts to collapse and no more pod can be scheduled to the node.

@lisenet
Copy link

lisenet commented Oct 21, 2021

The Helm chart does not specify default resources. Our findings suggest that a high load average on worker nodes may cause efs csi driver CPU resource starvation. We have a case where we can replicate this behaviour every time.

Setting resource requests for containers seems to solve the problem for us, e.g.:

[...]
      containers:
        - name: efs-plugin
          image: amazon/aws-efs-csi-driver:v1.3.3
          resources:
            limits:
              memory: "1024Mi"
              cpu: "1000m"
            requests:
              memory: "32Mi"
              cpu: "20m"
[...]
        - name: csi-driver-registrar
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "16Mi"
              cpu: "10m"
[...]
        - name: liveness-probe
          resources:
            limits:
              memory: "512Mi"
              cpu: "500m"
            requests:
              memory: "16Mi"
              cpu: "10m"
[...]

I'm not entirely sure if liveness-probe and csi-driver-registrar resource requests are required, as these containers don't show spikes in CPU usage.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 21, 2022
@miketexas007
Copy link

I am having the same issue with EKS 1.18 MountVolume.SetUp failed for volume "pvc-baaf6f17-6eb7-4189-89e8-450286d38856" : rpc error: code = InvalidArgument desc │ │ = Volume context property storage.kubernetes.io/csiProvisionerIdentity not supported │ │ Warning FailedMount 7m20s (x3 over 11m) kubelet, ip-172-31-26-97.ap-southeast-2.compute.internal Unable to attach or mount volumes: unmounted volumes=[trustedexec-netcert], unattached volumes=[trustgrid-secret-path tr │ │ stgrid-db-volume trustedexec-netcert default-token-v2gzz[]: timed out waiting for the condition

Hi Khurram, I had same issue "storage.kubernetes.io/csiProvisionerIdentity not supported " what is the resolution for this. can you please help me

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 11, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@V0lantis
Copy link

Seeing the same issue on the latest version of CSI driver

@V0lantis
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@V0lantis: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.