Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VolumeFailedDelete, when i deleted pvc but pv wasn't deleted #595

Closed
liuyuexizhi opened this issue Jan 24, 2024 · 27 comments · Fixed by #602 or #729
Closed

VolumeFailedDelete, when i deleted pvc but pv wasn't deleted #595

liuyuexizhi opened this issue Jan 24, 2024 · 27 comments · Fixed by #602 or #729
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@liuyuexizhi
Copy link

liuyuexizhi commented Jan 24, 2024

Warning VolumeFailedDelete 7s (x5 over 22s) nfs.csi.k8s.io_linux-k8s-master-2_c99df0ef-f12f-40a3-998f-f99dbc758a17 rpc error: code = Internal desc = archive subdirectory(/tmp/pvc-ee9206d9-a4fc-4897-93ab-0bb3f25c893a/default/test-busybox-pvc, /tmp/pvc-ee9206d9-a4fc-4897-93ab-0bb3f25c893a/archived-default/test-busybox-pvc) failed with rename /tmp/pvc-ee9206d9-a4fc-4897-93ab-0bb3f25c893a/default/test-busybox-pvc /tmp/pvc-ee9206d9-a4fc-4897-93ab-0bb3f25c893a/archived-default/test-busybox-pvc: no such file or directory

@liuyuexizhi
Copy link
Author

it seams archive subdirectory after umount operator.

sc yaml:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: 56-nfs-sc
provisioner: nfs.csi.k8s.io
parameters:
  server: "xxx.xxx.xxx.xxx"
  share: "/vol1/k8s"
  subDir: "${pvc.metadata.namespace}/${pvc.metadata.name}"
  mountPermissions: "0"
  onDelete: "archive"
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
mountOptions:
  - nfsvers=3
  - rsize=1048576
  - wsize=1048576
  - tcp
  - hard
  - nolock

pvc yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: test-pvc-nfs-dynamic
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 1Gi
  storageClassName: 56-nfs-sc

can somebody help me?

@liuyuexizhi liuyuexizhi changed the title VolumeFailedDelete VolumeFailedDelete, when delete pvc but pv wasn't deleted Jan 25, 2024
@liuyuexizhi liuyuexizhi changed the title VolumeFailedDelete, when delete pvc but pv wasn't deleted VolumeFailedDelete, when i delete pvc but pv wasn't deleted Jan 25, 2024
@liuyuexizhi liuyuexizhi changed the title VolumeFailedDelete, when i delete pvc but pv wasn't deleted VolumeFailedDelete, when i deleted pvc but pv wasn't deleted Jan 25, 2024
@liuyuexizhi
Copy link
Author

liuyuexizhi commented Jan 25, 2024

the new info:
when do not set subDir, it work!

sc yaml:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: 56-nfs-sc
provisioner: nfs.csi.k8s.io
parameters:
  server: "xxx.xxx.xxx.xxx"
  share: "/vol1/k8s"
  #subDir: "${pvc.metadata.namespace}/${pvc.metadata.name}"
  mountPermissions: "0"
  onDelete: "archive"
reclaimPolicy: Delete
volumeBindingMode: Immediate
allowVolumeExpansion: true
mountOptions:
  - nfsvers=3
  - rsize=1048576
  - wsize=1048576
  - tcp
  - hard
  - nolock

@andyzhangx
Copy link
Member

so archive mode with subDir set does not work, right? @liuyuexizhi

@andyzhangx
Copy link
Member

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 25, 2024
@liuyuexizhi
Copy link
Author

so archive mode with subDir set does not work, right? @liuyuexizhi

yes, it do not work with subDir set!

@liuyuexizhi
Copy link
Author

@andyzhangx hi, i find a new question.
i can not expand pvc, i had set allowVolumeExpansion: true for storageCalss.

  Normal   ExternalProvisioning   7m5s (x2 over 7m5s)  persistentvolume-controller                                             waiting for a volume to be created, either by external provisioner "nfs.csi.k8s.io" or manually created by system administrator
  Normal   Provisioning           7m5s                 nfs.csi.k8s.io_linux-k8s-master-3_365b3c6f-b74a-493b-9058-6ee923c437ce  External provisioner is provisioning volume for claim "default/test-pvc-nfs-dynamic"
  Normal   ProvisioningSucceeded  7m5s                 nfs.csi.k8s.io_linux-k8s-master-3_365b3c6f-b74a-493b-9058-6ee923c437ce  Successfully provisioned volume pvc-ff340d9e-8d99-4dda-8d4f-93d132e20765
  Warning  ExternalExpanding      6m40s                volume_expand                                                           Ignoring the PVC: didn't find a plugin capable of expanding the volume; waiting for an external controller to process this PVC.

do you know something?

@liuyuexizhi
Copy link
Author

and else, I can not limit pvc storage!!!

@andyzhangx
Copy link
Member

this driver does not support pvc expansion

@andyzhangx
Copy link
Member

so archive mode with subDir set does not work, right? @liuyuexizhi

yes, it do not work with subDir set!

@liuyuexizhi if you set subDir: "${pvc.metadata.namespace}-${pvc.metadata.name}", it should work.

@MRColorR
Copy link

MRColorR commented Aug 7, 2024

hello, i'm having the same issue but my subdir contains hypens:

  • event from kubectl describe pv : Warning VolumeFailedDelete 3m44s (x13 over 21m) nfs.csi.k8s.io_kube-master.test.intranet_3e1b5747-bd74-4af1-a568-e28fc997d6e3 rpc error: code = Internal desc = archive subdirectory(/tmp/pvc-3df3470d-60ba-4f12-9517-bfcf25884ddc/coder-coder-mrengo-test-sidecar-home-pvc-3df3470d-60ba-4f12-9517-bfcf25884ddc, /tmp/pvc-3df3470d-60ba-4f12-9517-bfcf25884ddc/archived-coder-coder-mrengo-test-sidecar-home-pvc-3df3470d-60ba-4f12-9517-bfcf25884ddc) failed with rename /tmp/pvc-3df3470d-60ba-4f12-9517-bfcf25884ddc/coder-coder-mrengo-test-sidecar-home-pvc-3df3470d-60ba-4f12-9517-bfcf25884ddc /tmp/pvc-3df3470d-60ba-4f12-9517-bfcf25884ddc/archived-coder-coder-mrengo-test-sidecar-home-pvc-3df3470d-60ba-4f12-9517-bfcf25884ddc: no such file or directory

  • my storage class is the following:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: nfs-csi
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: nfs.csi.k8s.io
parameters:
  server: kube-nfs
  share: /shared/pv_nfs-csi-k8s-io
  # csi.storage.k8s.io/provisioner-secret is only needed for providing mountOptions in DeleteVolume
  # csi.storage.k8s.io/provisioner-secret-name: "mount-options"
  # csi.storage.k8s.io/provisioner-secret-namespace: "default"
  subDir: ${pvc.metadata.namespace}-${pvc.metadata.name}-${pv.metadata.name}
  onDelete: archive
reclaimPolicy: Delete
volumeBindingMode: Immediate
mountOptions:
  - nfsvers=4.1
  • i'm using csi-driver-nfs v4.8.0 installed using helm with following parameters: helm upgrade --install csi-driver-nfs csi-driver-nfs/csi-driver-nfs --namespace kube-system --set externalSnapshotter.enabled=true --set controller.runOnControlPlane=true

update: commenting out the subDir part did not change the behaviour though

@andyzhangx
Copy link
Member

@MRColorR this PR should have fixed the issue: #729, are you able to try this image: gcr.io/k8s-staging-sig-storage/nfsplugin:canary ?

@MRColorR
Copy link

MRColorR commented Aug 8, 2024

@MRColorR this PR should have fixed the issue: #729, are you able to try this image: gcr.io/k8s-staging-sig-storage/nfsplugin:canary ?

Yes could try it. I've used Helm so far though, but for testing purposes, I could edit the image used by the pod with kube edit. Or maybe you have a better approach I could follow?

@andyzhangx
Copy link
Member

@MRColorR just kubectl edit deployment -n kube-system csi-nfs-controller and then replace with gcr.io/k8s-staging-sig-storage/nfsplugin:canary image

@MRColorR
Copy link

MRColorR commented Aug 8, 2024

ok this is the results of my test:

  • i've edited the image inside the deployment and the controller was up and ready.
  • then i deleted my current storageclass and re-applied it with the onDelete: value setted to archive instead of retain
  • then i deployed a pod with a pvc and its pv has been correctly provisioned as expacted
  • unfortunately after upon deletion i got the error again.
    These are the logs from the controller pod:
    I0808 08:17:46.924884 1 controller.go:873] "Started provisioner controller" component="nfs.csi.k8s.io_kube-master.eustema.intranet_692bda9b-bb13-47a4-bfb8-3b09e19e3d0c" I0808 08:23:19.328970 1 event.go:389] "Event occurred" object="coder/coder-mrengo-tes-archive-home" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="Provisioning" message="External provisioner is provisioning volume for claim \"coder/coder-mrengo-tes-archive-home\"" I0808 08:23:19.543596 1 controller.go:955] successfully created PV pvc-cf22bc4f-0c82-4a35-b590-57393d466465 for PVC coder-mrengo-tes-archive-home and csi volume name kube-nfs#shared/pv_nfs-csi-k8s-io#coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465#pvc-cf22bc4f-0c82-4a35-b590-57393d466465#archive I0808 08:23:19.550242 1 event.go:389] "Event occurred" object="coder/coder-mrengo-tes-archive-home" fieldPath="" kind="PersistentVolumeClaim" apiVersion="v1" type="Normal" reason="ProvisioningSucceeded" message="Successfully provisioned volume pvc-cf22bc4f-0c82-4a35-b590-57393d466465" I0808 08:27:07.291142 1 controller.go:1312] volume pvc-cf22bc4f-0c82-4a35-b590-57393d466465 does not need any deletion secrets I0808 08:27:09.448240 1 controller.go:1312] volume pvc-cf22bc4f-0c82-4a35-b590-57393d466465 does not need any deletion secrets E0808 08:27:11.602889 1 controller.go:1558] "Volume deletion failed" err="rpc error: code = Internal desc = archive subdirectory(/tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465, /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/archived-coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465) failed with rename /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465 /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/archived-coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465: no such file or directory" PV="pvc-cf22bc4f-0c82-4a35-b590-57393d466465" I0808 08:27:11.603067 1 controller.go:1007] "Retrying syncing volume" key="pvc-cf22bc4f-0c82-4a35-b590-57393d466465" failures=0 E0808 08:27:11.603199 1 controller.go:1025] error syncing volume "pvc-cf22bc4f-0c82-4a35-b590-57393d466465": rpc error: code = Internal desc = archive subdirectory(/tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465, /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/archived-coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465) failed with rename /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465 /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/archived-coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465: no such file or directory I0808 08:27:11.603286 1 event.go:389] "Event occurred" object="pvc-cf22bc4f-0c82-4a35-b590-57393d466465" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" reason="VolumeFailedDelete" message="rpc error: code = Internal desc = archive subdirectory(/tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465, /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/archived-coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465) failed with rename /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465 /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/archived-coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465: no such file or directory" I0808 08:27:12.603745 1 controller.go:1312] volume pvc-cf22bc4f-0c82-4a35-b590-57393d466465 does not need any deletion secrets E0808 08:27:14.698056 1 controller.go:1558] "Volume deletion failed" err="rpc error: code = Internal desc = archive subdirectory(/tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465, /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/archived-coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465) failed with rename /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465 /tmp/pvc-cf22bc4f-0c82-4a35-b590-57393d466465/archived-coder-coder-mrengo-tes-archive-home-pvc-cf22bc4f-0c82-4a35-b590-57393d466465: no such file or directory" PV="pvc-cf22bc4f-0c82-4a35-b590-57393d466465"
  • Also the folder of the pv inside the nfs has been deleted/moved as i cannot find it anymore but the pv is still in terminating state: pvc-cf22bc4f-0c82-4a35-b590-57393d466465 5Gi RWO Delete Terminating coder/coder-mrengo-tes-archive-home nfs-csi

@andyzhangx
Copy link
Member

@MRColorR can you provide the nfs container logs:

kubectl logs csi-nfs-controller-xxx -c nfs -n kube-system > csi-nfs-controller.log

@MRColorR
Copy link

MRColorR commented Aug 8, 2024

@MRColorR can you provide the nfs container logs:

kubectl logs csi-nfs-controller-xxx -c nfs -n kube-system > csi-nfs-controller.log

Sorry, I've already reverted the changes because I can't keep the k8s cluster on halt any longer. Luckily I had already included in the previous comment part of the controller log from after its startup through the volume creation to the errors during the archiving phase that follow the pvc decommision. Hope it will suffice

@andyzhangx
Copy link
Member

@MRColorR this PR should fix the issue, it just waits for rename operation complete, can you verify with the canary image again?

#735

@MRColorR
Copy link

Hello, and sorry for the delay. I’ve been offline for a week. I’ve tested it, and now the behavior is as follows:

When I apply a manifest of a test PVC using the defined StorageClass with onDelete: archive, the PV for the PVC is correctly created inside the defined folder on my NFS server.

Then, I delete the PVC, and the PV enters the terminating phase but never completes it. The folder of the PV is deleted from the NFS, but no “archived-” folder appears. Notice that the PV just hangs in the terminating state until I manually remove the finalizer with an edit (but the relative folder inside the NFS has been automatically removed).

The following is the dump of the logs csi-nfs-controller.log: https://pastebin.com/iyUPiQJq

@andyzhangx
Copy link
Member

@MRColorR can you set imagePullPolicy: Always for the nfs container inside csi-nfs-controller pod, you are not using the fixed canary image, the issue was fixed last week.

imagePullPolicy: IfNotPresent

@MRColorR
Copy link

MRColorR commented Aug 19, 2024

Yes, you’re right. Sorry, I didn’t think about the fact that the image tag for the canary image is always the same. I’ve edited the pull policy and checked in the pod description that the new canary image has been pulled.

I’ve tested again, but the behavior seems quite the same. From the log's timestamps it seems that it’s archiving the pv folder correctly (it reach and print in logs line 268

klog.V(2).Infof("archived subdirectory %s --> %s", internalVolumePath, archivedInternalVolumePath)
) then it unmount it but then it tries all again and again this time failing archiving as the pv folder is not there anymore and as it reaches the
klog.V(2).Infof("archiving subdirectory %s --> %s", internalVolumePath, archivedInternalVolumePath)
if err = os.RemoveAll(archivedInternalVolumePath); err != nil {
again in every attempt maybe its here that the first successufly created rachived folder is deleted as it's considered a stale archive from the runs after the first one.

Info: I’ve updated the Pastebin in the previous comment with the new logs.

Edit: If I check the folder in my NFS where the PV folders are stored, I cannot see either the original PV folder or the archived folder. Currently it behaves like a delete policy but hangs in the terminating state.
Edit: Or maybe the error is in a function that returns or exits with a status code that is interpreted as an error by Kubernetes, which then retries the operation.

@titou10titou10
Copy link

titou10titou10 commented Aug 21, 2024

Same problem here with v4.8.0
Roll backing to v4.7.0 make it works again

@andyzhangx
Copy link
Member

Yes, you’re right. Sorry, I didn’t think about the fact that the image tag for the canary image is always the same. I’ve edited the pull policy and checked in the pod description that the new canary image has been pulled.

I’ve tested again, but the behavior seems quite the same. From the log's timestamps it seems that it’s archiving the pv folder correctly (it reach and print in logs line 268

klog.V(2).Infof("archived subdirectory %s --> %s", internalVolumePath, archivedInternalVolumePath)

) then it unmount it but then it tries all again and again this time failing archiving as the pv folder is not there anymore and as it reaches the

klog.V(2).Infof("archiving subdirectory %s --> %s", internalVolumePath, archivedInternalVolumePath)

if err = os.RemoveAll(archivedInternalVolumePath); err != nil {

again in every attempt maybe its here that the first successufly created rachived folder is deleted as it's considered a stale archive from the runs after the first one.
Info: I’ve updated the Pastebin in the previous comment with the new logs.

Edit: If I check the folder in my NFS where the PV folders are stored, I cannot see either the original PV folder or the archived folder. Currently it behaves like a delete policy but hangs in the terminating state. Edit: Or maybe the error is in a function that returns or exits with a status code that is interpreted as an error by Kubernetes, which then retries the operation.

@MRColorR thanks for the test, could you upgrade csi-provisioner to v5.0.2 to have a try again? I doubt it's related to this bug: kubernetes-csi/external-provisioner#1235

@andyzhangx
Copy link
Member

@MRColorR nvm, I have made a fix to disable removing archived volume path since csi-provisioner v5.0.2 does not fix the volume deletion twice issue, pls try canary image again: gcr.io/k8s-staging-sig-storage/nfsplugin:canary

@MRColorR
Copy link

@MRColorR nvm, I have made a fix to disable removing archived volume path since csi-provisioner v5.0.2 does not fix the volume deletion twice issue, pls try canary image again: gcr.io/k8s-staging-sig-storage/nfsplugin:canary

Ok I'll try it as soon as i can thank you.
I'm quite overloaded these days but i hope I'll have some spare time this weekend

@titou10titou10
Copy link

titou10titou10 commented Aug 29, 2024

IMHO this issue should have been reopened...as it makes the driver not usable in real life
I have the same issue and "gcr.io/k8s-staging-sig-storage/nfsplugin:canary" + "csi-provisioner:v5.0.2" seems to solve the "problem"...at least in my case

@andyzhangx
Copy link
Member

pls try with new released version: https://github.com/kubernetes-csi/csi-driver-nfs/releases/tag/v4.9.0, it should have the fix, thx

@MRColorR
Copy link

MRColorR commented Sep 6, 2024

pls try with new released version: https://github.com/kubernetes-csi/csi-driver-nfs/releases/tag/v4.9.0, it should have the fix, thx

i've just tested the release 4.9.0 through helm and it works. The issues regarding the onDelete: archive seems resolved.
thank you for your work @andyzhangx

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
5 participants