Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

failed to delete pv #192

Closed
Madhu-1 opened this issue Feb 12, 2019 · 9 comments
Closed

failed to delete pv #192

Madhu-1 opened this issue Feb 12, 2019 · 9 comments

Comments

@Madhu-1
Copy link
Collaborator

Madhu-1 commented Feb 12, 2019

Describe the bug
failed to delete pv when the pvc got deleted
Kubernetes and Ceph CSI Versions
Kubernetes: 1.13
Ceph CSI: 1.0.0

To Reproduce
create 100 pvc and delete 100 pvc
Expected behavior
all pv should be deleted when pvc is deleted

[🎩︎]mrajanna@localhost ceph $]kube get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                      STORAGECLASS   REASON   AGE
pvc-6247d340-2e96-11e9-af82-525400b639ff   5Gi        RWX            Delete           Released   default/csi-cephfs-pvc93   csi-cephfs              32m
pvc-6276de8e-2e96-11e9-af82-525400b639ff   5Gi        RWX            Delete           Released   default/csi-cephfs-pvc94   csi-cephfs              32m
pvc-62a2be3a-2e96-11e9-af82-525400b639ff   5Gi        RWX            Delete           Released   default/csi-cephfs-pvc95   csi-cephfs              32m
pvc-62cce258-2e96-11e9-af82-525400b639ff   5Gi        RWX            Delete           Released   default/csi-cephfs-pvc96   csi-cephfs              32m
[🎩︎]mrajanna@localhost ceph $]kube get pvc
No resources found.
[🎩︎]mrajanna@localhost ceph $]
[🎩︎]mrajanna@localhost ceph $]kube describe pv/pvc-6247d340-2e96-11e9-af82-525400b639ff
Name:            pvc-6247d340-2e96-11e9-af82-525400b639ff
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: csi-cephfsplugin
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    csi-cephfs
Status:          Released
Claim:           default/csi-cephfs-pvc93
Reclaim Policy:  Delete
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        5Gi
Node Affinity:   <none>
Message:         
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            csi-cephfsplugin
    VolumeHandle:      csi-cephfs-pvc-6247d340-2e96-11e9-af82-525400b639ff
    ReadOnly:          false
    VolumeAttributes:      monitors=rook-ceph-mon-a.rook-ceph.svc.cluster.local:6789
                           pool=myfs-data0
                           provisionVolume=true
                           storage.kubernetes.io/csiProvisionerIdentity=1549952614808-8081-
Events:
  Type     Reason              Age                From                                                                                  Message
  ----     ------              ----               ----                                                                                  -------
  Warning  VolumeFailedDelete  28m                csi-cephfsplugin_csi-cephfsplugin-provisioner-0_b709bc9c-2e8f-11e9-a968-0a580ae94212  rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  VolumeFailedDelete  12m (x6 over 27m)  csi-cephfsplugin_csi-cephfsplugin-provisioner-0_b709bc9c-2e8f-11e9-a968-0a580ae94212  rpc error: code = Internal desc = configmaps "csi-cephfs-pvc-6247d340-2e96-11e9-af82-525400b639ff" not found

[🎩︎]mrajanna@localhost ceph $]kube get  cm
No resources found.
@rootfs
Copy link
Member

rootfs commented Feb 12, 2019

as discussed with @ShyamsundarR, current per PV configmap has a scaling issue. We aim to shift to a per Ceph cluster configmap and embeds the configmap name in the PV name, so controller server can look up the configmap and get ceph config from the configmap.

@gman0
Copy link
Contributor

gman0 commented Feb 12, 2019

This might be related to kubernetes-csi/external-provisioner#131: what might be happening is that while one DeleteVolume RPC (A) is still in progress, the 10s timeout is hit and another DeleteVolume RPC (B) is issued. Meanwhile, (A) finishes successfully and deletes the configmap. From (B)'s perspective, it fails to find the CM and returns error - and that's what the external-provisioner ultimately sees. This really is just a guess but nevertheless a plausible one.
@Madhu-1, could you please check if the actual volume directories on ceph cluster are deleted? Output from ceph auth ls should suffice (once the volume is deleted, the ceph user is deleted as well - no volume = no user).

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 12, 2019

@gman0 I destroyed the setup,but I will try once again, from the logs, it looks like the issue is with the configmap (need to investigate in which case am hitting this issue and also why am getting configmap not found error and when configmap got deleted)

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 12, 2019

@gman0 what we have to do if the configmap not found. currently we are returning an internal-server error.

@gman0
Copy link
Contributor

gman0 commented Feb 12, 2019

@Madhu-1 That's one of the harder problems - there is no way to distinguish between "volume already deleted" and "cannot delete a volume because of missing metadata". See the issue i've linked above.

@dllozhkin
Copy link

Hello,
I have the same problem, I can't del PV

Events:
  Type     Reason              Age                  From                                                                            Message
  ----     ------              ----                 ----                                                                            -------
  Warning  VolumeFailedDelete  7m10s (x7 over 38m)  csi-rbdplugin_csi-rbdplugin-provisioner-0_a8c39bb5-2dfb-11e9-a5c5-02fa171bccad  rpc error: code = Unknown desc = configmaps "csi-rbd-vol-77234b68-3037-11e9-8f75-0050562c0165" not found

when are you fix that?

@gman0
Copy link
Contributor

gman0 commented Feb 14, 2019

Got hit by this in rbdplugin as well.

Some so helpful logs:

I0214 20:41:49.129431       1 utils.go:97] GRPC call: /csi.v1.Controller/DeleteVolume
I0214 20:41:49.129783       1 utils.go:98] GRPC request: {"secrets":"***stripped***","volume_id":"csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"}
I0214 20:41:49.133904       1 hashed.go:47] hashedKeyMutex.LockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"
I0214 20:41:49.133970       1 hashed.go:49] hashedKeyMutex.LockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
I0214 20:41:49.160288       1 controllerserver.go:266] deleting volume pvc-bda0240b-3098-11e9-9a4d-52540013b951
I0214 20:41:49.160557       1 rbd_util.go:166] rbd: status pvc-bda0240b-3098-11e9-9a4d-52540013b951 using mon 192.168.122.15:6789, pool rbd
W0214 20:41:49.279994       1 rbd_util.go:188] rbd: no watchers on pvc-bda0240b-3098-11e9-9a4d-52540013b951
I0214 20:41:49.280017       1 rbd_util.go:213] rbd: rm pvc-bda0240b-3098-11e9-9a4d-52540013b951 using mon 192.168.122.15:6789, pool rbd
I0214 20:42:14.415483       1 utils.go:97] GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0214 20:42:14.420742       1 utils.go:98] GRPC request: {}
I0214 20:42:14.426830       1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}}]}
I0214 20:42:14.438932       1 utils.go:97] GRPC call: /csi.v1.Controller/ControllerGetCapabilities
I0214 20:42:14.439447       1 utils.go:98] GRPC request: {}
I0214 20:42:14.443723       1 controllerserver-default.go:62] Using default ControllerGetCapabilities
I0214 20:42:14.443735       1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":5}}},{"Type":{"Rpc":{"type":6}}},{"Type":{"Rpc":{"type":7}}}]}
I0214 20:42:14.463962       1 utils.go:97] GRPC call: /csi.v1.Identity/GetPluginInfo
I0214 20:42:14.464232       1 utils.go:98] GRPC request: {}
I0214 20:42:14.468072       1 identityserver-default.go:32] Using default GetPluginInfo
I0214 20:42:14.468082       1 utils.go:103] GRPC response: {"name":"csi-rbdplugin","vendor_version":"1.0.0"}
I0214 20:42:15.343733       1 utils.go:97] GRPC call: /csi.v1.Controller/DeleteVolume
I0214 20:42:15.344088       1 utils.go:98] GRPC request: {"secrets":"***stripped***","volume_id":"csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"}
I0214 20:42:15.347633       1 hashed.go:47] hashedKeyMutex.LockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"
I0214 20:42:35.647674       1 k8scmcache.go:177] k8s-cm-cache: successfully deleted metadata configmap csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086
I0214 20:42:35.653706       1 hashed.go:54] hashedKeyMutex.UnlockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"
I0214 20:42:35.653738       1 hashed.go:56] hashedKeyMutex.UnlockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
I0214 20:42:35.653759       1 utils.go:103] GRPC response: {}
I0214 20:42:35.654202       1 hashed.go:49] hashedKeyMutex.LockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
I0214 20:42:35.662184       1 hashed.go:54] hashedKeyMutex.UnlockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"
I0214 20:42:35.662446       1 hashed.go:56] hashedKeyMutex.UnlockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
E0214 20:42:35.667241       1 utils.go:101] GRPC error: configmaps "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" not found
I0214 20:42:56.323471       1 utils.go:97] GRPC call: /csi.v1.Identity/GetPluginCapabilities
I0214 20:42:56.323740       1 utils.go:98] GRPC request: {}
I0214 20:42:56.331733       1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Service":{"type":1}}}]}
I0214 20:42:56.342643       1 utils.go:97] GRPC call: /csi.v1.Controller/ControllerGetCapabilities
I0214 20:42:56.342822       1 utils.go:98] GRPC request: {}
I0214 20:42:56.345258       1 controllerserver-default.go:62] Using default ControllerGetCapabilities
I0214 20:42:56.345269       1 utils.go:103] GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":5}}},{"Type":{"Rpc":{"type":6}}},{"Type":{"Rpc":{"type":7}}}]}
I0214 20:42:56.366441       1 utils.go:97] GRPC call: /csi.v1.Identity/GetPluginInfo
I0214 20:42:56.370258       1 utils.go:98] GRPC request: {}
I0214 20:42:56.370936       1 identityserver-default.go:32] Using default GetPluginInfo
I0214 20:42:56.370945       1 utils.go:103] GRPC response: {"name":"csi-rbdplugin","vendor_version":"1.0.0"}
I0214 20:42:57.348685       1 utils.go:97] GRPC call: /csi.v1.Controller/DeleteVolume
I0214 20:42:57.354895       1 utils.go:98] GRPC request: {"secrets":"***stripped***","volume_id":"csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"}
I0214 20:42:57.355676       1 hashed.go:47] hashedKeyMutex.LockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"
I0214 20:42:57.358931       1 hashed.go:49] hashedKeyMutex.LockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
I0214 20:42:57.898229       1 hashed.go:54] hashedKeyMutex.UnlockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"
I0214 20:42:57.898596       1 hashed.go:56] hashedKeyMutex.UnlockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
E0214 20:42:57.902772       1 utils.go:101] GRPC error: configmaps "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" not found

but on the ceph cluster:

$ rbd ls

so the image is actually deleted, but external-provisioner ignores the success (kubernetes-csi/external-provisioner#131).

Looking at the logs, the first attempt (A) to delete takes too long to finish so the external-provisioner issues a new DeleteVolume RPC (B):

I0214 20:42:15.343733       1 utils.go:97] GRPC call: /csi.v1.Controller/DeleteVolume

(A) holds the lock though, so even though (B) is the "active" RPC (which will have the definitive answer over the result of the operation), it has to wait till (A) finishes:

I0214 20:42:15.343733       1 utils.go:97] GRPC call: /csi.v1.Controller/DeleteVolume
I0214 20:42:15.344088       1 utils.go:98] GRPC request: {"secrets":"***stripped***","volume_id":"csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"}
I0214 20:42:15.347633       1 hashed.go:47] hashedKeyMutex.LockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"

Once (A) is successfully done (along with the deletion of the image AND the metadata stored in configmap), it releases the lock, so that (B) may continue on its merry way:

I0214 20:42:35.647674       1 k8scmcache.go:177] k8s-cm-cache: successfully deleted metadata configmap csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086
I0214 20:42:35.653706       1 hashed.go:54] hashedKeyMutex.UnlockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"
I0214 20:42:35.653738       1 hashed.go:56] hashedKeyMutex.UnlockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
I0214 20:42:35.653759       1 utils.go:103] GRPC response: {}
I0214 20:42:35.654202       1 hashed.go:49] hashedKeyMutex.LockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
I0214 20:42:35.662184       1 hashed.go:54] hashedKeyMutex.UnlockKey(...) called for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086"

Since the metadata is now gone, (B) immediately exits because it cannot continue without it:

I0214 20:42:35.662446       1 hashed.go:56] hashedKeyMutex.UnlockKey(...) for id "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" completed.
E0214 20:42:35.667241       1 utils.go:101] GRPC error: configmaps "csi-rbd-vol-beb98884-3098-11e9-8e07-0a580af40086" not found

And that's what external-provisioner sees. The most recent "active" DeleteVolume RPC returned error - that's the source of truth here. Regardless of the fact that (A) was successful in the end.

I'll create a new issue tomorrow for possible solutions of this situation.

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 15, 2019

@gman0 @rootfs don't we need to continue for volume deletion if config map not found?
as CSI spec says all operations should be idempotent if the resource doesn't exist anymore.

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Mar 5, 2019

tested and not reproducible now ,closing this one

@Madhu-1 Madhu-1 closed this as completed Mar 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants