Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PVC deletion results in leaked volume on the storage provider, if CreateVolume never completed in time #311

Closed
ShyamsundarR opened this issue Jul 4, 2019 · 4 comments · Fixed by #312

Comments

@ShyamsundarR
Copy link
Contributor

I am facing an issue with the kubernetes CSI provisioner sidecar that leads to a volume leak on Ceph, the nature of things are as follows,

  1. Create a PVC (pointing to a CSI provisioner, in my case ceph-csi)
  2. Ensure the request CreateVolume made by the provisioner sidecar will always timeout
  • IOW, ensure it does not succeed till the test is complete (I did this with a simple sleep that was more than the --timeout configured on the provisioner sidecar)
  1. Post 1 or 2 retries by the provisioner sidecar to create the volume, delete the PVC

My expectation was that the provisioner would attempt creates till it got an error or a success and then later delete the created volume as the PVC (and PV) are no longer around.

What happened, post step (3), was the provisioner stopped requesting the create, and as the original calls to ceph had succeeded, there was a volume leak on the Ceph side. (neither did the provisioner call delete but that is expected as it does not have the volume ID to call delete with).

Provisioner sidecar versions attempted: v1.2.0 and v1.3.0

logs from the test (provisioner logs and ceph-csi logs): https://paste.fedoraproject.org/paste/J1mddHUmrtMmEbmla8gF9Q

The issue is opened to understand if the behavior is as expected, or there is a bug somewhere.

@ShyamsundarR
Copy link
Contributor Author

@jsafrane marking this for your attention.

@ShyamsundarR
Copy link
Contributor Author

It looks like csi external-provisioner [1] has not implemented the ProvisionerExt interface [2], that is meant to track the states of provisioning and take appropriate actions (try till error/success and delete if PVC is deleted). Hence, result from CSI on timeout failures would always be ProvisioningFinished from this piece of code.

To fix this, if my understanding and code reading it right, requires that CSI external-proviosner add the ProvisionerExt interface and return appropriate results to denote timeout errors.

Thanks to @jsafrane for pointing to this PR to understand the past context.

[1] csi external-provisioner provisioner registration (missing ProvisionerExt):

var _ controller.Provisioner = &csiProvisioner{}
var _ controller.BlockProvisioner = &csiProvisioner{}

[2] ProvisionerExt definition to ensure reported issue does not occur: https://github.com/kubernetes-sigs/sig-storage-lib-external-provisioner/blob/d22b74e900af4bf90174d259c3e52c3680b41ab4/controller/volume.go#L66-L78

@ShyamsundarR
Copy link
Contributor Author

Tested with an implementation of ProvisionerExt interface in the external-provisioner code and the required DeleteVolume requests are called on PVC deletion. The changes are present here (ShyamsundarR@6a23bad). (IOW, provisioner is attempting creates till it got an error or a success, and then later deletes the created volume as the PVC (and PV) is no longer around)

Still need to test how DeleteVolume retries perform on timeouts from the request.

Request someone to confirm that the way forward to resolve this issue looks as above. Thanks!

@jsafrane
Copy link
Contributor

jsafrane commented Jul 8, 2019

This is embarrassing, I forgot to push the last patch in my provisioner refactoring sequence :-).

See #312

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants