Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stray Clone Left in CephFS Filesystem When Deleting PVC During Cloning Process #4043

Closed
pehlert opened this issue Aug 8, 2023 · 3 comments
Closed
Labels
component/cephfs Issues related to CephFS

Comments

@pehlert
Copy link

pehlert commented Aug 8, 2023

Background

We are still struggling with Kasten.io exports and the fact that they use the old RW snapshot clone method which on volumes with any significant size will result in a timeout.

The trouble is not just that the backups aren't working so that we needed to put a workaround in place (which is something for Kasten to solve), but also that whenever the issue happens we get a stray folder on CephFS which eventually eats up our disk space and is difficult and a bit scary to clean up when you don't have a clear reference where it comes from.

We blame it on the issue below and I'd love to hear if anyone else has experienced this. If there's interest to work on this and you need to me to reproduce it manually, I'm happy to share details.

Issue

When cloning a CephFS snapshot, if the PVC is deleted while the cloning process is still ongoing, a stray clone remains in the CephFS filesystem.

Affected Versions

Tested on ceph-csi 3.8

Steps to Reproduce

Initiate a clone from a snapshot of significant size in traditional RW mode.
Before the cloning process completes and the volume becomes available, delete the PVC.

Expected Behavior

The cloning process should either be interrupted and the clone should be removed from CephFS, or there should be a reference retained in Kubernetes.

Actual Behavior

The cloning process continues uninterrupted, resulting in a new folder appearing on CephFS. However, there is no reference to this folder in Kubernetes, neither as a PV nor a PVC.

@Rakshith-R Rakshith-R added the component/cephfs Issues related to CephFS label Aug 8, 2023
@Rakshith-R
Copy link
Contributor

Rakshith-R commented Aug 8, 2023

Yes, this is a known issue. Especially when backup softwares try to take backups of cephfs PVC too frequently.
By default cephfs can only handle 4 simultaneous clone creation requests(cloner thread limit), all others are queued.
refer: https://tracker.ceph.com/issues/46892
This limit is configurable but is limited by the number of cores that mds has.

There is a tracker in external provisioner for PVC being deleted when provisioner pod restarts kubernetes-csi/external-provisioner#486 .

We've also requested ceph to reject cephfs clones creation request preemptively to avoid this scenario.
refer: https://tracker.ceph.com/issues/59714

@pehlert
Copy link
Author

pehlert commented Aug 8, 2023

Thank you very much for summarizing this here. It definitely helps us to understand it's a known issue and get that full perspective on it.

We'll try to find a solution that uses the new RO snapshot clones then to prevent this from happening in the first place.

@Rakshith-R
Copy link
Contributor

There seems to be progress on the ceph tracker https://tracker.ceph.com/issues/59714.

Converting this issue into disucssion since issue #3996 already exists to track and fix this.

@ceph ceph locked and limited conversation to collaborators Aug 9, 2023
@Rakshith-R Rakshith-R converted this issue into discussion #4045 Aug 9, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
component/cephfs Issues related to CephFS
Projects
None yet
Development

No branches or pull requests

2 participants