Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retention policy removes last valid snapshot, leaving no possibility of recovery #688

Open
mnacharov opened this issue Aug 30, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@mnacharov
Copy link

Describe the bug
VolumeSnapshot has the .status.readyToUse flag which indicates if a snapshot is ready to be used to restore a volume.
snapscheduler does not take this flag into account when deciding weather the maxCount retention has been reached.
This results in the loss of the last opportunity for recovery.

Steps to reproduce
in GKE(in my case v1.28.11) with snapscheduler(v3.4.0) installed:

  1. create PVC:
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: snapscheduler-test
      namespace: default
      labels:
        snapscheduler-test: "true"
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 1Gi
      storageClassName: standard-rwo
    
  2. run some pod with new pvc in order to create the volume:
    $ kubectl -n default run -it --rm snapscheduler-test --image=gcr.io/distroless/static-debian12 --overrides='{"spec": {"restartPolicy": "Never", "volumes": [{"name": "pvc", "persistentVolumeClaim":{"claimName": "snapscheduler-test"}}]}}' -- sh
  3. create SnapshotSchedule:
    apiVersion: snapscheduler.backube/v1
    kind: SnapshotSchedule
    metadata:
      name: snapscheduler-test
      namespace: default
    spec:
      claimSelector:
        matchLabels:
          snapscheduler-test: "true"
      retention:
        maxCount: 3
      schedule: "*/5 * * * *"
  1. wait 5-10 minutes, make sure that volumeshapshots successfully creating:
    $ kubectl -n default get volumesnapshot
    NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
    snapscheduler-test-snapscheduler-test-202408301525   true         snapscheduler-test                           1Gi           p2p-csi         snapcontent-4f748e4d-80d8-4353-8819-a6efb2836821   87s            2m6s
    
  2. remove compute disk in GCP (via WebUI or gcloud command) -- human error had happened :
    $ pv=$(kubectl -n default get pvc snapscheduler-test -ojsonpath='{.spec.volumeName}')
    $ zone=$(gcloud --project=$GCP_PROJECT compute disks list --filter="name=($pv)"|grep pvc|awk '{print $2}')
    $ gcloud --project p2p-data-warehouse compute disks delete $pv --zone $zone
    
  3. after 10 minutes there are two volumesnapshots with readytouse=false:
    $ kubectl -n default get volumesnapshot
    NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
    snapscheduler-test-snapscheduler-test-202408301525   true         snapscheduler-test                           1Gi           p2p-csi         snapcontent-4f748e4d-80d8-4353-8819-a6efb2836821   10m            11m
    snapscheduler-test-snapscheduler-test-202408301530   false        snapscheduler-test                                         p2p-csi         snapcontent-cec59c70-c186-44fd-99f8-9226192d7a6a                  6m38s
    snapscheduler-test-snapscheduler-test-202408301535   false        snapscheduler-test                                         p2p-csi         snapcontent-d81644f4-eb28-4da9-94b5-d57f1972aeb3                  98s
    
  4. after 15 minutes we don't have any valid snapshot anymore (maxCount: 3 retention policy)
    $ kubectl -n default get volumesnapshot
    NAME                                                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS   SNAPSHOTCONTENT                                    CREATIONTIME   AGE
    snapscheduler-test-snapscheduler-test-202408301530   false        snapscheduler-test                                         p2p-csi         snapcontent-cec59c70-c186-44fd-99f8-9226192d7a6a                  13m
    snapscheduler-test-snapscheduler-test-202408301535   false        snapscheduler-test                                         p2p-csi         snapcontent-d81644f4-eb28-4da9-94b5-d57f1972aeb3                  8m6s
    snapscheduler-test-snapscheduler-test-202408301540   false        snapscheduler-test                                         p2p-csi         snapcontent-b6113f79-3219-435d-8321-812ddc096154                  3m6s
    

Expected behavior
❗ retention policy must not take into account VolumeSnapshots with .status.readyToUse==false.
❔ if possible, create a new snapshot only after the previous one has entered the ready state

Actual results
retention policy removes last valid snapshot, leaving no possibility of recovery

Additional context

@mnacharov mnacharov added the bug Something isn't working label Aug 30, 2024
@JohnStrunk
Copy link
Member

I agree... that's not good. I'm happy to have thoughts/suggestions on a good fix.

A few ideas:

  1. Only count readyToUse snapshots when implementing the cleanup policy
    This runs the risk of creating an unbounded number of (unready) snapshots, potentially consuming all available space (or excessive expense)
  2. Skip the next snapshot if the previous is not ready
    This will cause problems for environments where it takes a long time for the snapshot to become ready (e.g., AWS), causing SnapScheduler to miss intervals
  3. If the policy determines that a snapshot should be deleted, we delete unready snapshots (starting with the oldest) before ready ones.
    This has the same problem as (2) in being unable to handle intervals that are less than the time for a snapshot to become ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants
@mnacharov @JohnStrunk and others