Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Longhorn Snapshots are not deleted after expired Backups (Velero) #5802

Closed
R-Studio opened this issue Apr 24, 2023 · 27 comments
Closed
Assignees
Labels
area/snapshot Volume snapshot (in-cluster snapshot or external backup) area/space-efficiency Space efficiency, especially for volume data usage area/upstream Upstream related like tgt upstream library area/volume-backup-restore Volume backup restore kind/improvement Request for improvement of existing function priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation wontfix
Milestone

Comments

@R-Studio
Copy link

R-Studio commented Apr 24, 2023

Describe the bug (🐛 if you encounter this issue)

We are using Velero to create backups from the Kubernetes manifests and the persistent volumes (in our example we backup Harbor).
If we create a backup, Velero saves the K8s manifests to a Object Storage (MinIO) and creates snapshots resources to trigger Longhorn backups with the velero-plugin-for-csi. Longhorn writes the backups to another MinIO bucket.
If we delete a Velero backup or the backup is expired, the snapshot (snapshots.longhorn.io) are not deleted:
image

We are using Velero v1.9.4 with EnableCSI feature and the following plugins:

  • velero/velero-plugin-for-csi:v0.4.0
  • velero/velero-plugin-for-aws:v1.6.0

We have the same issue in Velero v1.11.0 with EnableCSI feature and the following plugins:

  • velero/velero-plugin-for-csi:v0.5.0
  • velero/velero-plugin-for-aws:v1.6.0

To Reproduce

Steps to reproduce the behavior:

  1. Install the newest version of Velero and Rancher-Longhorn
  2. In Longhorn configre a S3 Backup Target (we are usng MinIO for this)
  3. Enable CSI Snapshot Support for Longhorn.
  4. Create a backup (for example with the Schedule below): velero backup create --from-schedule harbor-daily-0200
  5. Delete the backup velero backup delete <BACKUPNAME>
  6. The snapshot (snapshots.longhorn.io) is not deleted.

Expected behavior

The snapshot is deleted.

Environment

  • Longhorn version: 102.2.0+up1.4.1
  • Velero version:
  • Installation method (e.g. Rancher Catalog App/Helm/Kubectl): Rancher-Longhorn Helm Chart
  • Kubernetes distro (e.g. RKE/K3s/EKS/OpenShift) and version: RKE2, v1.25.7+rke2r1
    • Number of management node in the cluster: 1x
    • Number of worker node in the cluster: 3x
  • Node config
    • OS type and version: Ubuntu
  • Underlying Infrastructure (e.g. on AWS/GCE, EKS/GKE, VMWare/KVM, Baremetal): VMs on Proxmox
  • Number of Longhorn volumes in the cluster: 17

Additional context

Velero Backup Schedule for Harbor

---
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: harbor-daily-0200
  namespace: velero #Must be the namespace of the Velero server
spec:
  schedule: 0 0 * * *
  template:
    includedNamespaces:
    - 'harbor'
    includedResources:
    - '*'
    snapshotVolumes: true
    storageLocation: minio
    volumeSnapshotLocations:
      - longhorn
    ttl: 168h0m0s #7 Days retention
    defaultVolumesToRestic: false
    hooks:
      resources:
        - name: postgresql
          includedNamespaces:
          - 'harbor'
          includedResources:
          - pods
          excludedResources: []
          labelSelector:
            matchLabels:
              statefulset.kubernetes.io/pod-name: harbor-database-0
          pre:
            - exec:
                container: database
                command:
                  - /bin/bash
                  - -c
                  - "psql -U postgres -c \"CHECKPOINT\";"
                onError: Fail
                timeout: 30s

VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: "true"
driver: driver.longhorn.io
deletionPolicy: Delete

VolumeSnapshotClass

In our second cluster, with Velero v1.11.0 installed, we created the following resource (but same issue here):

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: bak

VolumeSnapshotLocation

apiVersion: velero.io/v1
kind: VolumeSnapshotLocation
metadata:
  name: longhorn
  namespace: velero
spec:
  provider: longhorn.io/longhorn
@R-Studio
Copy link
Author

I have also opened an issue in velero: vmware-tanzu/velero#6179

@weizhe0422
Copy link
Contributor

Would you mind providing the support bundle which includes the logs of CSI sidecars? Maybe it contains any clues about CSI external-snapshotter was triggered by Velero deleting backups.

@R-Studio
Copy link
Author

R-Studio commented May 1, 2023

@weizhe0422 here the support bundle. Thanks for any help.
Info

  • Start Backup: 2023-05-01 09:12:59 +0200 CEST
  • Complete Backup: 2023-05-01 09:14:29 +0200 CEST
  • Delete Backup: 2023-05-01 09:17:34 +0200 CEST

supportbundle_a3236774-99ca-4ab5-a2a5-74c925273bb4_2023-05-01T07-20-00Z.zip

@tcoupin
Copy link

tcoupin commented May 6, 2023

Related to #5797 ?

@R-Studio
Copy link
Author

R-Studio commented May 8, 2023

@tcoupin thanks but this is not a solution because if I use --snapshot-volumes=false then velero does not trigger a backup for the persistent volumes. So Velero only backups the manifests/YAML's.

@vineetsingh5
Copy link

vineetsingh5 commented May 9, 2023

I am also facing the same issue with
velero: v1.9.1
velero-plugin-for-csi: v0.4.1
longhorn version: 1.3.1

deleting velero backup is cleaning most of the resources related to corresponding backups like backups.longhorn.io, volumesnapshotcontent, volumesnapshot but snapshot.longhorn.io is still present in the system.

And backup started failing when number of snapshot objects increased> ~250.

Sharing both longhorn support bundle as well as snapshot controller logs

longhorn-support-bundle_8afefff1-085e-4f4e-97ae-3f0a518555ab_2023-05-09T05-19-35Z.zip

snapshot-controller.log

@tcoupin
Copy link

tcoupin commented May 9, 2023

I do not use --snapshot-volumes=false but I add a cronjob who deletes the snapshot.longhorn.io refered by backups.longhorn.io.

kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  namespace: longhorn-system
  name: snapshot-cleaner
rules:
- apiGroups:
  - longhorn.io
  resources:
  - backups
  verbs:
  - 'list'
- apiGroups:
  - longhorn.io
  resources:
  - snapshots
  verbs:
  - 'list'
  - 'delete'
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: snapshot-cleaner
  namespace: longhorn-system
subjects:
- kind: ServiceAccount
  name: sa-snapshot-cleaner
  namespace: longhorn-system
roleRef:
  kind: Role
  name: snapshot-cleaner
  apiGroup: ""
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: sa-snapshot-cleaner
  namespace: longhorn-system
---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: snapshot-cleaner
  namespace: longhorn-system
spec:
  concurrencyPolicy: Forbid
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      activeDeadlineSeconds: 30
      template:
        metadata:
          creationTimestamp: null
        spec:
          affinity: {}
          containers:
            - args:
                - '-c'
                - >-
                  cat <(kubectl get backups.longhorn.io -n longhorn-system -o
                  custom-columns=SNAPSHOT:.spec.snapshotName | grep
                  '^snapshot-'|sort|uniq) <(kubectl get snapshot.longhorn.io -n
                  longhorn-system -o custom-columns=SNAPSHOT:.metadata.name |
                  grep '^snapshot-'|sort|uniq)|sort|uniq -c|awk '$1==2 {print
                  $2}'|grep -v '^\n$'|xargs kubectl delete snapshot.longhorn.io
                  -n longhorn-system --wait=false
              command:
                - /bin/bash
              image: bitnami/kubectl:latest
              imagePullPolicy: Always
              name: snapshot-cleaner
              resources: {}
              terminationMessagePath: /dev/termination-log
              terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: OnFailure
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: sa-snapshot-cleaner
          serviceAccountName: sa-snapshot-cleaner
          terminationGracePeriodSeconds: 30
  schedule: '*/5 * * * *'
  successfulJobsHistoryLimit: 3
  suspend: false

@R-Studio
Copy link
Author

R-Studio commented May 9, 2023

@tcoupin, Thanks for your help, but this is just a workaround for me.

@R-Studio
Copy link
Author

When I create a backup via the Longhorn GUI and then delete this backup, no snapshots remain (everything works as expected).

@R-Studio
Copy link
Author

R-Studio commented May 15, 2023

I reproduced the issue again by creating a VolumeSnapshot resource and afterwards delete that and the issue occurs again.

VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: bak

VolumeSnapshot

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: new-snapshot-test
  namespace: harbor
spec:
  volumeSnapshotClassName: longhorn
  source:
    persistentVolumeClaimName: harbor-jobservice

image

Remove the VolumeSnapshot: kubectl delete volumesnapshot new-snapshot-test -n harbor

image

  • Why does Longhorn remain the snapshot?
  • Why does it work in the GUI but not without it?

@R-Studio
Copy link
Author

R-Studio commented May 15, 2023

I still found something interesting in the snapshot-controller logs:
image
But in the logs there is no error for deleting the snapshot:
image

@R-Studio
Copy link
Author

R-Studio commented Jun 5, 2023

Today I upgrade Longhorn from v1.4.1 to v1.4.2 and the issue still occurs. 😔😔

@R-Studio
Copy link
Author

R-Studio commented Jun 6, 2023

Today I noticed that I have deployed a snapshot controller like described in this documentation.
Although I already had an snapshot controller "rke2-snapshot-controller" on my cluster. I am not sure if this is comes with a rancher update or something. Anyway I removed my snapshot-controller and test the issue again.

VolumeSnapshotClass

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshotClass
metadata:
  name: longhorn
  namespace: longhorn-system
  labels:
    velero.io/csi-volumesnapshot-class: 'true'
driver: driver.longhorn.io
deletionPolicy: Delete
parameters:
  type: bak

VolumeSnapshot

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  name: new-snapshot-test
  namespace: harbor
spec:
  volumeSnapshotClassName: longhorn
  source:
    persistentVolumeClaimName: harbor-jobservice

image
image

Remove the VolumeSnapshot: kubectl delete volumesnapshot new-snapshot-test -n harbor
image

The same issue occurs but there is some interessting logs:

  • After I created the VolumeSnapshot the following logs are written:
    image
    image

@R-Studio
Copy link
Author

R-Studio commented Jun 6, 2023

After I upgrade to the newest RKE2 Helm charts, the error logs mentioned above "finalizers" no longer appears, but the issue still occurs.

I upgrade the following Helm releases:

  • rke2-snapshot-controller: 1.7.201 -> 1.7.202
  • rke2-snapshot-controller-crd: 1.7.201 -> 1.7.202
  • rke2-snapshot-validation.webhook: 1.7.200 -> 1.7.201

Here the log messages:

2023-06-06T14:36:39+02:00	I0606 12:36:39.815026       1 snapshot_controller_base.go:213] deletion of content "snapcontent-20d81f05-864b-489e-8875-3ea71832a743" was already processed
2023-06-06T14:36:38+02:00	E0606 12:36:38.814935       1 snapshot_controller_base.go:265] could not sync content "snapcontent-20d81f05-864b-489e-8875-3ea71832a743": snapshot controller failed to update snapcontent-20d81f05-864b-489e-8875-3ea71832a743 on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-20d81f05-864b-489e-8875-3ea71832a743": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshotcontents/snapcontent-20d81f05-864b-489e-8875-3ea71832a743, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: d376de84-4aa4-426e-8c97-f96df3073b73, UID in object meta: 
2023-06-06T14:36:38+02:00	time="2023-06-06T12:36:38Z" level=info msg="DeleteSnapshot: rsp: {}"
2023-06-06T14:36:38+02:00	time="2023-06-06T12:36:38Z" level=info msg="DeleteSnapshot: req: {\"snapshot_id\":\"bak://pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c/backup-3e65b0ef494940b8\"}"
2023-06-06T14:35:12+02:00	I0606 12:35:12.301474       1 snapshot_controller.go:998] checkandRemovePVCFinalizer[new-snapshot-test]: Remove Finalizer for PVC harbor-jobservice as it is not used by snapshots in creation
2023-06-06T14:35:12+02:00	I0606 12:35:12.296417       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshot", Namespace:"harbor", Name:"new-snapshot-test", UID:"20d81f05-864b-489e-8875-3ea71832a743", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"181825436", FieldPath:""}): type: 'Normal' reason: 'SnapshotReady' Snapshot harbor/new-snapshot-test is ready to use.
2023-06-06T14:35:12+02:00	I0606 12:35:12.296353       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshot", Namespace:"harbor", Name:"new-snapshot-test", UID:"20d81f05-864b-489e-8875-3ea71832a743", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"181825436", FieldPath:""}): type: 'Normal' reason: 'SnapshotCreated' Snapshot harbor/new-snapshot-test was successfully created by the CSI driver.
2023-06-06T14:35:12+02:00	time="2023-06-06T12:35:12Z" level=info msg="CreateSnapshot: rsp: {\"snapshot\":{\"creation_time\":{\"seconds\":1686054902},\"ready_to_use\":true,\"size_bytes\":1073741824,\"snapshot_id\":\"bak://pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c/backup-3e65b0ef494940b8\",\"source_volume_id\":\"pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c\"}}"
2023-06-06T14:35:12+02:00	time="2023-06-06T12:35:12Z" level=debug msg="ControllerServer CreateSnapshot rsp: snapshot:<size_bytes:1073741824 snapshot_id:\"bak://pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c/backup-3e65b0ef494940b8\" source_volume_id:\"pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c\" creation_time:<seconds:1686054902 > ready_to_use:true > "
2023-06-06T14:35:12+02:00	time="2023-06-06T12:35:12Z" level=info msg="createCSISnapshotTypeLonghornBackup: volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c backup backup-3e65b0ef494940b8 of snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743 in progress"
2023-06-06T14:35:12+02:00	time="2023-06-06T12:35:12Z" level=info msg="Backup backup-3e65b0ef494940b8 initiated for volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c for snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743"
2023-06-06T14:35:08+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:08Z" level=info msg="Done initiating backup creation, received backupID: backup-3e65b0ef494940b8"
2023-06-06T14:35:06+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:06Z" level=info msg="Loaded driver for s3://t1-longhorn-snapshots@minio/" pkg=s3
2023-06-06T14:35:06+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:06Z" level=info msg="Start creating backup backup-3e65b0ef494940b8" pkg=backup
2023-06-06T14:35:06+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:06Z" level=info msg="Initializing backup backup-3e65b0ef494940b8 for volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743" pkg=backup
2023-06-06T14:35:06+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:06Z" level=info msg="Backing up snapshot-20d81f05-864b-489e-8875-3ea71832a743 on tcp://10.42.1.154:10060, to s3://t1-longhorn-snapshots@minio/"
2023-06-06T14:35:06+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:06Z" level=info msg="Backing up snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743 to backup backup-3e65b0ef494940b8" serviceURL="10.42.2.8:10009"
2023-06-06T14:35:04+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:04Z" level=info msg="Done initiating backup creation, received backupID: backup-3e65b0ef494940b8"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Loaded driver for s3://t1-longhorn-snapshots@minio/" pkg=s3
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Start creating backup backup-3e65b0ef494940b8" pkg=backup
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Initializing backup backup-3e65b0ef494940b8 for volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743" pkg=backup
2023-06-06T14:35:02+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:02Z" level=info msg="Backing up snapshot-20d81f05-864b-489e-8875-3ea71832a743 on tcp://10.42.1.154:10060, to s3://t1-longhorn-snapshots@minio/"
2023-06-06T14:35:02+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:02Z" level=info msg="Backing up snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743 to backup backup-3e65b0ef494940b8" serviceURL="10.42.2.8:10009"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="createCSISnapshotTypeLonghornBackup: volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c initiating backup for snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished snapshot" snapshot=snapshot-20d81f05-864b-489e-8875-3ea71832a743 volume=pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished to snapshot: 10.42.3.102:10105 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Finished to snapshot: 10.42.1.154:10060 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Removing disk volume-head-004.img"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished creating disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-99a60236/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk checksum file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-99a60236/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.checksum before linking"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-46492401] time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk metadata file path /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-99a60236/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.meta before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Removing disk volume-head-004.img"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished creating disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-5286f5ef/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk checksum file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-5286f5ef/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.checksum before linking"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk metadata file path /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-5286f5ef/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.meta before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Starting to create disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Finished to snapshot: 10.42.2.93:10045 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-46492401] time="2023-06-06T12:35:02Z" level=info msg="Replica server starts to snapshot [snapshot-20d81f05-864b-489e-8875-3ea71832a743] volume, user created true, created time 2023-06-06T12:35:02Z, labels map[type:bak]"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Removing disk volume-head-003.img"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Finished creating disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-90ade5bb/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk checksum file /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-90ade5bb/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.checksum before linking"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-ff57e501] time="2023-06-06T12:35:02Z" level=info msg="Cleaning up new disk metadata file path /host/var/lib/longhorn/replicas/pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-90ade5bb/volume-snap-snapshot-20d81f05-864b-489e-8875-3ea71832a743.img.meta before linking"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Starting to create disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-62d5874e] time="2023-06-06T12:35:02Z" level=info msg="Replica server starts to snapshot [snapshot-20d81f05-864b-489e-8875-3ea71832a743] volume, user created true, created time 2023-06-06T12:35:02Z, labels map[type:bak]"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Starting to create disk" disk=snapshot-20d81f05-864b-489e-8875-3ea71832a743
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="Starting to snapshot: 10.42.1.154:10060 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Starting to snapshot: 10.42.2.93:10045 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-r-ff57e501] time="2023-06-06T12:35:02Z" level=info msg="Replica server starts to snapshot [snapshot-20d81f05-864b-489e-8875-3ea71832a743] volume, user created true, created time 2023-06-06T12:35:02Z, labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Starting to snapshot: 10.42.3.102:10105 snapshot-20d81f05-864b-489e-8875-3ea71832a743 UserCreated true Created at 2023-06-06T12:35:02Z, Labels map[type:bak]"
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Requesting system sync before snapshot" snapshot=snapshot-20d81f05-864b-489e-8875-3ea71832a743 volume=pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c
2023-06-06T14:35:02+02:00	[pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c-e-2e5a625b] time="2023-06-06T12:35:02Z" level=info msg="Starting snapshot" snapshot=snapshot-20d81f05-864b-489e-8875-3ea71832a743 volume=pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c
2023-06-06T14:35:02+02:00	[longhorn-instance-manager] time="2023-06-06T12:35:02Z" level=info msg="Snapshotting volume: snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743" serviceURL="10.42.2.8:10009"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="createCSISnapshotTypeLonghornBackup: volume pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c initiating snapshot snapshot-20d81f05-864b-489e-8875-3ea71832a743"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="CreateSnapshot: req: {\"name\":\"snapshot-20d81f05-864b-489e-8875-3ea71832a743\",\"parameters\":{\"type\":\"bak\"},\"source_volume_id\":\"pvc-21bf4b76-ac60-48d9-b5ba-fe15b50dd87c\"}"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="GetPluginInfo: rsp: {\"name\":\"driver.longhorn.io\",\"vendor_version\":\"v1.4.2\"}"
2023-06-06T14:35:02+02:00	time="2023-06-06T12:35:02Z" level=info msg="GetPluginInfo: req: {}"
2023-06-06T14:35:02+02:00	I0606 12:35:02.112599       1 event.go:285] Event(v1.ObjectReference{Kind:"VolumeSnapshot", Namespace:"harbor", Name:"new-snapshot-test", UID:"20d81f05-864b-489e-8875-3ea71832a743", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"181825428", FieldPath:""}): type: 'Normal' reason: 'CreatingSnapshot' Waiting for a snapshot harbor/new-snapshot-test to be created by the CSI driver.
2023-06-06T14:35:02+02:00	I0606 12:35:02.112195       1 snapshot_controller.go:291] createSnapshotWrapper: Creating snapshot for content snapcontent-20d81f05-864b-489e-8875-3ea71832a743 through the plugin ...
2023-06-06T14:35:02+02:00	I0606 12:35:02.106794       1 snapshot_controller.go:919] Added protection finalizer to persistent volume claim harbor/harbor-jobservice
2023-06-06T14:35:02+02:00	I0606 12:35:02.093901       1 snapshot_controller.go:638] createSnapshotContent: Creating content for snapshot harbor/new-snapshot-test through the plugin ...
2023-06-06T14:35:01+02:00	time="2023-06-06T12:35:01Z" level=debug msg="Setting allow-recurring-job-while-volume-detached is false

Following error looks interesting:

snapshot_controller_base.go:265] could not sync content "snapcontent-20d81f05-864b-489e-8875-3ea71832a743": snapshot controller failed to update snapcontent-20d81f05-864b-489e-8875-3ea71832a743 on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-20d81f05-864b-489e-8875-3ea71832a743": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshotcontents/snapcontent-20d81f05-864b-489e-8875-3ea71832a743, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: d376de84-4aa4-426e-8c97-f96df3073b73, UID in object meta:`

image

@R-Studio
Copy link
Author

@vineetsingh5 do you found a solution for this issue?

@R-Studio
Copy link
Author

@weizhe0422 do you found something interessting in the support bundle?

@anthony-pastor
Copy link

I'm having almost the same setup and versions and the same issue!
One interesting log line found on longhorn-csi-plugin:

longhorn-csi-plugin-5k8lg longhorn-csi-plugin time="2023-10-06T08:12:20Z" level=info msg="DeleteSnapshot: req: {\"snapshot_id\":\"bak://pvc-c57da450-ce82-44c8-ac83-0a039634a334/backup-04db0d0fe4ef49f1\"}" longhorn-csi-plugin-5k8lg longhorn-csi-plugin time="2023-10-06T08:12:20Z" level=info msg="DeleteSnapshot: rsp: {}" csi-snapshotter-5d899fdcfc-xv627 csi-snapshotter E1006 08:12:20.143392 1 snapshot_controller_base.go:265] could not sync content "snapcontent-55c4399b-1dec-4cf2-b9bd-a4eff27f315e": snapshot controller failed to update snapcontent-55c4399b-1dec-4cf2-b9bd-a4eff27f315e on API server: Operation cannot be fulfilled on volumesnapshotcontents.snapshot.storage.k8s.io "snapcontent-55c4399b-1dec-4cf2-b9bd-a4eff27f315e": StorageError: invalid object, Code: 4, Key: /registry/snapshot.storage.k8s.io/volumesnapshotcontents/snapcontent-55c4399b-1dec-4cf2-b9bd-a4eff27f315e, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: d4fa9b1d-416e-4df5-ad74-d3ac6bec3b66, UID in object meta:

@R-Studio
Copy link
Author

R-Studio commented Dec 5, 2023

I update to Velero v1.12.2, using the velero-plugin-for-csi v0.6.2 & velero-plugin-for-aws v1.8.2, but still the same issue.

@innobead innobead added the area/upstream Upstream related like tgt upstream library label Dec 5, 2023
@innobead
Copy link
Member

innobead commented Dec 5, 2023

Delete the backup velero backup delete
The snapshot (snapshots.longhorn.io) is not deleted.

I believe what Velero triggered is to delete the corresponding Longhorn backup (out-of-cluster) instead of the Longhorn snapshot (in-cluster, immutable COW layers for volume).

In the current design, deleting a backup is independent of deleting the corresponding snapshot generated by that backup. What is your expectation here or a feature you are looking for?

Please check if https://longhorn.io/docs/1.5.3/snapshots-and-backups/scheduling-backups-and-snapshots/ works for you, but this is our built-in mechanism nothing related to Velero.

@mantissahz please follow up.

@R-Studio
Copy link
Author

R-Studio commented Dec 5, 2023

@innobead thanks for your reply.
What I want: I have a Velero schedule that creates/triggers backup of my persistent volumes with a retention period of e.g. 7 days. After this retention period 7 days Velero deletes these backups, but the corresponding snapshots are not deleted and consumes disk space that I don't want.
As a workaround, I have a recurring job that deletes these snapshots (retain 7), but there are two disadvantages.

  • I'm using up disk space for snapshots I don't want and that are stored in my object store already.
  • For example, if I trigger 3 manual backups with Velero, the recurring job doesn't delete the snapshots based on the creation timestamp like Velero does. This means that I lose backup data that is older than 4 days.

@julienvincent
Copy link

@R-Studio @innobead

I think this issue can be simplified to completely exclude Velero.

At the core the issue here is that Longhorn does not delete snapshots or backups when the backing CSI VolumeSnapshot resource is deleted.

As a user of Longhorn that is interfacing with CSI and not native Longhorn resources, I expect the state of Longhorn resources to reflect the state of my CSI resources.

  • If I create a CSI VolumeSnapshot I expect Longhorn to create a snapshot/backup/bi. This works!
  • If I delete a CSI VolumeSnapshot I expect Longhorn to delete the backing snapshot/backup/bi that it created. This doesn't work.

Therefore I think it's fair to state that Longhorn is currently only providing a partial implementation of the CSI interface/spec.

Velero is just using this common CSI interface as it is intended to be used and expecting it to have the desired effect. This is not a Velero issue.


Perhaps this should be opened as a new issue with a smaller scope (CSI spec conformance).

@innobead innobead added this to the v1.7.0 milestone Dec 26, 2023
@innobead innobead added priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/doc Require updating the longhorn.io documentation area/volume-backup-restore Volume backup restore and removed kind/bug labels Dec 26, 2023
@innobead innobead added kind/improvement Request for improvement of existing function area/snapshot Volume snapshot (in-cluster snapshot or external backup) labels Dec 26, 2023
@innobead
Copy link
Member

Thanks for the valuable info.

We will improve this, as it's quite important for space efficiency.

@innobead innobead added area/space-efficiency Space efficiency, especially for volume data usage require/backport Require backport. Only used when the specific versions to backport have not been definied. labels Dec 26, 2023
@larssb
Copy link

larssb commented Apr 22, 2024

Kind request here for a status. What's the status? This issue somewhat renders using Velero combing with Longhorn somewhat quirky. As we seems to have to clean up snapshots instead of every backup artefact being deleted when a Velero backup is removed by Velero's internal clean up jobs.

@derekbit derekbit modified the milestones: v1.7.0, v1.8.0 May 17, 2024
@ChanYiLin
Copy link
Contributor

ChanYiLin commented Jul 29, 2024

VolumeSnapshot type: snap

Both creation and deletion works

  • Create a CSI VolumeSnapshot => Longhorn creates a Snapshot
  • Delete the CSI VolumeSnapshot => Longhron deletes the Snapshot

Note that, Longhorn won't delete the latest snapshot which is just behind the volume-head and will only marks it as removed
Reference: https://longhorn.io/docs/1.6.2/concepts/#243-deleting-snapshots

VolumeSnapshot type: bak

Both creation and deletion works

  • Create a CSI VolumeSnapshot => Longhorn creates a Backup
  • Delete the CSI VolumeSnapshot => Longhron deletes the Backup

As mentioned by @innobead above and @PhanLe1010 in another thread(vmware-tanzu/velero#6179),
Longhorn first creates the Snapshot(Longhorn snapshot) and then creates the Backup(longhorn Backup) based on that snapshot.
After backup creation, the snpashot is no longer binding with the backup.
Thus, when users delete the CSI VolumeSnapshot, Longhron only deletes the backup(longhorn backup) but not snapshot(Longhorn Snapshot).
That is why we can see the snapshots(longhorn snapshot) remaining in this issue. And it is expected because a Longhorn snapshot can be corresponding to multiple Longhorn backups

There are two ways to auto delete these snapshots.

  1. Setup a snapshot-delete recurring job to periodically delete the Longhron Snapshot.

  2. In the following v1.7.0 release, we have new setting auto-cleanup-when-delete-backup which can auto clean up the snapshot when the backup is deleted. Reference: feat: remove related snapshot when removing backup longhorn-manager#2783

VolumeSnapshot type: bi

Both creation and deletion works

  • Create a CSI VolumeSnapshot => Longhorn creates a Backingimage
  • Delete the CSI VolumeSnapshot => Longhron deletes the BackingImage

cc @R-Studio @julienvincent the csi volumesnapshot creation and deletion functions are implemented in Longhorn. I think we just have some confusion about Snapshot and Backup in Longhorn.
cc @larssb , yes, since Longhorn Backup and Longhorn Snapshot is not binding together after creation, you have to delete the Longhorn Snapshot manually or using a snapshot-delete recurring job. Or after v1.7.0, we will have a new setting auto-cleanup-when-delete-backup to automatically clean up the Snapshot.

@R-Studio
Copy link
Author

R-Studio commented Aug 12, 2024

@ChanYiLin thanks for the great summary. The new setting auto-cleanup-when-delete-backup sounds really helpful and we will test it after Rancher releases the Helm chart for v1.7.0 (and close this issue if it works).
FYI: @lucatr

@ChanYiLin
Copy link
Contributor

I am going to close the issue for now.
Feel free to open the issues if there is any new comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/snapshot Volume snapshot (in-cluster snapshot or external backup) area/space-efficiency Space efficiency, especially for volume data usage area/upstream Upstream related like tgt upstream library area/volume-backup-restore Volume backup restore kind/improvement Request for improvement of existing function priority/0 Must be implement or fixed in this release (managed by PO) require/auto-e2e-test Require adding/updating auto e2e test cases if they can be automated require/backport Require backport. Only used when the specific versions to backport have not been definied. require/doc Require updating the longhorn.io documentation wontfix
Projects
Status: Resolved
Status: Closed
Development

No branches or pull requests

10 participants