Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On restore PV is not getting restored correctly in Azure AKS #6595

Closed
soumyapattnaik opened this issue Aug 3, 2023 · 26 comments
Closed

On restore PV is not getting restored correctly in Azure AKS #6595

soumyapattnaik opened this issue Aug 3, 2023 · 26 comments
Assignees
Labels
Area/CSI Related to Container Storage Interface support area/datamover Bug kind/refactor Need E2E Test Case Needs investigation Needs triage We need discussion to understand problem and decide the priority pv-backup-info Reviewed Q3 2023 target/1.12.2
Milestone

Comments

@soumyapattnaik
Copy link

soumyapattnaik commented Aug 3, 2023

What steps did you take and what happened:
I created a namespace soumya and created a deployment, pvc and pv(dynamically provisioned csi disk) in the namespace. The PV created was with retain data.
Then i triggered velero backup for the cluster.

After the backup is complete, i deleted the deployment, pvc, pv and the underlying azure disk and then retriggered restored

Post restore i see that deployment anf pvc is restored. PV is restored with the same name as it was backed up and the underlying disk was missing.

What did you expect to happen:

on restore the PV should have been created with a different name as it is dynamically provisioned. To fully confirm this behaviour i retriggered another restore from the same backup , but this time i restored the backed up resources in a different namespace soumya1. Post restore i could see deployment, pvc, pv(with a different name) and also the underlying disk getting created successfully.

Sharing the data post restore. During backup also the pv name was pvc-41f06d44-68f8-4e82-a3e3-f479ae7c09c4

PS C:\Users\sopattna\aks> kubectl get deployment --namespace=soumya
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment1   0/2     2            0           162m
nginx-deployment2   0/2     2            0           162m
PS C:\Users\sopattna\aks> kubectl get pvc --namespace=soumya
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc-azuredisk2   Bound    pvc-41f06d44-68f8-4e82-a3e3-f479ae7c09c4   10Gi       RWO            managed-csi    163m
PS C:\Users\sopattna\aks> kubectl get deployment --namespace=soumya1
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment1   2/2     2            2           157m
nginx-deployment2   2/2     2            2           157m
PS C:\Users\sopattna\aks> kubectl get pvc --namespace=soumya1
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
pvc-azuredisk2   Bound    pvc-3eb1b795-b4d2-48bf-89d9-3ce518861d53   10Gi       RWO            managed-csi    158m
PS C:\Users\sopattna\aks> kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                    STORAGECLASS   REASON   AGE
pvc-3eb1b795-b4d2-48bf-89d9-3ce518861d53   10Gi       RWO            Delete           Bound    soumya1/pvc-azuredisk2   managed-csi             158m
pvc-41f06d44-68f8-4e82-a3e3-f479ae7c09c4   10Gi       RWO            Retain           Bound    soumya/pvc-azuredisk2    managed-csi             163m

Attaching the logs for backup and restores
clusterbackup-dataprotection-microsoft-backup-47e70f1e-c1fa-4615-bc20-b01bf772c1d5-logs - backup logs
restore-clusterbackup-dataprotection-microsoft-restore-6f926513-fc7e-439d-a20c-a8cda64973a4-logs - restore to soumya namespace logs. Here the same name pv is getting created which is not expected
restore-clusterbackup-dataprotection-microsoft-restore-8aa648c4-4ac2-4729-acef-220a613e6eeb-logs- restore to namespace soumya1 with dynamically created PV which is the desired behavior
restore-clusterbackup-dataprotection-microsoft-restore-6f926513-fc7e-439d-a20c-a8cda64973a4-logs.gz
clusterbackup-dataprotection-microsoft-backup-47e70f1e-c1fa-4615-bc20-b01bf772c1d5-logs.gz
restore-clusterbackup-dataprotection-microsoft-restore-8aa648c4-4ac2-4729-acef-220a613e6eeb-logs.gz

@ywk253100
Copy link
Contributor

Could you provide the yaml of PV post restored? The status should contain information about why no underlying disk is created

@ywk253100 ywk253100 added the Needs info Waiting for information label Aug 3, 2023
@soumyapattnaik
Copy link
Author

soumyapattnaik commented Aug 4, 2023

pvc-41f06d44-68f8-4e82-a3e3-f479ae7c09c4.txt

please find the yaml of restored PV

@pradeepkchaturvedi pradeepkchaturvedi added the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Aug 4, 2023
@ywk253100
Copy link
Contributor

The PV is bound and it's underlying disk is /subscriptions/f0c630e0-2995-4853-b056-0b3c09cb673f/resourceGroups/mc_sopattnaeacan_sopattnaeatest_eastasia/providers/Microsoft.Compute/disks/pvc-41f06d44-68f8-4e82-a3e3-f479ae7c09c4, so why do you think the disk is missing?

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: disk.csi.azure.com
    volume.kubernetes.io/provisioner-deletion-secret-name: ""
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  creationTimestamp: "2023-08-03T03:58:40Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-attacher/disk-csi-azure-com
  labels:
    velero.io/backup-name: clusterbackup-dataprotection-microsoft-backup-47e70f1e-c11bbb41
    velero.io/restore-name: clusterbackup-dataprotection-microsoft-restore-6f926513-f9d48e2
  name: pvc-41f06d44-68f8-4e82-a3e3-f479ae7c09c4
  resourceVersion: "1519305"
  uid: 2d8a19ec-5498-45a4-b136-a69b3cc1193a
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: pvc-azuredisk2
    namespace: soumya
    resourceVersion: "1519303"
    uid: 8bc0a92c-07bd-4eb7-9008-3ea94c97562b
  csi:
    driver: disk.csi.azure.com
    volumeAttributes:
      csi.storage.k8s.io/pv/name: pvc-41f06d44-68f8-4e82-a3e3-f479ae7c09c4
      csi.storage.k8s.io/pvc/name: pvc-azuredisk2
      csi.storage.k8s.io/pvc/namespace: soumya
      requestedsizegib: "10"
      skuname: StandardSSD_LRS
      storage.kubernetes.io/csiProvisionerIdentity: 1690791203796-6707-disk.csi.azure.com
    volumeHandle: /subscriptions/f0c630e0-2995-4853-b056-0b3c09cb673f/resourceGroups/mc_sopattnaeacan_sopattnaeatest_eastasia/providers/Microsoft.Compute/disks/pvc-41f06d44-68f8-4e82-a3e3-f479ae7c09c4
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.disk.csi.azure.com/zone
          operator: In
          values:
          - ""
  persistentVolumeReclaimPolicy: Retain
  storageClassName: managed-csi
  volumeMode: Filesystem
status:
  phase: Bound

@soumyapattnaik
Copy link
Author

This yaml got restored as it is from the backup point. the PV name was same during backup and post restore. Before retore, the PV and underlying disk mentioned in volume handle was deleted. During restore the PV yaml was restored as it is without creating the underlying disk. So post restore the underlying disk is missing
In the PV that was restored to a different namespace soumya1 it was dynamically provisioned with a different PV name and not the same name as the PV name during backup. I think this should have been the scenario during first restore as well.

@ywk253100
Copy link
Contributor

@soumyapattnaik Could you upload the debug bundle by running velero debug --backup xxx --restore xxx?

@ywk253100 ywk253100 self-assigned this Aug 6, 2023
@soumyapattnaik
Copy link
Author

soumyapattnaik commented Aug 8, 2023

I had to repro the issue again as my backup point had got cleaned up. This time i created two deployments, one with pvc as retain data and other with delete data.
Snapshot of the resources during backup

soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                   STORAGECLASS         REASON   AGE
pvc-0e6b64d1-f141-4bf4-b044-ffdc02c99b3f   10Gi       RWO            Retain           Bound    soumya/pvc-azuredisk2   managed-csi-retain            91m
pvc-7771cac2-982c-4f6a-a9c1-aee48e6b1d3d   10Gi       RWO            Delete           Bound    soumya/pvc-azuredisk1   managed-csi                   92m

soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl get pvc
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         AGE
pvc-azuredisk1   Bound    pvc-7771cac2-982c-4f6a-a9c1-aee48e6b1d3d   10Gi       RWO            managed-csi          94m
pvc-azuredisk2   Bound    pvc-0e6b64d1-f141-4bf4-b044-ffdc02c99b3f   10Gi       RWO            managed-csi-retain   93m

soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl get deploy
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment1   2/2     2            2           93m
nginx-deployment2   2/2     2            2           93m

Then i deleted all the data in my namespace for both the type of data

soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl delete deploy nginx-deployment1 nginx-deployment2
deployment.apps "nginx-deployment1" deleted
deployment.apps "nginx-deployment2" deleted

soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl delete pvc pvc-azuredisk1 pvc-azuredisk2
persistentvolumeclaim "pvc-azuredisk1" deleted
persistentvolumeclaim "pvc-azuredisk2" deleted

soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                   STORAGECLASS         REASON   AGE
pvc-0e6b64d1-f141-4bf4-b044-ffdc02c99b3f   10Gi       RWO            Retain           Released   soumya/pvc-azuredisk2   managed-csi-retain            95m
soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl delete pv pvc-0e6b64d1-f141-4bf4-b044-ffdc02c99b3f
persistentvolume "pvc-0e6b64d1-f141-4bf4-b044-ffdc02c99b3f" deleted

Post restore

soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl get deploy
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
nginx-deployment1   2/2     2            2           75m
nginx-deployment2   0/2     2            0           75m
soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl get pvc
NAME             STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         AGE
pvc-azuredisk1   Bound    pvc-25fc81fe-f034-4e24-b795-f40ad7dc4888   10Gi       RWO            managed-csi          75m
pvc-azuredisk2   Bound    pvc-0e6b64d1-f141-4bf4-b044-ffdc02c99b3f   10Gi       RWO            managed-csi-retain   75m
soumya@MININT-JHGO7F8:~/velero/tilt-resources$ kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                   STORAGECLASS         REASON   AGE
pvc-0e6b64d1-f141-4bf4-b044-ffdc02c99b3f   10Gi       RWO            Retain           Bound    soumya/pvc-azuredisk2   managed-csi-retain            75m
pvc-25fc81fe-f034-4e24-b795-f40ad7dc4888   10Gi       RWO            Delete           Bound    soumya/pvc-azuredisk1   managed-csi                   75m

As i can see for PV with retain data again the PV was created with same name as during backup- pvc-0e6b64d1-f141-4bf4-b044-ffdc02c99b3f with the underlying azure disk not created

soumya@MININT-JHGO7F8:~/velero/tilt-resources$ velero debug --backup clusterbackup-dataprotection-microsoft-backup-4862cd59-9685-429a-bd5a-69e607f6bc29 --restore clusterbackup-dataprotection-microsoft-restore-24c230a6-c7ad-4174-8672-b405b025c312
2023/08/08 22:46:34 Collecting velero resources in namespace: dataprotection-microsoft
2023/08/08 22:46:38 Collecting velero deployment logs in namespace: dataprotection-microsoft
2023/08/08 22:46:44 Collecting log and information for backup: clusterbackup-dataprotection-microsoft-backup-4862cd59-9685-429a-bd5a-69e607f6bc29
2023/08/08 22:46:50 Collecting log and information for restore: clusterbackup-dataprotection-microsoft-restore-24c230a6-c7ad-4174-8672-b405b025c312
2023/08/08 22:46:57 Generated debug information bundle: /home/soumya/velero/tilt-resources/bundle-2023-08-08-22-46-32.tar.gz

Attaching the bundle for debugging.

bundle-2023-08-08-22-46-32.tar.gz

@ywk253100 ywk253100 added Area/CSI Related to Container Storage Interface support Bug and removed Needs info Waiting for information labels Aug 9, 2023
@ywk253100
Copy link
Contributor

ywk253100 commented Aug 14, 2023

@blackpiglet I reproduced this issue in my env.
Seems it's related to the code that still restores the PVs if its reclaim policy is Retain even though the CSI snapshot is enabled.
Please also note the underlying disk is removed before performing the restoration.
Could you take a look at it?

@ywk253100 ywk253100 assigned blackpiglet and unassigned ywk253100 Aug 14, 2023
@ywk253100
Copy link
Contributor

PV:

k get pv pvc-05889bc3-1007-46d9-baa3-eed008af3b56 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: disk.csi.azure.com
    volume.kubernetes.io/provisioner-deletion-secret-name: ""
    volume.kubernetes.io/provisioner-deletion-secret-namespace: ""
  creationTimestamp: "2023-08-14T06:03:30Z"
  finalizers:
  - kubernetes.io/pv-protection
  - external-attacher/disk-csi-azure-com
  labels:
    velero.io/backup-name: csi-03
    velero.io/restore-name: csi-03
  name: pvc-05889bc3-1007-46d9-baa3-eed008af3b56
  resourceVersion: "55261213"
  uid: cdf6b35b-76e9-47ea-aade-f1d5cd775f1c
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: etcd0-pv-claim
    namespace: default
    resourceVersion: "55261210"
    uid: e98c2bc6-63b7-42a1-9526-633c2720b97f
  csi:
    driver: disk.csi.azure.com
    volumeAttributes:
      csi.storage.k8s.io/pv/name: pvc-05889bc3-1007-46d9-baa3-eed008af3b56
      csi.storage.k8s.io/pvc/name: etcd0-pv-claim
      csi.storage.k8s.io/pvc/namespace: default
      requestedsizegib: "1"
      skuname: StandardSSD_LRS
      storage.kubernetes.io/csiProvisionerIdentity: 1691152563681-3939-disk.csi.azure.com
    volumeHandle: /subscriptions/xxx/resourceGroups/xxx/providers/Microsoft.Compute/disks/pvc-05889bc3-1007-46d9-baa3-eed008af3b56
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: topology.disk.csi.azure.com/zone
          operator: In
          values:
          - ""
  persistentVolumeReclaimPolicy: Retain
  storageClassName: test
  volumeMode: Filesystem
status:
  phase: Bound

PVC:

k get pvc etcd0-pv-claim -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    backup.velero.io/must-include-additional-items: "true"
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"PersistentVolumeClaim","metadata":{"annotations":{},"name":"etcd0-pv-claim","namespace":"default"},"spec":{"accessModes":["ReadWriteOnce"],"resources":{"requests":{"storage":"1Gi"}},"storageClassName":"test"}}
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    velero.io/backup-name: csi-03
    velero.io/volume-snapshot-name: velero-etcd0-pv-claim-rm9zd
  creationTimestamp: "2023-08-14T06:03:31Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    backup.velero.io/must-include-additional-items: "true"
    velero.io/backup-name: csi-03
    velero.io/restore-name: csi-03
    velero.io/volume-snapshot-name: velero-etcd0-pv-claim-rm9zd
  name: etcd0-pv-claim
  namespace: default
  resourceVersion: "55261215"
  uid: e98c2bc6-63b7-42a1-9526-633c2720b97f
spec:
  accessModes:
  - ReadWriteOnce
  dataSource:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: velero-etcd0-pv-claim-rm9zd
  dataSourceRef:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: velero-etcd0-pv-claim-rm9zd
  resources:
    requests:
      storage: 1Gi
  storageClassName: test
  volumeMode: Filesystem
  volumeName: pvc-05889bc3-1007-46d9-baa3-eed008af3b56
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 1Gi
  phase: Bound

Pod cannot start up because of the not-found disk:

k describe po etcd0
Name:         etcd0
Namespace:    default
Priority:     0
Node:         aks-sysnodepool-28167335-vmss000000/xxx
Start Time:   Mon, 14 Aug 2023 14:03:31 +0800
Labels:       app=etcd
              etcd_node=etcd0
              velero.io/backup-name=csi-03
              velero.io/restore-name=csi-03
Annotations:  <none>
Status:       Pending
IP:
IPs:          <none>
Containers:
  etcd0:
    Container ID:
    Image:         quay.io/coreos/etcd:latest
    Image ID:
    Ports:         2379/TCP, 2380/TCP
    Host Ports:    0/TCP, 0/TCP
    Command:
      /usr/local/bin/etcd
      --name
      etcd0
      --initial-advertise-peer-urls
      http://etcd0:2380
      --listen-peer-urls
      http://0.0.0.0:2380
      --listen-client-urls
      http://0.0.0.0:2379
      --advertise-client-urls
      http://etcd0:2379
      --initial-cluster
      etcd0=http://etcd0:2380,etcd1=http://etcd1:2380,etcd2=http://etcd2:2380
      --initial-cluster-state
      new
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /etcd0.etcd from etcd0-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bb29t (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  etcd0-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  etcd0-pv-claim
    ReadOnly:   false
  kube-api-access-bb29t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason              Age                 From                     Message
  ----     ------              ----                ----                     -------
  Warning  FailedMount         28m (x2 over 64m)   kubelet                  Unable to attach or mount volumes: unmounted volumes=[etcd0-storage], unattached volumes=[kube-api-access-bb29t etcd0-storage]: timed out waiting for the condition
  Warning  FailedMount         90s (x27 over 62m)  kubelet                  Unable to attach or mount volumes: unmounted volumes=[etcd0-storage], unattached volumes=[etcd0-storage kube-api-access-bb29t]: timed out waiting for the condition
  Warning  FailedAttachVolume  86s (x40 over 66m)  attachdetach-controller  AttachVolume.Attach failed for volume "pvc-05889bc3-1007-46d9-baa3-eed008af3b56" : rpc error: code = NotFound desc = Volume not found, failed with error: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 404, RawError: {"error":{"code":"ResourceNotFound","message":"The Resource 'Microsoft.Compute/disks/pvc-05889bc3-1007-46d9-baa3-eed008af3b56' under resource group 'mc_yinw-resource-group-01_yinw-cluster-11_westus' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix"}}

@blackpiglet
Copy link
Contributor

blackpiglet commented Aug 22, 2023

@soumyapattnaik
This is a corner case and may need to add some logic to avoid the error.
IMO, this should be handled similarly to the volume backed up by Velero native snapshot. If a CSI snapshot can be found, then do not restore the PV, then let the k8s create PV dynamically. If no CSI snapshot can be found, and the PV's DeletionPolicy is Retain, then restore the PV.

@Lyndon-Li
Copy link
Contributor

Looks like CSI snapshot data mover is also affected

@reasonerjt
Copy link
Contributor

If we wanna track how PVs are snapshotted, we need a design, as this may be a break change.

@reasonerjt reasonerjt added the Needs triage We need discussion to understand problem and decide the priority label Sep 6, 2023
@reasonerjt reasonerjt self-assigned this Sep 20, 2023
@anshulahuja98
Copy link
Collaborator

@soumyapattnaik This is a corner case and may need to add some logic to avoid the error. IMO, this should be handled similarly to the volume backed up by Velero native snapshot. If a CSI snapshot can be found, then do not restore the PV, then let the k8s create PV dynamically. If no CSI snapshot can be found, and the PV's DeletionPolicy is Retain, then restore the PV.

@Lyndon-Li will this change be relatively trivial to add or will it require design as @reasonerjt commented above?

@anshulahuja98
Copy link
Collaborator

I feel this is a critical scenario and could impact restores for end customers, and I wanted to see how we can prioritize this...

@blackpiglet
Copy link
Contributor

@anshulahuja98
At first, it looks like a completely new feature, but after discussion, we think maybe the skipped PV summary can be extended to address this issue, of course, I can write a design document for this.

#6496

@anshulahuja98
Copy link
Collaborator

Can't we just enhance the check for hasSnapshots to CSI?
And hopefully that should be enough?

@anshulahuja98
Copy link
Collaborator

Skipping PVCs which had Retain tag but still had snapshots and no PVC in target cluster
This behavior is not correct

@blackpiglet
Copy link
Contributor

blackpiglet commented Sep 22, 2023

Not preferring that way is it's better to not read the result of the plugin result. Enhancing hasSnapshots needs to read the VolumeSnapshots and DataUpload generated from the CSI plugin.

Not just the skipped PVCs. The plan is to extend the summary and add all PVCs handling information, including backed up by which method, status, snapshot, and other useful information, then store that in the OSS backup storage as a separate metadata file.

@anshulahuja98
Copy link
Collaborator

are you suggesting that once this information is backed up in a different JSON file in the storage. You'll rely on that value in the hasSnapshot function?
That will take some design to potentially mature and use, in the short term can we think of some fix to ensure restore doesn't break for such PVCs.
I currently see this as a bug in CSI restore and there should be a way to fix this and even potentially backport to previous versions without the long route of saving this info.

@anshulahuja98
Copy link
Collaborator

Currently also I believe the hasSnapshot checks for snapshots taken by plugins of diff clouds
i am just suggesting to extend that for CSI for now.

@blackpiglet
Copy link
Contributor

That makes sense. The backporting change should be more concise.
I will create a design PR for the main branch, and the backport PR will use the existing behavior.

@anshulahuja98
Copy link
Collaborator

Great that we have consensus!
I will add this to 1.13 milestone. Hope that's okay @blackpiglet / @reasonerjt

@anshulahuja98 anshulahuja98 added this to the v1.13 milestone Sep 22, 2023
@reasonerjt
Copy link
Contributor

@anshulahuja98
It's OK but we'll make sure there's a design about the structure of the additional metadata for PV. I'll try to discuss that in Community meeting after we are back from holiday.

@reasonerjt reasonerjt removed the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Oct 18, 2023
@blackpiglet
Copy link
Contributor

@soumyapattnaik @anshulahuja98 @shubham-pampattiwar
After discussion, the Velero team thinks this issue is a legacy issue, and it only impacts some corner cases.
As a result, we put on hold to back-port the fix to existing releases.

Please note that the main release will introduce a new backup volume information metadata. Velero will decide according to the content of the new metadata file.

The fix is already in the main release, and it will be included in the coming v1.13 release. The fix PR is #7061. It is used to let Velero can also work with old format backup.

According to the N-2 support policy, #7061 will be kept for two releases, which means it will be removed in v1.15.x. After that, the backup volume information metadata will be used instead.

Is this acceptable to you?

@anshulahuja98
Copy link
Collaborator

anshulahuja98 commented Nov 10, 2023

@blackpiglet I understand that you would not prefer take out a release for older versions <= 1.11
But I would request if we could atleast take this to 1.12 since we will anyways have another patch release for it.

Just having it on main is not sufficient, atleast 1 GAd release should ideally have the changes.

@blackpiglet
Copy link
Contributor

@anshulahuja98
Got it. #7087 is created to cherry-pick the fix to release-1.12.

@anshulahuja98
Copy link
Collaborator

Great, thank you so much @blackpiglet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/CSI Related to Container Storage Interface support area/datamover Bug kind/refactor Need E2E Test Case Needs investigation Needs triage We need discussion to understand problem and decide the priority pv-backup-info Reviewed Q3 2023 target/1.12.2
Projects
None yet
Development

No branches or pull requests

8 participants