Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kopia fails with data path backup failed: Failed to run kopia backup: unable to get local block device entry: resolveSymlink: lstat /var/lib/kubelet: no such file or directory #7947

Open
e3b0c442 opened this issue Jun 28, 2024 · 11 comments
Assignees

Comments

@e3b0c442
Copy link

e3b0c442 commented Jun 28, 2024

What steps did you take and what happened:

Attempt a backup with a CSI snapshot data mover.
Command: velero backup create one-backup --include-namespaces vms -l kubevirt.io/vm=one --snapshot-move-data --wait
The backup PartiallyFailed with the volume data not on the remote backup location:

Backup Item Operations:
  Operation for persistentvolumeclaims vms/one-rootdisk:
    Backup Item Action Plugin:  velero.io/csi-pvc-backupper
    Operation ID:               du-aaaf854f-a5d7-4648-a87c-7de3a7e0dcb9.1104f25b-5c11-40196df57
    Items to Update:
                           datauploads.velero.io velero/one-backup-9vfl8
    Phase:                 Failed
    Operation Error:       data path backup failed: Failed to run kopia backup: unable to get local block device entry: resolveSymlink: lstat /var/lib/kubelet: no such file or directory
    Progress description:  Failed
    Created:               2024-06-27 08:02:39 -0500 CDT
    Started:               2024-06-27 08:02:54 -0500 CDT
    Updated:               2024-06-27 08:02:56 -0500 CDT

What did you expect to happen:
Backup completed successfully

The following information will help us better understand what's going on:
bundle-2024-06-27-20-50-21.tar.gz

Anything else you would like to add:

I created a debug pod with a mount for /var/lib/kubelet/pods matching node-agent pod spec and was able to interact with no issue, so I do not believe this is an OS issue. Unfortunately the backup mover pod disappears so fast I wasn't able to grab the pod spec.

Environment:

  • Velero version (use velero version):
	Version: v1.14.0
	Git commit: -
Server:
	Version: v1.14.0
  • Velero features (use velero client config get features):
features: <NOT SET>
  • Kubernetes version (use kubectl version):
Client Version: v1.30.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.6
  • Kubernetes installer & version: Talos v1.7.5
  • Cloud provider or hardware configuration: bare metal
  • OS (e.g. from /etc/os-release): Talos Linux v1.7.5

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@e3b0c442
Copy link
Author

I was finally able to capture the podspec from the backup mover pod:

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2024-06-28T01:58:54Z"
  labels:
    velero.io/data-upload: one-backup-8cv5q
    velero.io/exposer-pod-group: snapshot-exposer
  name: one-backup-8cv5q
  namespace: velero
  ownerReferences:
  - apiVersion: velero.io/v2alpha1
    controller: true
    kind: DataUpload
    name: one-backup-8cv5q
    uid: a9f1adae-b258-4fd6-bb0c-63a44a5b4105
  resourceVersion: "2931830"
  uid: 8c834182-aaaf-4c99-9ee1-6a1d8afe2899
spec:
  containers:
  - command:
    - /velero-helper
    - pause
    image: velero/velero:v1.14.0
    imagePullPolicy: Never
    name: a9f1adae-b258-4fd6-bb0c-63a44a5b4105
    resources: {}
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeDevices:
    - devicePath: /a9f1adae-b258-4fd6-bb0c-63a44a5b4105
      name: a9f1adae-b258-4fd6-bb0c-63a44a5b4105
    volumeMounts:
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-jr452
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  nodeName: go
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: velero-server
  serviceAccountName: velero-server
  terminationGracePeriodSeconds: 0
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  topologySpreadConstraints:
  - labelSelector:
      matchLabels:
        velero.io/exposer-pod-group: snapshot-exposer
    maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway
  volumes:
  - name: a9f1adae-b258-4fd6-bb0c-63a44a5b4105
    persistentVolumeClaim:
      claimName: one-backup-8cv5q
  - name: kube-api-access-jr452
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-06-28T01:58:55Z"
    status: "False"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-06-28T01:58:55Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-06-28T01:58:55Z"
    message: 'containers with unready status: [a9f1adae-b258-4fd6-bb0c-63a44a5b4105]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-06-28T01:58:55Z"
    message: 'containers with unready status: [a9f1adae-b258-4fd6-bb0c-63a44a5b4105]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-06-28T01:58:55Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: velero/velero:v1.14.0
    imageID: ""
    lastState: {}
    name: a9f1adae-b258-4fd6-bb0c-63a44a5b4105
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        reason: ContainerCreating
  hostIP: 192.168.227.5
  hostIPs:
  - ip: 192.168.227.5
  phase: Pending
  qosClass: BestEffort
  startTime: "2024-06-28T01:58:55Z"

@blackpiglet
Copy link
Contributor

blackpiglet commented Jun 28, 2024

Please check the directory kubelet save the pod data.
The node-agent tries to read that directory at /var/lib/kubelet.
Your environment may not use the same directory. Please check and update the node-agent setting accordingly.

      volumes:
      - hostPath:
          path: /var/lib/kubelet/pods
          type: ""
        name: host-pods

@e3b0c442
Copy link
Author

e3b0c442 commented Jun 28, 2024

I created a debug pod with a mount for /var/lib/kubelet/pods matching node-agent pod spec and was able to interact with no issue, so I do not believe this is an OS issue.

@e3b0c442
Copy link
Author

I also tried setting the node-agent container securityContext to privileged: true with no improvement.

@e3b0c442
Copy link
Author

I tried rolling back to Velero 1.13 as well, no improvement.

@e3b0c442
Copy link
Author

e3b0c442 commented Jun 28, 2024

Verifying the correct directory:

~ talosctl ls -d3 -n go -Hl /var/lib/kubelet
NODE   MODE         UID   GID   SIZE(B)   LASTMOD         NAME
go     drwx------   0     0     215 B     1 day ago       kubelet
go     -rw-------   0     0     62 B      4 days ago      kubelet/cpu_manager_state
go     drwxr-xr-x   0     0     173 B     1 day ago       kubelet/device-plugins
go     Srwxr-xr-x   0     0     0 B       1 day ago       kubelet/device-plugins/gpu.intel.com-i915.sock
go     Srwxr-xr-x   0     0     0 B       1 day ago       kubelet/device-plugins/kubelet.sock
go     -rw-------   0     0     34 kB     1 day ago       kubelet/device-plugins/kubelet_internal_checkpoint
go     Srwxr-xr-x   0     0     0 B       1 day ago       kubelet/device-plugins/kubevirt-kvm.sock
go     Srwxr-xr-x   0     0     0 B       1 day ago       kubelet/device-plugins/kubevirt-tun.sock
go     Srwxr-xr-x   0     0     0 B       1 day ago       kubelet/device-plugins/kubevirt-vhost-net.sock
go     -rw-r--r--   0     0     89 B      1 day ago       kubelet/graceful_node_shutdown_state
go     -rw-------   0     0     61 B      4 days ago      kubelet/memory_manager_state
go     drwxr-xr-x   0     0     124 B     1 day ago       kubelet/pki
go     -rw-------   0     0     842 B     4 days ago      kubelet/pki/kubelet-client-2024-06-23-20-17-53.pem
go     Lrwxrwxrwx   0     0     59 B      4 days ago      kubelet/pki/kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2024-06-23-20-17-53.pem
go     -rw-r--r--   0     0     2.2 kB    1 day ago       kubelet/pki/kubelet.crt
go     -rw-------   0     0     1.7 kB    1 day ago       kubelet/pki/kubelet.key
go     drwxr-x---   0     0     98 B      4 days ago      kubelet/plugins
go     drwxr-x---   0     0     17 B      4 days ago      kubelet/plugins/kubernetes.io
go     drwxr-xr-x   0     0     39 B      1 day ago       kubelet/plugins/rook-ceph.cephfs.csi.ceph.com
go     drwxr-xr-x   0     0     22 B      1 day ago       kubelet/plugins/rook-ceph.rbd.csi.ceph.com
go     drwxr-x---   0     0     95 B      1 day ago       kubelet/plugins_registry
go     Srwx------   0     0     0 B       1 day ago       kubelet/plugins_registry/rook-ceph.cephfs.csi.ceph.com-reg.sock
go     Srwx------   0     0     0 B       1 day ago       kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock
go     drwxr-x---   0     0     26 B      1 day ago       kubelet/pod-resources
go     Srwxr-xr-x   0     0     0 B       1 day ago       kubelet/pod-resources/kubelet.sock
go     drwxr-x---   0     0     8.2 kB    7 minutes ago   kubelet/pods
go     drwxr-x---   0     0     94 B      1 day ago       kubelet/pods/00fe8484-abc6-4778-a58a-3995a0686214
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/01800aea-6dbc-4d1b-ab75-c9f852627e5d
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/0330bf48-d894-438b-bf50-e56f544fd6c0
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/0943d7f9-dd30-47c7-8a6b-8df1e193a12a
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/1c949679-a7d1-443d-8654-ee09e4df1798
go     drwxr-xr-x   0     0     71 B      1 day ago       kubelet/pods/1cd491e3-b677-40e8-a121-4e246dea0135
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/21bce9fc-3557-4aea-b6ef-3ebd382a9528
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/2e4fb0b5-b00e-42a6-8bed-3d8574d77e6d
go     drwxr-x---   0     0     92 B      1 day ago       kubelet/pods/30b88225-f47e-4309-9ac9-b5d985f582d8
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/36cfc4bf-1b99-4102-8882-94b01b4f8208
go     drwxr-x---   0     0     94 B      1 day ago       kubelet/pods/375d7bf5-1b0e-40e2-88db-707cc8932d7c
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/37aa08cb-84ed-4042-acf6-29ddbcef2ed8
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/39233f8b-82eb-4377-a844-997bac9d30f8
go     drwxr-x---   0     0     94 B      1 day ago       kubelet/pods/394ab777-f75c-4a75-95a5-cfb0281c2eb8
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/3c853306-2cc8-48a0-80cf-65782e1a189a
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/4e3978bf-3850-4427-bfe3-033704babab8
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/551602cd-db8f-4459-a085-1ac342f44e74
go     drwxr-x---   0     0     94 B      13 hours ago    kubelet/pods/585bed97-dbdd-40a3-a525-ee76e50c9325
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/5df5d242-d8ba-4f11-bb02-f5c37843cbb8
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/61189325-b7c0-442f-95e7-2faf43d4317d
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/61a0d23e-5697-4335-b7a3-f4eef9a33e0b
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/63dd0b78-8c36-46b6-ab7d-ddedf2cf9151
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/6702d1ad-be5d-4470-84be-fc62babedae2
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/6c36f47e-06b4-4b48-844e-4fd4de313bed
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/6cc0bfd8-02fc-4156-b929-e47ceba7a76e
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/75245a84-8fbe-4463-8a1b-8573f3903161
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/75cbcc8a-3c7f-4e18-ba74-e417b33a89b4
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/76c25ca7-c55d-4763-ab39-ae779a76d7eb
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/77f243b2-7d26-4a0a-b10a-2bb24443b185
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/788fe973-4d51-4896-a419-e1f6b0f1117d
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/7b6a34b4-f63b-489f-9ccb-4f882035df06
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/7d99e6be-ebc9-4374-9389-5f2c02ecc7de
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/7ff62351-8169-4af0-90f1-be0574a9b631
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/869991e4-e584-4f5e-9ddf-526db1d29747
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/8d3beb6a-d8a5-4a0f-9d19-3228627b7019
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/904de1c0-ef44-4d2b-aad4-7c76dd898829
go     drwxr-xr-x   0     0     71 B      1 day ago       kubelet/pods/9266b9a2-1189-4ae3-91b9-f11a343af1a4
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/9607c310-e85c-47e5-8ede-ba1328c4b515
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/961e0e8d-5d60-4d8e-97f1-5d003376b87e
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/965cecd5-b749-4a20-a327-f9bd76007eed
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/a6a8f181-7060-42a3-bd05-d30d3592db47
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/a8f49850-fc11-4680-b50a-170da2da07b2
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/afc92b25-6be3-4fbe-b9a2-c87fdaec2220
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/b23280c8-4819-432e-b645-3895a83dc604
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/b6241eca-425f-492e-a388-7092c320e061
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/b7e866c4-0585-4236-9dc5-cfe3c0a9fa5b
go     drwxr-x---   0     0     94 B      1 day ago       kubelet/pods/bc8f408b-7a31-4be4-9cc1-5776951148cf
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/bfa93d74-7c1f-4421-afff-7a442e9b26c2
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/c885be9f-003d-4e52-b9f0-0fb72979d0d6
go     drwxr-x---   0     0     71 B      23 hours ago    kubelet/pods/c9c7949d-c27c-4eb6-b884-e87edef9d89a
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/cffc298d-36b5-40c1-8d3f-34110bdcf5b1
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/d0c7438a-d5f2-489b-8d95-19e44d776625
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/d3d7cffa-1e3c-4777-b919-fbd58cbb559b
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/d9ba9327-cde8-4573-8b3a-90b338766919
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/e24a3315-5459-49ad-ba79-ee587699fe47
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/e3644ff3-04df-478a-a60b-704f635ba1d2
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/e56ffb9c-4244-4d4e-a52a-f9a6810f93d6
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/e612f8b6-4b03-47e1-adda-257f0d255b89
go     drwxr-x---   0     0     71 B      8 minutes ago   kubelet/pods/e67e97eb-7bbe-4475-98b2-d2f6175d1eb7
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/e900eda3-9f43-4e5c-ad28-a4e156b121a9
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/e9ccf61f-25c2-409d-953a-15eca13a8c21
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/f4926bb6-f14f-40ad-9be1-2cdc533f5f00
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/f5e9ad10-b438-4bd3-9e09-3ce97cea619c
go     drwxr-x---   0     0     71 B      1 day ago       kubelet/pods/f95e5f63-b239-4fa1-8b5b-dff909adb81b
go     drwx------   0     0     22 B      4 days ago      kubelet/seccomp
go     drwx------   0     0     6 B       4 days ago      kubelet/seccomp/profiles

@blackpiglet blackpiglet self-assigned this Jul 1, 2024
@blackpiglet
Copy link
Contributor

Cons:

It backs up data from the live file system, in which way the data is not captured at the same point in time, so is less consistent than the snapshot approaches.
It access the file system from the mounted hostpath directory, so Velero Node Agent pods need to run as root user and even under privileged mode in some environments.

Please check whether the node-agent was run as the root user. This can be done by set the runAsUser to 0 for the node-agent DaemonSet.

      securityContext:
        runAsUser: 0

@e3b0c442
Copy link
Author

e3b0c442 commented Jul 1, 2024

Yes, node-agent is run as root.

@e3b0c442
Copy link
Author

e3b0c442 commented Jul 1, 2024

OK, making some headway here.

I'm fairly sure this message originates from a file in the kubevirt PVC that I'm trying to back up which symlinks back to /var/lib/kubelet:

./volumeDevices/kubernetes.io~csi:
total 0
drwxr-x--- 2 root root  54 Jul  1 20:48 .
drwxr-x--- 3 root root  31 Jul  1 20:48 ..
lrwxrwxrwx 1 root root 142 Jul  1 20:48 pvc-eecd1d33-cefd-42bf-a99b-c45f6cae5759 -> /var/lib/kubelet/plugins/kubernetes.io/csi/volumeDevices/publish/pvc-eecd1d33-cefd-42bf-a99b-c45f6cae5759/88e93be8-befb-4a7f-a670-1a87b081aedb

(where . is /var/lib/kubelet/pods/POD_UUID on the node the pod is currently running on)

What I am still uncertain of is why this is an issue. I feel like there is something at the intersection of Talos, kubevirt, and velero/kopia that is making this failure unique to this combination. I suspect that if node-agent mounted the host /var/lib/kubelet at the same path in the container instead of in /host_pods, that this would not fail, however I'm not familiar enough with Kubernetes machinery to know if this might cause other unintended consequences.

I am going to create an issue in the kubevirt-velero-plugin repo so that they can take a look as well.

@e3b0c442
Copy link
Author

e3b0c442 commented Jul 1, 2024

@e3b0c442
Copy link
Author

e3b0c442 commented Jul 1, 2024

Adding an extra volume mount to mount /var/lib/kubelet in the pod at the same location as the host allowed the backup to succeed, confirming my hypothesis. This doesn't seem like it should be normal practice though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants