Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup backups for K8s #37

Closed
7 tasks done
nickatnceas opened this issue Sep 12, 2023 · 17 comments
Closed
7 tasks done

Setup backups for K8s #37

nickatnceas opened this issue Sep 12, 2023 · 17 comments
Assignees

Comments

@nickatnceas
Copy link
Contributor

nickatnceas commented Sep 12, 2023

We would like to have both K8s clusters, k8s-prod and k8s-dev, backed up to prevent data loss in the event of hardware failure, human error, malicious actors, etc.

  • Automatic nightly backups of k8s-dev (exclusions allowed)
  • Automatic nightly backups of k8s-prod (exclusions allowed)
  • Monitoring and alerts for failed or stale backups

Since we are using Velero in File System Backup mode the underlying storage does not matter, and this earlier requirements list can be ignored:

Our K8s clusters store data in four different places on the DataONE Ceph cluster:

Ceph Storage K8s Storage Provisioning K8s cluster
libvirt-pool RBD VM image manual production and dev
k8s-pool-ec42-* RBD PV automatic production
k8sdev-pool-ec42-* RBD PV automatic dev
cephfs CephFS PV manual production and dev

We currently are backing up the VM images in libvirt-pool. Before moving forward on some other issues, like #1, we should be able to restore a broken cluster from a backup.

Backups:

  • RBD VM images
  • RBD PVs (dev)
  • RBD PVs (prod)
  • CephFS PVs
@nickatnceas nickatnceas self-assigned this Sep 12, 2023
@nickatnceas
Copy link
Contributor Author

After spending some time Googling, it looks like Velero is a frequently recommended, open source, free backup options for K8s, which offers (from their website):

  • Take backups of your cluster and restore in case of loss.
  • Migrate cluster resources to other clusters.
  • Replicate your production cluster to development and testing clusters.
  • You can back up or restore all objects in your cluster, or you can filter objects by type, namespace, and/or label.
  • Velero is ideal for the disaster recovery use case, as well as for snapshotting your application state, prior to performing system operations on your cluster, like upgrades.

I would like to try this option instead of backend RBD snapshot mirroring and rsync since it appears to solve the initial problem of backing up and deleting unused pods from the cluster, and has other features which look useful, such as snapshotting the entire K8s state before upgrades, and being able to restore back to that state.

@nickatnceas
Copy link
Contributor Author

Some additional information:

Velero uses object storage to store backups and associated artifacts. It also optionally integrates with supported block storage systems to snapshot your persistent volumes. Before beginning the installation process, you should identify the object storage provider and optional block storage provider(s) you’ll be using from the list of compatible providers.

For the block storage provider, it looks like we can use the Container Storage Interface (CSI), with a fallback of File System Backup if that doesn't work.

For the object storage provider, we can setup a Ceph Object Gateway on the Anacapa Ceph cluster. This is another project, but will be useful for learning how to setup a Ceph Object Gateway for production use on the DataONE Ceph cluster later on.

@nickatnceas
Copy link
Contributor Author

Object storage is now available on the Anacapa Ceph cluster: https://github.nceas.ucsb.edu/NCEAS/Computing/issues/254

@nickatnceas
Copy link
Contributor Author

I installed Velero on k8s-dev, and after some issues have a partially successful backup:

outin@halt:~/velero/velero-v1.12.0-darwin-amd64$ velero backup describe backup-fsb-6
Name:         backup-fsb-6
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.22.0
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=22

Phase:  PartiallyFailed (run `velero backup logs backup-fsb-6` for more information)


Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    hwitw:    resource: /pods name: /cdstool-job--1-n7n6n error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-n7n6n, namespace=hwitw, phase=Failed: pod is not running
              resource: /pods name: /cdstool-job--1-q5hj2 error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-q5hj2, namespace=hwitw, phase=Failed: pod is not running
              resource: /pods name: /cdstool-job--1-sdv4x error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-sdv4x, namespace=hwitw, phase=Failed: pod is not running
              resource: /pods name: /cdstool-job--1-sqlrk error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-sqlrk, namespace=hwitw, phase=Failed: pod is not running
              resource: /pods name: /cdstool-job--1-st98r error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-st98r, namespace=hwitw, phase=Failed: pod is not running
              resource: /pods name: /cdstool-job--1-xqp78 error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-xqp78, namespace=hwitw, phase=Failed: pod is not running
              resource: /pods name: /cdstool-job--1-xxnrv error: /backup for volume hwitw-data is skipped: pod is not in the expected status, name=cdstool-job--1-xxnrv, namespace=hwitw, phase=Failed: pod is not running
    polder:   resource: /pods name: /setup-gleaner--1-5qvwv error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-5qvwv, namespace=polder, phase=Failed: pod is not running
              resource: /pods name: /setup-gleaner--1-8fwt9 error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-8fwt9, namespace=polder, phase=Failed: pod is not running
              resource: /pods name: /setup-gleaner--1-bdbzr error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-bdbzr, namespace=polder, phase=Failed: pod is not running
              resource: /pods name: /setup-gleaner--1-lp2lx error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-lp2lx, namespace=polder, phase=Failed: pod is not running
              resource: /pods name: /setup-gleaner--1-pd9dm error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-pd9dm, namespace=polder, phase=Failed: pod is not running
              resource: /pods name: /setup-gleaner--1-swmmc error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-swmmc, namespace=polder, phase=Failed: pod is not running
              resource: /pods name: /setup-gleaner--1-vffcn error: /backup for volume gleaner-context is skipped: pod is not in the expected status, name=setup-gleaner--1-vffcn, namespace=polder, phase=Failed: pod is not running

Errors:
  Velero:   name: /hwitw-7c7b669857-lz7vc error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
            name: /d1index-idxworker-5fd67f856b-hdbvx error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
            name: /metadig-controller-595d76dc6c-j6ht7 error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
            name: /dev-gleaner-74bf949f4f-555nm error: /pod volume backup failed: data path backup failed: Failed to run kopia backup: Error when processing data/repositories/polder/storage/pos: ConcatenateObjects is not supported
Error when processing data/repositories/polder/storage/pso: ConcatenateObjects is not supported
             name: /dev-gleaner-74bf949f4f-555nm error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
  Cluster:    <none>
  Namespaces: <none>

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2023-09-29 13:28:45 -0700 PDT
Completed:  2023-09-29 14:22:19 -0700 PDT

Expiration:  2023-10-29 13:28:45 -0700 PDT

Total items to be backed up:  2206
Items backed up:              2206

Velero-Native Snapshots: <none included>

kopia Backups (specify --details for more information):
  Completed:  66
  Failed:     5

I started docs at https://github.com/DataONEorg/k8s-cluster/blob/main/admin/backup.md

@nickatnceas
Copy link
Contributor Author

I ran new backups and the previous errors are gone, however some new pods are reporting errors:

Errors:
  Velero:    name: /hwitw-5d876bf94c-khh26 error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /d1index-idxworker-5fd67f856b-hdbvx error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /metadig-controller-595d76dc6c-j6ht7 error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
  Cluster:    <none>
  Namespaces: <none>

I ran two backups, and the same three pods failed in both. When viewing the pods with kubectl they show as Running, for example:

outin@halt:~/velero/velero-v1.12.0-darwin-amd64$ kubectl get pod -n hwitw hwitw-5d876bf94c-khh26 -o wide
NAME                     READY   STATUS    RESTARTS   AGE     IP               NODE             NOMINATED NODE   READINESS GATES
hwitw-5d876bf94c-khh26   1/1     Running   0          3d17h   192.168.108.38   k8s-dev-node-1   <none>           <none>

While I don't yet understand why these backups are failing, It's possible these are transient errors, I'm going to try the backup again in a few days to see if they are cleared again.

I also setup a nightly backup schedule:

velero schedule create k8s-dev-daily --schedule="0 1 * * *"

@nickatnceas
Copy link
Contributor Author

nickatnceas commented Feb 17, 2024

I need to reconfigure Velero K8s backups as the Anacapa Ceph cluster has been shut down. My current plan is to install Minio on host-ucsb-26 at Anacapa to use it to receive backup from Velero (Velero requires backing up to object storage). I started setting up host-ucsb-26 today, and hope to get K8s backups running next week.

@nickatnceas
Copy link
Contributor Author

After spending the time to enable RBD and CephFS snapshots on the two K8s clusters I setup snapshot-based backups with Velero. I tested these successfully with Velero using the csi-rbd-sc storage class, however, PVs without a storage class (ie any that have been created manually, such as gnis/cephfs-gnis-pvc do not have snapshot support and were not backed up. Adding snapshot support to all manually created PVs seems time consuming and not practical.

Errors:
  Velero:    name: /gnis-74c6f7c6df-8vzf2 message: /Error backing up item error: /error executing custom action (groupResource=persistentvolumeclaims, namespace=gnis, name=cephfs-gnis-pvc): rpc error: code = Unknown desc = Cannot snapshot PVC gnis/cephfs-gnis-pvc, PVC has no storage class.

I switched to Velero FSB backups, which can backup all types of PVs, but don’t have the consistency of snapshots. I have not figured out a way to get Velero to use both types of backups, but that may be an option in the future. With FSB instead of CSI snapshots the backup of the GNIS namespace completed successfully. I updated the docs with both CSI and FSB install commands.

I started a full backup of the k8s-dev cluster and was able to back up the entire k8s-dev cluster except for the following:

Errors:
  Velero:    name: /ekbrooke-elasticsearch-data-1 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2 message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2: expected one matching path: /host_pods/e23f6cc5-acfe-411f-8ea9-69fb2bc075c8/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
             name: /metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw: expected one matching path: /host_pods/0bb2f50c-04a0-4a1d-9559-49aa7f73a80e/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
             name: /metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx: expected one matching path: /host_pods/204f09ca-7e9f-402d-b0e2-12f5b62257b9/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
             name: /hwitw-67ccd577-rltrp message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /d1index-idxworker-5fd67f856b-hdbvx message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /metadig-controller-5ddff7d9fb-jxx77 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"

I started checking on hwitw-67ccd577-rltrp and found OOM killed error messages on the host server. This led me to the docs to discover that Velero default memory config is for 100 GiB of data, and cephfs-hwitw-0 maps to /volumes/hwitw-subvol-group/hwitw-subvol/7cb7d655-7ba9-49d2-8dd6-c83a47ff38a1, which contains about 14 TiB of data.

I increased the default memory limits 8x and restarted the backup for hwitw. It appears to be working, it has not run out of memory quickly like it did during the test runs. I'm going to let it run over the weekend out of curiousity, but will probably end up skipping this volume in Velero, and backing up with rsync instead (if it even needs to be backed up).

@nickatnceas
Copy link
Contributor Author

Increasing the pod memory settings for velero and its agents appears to have fixed the error with large backups, and I was able to finish a backup of the hwitw namespace, of about 14 TiB:

outin@halt:~/velero$ velero backup get hwitw-backup-5
NAME             STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
hwitw-backup-5   Completed   0        0          2024-02-26 09:22:05 -0800 PST   29d       default            <none>

outin@halt:~$ mc du -r velero/k8s-dev
...
257KiB	12 objects	k8s-dev/backups/hwitw-backup-5
14TiB	710551 objects	k8s-dev/kopia/hwitw
...

I started a second incremental backup run of the entire cluster.

@nickatnceas
Copy link
Contributor Author

A full backup of K8s-dev completed except for three resources:

    Failed:
      brooke/metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2: metacatbrooke-temp-tripledb-volume
      brooke/metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw: metacatbrooke-temp-tripledb-volume
      brooke/metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx: metacatbrooke-temp-tripledb-volume

Longer error messages are:

Errors:
  Velero:    name: /metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2 message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2: expected one matching path: /host_pods/e23f6cc5-acfe-411f-8ea9-69fb2bc075c8/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
             name: /metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw: expected one matching path: /host_pods/0bb2f50c-04a0-4a1d-9559-49aa7f73a80e/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
             name: /metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx message: /Error backing up item error: /pod volume backup failed: error exposing host path for pod volume: error identifying unique volume path on host for volume metacatbrooke-temp-tripledb-volume in pod metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx: expected one matching path: /host_pods/204f09ca-7e9f-402d-b0e2-12f5b62257b9/volumes/*/metacatbrooke-temp-tripledb-volume, got 0
  Cluster:    <none>
  Namespaces: <none>

I'm looking into fixing these errors.

@nickatnceas
Copy link
Contributor Author

It appears that the pod volumes have an issue with their storage:

$ kubectl describe pod metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2 -n brooke

...
  metacatbrooke-temp-tripledb-volume:
    Type:          EphemeralVolume (an inline specification for a volume that gets created and deleted with the pod)
    StorageClass:  csi-cephfs-sc-ephemeral
    Volume:
    Labels:            <none>
    Annotations:       <none>
    Capacity:
    Access Modes:
    VolumeMode:    Filesystem
...

I'll check with @artntek before proceeding...

@artntek
Copy link
Contributor

artntek commented Feb 27, 2024

@nickatnceas - those volumes are ephemeral, basically acting as a short-term local cache - by definition, they are non-critical and will be regenerated as needed. If there's a way of excluding them from the backups, that would be the best bet, I think.

@nickatnceas
Copy link
Contributor Author

I excluded the three pod volumes from backup and the namespace backup for brooke now completes successfully.

velero backup create brooke-backup-1 --include-namespaces brooke

kubectl -n brooke annotate pod/metacatbrooke-dataone-indexer-845fd4c5f5-cnwc2 backup.velero.io/backup-volumes-excludes=metacatbrooke-temp-tripledb-volume

velero backup create brooke-backup-2 --include-namespaces brooke

kubectl -n brooke annotate pod/metacatbrooke-dataone-indexer-845fd4c5f5-rzzcw backup.velero.io/backup-volumes-excludes=metacatbrooke-temp-tripledb-volume
kubectl -n brooke annotate pod/metacatbrooke-dataone-indexer-845fd4c5f5-vkbcx backup.velero.io/backup-volumes-excludes=metacatbrooke-temp-tripledb-volume

velero backup create brooke-backup-3 --include-namespaces brooke
NAME                STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
brooke-backup-1     PartiallyFailed   3        0          2024-02-27 16:11:08 -0800 PST   29d       default            <none>
brooke-backup-2     PartiallyFailed   2        0          2024-02-27 16:15:33 -0800 PST   29d       default            <none>
brooke-backup-3     Completed         0        0          2024-02-27 16:21:39 -0800 PST   29d       default            <none>

I added the exclusion instructions to the backup docs, and started another full namespace backup.

@nickatnceas
Copy link
Contributor Author

A full backup run reported it completed!

There are a couple of warnings in the backup log, but they appear to be for broken pods, which are probably safe to ignore:

$ velero backup create full-backup-4
Backup request "full-backup-4" submitted successfully.
Run `velero backup describe full-backup-4` or `velero backup logs full-backup-4` for more details.

$ velero get backups
NAME                STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SE
full-backup-1       PartiallyFailed   7        0          2024-02-23 11:51:44 -0800 PST   25d       default            <none>
full-backup-2       PartiallyFailed   3        2          2024-02-26 09:41:42 -0800 PST   28d       default            <none>
full-backup-3       PartiallyFailed   3        2          2024-02-27 10:54:16 -0800 PST   29d       default            <none>
full-backup-4       Completed         0        2          2024-02-27 16:24:51 -0800 PST   29d       default            <none>

$ velero backup describe full-backup-4
Name:         full-backup-4
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.22.0
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=22

Phase:  Completed


Warnings:
  Velero:     <none>
  Cluster:    <none>
  Namespaces:
    pdgrun:   resource: /pods name: /parsl-worker-1708968600101 message: /Skip pod volume pdgrun-dev-0 error: /pod is not in the expected status, name=parsl-worker-1708968600101, namespace=pdgrun, phase=Pending: pod is not running
              resource: /pods name: /parsl-worker-1708968600248 message: /Skip pod volume pdgrun-dev-0 error: /pod is not in the expected status, name=parsl-worker-1708968600248, namespace=pdgrun, phase=Pending: pod is not running

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-02-27 16:24:51 -0800 PST
Completed:  2024-02-27 16:39:35 -0800 PST

Expiration:  2024-03-28 17:24:51 -0700 PDT

Total items to be backed up:  2654
Items backed up:              2654

Backup Volumes:
  Velero-Native Snapshots: <none included>

  CSI Snapshots: <none included>

  Pod Volume Backups - kopia (specify --details for more information):
    Completed:  85

HooksAttempted:  0
HooksFailed:     0

$ kubectl get pods -n pdgrun
NAME                         READY   STATUS             RESTARTS   AGE
parsl-worker-1708968600101   0/1     InvalidImageName   0          31h
parsl-worker-1708968600248   0/1     InvalidImageName   0          31h

@nickatnceas
Copy link
Contributor Author

I made the following changes:

  • Set up a nightly backup schedule for k8s-dev
  • Started backing up k8s-prod
    - Created new bucket and user in MinIO at Anacapa
    - Installed Velero pods on k8s-prod
    - Ran a test backup of the purser namespace
    - Started a full backup

I updated the first comment to include new ticket requirements.

@nickatnceas
Copy link
Contributor Author

Received the following errors after running the full backup of k8s-prod:

Errors:
  Velero:    name: /gnis-6c7f9d9bb7-8mg4j message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /metadig-controller-7db96b7585-zb2dk message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /metadig-solr-0 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /prod-gleaner-76df9dfc54-rp9x8 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"
             name: /prod-gleaner-76df9dfc54-rp9x8 message: /Error backing up item error: /pod volume backup failed: get a podvolumebackup with status "InProgress" during the server starting, mark it as "Failed"

    Failed:
      gnis/gnis-6c7f9d9bb7-8mg4j: gnis-volume
      metadig/metadig-controller-7db96b7585-zb2dk: metadig-pv
      metadig/metadig-solr-0: data
      polder/prod-gleaner-76df9dfc54-rp9x8: s3system-volume, triplestore-volume

@nickatnceas
Copy link
Contributor Author

I was able to fix the failing backup by increasing the memory limits on Velero and its pods to double of what I used on k8s-dev:

kubectl patch daemonset node-agent -n velero --patch '{"spec":{"template":{"spec":{"containers":[{"name": "node-agent", "resources": {"limits":{"cpu": "4", "memory": "16384Mi"}, "requests": {"cpu": "2", "memory": "8192Mi"}}}]}}}}'
kubectl patch deployment velero -n velero --patch '{"spec":{"template":{"spec":{"containers":[{"name": "velero", "resources": {"limits":{"cpu": "4", "memory": "8192Mi"}, "requests": {"cpu": "2", "memory": "2048Mi"}}}]}}}}'

I also set a nightly backup schedule with 90 days of retention, and the first scheduled backup ran successfully last night:

outin@halt:~$ velero schedule get
NAME          STATUS    CREATED                         SCHEDULE    BACKUP TTL   LAST BACKUP   SELECTOR   PAUSED
full-backup   Enabled   2024-03-20 17:12:47 -0700 PDT   0 1 * * *   2160h0m0s    22h ago       <none>     false

outin@halt:~$ velero backup get
NAME                         STATUS            ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
full-backup-20240321010020   Completed         0        33         2024-03-20 18:00:20 -0700 PDT   89d       default            <none>
...

@nickatnceas
Copy link
Contributor Author

I setup backup monitoring on optimal-squirrel.nceas.ucsb.edu to alert when Velero fails to complete a backup, or if the backups silently stop running, for the nightly backups of k8s-prod and k8s-dev.

The backup scripts are in https://github.nceas.ucsb.edu/outin/check_velero_backups

Check_MK alerts:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants