-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setup backups for K8s #37
Comments
After spending some time Googling, it looks like Velero is a frequently recommended, open source, free backup options for K8s, which offers (from their website):
I would like to try this option instead of backend RBD snapshot mirroring and rsync since it appears to solve the initial problem of backing up and deleting unused pods from the cluster, and has other features which look useful, such as snapshotting the entire K8s state before upgrades, and being able to restore back to that state. |
Some additional information:
For the block storage provider, it looks like we can use the For the object storage provider, we can setup a Ceph Object Gateway on the Anacapa Ceph cluster. This is another project, but will be useful for learning how to setup a Ceph Object Gateway for production use on the DataONE Ceph cluster later on. |
Object storage is now available on the Anacapa Ceph cluster: https://github.nceas.ucsb.edu/NCEAS/Computing/issues/254 |
I installed Velero on k8s-dev, and after some issues have a partially successful backup:
I started docs at https://github.com/DataONEorg/k8s-cluster/blob/main/admin/backup.md |
I ran new backups and the previous errors are gone, however some new pods are reporting errors:
I ran two backups, and the same three pods failed in both. When viewing the pods with kubectl they show as
While I don't yet understand why these backups are failing, It's possible these are transient errors, I'm going to try the backup again in a few days to see if they are cleared again. I also setup a nightly backup schedule:
|
I need to reconfigure Velero K8s backups as the Anacapa Ceph cluster has been shut down. My current plan is to install Minio on host-ucsb-26 at Anacapa to use it to receive backup from Velero (Velero requires backing up to object storage). I started setting up host-ucsb-26 today, and hope to get K8s backups running next week. |
After spending the time to enable RBD and CephFS snapshots on the two K8s clusters I setup snapshot-based backups with Velero. I tested these successfully with Velero using the csi-rbd-sc storage class, however, PVs without a storage class (ie any that have been created manually, such as
I switched to Velero FSB backups, which can backup all types of PVs, but don’t have the consistency of snapshots. I have not figured out a way to get Velero to use both types of backups, but that may be an option in the future. With FSB instead of CSI snapshots the backup of the GNIS namespace completed successfully. I updated the docs with both CSI and FSB install commands. I started a full backup of the k8s-dev cluster and was able to back up the entire k8s-dev cluster except for the following:
I started checking on hwitw-67ccd577-rltrp and found OOM killed error messages on the host server. This led me to the docs to discover that Velero default memory config is for 100 GiB of data, and I increased the default memory limits 8x and restarted the backup for hwitw. It appears to be working, it has not run out of memory quickly like it did during the test runs. I'm going to let it run over the weekend out of curiousity, but will probably end up skipping this volume in Velero, and backing up with rsync instead (if it even needs to be backed up). |
Increasing the pod memory settings for velero and its agents appears to have fixed the error with large backups, and I was able to finish a backup of the outin@halt:~/velero$ velero backup get hwitw-backup-5
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
hwitw-backup-5 Completed 0 0 2024-02-26 09:22:05 -0800 PST 29d default <none>
outin@halt:~$ mc du -r velero/k8s-dev
...
257KiB 12 objects k8s-dev/backups/hwitw-backup-5
14TiB 710551 objects k8s-dev/kopia/hwitw
... I started a second incremental backup run of the entire cluster. |
A full backup of K8s-dev completed except for three resources:
Longer error messages are:
I'm looking into fixing these errors. |
It appears that the pod volumes have an issue with their storage:
I'll check with @artntek before proceeding... |
@nickatnceas - those volumes are ephemeral, basically acting as a short-term local cache - by definition, they are non-critical and will be regenerated as needed. If there's a way of excluding them from the backups, that would be the best bet, I think. |
I excluded the three pod volumes from backup and the namespace backup for
I added the exclusion instructions to the backup docs, and started another full namespace backup. |
A full backup run reported it completed! There are a couple of warnings in the backup log, but they appear to be for broken pods, which are probably safe to ignore: $ velero backup create full-backup-4
Backup request "full-backup-4" submitted successfully.
Run `velero backup describe full-backup-4` or `velero backup logs full-backup-4` for more details.
$ velero get backups
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SE
full-backup-1 PartiallyFailed 7 0 2024-02-23 11:51:44 -0800 PST 25d default <none>
full-backup-2 PartiallyFailed 3 2 2024-02-26 09:41:42 -0800 PST 28d default <none>
full-backup-3 PartiallyFailed 3 2 2024-02-27 10:54:16 -0800 PST 29d default <none>
full-backup-4 Completed 0 2 2024-02-27 16:24:51 -0800 PST 29d default <none>
$ velero backup describe full-backup-4
Name: full-backup-4
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/resource-timeout=10m0s
velero.io/source-cluster-k8s-gitversion=v1.22.0
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=22
Phase: Completed
Warnings:
Velero: <none>
Cluster: <none>
Namespaces:
pdgrun: resource: /pods name: /parsl-worker-1708968600101 message: /Skip pod volume pdgrun-dev-0 error: /pod is not in the expected status, name=parsl-worker-1708968600101, namespace=pdgrun, phase=Pending: pod is not running
resource: /pods name: /parsl-worker-1708968600248 message: /Skip pod volume pdgrun-dev-0 error: /pod is not in the expected status, name=parsl-worker-1708968600248, namespace=pdgrun, phase=Pending: pod is not running
Namespaces:
Included: *
Excluded: <none>
Resources:
Included: *
Excluded: <none>
Cluster-scoped: auto
Label selector: <none>
Or label selector: <none>
Storage Location: default
Velero-Native Snapshot PVs: auto
Snapshot Move Data: false
Data Mover: velero
TTL: 720h0m0s
CSISnapshotTimeout: 10m0s
ItemOperationTimeout: 4h0m0s
Hooks: <none>
Backup Format Version: 1.1.0
Started: 2024-02-27 16:24:51 -0800 PST
Completed: 2024-02-27 16:39:35 -0800 PST
Expiration: 2024-03-28 17:24:51 -0700 PDT
Total items to be backed up: 2654
Items backed up: 2654
Backup Volumes:
Velero-Native Snapshots: <none included>
CSI Snapshots: <none included>
Pod Volume Backups - kopia (specify --details for more information):
Completed: 85
HooksAttempted: 0
HooksFailed: 0
$ kubectl get pods -n pdgrun
NAME READY STATUS RESTARTS AGE
parsl-worker-1708968600101 0/1 InvalidImageName 0 31h
parsl-worker-1708968600248 0/1 InvalidImageName 0 31h |
I made the following changes:
I updated the first comment to include new ticket requirements. |
Received the following errors after running the full backup of k8s-prod:
|
I was able to fix the failing backup by increasing the memory limits on Velero and its pods to double of what I used on k8s-dev:
I also set a nightly backup schedule with 90 days of retention, and the first scheduled backup ran successfully last night: outin@halt:~$ velero schedule get
NAME STATUS CREATED SCHEDULE BACKUP TTL LAST BACKUP SELECTOR PAUSED
full-backup Enabled 2024-03-20 17:12:47 -0700 PDT 0 1 * * * 2160h0m0s 22h ago <none> false
outin@halt:~$ velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
full-backup-20240321010020 Completed 0 33 2024-03-20 18:00:20 -0700 PDT 89d default <none>
... |
I setup backup monitoring on The backup scripts are in https://github.nceas.ucsb.edu/outin/check_velero_backups |
We would like to have both K8s clusters, k8s-prod and k8s-dev, backed up to prevent data loss in the event of hardware failure, human error, malicious actors, etc.
Since we are using Velero in File System Backup mode the underlying storage does not matter, and this earlier requirements list can be ignored:
Our K8s clusters store data in four different places on the DataONE Ceph cluster:
libvirt-pool
k8s-pool-ec42-*
k8sdev-pool-ec42-*
cephfs
We currently are backing up the VM images in
libvirt-pool
. Before moving forward on some other issues, like #1, we should be able to restore a broken cluster from a backup.Backups:
The text was updated successfully, but these errors were encountered: