Velero Metrics Mostly Zero and Prometheus Metrics Incorrectly Functioning #7951

vladislav-curvetech · 2024-06-28T12:26:10Z

I am experiencing issues with Velero where most of the metrics are always zero, and basic Prometheus metrics are not functioning correctly. This issue significantly affects our ability to monitor the backup status and reliability.

A few problematic metrics:

velero_backup_failure_total
velero_backup_items_errors
velero_backup_partial_failure_total
velero_backup_warning_total
velero_backup_attempt_total

These metrics are crucial for us to monitor the health and status of our backup operations, but they consistently report zero values, which is not accurate.

Expected Behavior:
The above metrics should provide accurate and non-zero values reflecting the actual state of Velero backups.

Environment:
Velero version: 1.13.0
Kubernetes version: 1.28
Cloud provider: AWS EKS

Additional Context:
Any insights or solutions to this issue would be greatly appreciated as these metrics are critical for our backup monitoring and alerting.

Thank you for your assistance!

The text was updated successfully, but these errors were encountered:

komljen · 2024-08-02T07:46:40Z

Did you find the issue?

kaovilai · 2024-08-02T21:31:29Z

Check for any pod restarts.. these metrics IIUC are incremental as new backup/restores are processed. Velero does not list all existing backups prior to its startup to count attempt/failure totals.

vladislav-curvetech · 2024-08-12T21:25:55Z

Thank you, Kaovilai, for your participation.
Yes, the pod sometimes restarts for an unknown reason. Before the restart, I see just one warning:

level=warning msg="active indexes ....blabla.....12b-c1] deletion watermark 2024-08-10 20:30:46 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error.

Did I understand correctly that if the pod restarts, the metrics important to me are reset?

kaovilai · 2024-08-13T01:25:40Z

yes

kaovilai · 2024-08-13T01:27:01Z

One reason is, velero sync backup from object storage (could be from a different cluster) to cluster.

Many of those will have status of completed.

If metrics count completed backups in cluster, it would overcount what this cluster has actually completed.

blackpiglet added the Metrics Related to prometheus metrics label Jun 29, 2024

blackpiglet assigned reasonerjt Jul 1, 2024

reasonerjt added the Needs investigation label Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero Metrics Mostly Zero and Prometheus Metrics Incorrectly Functioning #7951

Velero Metrics Mostly Zero and Prometheus Metrics Incorrectly Functioning #7951

vladislav-curvetech commented Jun 28, 2024

komljen commented Aug 2, 2024

kaovilai commented Aug 2, 2024

vladislav-curvetech commented Aug 12, 2024

kaovilai commented Aug 13, 2024

kaovilai commented Aug 13, 2024

Velero Metrics Mostly Zero and Prometheus Metrics Incorrectly Functioning #7951

Velero Metrics Mostly Zero and Prometheus Metrics Incorrectly Functioning #7951

Comments

vladislav-curvetech commented Jun 28, 2024

komljen commented Aug 2, 2024

kaovilai commented Aug 2, 2024

vladislav-curvetech commented Aug 12, 2024

kaovilai commented Aug 13, 2024

kaovilai commented Aug 13, 2024