Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero Metrics Mostly Zero and Prometheus Metrics Incorrectly Functioning #7951

Open
vladislav-curvetech opened this issue Jun 28, 2024 · 5 comments
Assignees
Labels
Metrics Related to prometheus metrics Needs investigation

Comments

@vladislav-curvetech
Copy link

I am experiencing issues with Velero where most of the metrics are always zero, and basic Prometheus metrics are not functioning correctly. This issue significantly affects our ability to monitor the backup status and reliability.

A few problematic metrics:

velero_backup_failure_total
velero_backup_items_errors
velero_backup_partial_failure_total
velero_backup_warning_total
velero_backup_attempt_total

These metrics are crucial for us to monitor the health and status of our backup operations, but they consistently report zero values, which is not accurate.

Expected Behavior:
The above metrics should provide accurate and non-zero values reflecting the actual state of Velero backups.

Environment:
Velero version: 1.13.0
Kubernetes version: 1.28
Cloud provider: AWS EKS

Additional Context:
Any insights or solutions to this issue would be greatly appreciated as these metrics are critical for our backup monitoring and alerting.

Thank you for your assistance!

@blackpiglet blackpiglet added the Metrics Related to prometheus metrics label Jun 29, 2024
@komljen
Copy link

komljen commented Aug 2, 2024

Did you find the issue?

@kaovilai
Copy link
Contributor

kaovilai commented Aug 2, 2024

Check for any pod restarts.. these metrics IIUC are incremental as new backup/restores are processed. Velero does not list all existing backups prior to its startup to count attempt/failure totals.

@vladislav-curvetech
Copy link
Author

Thank you, Kaovilai, for your participation.
Yes, the pod sometimes restarts for an unknown reason. Before the restart, I see just one warning:

level=warning msg="active indexes ....blabla.....12b-c1] deletion watermark 2024-08-10 20:30:46 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error.

Did I understand correctly that if the pod restarts, the metrics important to me are reset?

@kaovilai
Copy link
Contributor

yes

@kaovilai
Copy link
Contributor

One reason is, velero sync backup from object storage (could be from a different cluster) to cluster.

Many of those will have status of completed.

If metrics count completed backups in cluster, it would overcount what this cluster has actually completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metrics Related to prometheus metrics Needs investigation
Projects
None yet
Development

No branches or pull requests

5 participants