Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445

allenxu404 · 2024-02-20T09:43:13Z

Thank you for contributing to Velero!

Please add a summary of your change

Does your change fix a particular issue?

Fixes #7378

Please indicate you've done the following:

Accepted the DCO. Commits without the DCO will delay acceptance.
Created a changelog file or added /kind changelog-not-required as a comment on this pull request.
Updated the corresponding documentation in site/content/docs/main.

codecov · 2024-02-20T09:51:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (d45b313) 61.64% compared to head (84fb88c) 61.64%.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #7445   +/-   ##
=======================================
  Coverage   61.64%   61.64%           
=======================================
  Files         262      262           
  Lines       28477    28477           
=======================================
  Hits        17556    17556           
  Misses       9689     9689           
  Partials     1232     1232

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ly incrementing over time Signed-off-by: allenxu404 <qix2@vmware.com>

Snaipe · 2024-02-21T10:06:32Z

pkg/metrics/metrics.go

@@ -468,7 +468,7 @@ func (m *ServerMetrics) InitSchedule(scheduleName string) {
 		c.WithLabelValues(scheduleName).Add(0)
 	}
 	if c, ok := m.metrics[backupLastStatus].(*prometheus.GaugeVec); ok {
-		c.WithLabelValues(scheduleName).Add(1)
+		c.WithLabelValues(scheduleName).Set(float64(1))


Found this PR when trying to figure out why the metric was increasing; if I understand correctly, this ends up running periodically (every time the schedule is reconciled), right?

If so, wouldn't this incorrectly clear a failing status? Say that the schedule created a backup that failed, the metric would become 0, and then get reset back to 1 after a couple of minutes, which seems problematic since this can hide failures.

If anything, I think lines 470-471 should be omitted entirely. This would prevent from emitting a data point until the point where we actually have one, which should still fix what #6838 was aiming for. If an accurate status must be set, it should be done as part of scheduleReconciler.Reconcile, using the status of the last backup created by the schedule.

@Snaipe yes, you're right.

This PR only aimed to address the specific bug caused in #6838. I will create a separate PR to resolve the issue with incorrect failure status clear and keep you posted. Thanks for highlighting this larger problem!

The purpose of setting the default value of backup_last_status to 1 was to prevent mistaken failure reports that could occur if the code exits without updating that metric. I prefer to leave that implementation unchanged. Upon further analysis, I think the true underlying issue seems to be that metrics are re-initialized each time the schedule is reconciled, instead, the metrics should be initialized only if a new backup is due to be created. So I think the InitSchedulefunction would be better placed below, where backups are created, to avoid unnecessary reset. what's your thought?

if c.ifDue(schedule, cronSchedule) && !c.checkIfBackupInNewOrProgress(schedule) { if err := c.submitBackup(ctx, schedule); err != nil { return ctrl.Result{}, errors.Wrapf(err, "error submit backup for schedule %s", req.String()) } c.metrics.InitSchedule(schedule.Name) }

I think it'd help, but I'm not sure it would fix the underlying problem. As far as I can tell, what this would do is that if your schedule is failing, and a new backup is created, then the schedule would be reported as "fixed" metrics-wise for the entire duration of the backup.

It's less bad than the first situation, but it would definitely confuse alerting.

I think not advertising a value for the metric up until the point where there is a data point seems slightly better in the sense that users can choose to ignore them or set it to the default value of their choice (0 or 1) in grafana/alertmanager. A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.

A better solution would be to set the initial value of the metric directly to the result of the last backup of the schedule.

I agree, will raise another issue to track it.

Seems we've already has the similar issue to track: #6936

github-actions bot assigned allenxu404 Feb 20, 2024

github-actions bot requested review from shubham-pampattiwar and ywk253100 February 20, 2024 09:43

allenxu404 changed the title ~~Adjust the logic for the backup_last_status metric to stop incorrectl…~~ Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time Feb 20, 2024

allenxu404 force-pushed the backup-last-status-fixed branch from 10909b6 to 269d806 Compare February 20, 2024 09:48

github-actions bot added the has-changelog label Feb 20, 2024

Adjust the logic for the backup_last_status metrics to stop incorrect…

84fb88c

…ly incrementing over time Signed-off-by: allenxu404 <qix2@vmware.com>

allenxu404 force-pushed the backup-last-status-fixed branch from 269d806 to 84fb88c Compare February 21, 2024 05:21

ywk253100 approved these changes Feb 21, 2024

View reviewed changes

blackpiglet approved these changes Feb 21, 2024

View reviewed changes

blackpiglet merged commit 2a1ae0e into vmware-tanzu:main Feb 21, 2024
66 checks passed

Snaipe reviewed Feb 21, 2024

View reviewed changes

TTRCmedia mentioned this pull request Apr 22, 2024

Adhere to semver vmware-tanzu/helm-charts#554

Closed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445

Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445

allenxu404 commented Feb 20, 2024 •

edited

Loading

codecov bot commented Feb 20, 2024 •

edited

Loading

Snaipe Feb 21, 2024

allenxu404 Feb 22, 2024

Snaipe Feb 26, 2024

allenxu404 Feb 26, 2024

allenxu404 Feb 26, 2024

Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445

Adjust the logic for the backup_last_status metric to stop incorrectly incrementing over time #7445

Conversation

allenxu404 commented Feb 20, 2024 • edited Loading

Please add a summary of your change

Does your change fix a particular issue?

Please indicate you've done the following:

codecov bot commented Feb 20, 2024 • edited Loading

Codecov Report

Snaipe Feb 21, 2024

Choose a reason for hiding this comment

allenxu404 Feb 22, 2024

Choose a reason for hiding this comment

Snaipe Feb 26, 2024

Choose a reason for hiding this comment

allenxu404 Feb 26, 2024

Choose a reason for hiding this comment

allenxu404 Feb 26, 2024

Choose a reason for hiding this comment

allenxu404 commented Feb 20, 2024 •

edited

Loading

codecov bot commented Feb 20, 2024 •

edited

Loading