Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alert store not calling callback when an alert is deleted #2622

Closed
wants to merge 2 commits into from

Conversation

gotjosh
Copy link
Member

@gotjosh gotjosh commented Jun 15, 2021

The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes #2619

@gotjosh gotjosh force-pushed the fix-store-resolved-callback branch 6 times, most recently from 7391f4a to d78b5aa Compare June 15, 2021 18:45
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.

When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.

Fixes prometheus#2619

Signed-off-by: gotjosh <josue@grafana.com>
@gotjosh gotjosh force-pushed the fix-store-resolved-callback branch from d78b5aa to 2436b98 Compare June 15, 2021 18:51
store/store_test.go Outdated Show resolved Hide resolved
store/store_test.go Outdated Show resolved Hide resolved
store/store_test.go Outdated Show resolved Hide resolved
@@ -77,7 +77,7 @@ func NewAlerts(ctx context.Context, m types.Marker, intervalGC time.Duration, al
logger: log.With(l, "component", "provider"),
callback: alertCallback,
}
a.alerts.SetGCCallback(func(alerts []*types.Alert) {
a.alerts.SetResolvedCallback(func(alerts []*types.Alert) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A part from m.Delete(alert.Fingerprint()), is it correct to execute the rest of the logic for the case this callback is called because an explicit call to Alerts.Delete()? Could this lead to any side effect? Have you checked it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified that Delete is only called from within dispatch.

But I think you might be right, given we now have a chance of calling this callback more than once concurrently I should double-check that the operation within it is safe. I'll do that tomorrow and let you know.

Copy link
Member Author

@gotjosh gotjosh Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising this, I've now taken a look. The way I understand it, an explicit call to Alerts.Delete() would only occur if (due to a timing issue) the alert would resolve just before we're about to send it.

if a.Resolved() && got.UpdatedAt == a.UpdatedAt {
if err := ag.alerts.Delete(fp); err != nil {
level.Error(ag.logger).Log("msg", "error on delete alert", "err", err, "alert", a.String())
}
}

While (to the best of my knowledge) the deletion of no longer used listeners at a more regular interval (each receiver/inhibitions subscribes to the alerts store) doesn't produce any side effects, I don't think a single alert resolving should trigger the garbage collection of this.

But, there's more... After testing more thoroughly I realised that this doesn't really solve what we wanted in the first place (accurate reporting of the alertmanager_alerts) because if an alert is resolved naturally it wouldn't be GCed until the next run anyway.

An example of this behaviour (with this change included) is that despite seeing 67 alerts in the UI and the metric reporting 72 alerts, it was not until the GC cycle ran (every 30 minutes by default) that I could see the accurate measurement:

Metrics
Screenshot 2021-06-23 at 18 18 15

API (all filters are enabled by default)
Screenshot 2021-06-23 at 18 20 07

Given there's no way to see resolved alerts through the UI, My proposal now is that we actually create a new label of "resolved" for this metric which would allow us to accurately track the value over time in accordance with what we see in the UI.

With that we can do

sum(alertmanager_alerts{state=~"active|supressed"}) - sum(alertmanager_alerts{state="resolved"})

to match what we see in the UI.

Signed-off-by: gotjosh <josue@grafana.com>
@roidelapluie
Copy link
Member

The metrics in alertmanager are for pure operational reasons IIRC and are not meant to reflect the UI or the state of alertmanager. "Fixing" them might break users. (Note: I could be wrong)

@roidelapluie
Copy link
Member

My point is that if you want accurate metrics it might be better to implement a new collector.

@stale stale bot added the stale label Aug 22, 2021
@gotjosh
Copy link
Member Author

gotjosh commented Sep 2, 2021

Still valid, I'll come back to this soon.

@stale stale bot removed the stale label Sep 2, 2021
@stale stale bot added the stale label Nov 9, 2021
@gotjosh gotjosh closed this Mar 16, 2022
@gotjosh gotjosh deleted the fix-store-resolved-callback branch June 20, 2022 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metric alertmanager_alerts report incorrect number of alerts
3 participants