-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alert store not calling callback when an alert is deleted #2622
Conversation
7391f4a
to
d78b5aa
Compare
The garbage collection process within the store is in charge of determining if an alert is resolved, deleting it, and then communicating this back to the callback set. When an alert was explicitly deleted, these were not being communicated back to the callback and caused the metric to report incorrect results. Fixes prometheus#2619 Signed-off-by: gotjosh <josue@grafana.com>
d78b5aa
to
2436b98
Compare
@@ -77,7 +77,7 @@ func NewAlerts(ctx context.Context, m types.Marker, intervalGC time.Duration, al | |||
logger: log.With(l, "component", "provider"), | |||
callback: alertCallback, | |||
} | |||
a.alerts.SetGCCallback(func(alerts []*types.Alert) { | |||
a.alerts.SetResolvedCallback(func(alerts []*types.Alert) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A part from m.Delete(alert.Fingerprint())
, is it correct to execute the rest of the logic for the case this callback is called because an explicit call to Alerts.Delete()
? Could this lead to any side effect? Have you checked it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified that Delete
is only called from within dispatch.
But I think you might be right, given we now have a chance of calling this callback more than once concurrently I should double-check that the operation within it is safe. I'll do that tomorrow and let you know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this, I've now taken a look. The way I understand it, an explicit call to Alerts.Delete()
would only occur if (due to a timing issue) the alert would resolve just before we're about to send it.
alertmanager/dispatch/dispatch.go
Lines 529 to 533 in fd0929b
if a.Resolved() && got.UpdatedAt == a.UpdatedAt { | |
if err := ag.alerts.Delete(fp); err != nil { | |
level.Error(ag.logger).Log("msg", "error on delete alert", "err", err, "alert", a.String()) | |
} | |
} |
While (to the best of my knowledge) the deletion of no longer used listeners at a more regular interval (each receiver/inhibitions subscribes to the alerts store) doesn't produce any side effects, I don't think a single alert resolving should trigger the garbage collection of this.
But, there's more... After testing more thoroughly I realised that this doesn't really solve what we wanted in the first place (accurate reporting of the alertmanager_alerts
) because if an alert is resolved naturally it wouldn't be GCed until the next run anyway.
An example of this behaviour (with this change included) is that despite seeing 67 alerts in the UI and the metric reporting 72 alerts, it was not until the GC cycle ran (every 30 minutes by default) that I could see the accurate measurement:
API (all filters are enabled by default)
Given there's no way to see resolved alerts through the UI, My proposal now is that we actually create a new label of "resolved" for this metric which would allow us to accurately track the value over time in accordance with what we see in the UI.
With that we can do
sum(alertmanager_alerts{state=~"active|supressed"}) - sum(alertmanager_alerts{state="resolved"})
to match what we see in the UI.
Signed-off-by: gotjosh <josue@grafana.com>
The metrics in alertmanager are for pure operational reasons IIRC and are not meant to reflect the UI or the state of alertmanager. "Fixing" them might break users. (Note: I could be wrong) |
My point is that if you want accurate metrics it might be better to implement a new collector. |
Still valid, I'll come back to this soon. |
The garbage collection process within the store is in charge of
determining if an alert is resolved, deleting it, and then communicating
this back to the callback set.
When an alert was explicitly deleted, these were not being communicated
back to the callback and caused the metric to report incorrect results.
Fixes #2619