-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metric for permanently failed notifications #2383
Add metric for permanently failed notifications #2383
Conversation
@SuperQ could you please take a look? |
notify/notify.go
Outdated
if !retry { | ||
r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the metric needs to be incremented before L701 too (when the context is done).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. Fixed, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:
func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
r.metrics.numNotifications.Inc()
ctx, alerts, err := r.exec(ctx, l, alerts...)
if err != nil {
r.metrics.numFailedNotifications.Inc()
}
return ctx, alerts, err
}
func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
...
}
94998ff
to
4efde55
Compare
notify/notify.go
Outdated
if !retry { | ||
r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:
func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
r.metrics.numNotifications.Inc()
ctx, alerts, err := r.exec(ctx, l, alerts...)
if err != nil {
r.metrics.numFailedNotifications.Inc()
}
return ctx, alerts, err
}
func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
...
}
notify/notify.go
Outdated
Namespace: "alertmanager", | ||
Name: "notifications_failed_total", | ||
Help: "The total number of failed notifications.", | ||
}, []string{"integration"}), | ||
numPermanentlyFailedNotifications: prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
Namespace: "alertmanager", | ||
Name: "notifications_permanently_failed_total", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not totally convinced by the metric name. We should also count the number of tries.
I'm wondering if we shouldn't use the existing alertmanager_notifications_failed_total
to count notifications that couldn't be delivered permanently and introduce alertmanager_notification_requests_total
/alertmanager_notification_requests_failed_total
metrics to count notification request attempts and failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonpasquier thanks for the update. I tried to be non-intrusive and keep the existing meaning for the notifications_failed_total
. I agree that your suggested metrics have much clearer intent given these names.
Should I introduce them and delete the old metric or just add the new ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that existing metrics have their own merit and shouldn't be dropped. I'd advise to rename the current alertmanager_notifications_total
and alertmanager_notifications_failed_total
metrics as alertmanager_notification_requests_total
and alertmanager_notification_requests_failed_total
and use alertmanager_notifications_total
and alertmanager_notifications_failed_total
for counting actual notification results.
Signed-off-by: Max Neverov <neverov.max@gmail.com>
4efde55
to
0749478
Compare
@simonpasquier could you please have another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks!
Fixes: #2361
Signed-off-by: Max Neverov neverov.max@gmail.com