Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metric for permanently failed notifications #2383

Merged

Conversation

mneverov
Copy link
Contributor

@mneverov mneverov commented Oct 4, 2020

Fixes: #2361

Signed-off-by: Max Neverov neverov.max@gmail.com

@mneverov
Copy link
Contributor Author

mneverov commented Oct 4, 2020

@SuperQ could you please take a look?

notify/notify.go Outdated
if !retry {
r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metric needs to be incremented before L701 too (when the context is done).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. Fixed, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:

func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
    r.metrics.numNotifications.Inc()
    ctx, alerts, err := r.exec(ctx, l, alerts...)
    if err != nil {
        r.metrics.numFailedNotifications.Inc()
    }
    return ctx, alerts, err
}

func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
...
}

notify/notify.go Outdated
if !retry {
r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:

func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
    r.metrics.numNotifications.Inc()
    ctx, alerts, err := r.exec(ctx, l, alerts...)
    if err != nil {
        r.metrics.numFailedNotifications.Inc()
    }
    return ctx, alerts, err
}

func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
...
}

notify/notify.go Outdated
Namespace: "alertmanager",
Name: "notifications_failed_total",
Help: "The total number of failed notifications.",
}, []string{"integration"}),
numPermanentlyFailedNotifications: prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "alertmanager",
Name: "notifications_permanently_failed_total",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally convinced by the metric name. We should also count the number of tries.
I'm wondering if we shouldn't use the existing alertmanager_notifications_failed_total to count notifications that couldn't be delivered permanently and introduce alertmanager_notification_requests_total/alertmanager_notification_requests_failed_total metrics to count notification request attempts and failures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier thanks for the update. I tried to be non-intrusive and keep the existing meaning for the notifications_failed_total. I agree that your suggested metrics have much clearer intent given these names.
Should I introduce them and delete the old metric or just add the new ones?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that existing metrics have their own merit and shouldn't be dropped. I'd advise to rename the current alertmanager_notifications_total and alertmanager_notifications_failed_total metrics as alertmanager_notification_requests_total and alertmanager_notification_requests_failed_total and use alertmanager_notifications_total and alertmanager_notifications_failed_total for counting actual notification results.

Signed-off-by: Max Neverov <neverov.max@gmail.com>
@mneverov
Copy link
Contributor Author

@simonpasquier could you please have another look?

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks!

@simonpasquier simonpasquier merged commit c39b787 into prometheus:master Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Separate metrics for temporary and permanently failed notifications
2 participants