Add metric for permanently failed notifications #2383

mneverov · 2020-10-04T14:23:19Z

Signed-off-by: Max Neverov neverov.max@gmail.com

mneverov · 2020-10-04T14:26:56Z

@SuperQ could you please take a look?

simonpasquier · 2020-10-07T12:01:58Z

notify/notify.go

 				if !retry {
+					r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc()


the metric needs to be incremented before L701 too (when the context is done).

you're right. Fixed, thanks.

For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:

func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) { r.metrics.numNotifications.Inc() ctx, alerts, err := r.exec(ctx, l, alerts...) if err != nil { r.metrics.numFailedNotifications.Inc() } return ctx, alerts, err } func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) { ... }

simonpasquier · 2020-10-13T12:58:55Z

notify/notify.go

 				if !retry {
+					r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc()


For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:

func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) { r.metrics.numNotifications.Inc() ctx, alerts, err := r.exec(ctx, l, alerts...) if err != nil { r.metrics.numFailedNotifications.Inc() } return ctx, alerts, err } func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) { ... }

simonpasquier · 2020-10-13T13:13:13Z

notify/notify.go

 			Namespace: "alertmanager",
 			Name:      "notifications_failed_total",
 			Help:      "The total number of failed notifications.",
 		}, []string{"integration"}),
+		numPermanentlyFailedNotifications: prometheus.NewCounterVec(prometheus.CounterOpts{
+			Namespace: "alertmanager",
+			Name:      "notifications_permanently_failed_total",


Not totally convinced by the metric name. We should also count the number of tries.
I'm wondering if we shouldn't use the existing alertmanager_notifications_failed_total to count notifications that couldn't be delivered permanently and introduce alertmanager_notification_requests_total/alertmanager_notification_requests_failed_total metrics to count notification request attempts and failures.

@simonpasquier thanks for the update. I tried to be non-intrusive and keep the existing meaning for the notifications_failed_total. I agree that your suggested metrics have much clearer intent given these names.
Should I introduce them and delete the old metric or just add the new ones?

I think that existing metrics have their own merit and shouldn't be dropped. I'd advise to rename the current alertmanager_notifications_total and alertmanager_notifications_failed_total metrics as alertmanager_notification_requests_total and alertmanager_notification_requests_failed_total and use alertmanager_notifications_total and alertmanager_notifications_failed_total for counting actual notification results.

Signed-off-by: Max Neverov <neverov.max@gmail.com>

mneverov · 2020-10-18T04:01:37Z

@simonpasquier could you please have another look?

simonpasquier

👍 thanks!

simonpasquier reviewed Oct 7, 2020

View reviewed changes

simonpasquier mentioned this pull request Oct 7, 2020

PagerDuty notifier fails with the error "http: server closed idle connection" #2352

Closed

mneverov force-pushed the failed_permanently_metric branch from 94998ff to 4efde55 Compare October 8, 2020 05:36

simonpasquier reviewed Oct 13, 2020

View reviewed changes

Add metrics for notification requests (prometheus#2361)

0749478

Signed-off-by: Max Neverov <neverov.max@gmail.com>

mneverov force-pushed the failed_permanently_metric branch from 4efde55 to 0749478 Compare October 17, 2020 20:03

simonpasquier reviewed Nov 6, 2020

View reviewed changes

simonpasquier merged commit c39b787 into prometheus:master Nov 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metric for permanently failed notifications #2383

Add metric for permanently failed notifications #2383

mneverov commented Oct 4, 2020

mneverov commented Oct 4, 2020

simonpasquier Oct 7, 2020

mneverov Oct 8, 2020

simonpasquier Oct 13, 2020

simonpasquier Oct 13, 2020

simonpasquier Oct 13, 2020

mneverov Oct 13, 2020

simonpasquier Oct 13, 2020

mneverov commented Oct 18, 2020

simonpasquier left a comment

		if !retry {
		r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc()

Add metric for permanently failed notifications #2383

Add metric for permanently failed notifications #2383

Conversation

mneverov commented Oct 4, 2020

mneverov commented Oct 4, 2020

simonpasquier Oct 7, 2020

Choose a reason for hiding this comment

mneverov Oct 8, 2020

Choose a reason for hiding this comment

simonpasquier Oct 13, 2020

Choose a reason for hiding this comment

simonpasquier Oct 13, 2020

Choose a reason for hiding this comment

simonpasquier Oct 13, 2020

Choose a reason for hiding this comment

mneverov Oct 13, 2020

Choose a reason for hiding this comment

simonpasquier Oct 13, 2020

Choose a reason for hiding this comment

mneverov commented Oct 18, 2020

simonpasquier left a comment

Choose a reason for hiding this comment