PagerDuty notifier fails with the error "http: server closed idle connection" #2352

mapshen · 2020-08-13T20:22:42Z

What did you do?

We have an alert rule like the following which fires off periodically, saying failing to notify PagerDuty.

    - alert: AlertmanagerNotificationFailing
      expr: 'rate(alertmanager_notifications_failed_total[1m]) > 0'
      labels:
        severity: warning
      annotations:
        description: "Alertmanager is failing sending notifications."
        summary: "AlertManager {{ $labels.instance }} - 
          {{ $labels.integration }} notification failing"

After setting the log level to debug, we found messages like below:

{"attempt":1,"caller":"notify.go:668","component":"dispatcher","err":"failed to post message to PagerDuty: Post https://events.pagerduty.com/v2/enqueue: http: server closed idle connection","integration":"pagerduty","level":"debug","msg":"Notify attempt failed","receiver":"linf","ts":"2020-08-10T19:01:54.036Z"}

So the PagerDuty notifier tries to keep connections alive and the idle timeout is 5 min as per here, but clearly PagerDuty closes out idle connections sooner than that.

Response from PagerDuty's engineering team is

It seems like you are trying to use a persistent connection to our Events API in order to send multiple events. I'm afraid this is currently not supported by the Events API, we don't expect to keep connections alive any longer than a single API request.

What did you expect to see?

Based on that, I am wondering if we could update this line from

	client, err := commoncfg.NewClientFromConfig(*c.HTTPConfig, "pagerduty", false)

to

	client, err := commoncfg.NewClientFromConfig(*c.HTTPConfig, "pagerduty", true)

so that keep-alive will be disabled, and users will not receive the above errors, and the metric alertmanager_notifications_total will not be incremented because of it.

Alertmanager version:

Version Information
Branch: HEAD
BuildDate: 20191211-14:13:14
BuildUser: root@00c3106655f8
GoVersion: go1.13.5
Revision: f74be0400a6243d10bb53812d6fa408ad71ff32d
Version: 0.20.0

Alertmanager configuration file:

global:
  resolve_timeout: 5m
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  hipchat_api_url: https://api.hipchat.com/
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: default
  group_by:
  - '...'
  routes:
  - receiver: default
    group_by:
    - '...'
    match:
      label: value
  group_wait: 0s
  group_interval: 5m
  repeat_interval: 4h
receivers:
- name: default
  pagerduty_configs:
  - send_resolved: true
    routing_key: <secret>
    url: https://events.pagerduty.com/v2/enqueue
    client: '{{ template "pagerduty.default.client" . }}'
    client_url: '{{ template "pagerduty.default.clientURL" . }}'
    description: '{{ .CommonAnnotations.summary }}'
    details:
      firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
      num_firing: '{{ .Alerts.Firing | len }}'
      num_resolved: '{{ .Alerts.Resolved | len }}'
      resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
    severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower
      }}{{ else }}critical{{ end }}'

The text was updated successfully, but these errors were encountered:

brian-brazil · 2020-08-13T22:31:19Z

I'm afraid this is currently not supported by the Events API, we don't expect to keep connections alive any longer than a single API request.

If that's the case then Pagerduty should be sending back a Connection: Close header as connection persistence has been the norm for 20+ years. As-is, Pagerduty is violating the HTTP spec.

roidelapluie · 2020-08-14T13:58:29Z

@shamilpd Can you look at this?

shamilpd · 2020-08-15T16:35:55Z

Hi @roidelapluie. I am looking at this. We will decide next week whether we will support keep-alive connections going forward or respond with a Connection: close header.

roidelapluie · 2020-08-15T16:47:20Z

Thank you!

In both ways it seems that the bug is not on the Alertmanager side then. I will rename the issue and let it open for a week so that other users who might experience this do not create the same issue and can see this discussion.

mapshen · 2020-08-17T17:46:36Z

Thanks everyone for chiming in and helping out!

@shamilpd Looking forward to a verdict on this :) This can really save us from being paged past midnight for false alarms.

mapshen · 2020-08-24T19:08:29Z

@shamilpd just wondering if you have decided on this?

shamilpd · 2020-08-27T16:04:04Z

Hey @mapshen sorry for the late reply! We have support for keep-alive connections now and we are using 75s for the timeout (default for nginx). If you set IdleConnTimeout to 75s or less, everything should be fine.

roidelapluie · 2020-08-27T16:07:11Z

The timeout is not configurable by alertmanager users and is set to 5 minutes.

mapshen · 2020-09-04T17:09:42Z

@shamilpd like @roidelapluie said, it's not configurable. Maybe could we just return Connection: Close? We got another false alarm last night...

brian-brazil · 2020-09-04T18:49:48Z

We automatically retry, and there's nothing about this situation that should in itself generate alerts. Where exactly are these alerts coming from?

roidelapluie · 2020-09-04T19:37:06Z

We automatically retry, and there's nothing about this situation that should in itself generate alerts. Where exactly are these alerts coming from?

Now that I think about it, could it also hit the HTTP/2 bug?

shamilpd · 2020-09-08T15:22:13Z

@mapshen Pagerduty won't be able to disable keep-alive connections. We are following the normal convention now and are using a default value for idle connection timeouts.

It seems to me that the behavior you are seeing from Alermanager (timeout error + retry) is by design for handling idle connections so it's to be expected - especially because the timeout is set to such a high value (5 minutes). We also haven't received any similar complaints from our other Prometheus users.

I would ask you to reconsider whether or not you really need the alert you made. Is there a different metric you can use? Since this is an expected case, one option would be to increase the threshold on the alert to be higher than the value that it reaches when it triggers due to these timeouts.

mapshen · 2020-09-08T16:26:54Z

We alert based on alertmanager_notifications_failed_total, and we choose to alert whenever this metric gets incremented as not all failed notifications are being retried automatically like these and we don't want to miss potential real alerts. Not sure how we could improve this alert rule but we are definitely open to better ideas. Chances are we are doing it wrong.

One idea I have is that, since this line increments alertmanager_notifications_failed_total if alertmanager fails to notify, even when retry is set to true, we could enrich this metric with an retriable label denoting whether it's a retriable error. If not, we alert when the value increases; otherwise, we alert when a certain threshold gets hit within a time period.

simonpasquier · 2020-09-28T13:43:46Z

This is similar to #2361

simonpasquier · 2020-10-07T12:51:56Z

IIUC PagerDuty now supports keep-alives. The connection might occasionally need to be re-established due to different time-out settings hence alertmanager_notifications_failed_total is increased but the next retry should succeed. Knowing that we've got already #2361 to improve the metrics (there's even a PR opened, #2383), I'm closing this issue for now. Feel free to re-open if I'm mistaken.

mapshen · 2020-10-07T18:25:44Z

Meant to do a PR but never got the time. The solution proposed in #2383 looks alright to me.

Thanks for taking care of this, @simonpasquier !

roidelapluie changed the title ~~Set disableKeepAlives to true for PagerDuty notifier~~ PagerDuty notifier fails with the error "http: server closed idle connection" Aug 15, 2020

roidelapluie added component/notify dependency/external labels Aug 15, 2020

simonpasquier closed this as completed Oct 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PagerDuty notifier fails with the error "http: server closed idle connection" #2352

PagerDuty notifier fails with the error "http: server closed idle connection" #2352

mapshen commented Aug 13, 2020 •

edited

Loading

brian-brazil commented Aug 13, 2020

roidelapluie commented Aug 14, 2020

shamilpd commented Aug 15, 2020 •

edited

Loading

roidelapluie commented Aug 15, 2020

mapshen commented Aug 17, 2020

mapshen commented Aug 24, 2020

shamilpd commented Aug 27, 2020

roidelapluie commented Aug 27, 2020

mapshen commented Sep 4, 2020

brian-brazil commented Sep 4, 2020

roidelapluie commented Sep 4, 2020

shamilpd commented Sep 8, 2020

mapshen commented Sep 8, 2020

simonpasquier commented Sep 28, 2020

simonpasquier commented Oct 7, 2020

mapshen commented Oct 7, 2020

PagerDuty notifier fails with the error "http: server closed idle connection" #2352

PagerDuty notifier fails with the error "http: server closed idle connection" #2352

Comments

mapshen commented Aug 13, 2020 • edited Loading

brian-brazil commented Aug 13, 2020

roidelapluie commented Aug 14, 2020

shamilpd commented Aug 15, 2020 • edited Loading

roidelapluie commented Aug 15, 2020

mapshen commented Aug 17, 2020

mapshen commented Aug 24, 2020

shamilpd commented Aug 27, 2020

roidelapluie commented Aug 27, 2020

mapshen commented Sep 4, 2020

brian-brazil commented Sep 4, 2020

roidelapluie commented Sep 4, 2020

shamilpd commented Sep 8, 2020

mapshen commented Sep 8, 2020

simonpasquier commented Sep 28, 2020

simonpasquier commented Oct 7, 2020

mapshen commented Oct 7, 2020

mapshen commented Aug 13, 2020 •

edited

Loading

shamilpd commented Aug 15, 2020 •

edited

Loading