Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PagerDuty notifier fails with the error "http: server closed idle connection" #2352

Closed
mapshen opened this issue Aug 13, 2020 · 16 comments
Closed

Comments

@mapshen
Copy link

mapshen commented Aug 13, 2020

What did you do?

We have an alert rule like the following which fires off periodically, saying failing to notify PagerDuty.

    - alert: AlertmanagerNotificationFailing
      expr: 'rate(alertmanager_notifications_failed_total[1m]) > 0'
      labels:
        severity: warning
      annotations:
        description: "Alertmanager is failing sending notifications."
        summary: "AlertManager {{ $labels.instance }} - 
          {{ $labels.integration }} notification failing"

After setting the log level to debug, we found messages like below:

{"attempt":1,"caller":"notify.go:668","component":"dispatcher","err":"failed to post message to PagerDuty: Post https://events.pagerduty.com/v2/enqueue: http: server closed idle connection","integration":"pagerduty","level":"debug","msg":"Notify attempt failed","receiver":"linf","ts":"2020-08-10T19:01:54.036Z"}

So the PagerDuty notifier tries to keep connections alive and the idle timeout is 5 min as per here, but clearly PagerDuty closes out idle connections sooner than that.

Response from PagerDuty's engineering team is

It seems like you are trying to use a persistent connection to our Events API in order to send multiple events. I'm afraid this is currently not supported by the Events API, we don't expect to keep connections alive any longer than a single API request.

What did you expect to see?

Based on that, I am wondering if we could update this line from

	client, err := commoncfg.NewClientFromConfig(*c.HTTPConfig, "pagerduty", false)

to

	client, err := commoncfg.NewClientFromConfig(*c.HTTPConfig, "pagerduty", true)

so that keep-alive will be disabled, and users will not receive the above errors, and the metric alertmanager_notifications_total will not be incremented because of it.

  • Alertmanager version:
Version Information
Branch: HEAD
BuildDate: 20191211-14:13:14
BuildUser: root@00c3106655f8
GoVersion: go1.13.5
Revision: f74be0400a6243d10bb53812d6fa408ad71ff32d
Version: 0.20.0
  • Alertmanager configuration file:
global:
  resolve_timeout: 5m
  smtp_hello: localhost
  smtp_require_tls: true
  pagerduty_url: https://events.pagerduty.com/v2/enqueue
  hipchat_api_url: https://api.hipchat.com/
  opsgenie_api_url: https://api.opsgenie.com/
  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/
  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/
route:
  receiver: default
  group_by:
  - '...'
  routes:
  - receiver: default
    group_by:
    - '...'
    match:
      label: value
  group_wait: 0s
  group_interval: 5m
  repeat_interval: 4h
receivers:
- name: default
  pagerduty_configs:
  - send_resolved: true
    routing_key: <secret>
    url: https://events.pagerduty.com/v2/enqueue
    client: '{{ template "pagerduty.default.client" . }}'
    client_url: '{{ template "pagerduty.default.clientURL" . }}'
    description: '{{ .CommonAnnotations.summary }}'
    details:
      firing: '{{ template "pagerduty.default.instances" .Alerts.Firing }}'
      num_firing: '{{ .Alerts.Firing | len }}'
      num_resolved: '{{ .Alerts.Resolved | len }}'
      resolved: '{{ template "pagerduty.default.instances" .Alerts.Resolved }}'
    severity: '{{ if .CommonLabels.severity }}{{ .CommonLabels.severity | toLower
      }}{{ else }}critical{{ end }}'
@brian-brazil
Copy link
Contributor

I'm afraid this is currently not supported by the Events API, we don't expect to keep connections alive any longer than a single API request.

If that's the case then Pagerduty should be sending back a Connection: Close header as connection persistence has been the norm for 20+ years. As-is, Pagerduty is violating the HTTP spec.

@roidelapluie
Copy link
Member

@shamilpd Can you look at this?

@shamilpd
Copy link
Contributor

shamilpd commented Aug 15, 2020

Hi @roidelapluie. I am looking at this. We will decide next week whether we will support keep-alive connections going forward or respond with a Connection: close header.

@roidelapluie
Copy link
Member

Thank you!

In both ways it seems that the bug is not on the Alertmanager side then. I will rename the issue and let it open for a week so that other users who might experience this do not create the same issue and can see this discussion.

@roidelapluie roidelapluie changed the title Set disableKeepAlives to true for PagerDuty notifier PagerDuty notifier fails with the error "http: server closed idle connection" Aug 15, 2020
@mapshen
Copy link
Author

mapshen commented Aug 17, 2020

Thanks everyone for chiming in and helping out!

@shamilpd Looking forward to a verdict on this :) This can really save us from being paged past midnight for false alarms.

@mapshen
Copy link
Author

mapshen commented Aug 24, 2020

@shamilpd just wondering if you have decided on this?

@shamilpd
Copy link
Contributor

Hey @mapshen sorry for the late reply! We have support for keep-alive connections now and we are using 75s for the timeout (default for nginx). If you set IdleConnTimeout to 75s or less, everything should be fine.

@roidelapluie
Copy link
Member

The timeout is not configurable by alertmanager users and is set to 5 minutes.

@mapshen
Copy link
Author

mapshen commented Sep 4, 2020

@shamilpd like @roidelapluie said, it's not configurable. Maybe could we just return Connection: Close? We got another false alarm last night...

@brian-brazil
Copy link
Contributor

We automatically retry, and there's nothing about this situation that should in itself generate alerts. Where exactly are these alerts coming from?

@roidelapluie
Copy link
Member

We automatically retry, and there's nothing about this situation that should in itself generate alerts. Where exactly are these alerts coming from?

Now that I think about it, could it also hit the HTTP/2 bug?

@shamilpd
Copy link
Contributor

shamilpd commented Sep 8, 2020

@mapshen Pagerduty won't be able to disable keep-alive connections. We are following the normal convention now and are using a default value for idle connection timeouts.

It seems to me that the behavior you are seeing from Alermanager (timeout error + retry) is by design for handling idle connections so it's to be expected - especially because the timeout is set to such a high value (5 minutes). We also haven't received any similar complaints from our other Prometheus users.

I would ask you to reconsider whether or not you really need the alert you made. Is there a different metric you can use? Since this is an expected case, one option would be to increase the threshold on the alert to be higher than the value that it reaches when it triggers due to these timeouts.

@mapshen
Copy link
Author

mapshen commented Sep 8, 2020

We alert based on alertmanager_notifications_failed_total, and we choose to alert whenever this metric gets incremented as not all failed notifications are being retried automatically like these and we don't want to miss potential real alerts. Not sure how we could improve this alert rule but we are definitely open to better ideas. Chances are we are doing it wrong.

One idea I have is that, since this line increments alertmanager_notifications_failed_total if alertmanager fails to notify, even when retry is set to true, we could enrich this metric with an retriable label denoting whether it's a retriable error. If not, we alert when the value increases; otherwise, we alert when a certain threshold gets hit within a time period.

@simonpasquier
Copy link
Member

This is similar to #2361

@simonpasquier
Copy link
Member

IIUC PagerDuty now supports keep-alives. The connection might occasionally need to be re-established due to different time-out settings hence alertmanager_notifications_failed_total is increased but the next retry should succeed. Knowing that we've got already #2361 to improve the metrics (there's even a PR opened, #2383), I'm closing this issue for now. Feel free to re-open if I'm mistaken.

@mapshen
Copy link
Author

mapshen commented Oct 7, 2020

Meant to do a PR but never got the time. The solution proposed in #2383 looks alright to me.

Thanks for taking care of this, @simonpasquier !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants