-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PagerDuty notifier fails with the error "http: server closed idle connection" #2352
Comments
If that's the case then Pagerduty should be sending back a |
@shamilpd Can you look at this? |
Hi @roidelapluie. I am looking at this. We will decide next week whether we will support keep-alive connections going forward or respond with a |
Thank you! In both ways it seems that the bug is not on the Alertmanager side then. I will rename the issue and let it open for a week so that other users who might experience this do not create the same issue and can see this discussion. |
Thanks everyone for chiming in and helping out! @shamilpd Looking forward to a verdict on this :) This can really save us from being paged past midnight for false alarms. |
@shamilpd just wondering if you have decided on this? |
Hey @mapshen sorry for the late reply! We have support for keep-alive connections now and we are using 75s for the timeout (default for nginx). If you set |
The timeout is not configurable by alertmanager users and is set to 5 minutes. |
@shamilpd like @roidelapluie said, it's not configurable. Maybe could we just return |
We automatically retry, and there's nothing about this situation that should in itself generate alerts. Where exactly are these alerts coming from? |
Now that I think about it, could it also hit the HTTP/2 bug? |
@mapshen Pagerduty won't be able to disable keep-alive connections. We are following the normal convention now and are using a default value for idle connection timeouts. It seems to me that the behavior you are seeing from Alermanager (timeout error + retry) is by design for handling idle connections so it's to be expected - especially because the timeout is set to such a high value (5 minutes). We also haven't received any similar complaints from our other Prometheus users. I would ask you to reconsider whether or not you really need the alert you made. Is there a different metric you can use? Since this is an expected case, one option would be to increase the threshold on the alert to be higher than the value that it reaches when it triggers due to these timeouts. |
We alert based on One idea I have is that, since this line increments |
This is similar to #2361 |
IIUC PagerDuty now supports keep-alives. The connection might occasionally need to be re-established due to different time-out settings hence |
Meant to do a PR but never got the time. The solution proposed in #2383 looks alright to me. Thanks for taking care of this, @simonpasquier ! |
What did you do?
We have an alert rule like the following which fires off periodically, saying failing to notify PagerDuty.
After setting the log level to
debug
, we found messages like below:So the PagerDuty notifier tries to keep connections alive and the idle timeout is 5 min as per here, but clearly PagerDuty closes out idle connections sooner than that.
Response from PagerDuty's engineering team is
What did you expect to see?
Based on that, I am wondering if we could update this line from
to
so that keep-alive will be disabled, and users will not receive the above errors, and the metric
alertmanager_notifications_total
will not be incremented because of it.The text was updated successfully, but these errors were encountered: