-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for heartbeat #444
Conversation
Interesting. I know @grobie and @matthiasr were thinking of something like this. Maybe they want to chime in. |
We do have a setup for this at SoundCloud. We have one alert which is always firing and use a dead man switch service to detect whether that doesn't reach our notification system (PagerDuty). I think @matthiasr contacted Pagerduty and they plan to support that natively. It's important to know when the alertmanager is down, so more support for that is good I think. Having an independent heartbeat routine won't catch a few other issues (like Prometheus servers can't reach alertmanager, the alertmanager can't send out notifications, etc.) but I guess that's a good start. |
return &OpsGenie{ | ||
conf: conf, | ||
done: make(chan struct{}), | ||
ticker: time.NewTicker(time.Duration(conf.Interval)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will leak ticker routines, as these don't get stopped.
I don't see the need or advantage of making Tick() part of the interface. I'd just expose the interval and keep the ticker local to the Run()
method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed I didn't see this... I made the Tick() method because it is not possible AFAIK to expose attributes on an interface. I could create a HeartbeatProvider interface and a Heartbeat struct to implement this or expose the interval in the HeartbeatRunner on a per implementation basis (map[string]struct{done chan, time.Duration interval})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Well, you could close the ticker in the Stop method, but I'd really keep that local. What about a func Interval() time.Duration
method on the interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
- Send an HTTP request to specific hosts so they know the current instance is running - Only implements OpsGenie API Conf sample: heartbeats: - name: 'opsgenie-hb' opsgenie_configs: - api_key: 'f5ec0d1f-5978-11e6-881f-64006a5a8984' interval: '10m' name: 'my-app-heartbeat' Change-Id: Ied9d45ea630362ed6ed17845506db7b5b21919a4
Hey, what's the status on this PR ? Are you considering merging it or do you require some more work from me ? |
Hey, thanks for pinging again about this. I think generally a heartbeat feature is very useful. As grobie said, it does not catch a case where Prometheus servers cannot reach any Alertmanager. But this would only be thoroughly validated if each Prometheus server had its own alert and we checked each of them. But we should discuss this a bit further before adding a new feature. |
This should be covered by general meta-monitoring, this approach has the usual issues with bottom-up push-based monitoring and is likely to trigger false positives in an HA scenario as no single AM is a SPOF. This approach also seems a bit OpsGenie specific. The generic version of this would be normal Prometheus scraping for example. |
Mh, not sure I understand. A heartbeat would allow me to get a page when Alertmanager cannot reach the integration. How is that covered by meta monitoring? My meta monitoring will send an alert to my Alertmanager, which cannot send out the notification... and then what? In an HA setup, you'd have to deduplicate heartbeats along a name I'd suppose and it would only trigger if no AM sends something on that named heartbeat channel anymore. So it wouldn't be coupled to single AM instances. |
What you really need is an end-to-end test, and usually the issue is with one AM so the other AMs are still working. |
I think we are talking about different intends behind this. I can run 10 fully functional AMs in my DC but if it's cut off from the Internet, there's no meta monitoring that can help me. FWIW, most users are small-scale and telling them to run a separate meta-monitoring Prometheus+AM in another DC is not a good story for a fully functional and easy to deploy (one of our core features) monitoring setup. Especially if integrations provide such heartbeat/dead-man's-switch functionality already. |
Meta-monitoring should catch the DC going down, otherwise you don't have full meta-monitoring. Meta-monitoring is everything you need to catch your monitoring going down. I think something more along the lines of @grobie's lightning talk at PromCon is what we should be looking at. |
That was @matthiasr's talk IIRC. Conceptually it wasn't too different – it also was a heartbeat/dead-man's-switch – just more Prometheus native and more end-to-end. But in the end, by your reasoning this seems just as obsolete – with full meta-monitoring you'd just directly send an alert notification that your monitoring is not working rather than going through the entire contraption described. So not sure I understand what you are leaning towards exactly. |
The heartbeat here as proposed only partially tests that one AM can talk to the notification mechanism. The lightning talk tested that the monitoring system as a whole can talk to the notification mechanism. This much more accurately represents what you want to monitor. |
Yes, but the path Prometheus->Alertmanager can be verified from within the data center via meta monitoring. If that is setup, the only missing piece (when not wanting to require users to setup cross-DC meta monitoring) is the Alertmanager->Internet route. This gap could be filled by having Prometheus heartbeat to an integration. I agree that the other approach covers more and works just as nice. More importantly, it doesn't introduce another concept. Of course people can add their own webhook again that sends some JSON to whatever heartbeat API on an alert notification. But for such trivial stuff, maybe it's worth revisiting whether we should allow assembling custom JSON via templates.
And in this case we have named heartbeats. So it does not verify that one AM is working as the entire AM cluster would send heartbeats and they are not differentiated by source IP or similar. |
This is an offline-processing system, attempting to monitor each hop individually will miss things. A end-to-end heartbeat is desirable.
I'm not sure we should have this as a primary goal. Everything we're talking about here should be some form of cross-DC monitoring.
This is actually one of the simplest solutions I've seen to this general class of problem. |
Not sure what you mean. If PagerDuty just had an incident handler that is inverted, i.e. pages you if it doesn't receive alerts, this would be super straightforward. The contraption of using 2 or 3 different SaaS providers after leaving Alertmanager has nothing to do with this being a complex problem class – it's simply a gap in the offerings of PD, OpsGenie, etc. And arguably the OpsGenie heartbeat functionality is exactly that, just with it's tiny API that is unfortunately not equal to the incident API. You skipped the essential part of my last response.
Nothing said of a primary goal, but it's a use case we have to consider. Not every company allows their sysadmins to run something in the cloud. And when on-site is just a single site, there must be a way to do it. I'm not sure what we are talking about anymore, to be honest. Your last few responses are giving contradicting signals. |
It's not possible to do meta-monitoring without use of a provider outside your network, so I don't think this is a reasonable design restriction. This PR presumes a cloud service in OpsGenie for example.
I believe I've been consistent. The OpsGenie heartbeat would offer testing of the AM->OpsGenie path in some cases - but there's more to meta monitoring than testing that one hop of the alerting path and not everyone uses OpsGenie. An end to end test that starts at Prometheus and ends outside your network is what's needed. The alertmanager sending heartbeats itself is not the right way to provide this feature, and may lull users into a false sense of security. |
This was meant to be a general discussion rather than about accepting exactly this PR. Probably should have moved this into its own issue for clarity. Last response here to finish this. You are arguing against what I'm saying when I don't even disagree with you. I've explicitly said that the end-to-end solution is better and proposed that we find a way to integrate our general notifications with such heartbeat APIs, which was ignored. So one final time to hopefully clarify this: We want some sort of heartbeat/dead-man's-switch functionality. You do apparently agree with that. The end-to-end variant is exactly that while validating another part of the pipeline along the way, which is great of course. Please open an issue if you want to discuss this further without spinning in circles. |
I don't think there's enough standardisation in this space to know how to approach this. I've come across about 5 of these from various providers, and I suspect most of them could take a webhook directly as they tend to ignore the body. So for now webhook it is I think. For this particular PR I'm against as it's approaching the problem at the wrong level. Do you think we should accept this PR? |
In the OpsGenie case, the body specifies the name of the heartbeat and cannot be left out. At this early stage adding such an exposed feature to AM is probably unwise. So no to this PR for now. |
I think there was one proposal (from you?) for json templating that didn't look too terrible. I'm mainly concerned on complexity and support grounds, as it'd be far more powerful/complex than what our current templating offers (which already confuses users) and we get to deal with arbitrarily different http response codes/bodies that need special handling. That'd take a good chunk of software engineering to get sane. |
Yes, I mentioned that before and further up in this discussion. There's http://jsonnet.org/ which seems rather sane but is lacking Go support. An alternative would be to specify a request body structure directly in the YAML configuration where you can use templated strings for values. At least this won't require people to generate valid JSON via Go templates, which are also using |
The problem with doing it via YAML is that you're restricting to structure to be completely static, so no lists or nested dicts. At what point are you okay with users being forced to use the webhook? |
Sure, you can just have a field that allows any YAML object. |
Which probably means you need all fields not to be escaped, leaving escaping up to the user. That's not likely to end well. |
Sorry, I missed the original discussion because I was on vacation. In general, I believe there is no way to be really sure your alerts are working that doesn't involve a heartbeat. The more the heartbeat deviates from "normal" alerts, the more potential for false positives and negatives there is. Having a "perfectly normal" alert fire all the time makes sure that alerts work, not just that heartbeats work. The heartbeat interval is configurable already via the normal alertmanager configuration. I understand that this PR aims to support the OpsGenie specific heartbeat functionality. Would this heartbeat work if it is configured like a normal OpsGenie receiver? In that case, this PR doesn't add anything that could not be done already using a few lines of configuration. I'd prefer to just document how to do it over additional code and configuration directives. That would also solve checking the Prometheus->Alertmanager leg, which the proposed implementation here would not. And the general approach is transferable to other integrations without having to implement it separately every time. I agree that using Dead Man's Snitch is a crutch caused by lack of support in PagerDuty. The documentation (blog post?) could also describe that. If someone has done something similar with other integrations we can add it; but given these two examples the general approach should be clear enough for everyone to be able to implement it according to their environment. Of course, a deployment with a single datacenter and a prohibition against relying on anything outside that datacenter will never be able to use any of this, but that seems like a very special case. |
PS: I would be happy to write out the documentation-in-lieu-of-code post. |
Is this one going anywhere? |
Reading back through everything, the consensus is not to accept this PR. Instead it's proposed to document how to do this properly in the form of an end-to-end test, and maybe also allow templating of webhook. |
@brian-brazil Should #679 be reopened in that case - and/or be re-phrased according to the new plan? |
This PR is an implementation of #679. |
Sorry I missed the discussion. From my memories, it wasn't possible to send fake alarms to OpsGenie to implement a heartbeat. Rather, one had to implement a dedicated HTTP call to comply with their implementation. I do agree that this PR may be overkill if webhook are expended to be able to send any kind of formatted JSON. |
Hi, Any news on this topics ? I was not able to find the "end-to-end" documentation regarding opsgenie heartbeat integration. It seems using normal OpsGenie receiver is "almost" able to do the job with the opsgenie api v1 (except that the name is not part of the POST). But for the opsgenie api v2 it seems not possible at all : https://docs.opsgenie.com/docs/heartbeat-api#pingHeartbeat (header are used) Any suggestions ? |
@talset maybe what @Nin-0 did here https://github.com/traum-ferienwohnungen/opsgenie-heartbeat-proxy is of help for you. |
@jayme-github thanks, in fact we did the same but if something is wrong with the pod, we could have an alert than we should not have. Anyway this is the only current workaround. |
@talset and anyone bumping into this thread: you can send heartbeats towards OpsGenie, the following authentication methods are supported by both systems: 1.) You can simply use the apiKey in the target URL: https://api.opsgenie.com/v2/heartbeats/heartbeatname/ping?apiKey=XXXX 2.) You can use the Basic authentication method in AlertManager. OpsGenie will accept the following: Header Key: Authorization Basically, just leave the username empty, and use the apikey as the secret |
What have you got in y our Prometheus.yml for this configuration? |
it is in AM.yml not promethteus.yml it works like this:
|
This is great. Thank you for the missing part of my the puzzle |
Got another issue like this.
this works pretty good until one of our pr changes path to templates to
that was pretty easy to pass mr cause looks very similar to good one. it seems that solution from #444 (comment) will not help with this. any ideas ? |
Thanks for this! One minor change: you have to define |
@yosefy what format is the password in?
but it didn't work for me. |
running
This change is