Dead Man Switches (NON-OpsGenie Heartbeats!) #1542

jnovack · 2018-09-06T16:55:13Z

I've read quite a number of issues in a few of the prom repositories regarding health/heartbeating between alert-manager and the alerting-system (e.g. OpsGenie, Slack, JSON Endpoint). I am NOT looking to monitor my alerting system.

How is this different from #444 or #679?

I'm not looking to monitor alertmanager (the core use-case of #444 and #679). I am looking to monitor a downstream service BY alert-manager.

What did you want it to do?

I'm looking for a Dead-Man Switch for a service. My use-case has no output, it generates no metrics (to scrape). I merely wish to call a URL every interval.

What did you expect to see?

If I do NOT call the URL within a defined interval, a alert (as defined under alert-manager is created. When the URL is called again, the alert is resolved.

Why can't prometheus do it?

I have no metrics, I have no output. I merely need to confirm "something happened". Whatever that something is.

Write an exporter, bring up an endpoint, have prometheus scrape it, then write a rule in alert-manager checking for value=0.

You are kidding right? I believe this is a valid use-case that warrants a feature, rather than a work-around.

Ok, ok. What is this like? What can I relate this to?

Dead Man's Snitch or StatusCake's PUSH Alert.

When you create a switch, a timer starts. If the URL provided is NOT called before the timer runs out, the alert is generated. When it is finally called, the alert is resolved.

Gimme a use-case...

Any random shell-script runs periodically. It may do something, it may crash. At the end, it calls a URL to check-in that it ran. Successfully or not, that is not your concern. The script reached the callURL() function, and thus has completed.

Clearly your script needs metrics or this has to be re-written. Why would you depend on something that just "ran" without measuring it?

Not every piece of software is as well designed or coded as prometheus is. Some shittier software (the kind made equally made by smaller less-agile developers or larger international companies) does not cater to providing metrics, endpoints or integrations for third-party use.

This is changing the argument from "how can we implement this feature if it's useful" to "why can't you do it differently so it fits within the model of the already established framework."

brian-brazil · 2018-09-06T17:06:33Z

Thanks for your suggestion.

a alert (as defined under alert-manager is created. When the URL is called again, the alert is resolved.

It's the role of Prometheus to contain alerting logic, not the Alertmanager - the Alertmanager manages alerts which have already been sent to it. All alerting thresholds and the like live in Prometheus.

Any random shell-script runs periodically. It may do something, it may crash. At the end, it calls a URL to check-in that it ran.

If this is a cluster-level batch job you want the Pushgateway combined with an alert on push_time_seconds. If it's a machine-level batch job, then touch a file for the Node Exporter's textfile collector and alert on the node_textfile_mtime_seconds metric.

jnovack · 2018-09-06T17:29:11Z

You are correct! I have not investigated that yet.

Thank you for your quick response, and my apologies for the misunderstanding of the architecture. There's SO many pieces to Prometheus (not a bad thing), I didn't have a full grasp of the distributed system.

jnovack closed this as completed Sep 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dead Man Switches (NON-OpsGenie Heartbeats!) #1542

Dead Man Switches (NON-OpsGenie Heartbeats!) #1542

jnovack commented Sep 6, 2018

brian-brazil commented Sep 6, 2018

jnovack commented Sep 6, 2018

Dead Man Switches (NON-OpsGenie Heartbeats!) #1542

Dead Man Switches (NON-OpsGenie Heartbeats!) #1542

Comments

jnovack commented Sep 6, 2018

brian-brazil commented Sep 6, 2018

jnovack commented Sep 6, 2018