Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dead Man Switches (NON-OpsGenie Heartbeats!) #1542

Closed
jnovack opened this issue Sep 6, 2018 · 2 comments
Closed

Dead Man Switches (NON-OpsGenie Heartbeats!) #1542

jnovack opened this issue Sep 6, 2018 · 2 comments

Comments

@jnovack
Copy link

jnovack commented Sep 6, 2018

I've read quite a number of issues in a few of the prom repositories regarding health/heartbeating between alert-manager and the alerting-system (e.g. OpsGenie, Slack, JSON Endpoint). I am NOT looking to monitor my alerting system.

How is this different from #444 or #679?

I'm not looking to monitor alertmanager (the core use-case of #444 and #679). I am looking to monitor a downstream service BY alert-manager.

What did you want it to do?

I'm looking for a Dead-Man Switch for a service. My use-case has no output, it generates no metrics (to scrape). I merely wish to call a URL every interval.

What did you expect to see?

If I do NOT call the URL within a defined interval, a alert (as defined under alert-manager is created. When the URL is called again, the alert is resolved.

Why can't prometheus do it?

I have no metrics, I have no output. I merely need to confirm "something happened". Whatever that something is.

Write an exporter, bring up an endpoint, have prometheus scrape it, then write a rule in alert-manager checking for value=0.

You are kidding right? I believe this is a valid use-case that warrants a feature, rather than a work-around.

Ok, ok. What is this like? What can I relate this to?

Dead Man's Snitch or StatusCake's PUSH Alert.

When you create a switch, a timer starts. If the URL provided is NOT called before the timer runs out, the alert is generated. When it is finally called, the alert is resolved.

Gimme a use-case...

Any random shell-script runs periodically. It may do something, it may crash. At the end, it calls a URL to check-in that it ran. Successfully or not, that is not your concern. The script reached the callURL() function, and thus has completed.

Clearly your script needs metrics or this has to be re-written. Why would you depend on something that just "ran" without measuring it?

Not every piece of software is as well designed or coded as prometheus is. Some shittier software (the kind made equally made by smaller less-agile developers or larger international companies) does not cater to providing metrics, endpoints or integrations for third-party use.

This is changing the argument from "how can we implement this feature if it's useful" to "why can't you do it differently so it fits within the model of the already established framework."

@brian-brazil
Copy link
Contributor

Thanks for your suggestion.

a alert (as defined under alert-manager is created. When the URL is called again, the alert is resolved.

It's the role of Prometheus to contain alerting logic, not the Alertmanager - the Alertmanager manages alerts which have already been sent to it. All alerting thresholds and the like live in Prometheus.

Any random shell-script runs periodically. It may do something, it may crash. At the end, it calls a URL to check-in that it ran.

If this is a cluster-level batch job you want the Pushgateway combined with an alert on push_time_seconds. If it's a machine-level batch job, then touch a file for the Node Exporter's textfile collector and alert on the node_textfile_mtime_seconds metric.

@jnovack
Copy link
Author

jnovack commented Sep 6, 2018

You are correct! I have not investigated that yet.

Thank you for your quick response, and my apologies for the misunderstanding of the architecture. There's SO many pieces to Prometheus (not a bad thing), I didn't have a full grasp of the distributed system.

@jnovack jnovack closed this as completed Sep 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants