Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] probe (module scoped) failure reason metric #1077

Open
domcyrus opened this issue Jun 6, 2023 · 8 comments
Open

[feature request] probe (module scoped) failure reason metric #1077

domcyrus opened this issue Jun 6, 2023 · 8 comments

Comments

@domcyrus
Copy link

domcyrus commented Jun 6, 2023

probe (module scoped) failure reason metric

Current Situation

Currently, the blackbox_exporter provides some general metrics such as probe_success and probe_duration_seconds that apply universally to all modules. Additionally, specific modules like the http module / prober offer their own metrics like probe_http_status_code, which help monitor the availability and performance of http endpoints. However, when a probe fails, it can be challenging to pinpoint the exact cause of the failure without manually inspecting blackbox_exporter logs or attempting to reproduce the error, if possible.

Proposal

To address this issue, we propose the addition of a new metric called probe_($MODULE)_failure_reason to the blackbox_exporter. This metric would provide more detailed information about the reasons behind probe failures. It would include a label named "reason" with descriptive and enumerable values such as "dns-resolution-error," "http-timeout," or "ssl-certificate-validation-failed," among others. Currently, these failures can only be inferred from the logged errors.

Benefits

The introduction of the probe_($MODULE)_failure_reason metric would significantly enhance troubleshooting capabilities. In most cases users would be able to identify the root cause of a probe failure without the need for manual log inspection or additional testing. Moreover, this new metric would facilitate the setup of alerts and notifications tailored to specific failure scenarios.

Contribution

We believe that incorporating the probe_($MODULE)_failure_reason metric would be a valuable enhancement for the blackbox_exporter, improving its usability and effectiveness. We would be happy to contribute to the development of this feature and provide feedback on its implementation.

Thank you for considering our proposal. If this is something that would be ok to go forward with we’d love to contribute the functionality to blackbox_exporter.

@druanoor
Copy link
Contributor

Relates to this: #1062

domcyrus pushed a commit to domcyrus/blackbox_exporter that referenced this issue Oct 19, 2023
domcyrus pushed a commit to domcyrus/blackbox_exporter that referenced this issue Oct 23, 2023
Signed-off-by: Marco Cadetg <marco.cadetg@amazee.io>
@slrtbtfs
Copy link

slrtbtfs commented Dec 3, 2024

Hi, I'm a bit confused as to why this is marked closed as completed, as it doesn't seem to have been merged.

That being said, I'd be very happy to see a feature like this in blackbox exporter and am thankful for the work you put into it!

@beorn7
Copy link
Member

beorn7 commented Dec 3, 2024

Hi, I'm a bit confused as to why this is marked closed as completed,

I guess because that is the default if you simply hit the "close" button. :)

Maybe the actual meaning was "I won't have time to work on this anymore" or "I never got feedback from the maintainers, so I gave up."

@domcyrus
Copy link
Author

domcyrus commented Dec 3, 2024

Hi, I'm a bit confused as to why this is marked closed as completed,

I guess because that is the default if you simply hit the "close" button. :)

Yes, sorry this is what I did.

Maybe the actual meaning was "I won't have time to work on this anymore" or "I never got feedback from the maintainers, so I gave up."

It was the second and therefore I thought that it may just not be interesting or needed by anyone. I guess if that is not the case it's still possible to reopen it.

@SuperQ SuperQ reopened this Dec 3, 2024
@slrtbtfs
Copy link

slrtbtfs commented Dec 3, 2024

Thanks for reopening!

I think this feature would be great to have and would significantly improve the service my team can provide.

One minor change i would propose is to call the resulting metric just probe_failure_reason instead of probe_{module}_failure_reason. This makes it easier on the Prometheus side to write Queries that work for multiple blackbox modules.

@roidelapluie @mem @electron0zero Would you be interested in principle to accept contributions for such a feature?

(feeling a bit bad to ping all the maintainers, but that was suggested on the prometheus-developers mailing list, so I hope its ok)

@SuperQ
Copy link
Member

SuperQ commented Dec 3, 2024

I think this would be a useful metric.

There was a request on slack about having a probe_timeout bool metric to indicate if a probe timeout is reached.

This implementation may also be useful

@SuperQ
Copy link
Member

SuperQ commented Dec 3, 2024

Minor nit, I would probably call it probe_failure_info. Not sure we need a per-module variation.

@slrtbtfs
Copy link

I did a draft implementation of this in #1334 and would appreciate some Feedback about whether you think the general approach taken there is viable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants