-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
If infrastructure of configured enabled_collectors is missing keep scraping and don't stop,crash or prevent start #1748
Comments
@Nachtfalkeaw are you open for testing a snapshot build? I would like to get some feedback here, if it meets your expectation? Currently, I have option A in focus. B may be to complicated. It might not helpful it situation of temporary failure where users tries to fix an issue that prevent metrics collection, but they might not getting feedback. It also requires a complexity, which needs additional maintenance. As I know, node_exporter works like option A.
The leads to issue like #1743 where
How you define "working"? It's that windows_exporter is calling a separate binary. Once a collector is enabled, it always runs. |
Hello @jkroepke
The collector could parse the config file correctly and the collector could start as expected to scrape but the result is "no infrastructure" then we probably do not need a log that no infrastructure exists because this will spam the logs with every scrape. However maybe this is relevant for someone to know so this should be behind a logging option/level like "info" or maybe "debug" But if the collector's config file is correct and it started correct but the scrape failed because the scraped data has invalid syntax, format, ... we should get a log. Or with other words: |
I will keep the logs on warning level. Users that does want to see them, can set the log level to ERROR. |
In https://github.com/prometheus-community/windows_exporter/releases/tag/v0.30.0-beta.5 collector failures are not longer logger. However, collector failure are logged once as warning on startup. Few exceptions exists for unexpected errors, which are still logged as warning on each scrape. |
Hello, I think mssql metric and log do not match. I do not have mssql on my home computer so failed collector is fine but metrics seems to be wrong.
Maybe same for textfile:
It looks like the error handling of wrong (??) config is not working or I did it wrong:
[.alloy.] leads to all "windows_process_" metrics for any process are exported. I would expect (a) wrong config or (b) alloy process or (c) nothing but I would not expect that any processes metrics' are exported/collected. Same behaviour if I remove the "exclude" part.
Can you tell me what I should focus on for further tests? |
Hi, thanks for testing the changes. In context of textfile, I raised a fix in #1775 In context of the process collector, the config file is wrong. Omit the brackets on include. collector:
process:
include: .*alloy.*
exclude: "Idle"
I would like to know, if the current implementation met your expectation, if textfile and mssql collector are fixed. I also run the export on a client machine and get noted of |
I was sure the syntax is wrong but what I want to say is that with this wrong syntax in "include" and the correct in "exclude" it listed all available process metrics for all processes on the system. I would expect that if there is wrong syntax the windows_exporter complains at start? for mssql and the other collectors. I did some more tests and I think from what I can see it looks good. |
Problem Statement
The windows_exporter and Grafana Alloy prometheus.exporter.windows do not allow to enable collectors if the infrastructure where windows_exporter/alloy is installed does not support it. The result is that the exporter/alloy does not start, stops working or (partially) stops scraping metrics.
Form a discussion on Grafana Alloy slack this is an expected behaviour/expected design implementation of windows_exporter. It should prevent that someone installs an exporter on a system and is expecting to get metrics for a specific enabled_collector but the collector is not working because infrastructure is missing it.
The problem I am facing is the following:
In our company we have many different IT departments responsible for different tasks. There are people only responsible for monitoring solutions. Then we have people only responsible for Windows Server and others for Linux and others for databases, ...
If a customer(employee) wants to deploy its own application this customer orders e.g. a Linux Server and a Windows server from the IT departments responsible for that. He deploys on top his own applications he is responsible for. The customer is responsible for his application - however other IT departments are responsible for Linux and Windows and another for the monitoring solution.
The department running the monitoring solution does not know which application will be installed on the server.
To deploy Grafana Alloy/Windows_exporter we have another department who makes sure the deployment of the package fullfills all requirements, has the correct system user permissions, is compatible with different versions of Windows, Linux, is added to the companies shop to make it available for users to "order" and install it.
For that reason it is important that I can prepare a configuration file which allows alle collectors to be enabled because I do not know what applications are running on the server later. I do not know if it is on VMware or hyperV. I do not know if there is a database on it or not.
The goal is that all collectors can be enabled and all metrics are collected if the infrastructure exists. If the infrastructure does not exist the configured and enabled_collector is skipped.
At the end I only want to be sure the exporter/alloy is running but I do not care which metrics are collected and if the user needs these metrics or not. If the user wants the metrics they are there and can be visualized - if not it does not hurt anyone. If someone uninstalls an "application" and alloy/windows_exporter cannot scrape it anymore because infrastructure now is not available anymore it should not crash but keep scraping what is possible.
It is impossible to maintain different individual configurations for hundreds and thousands of systems. It is impossible for a monitoring only team to always know which user is using which application and the users often do not know how the monitoring solution works and they do not care. They just want to open the dashboard in Grafana and that's it.
Proposed Solution
There are my ideas which may help in this situation:
a) The exporter scrapes the configured enabled_collectors which are selected in the config file. If the infrastructure does not exist the scrape fails. (windows_exporter_collector_success: 0) and other collectors are scraped correctly. (windows_exporter_collector_success: 1)
b) the exporter tries to scrape the selected collectors in the config. If the scrape fails for e.g (10 times - or configurable amount of scrapes) then this collector will be disabled until (i) a configurable amount of time (e.g 3h disabled then tries again 10x) or (ii) until restart/reload of the exporter
This could be behind an additional flag which allows to use/keep the existing behaviour or allows a variant of the new behaviour.
I would prefer option/variant b) (i) because it would reduce the noise in the log if a scape failed. It would fail 10x for several collectors generating logs but then it would be quiet for the specified amount of time or until a restart of the service.
Maybe an option to "disable logging if windows_exporter_collector_success could be set to 0" could help. If scrape was not successfully but exporter/collector was "working", we maybe do not need a windows event log or only with "info" level. However if collector was not able to run for other reason it should still generate a log.
Additional information
Compared to prometheus node_exporter this is the same behaviour. The node_exporter scrapes what is available and what is not available will be skipped and scrape success is "0" (node_scrape_collector_success).
This would allow universal configuration files independent on the infrastructure where the exporter will be installed on.
It would allow the exporter keep working even if the infrastructure below changed because the end-user changed something.
This is the link to the discussion.
https://grafana.slack.com/archives/C01050C3D8F/p1732025962482719
Acceptance Criteria
The text was updated successfully, but these errors were encountered: