If infrastructure of configured enabled_collectors is missing keep scraping and don't stop,crash or prevent start #1748

Nachtfalkeaw · 2024-11-19T20:46:16Z

Problem Statement

The windows_exporter and Grafana Alloy prometheus.exporter.windows do not allow to enable collectors if the infrastructure where windows_exporter/alloy is installed does not support it. The result is that the exporter/alloy does not start, stops working or (partially) stops scraping metrics.

Form a discussion on Grafana Alloy slack this is an expected behaviour/expected design implementation of windows_exporter. It should prevent that someone installs an exporter on a system and is expecting to get metrics for a specific enabled_collector but the collector is not working because infrastructure is missing it.

The problem I am facing is the following:
In our company we have many different IT departments responsible for different tasks. There are people only responsible for monitoring solutions. Then we have people only responsible for Windows Server and others for Linux and others for databases, ...

If a customer(employee) wants to deploy its own application this customer orders e.g. a Linux Server and a Windows server from the IT departments responsible for that. He deploys on top his own applications he is responsible for. The customer is responsible for his application - however other IT departments are responsible for Linux and Windows and another for the monitoring solution.

The department running the monitoring solution does not know which application will be installed on the server.

To deploy Grafana Alloy/Windows_exporter we have another department who makes sure the deployment of the package fullfills all requirements, has the correct system user permissions, is compatible with different versions of Windows, Linux, is added to the companies shop to make it available for users to "order" and install it.

For that reason it is important that I can prepare a configuration file which allows alle collectors to be enabled because I do not know what applications are running on the server later. I do not know if it is on VMware or hyperV. I do not know if there is a database on it or not.

The goal is that all collectors can be enabled and all metrics are collected if the infrastructure exists. If the infrastructure does not exist the configured and enabled_collector is skipped.

At the end I only want to be sure the exporter/alloy is running but I do not care which metrics are collected and if the user needs these metrics or not. If the user wants the metrics they are there and can be visualized - if not it does not hurt anyone. If someone uninstalls an "application" and alloy/windows_exporter cannot scrape it anymore because infrastructure now is not available anymore it should not crash but keep scraping what is possible.

It is impossible to maintain different individual configurations for hundreds and thousands of systems. It is impossible for a monitoring only team to always know which user is using which application and the users often do not know how the monitoring solution works and they do not care. They just want to open the dashboard in Grafana and that's it.

Proposed Solution

There are my ideas which may help in this situation:

a) The exporter scrapes the configured enabled_collectors which are selected in the config file. If the infrastructure does not exist the scrape fails. (windows_exporter_collector_success: 0) and other collectors are scraped correctly. (windows_exporter_collector_success: 1)

b) the exporter tries to scrape the selected collectors in the config. If the scrape fails for e.g (10 times - or configurable amount of scrapes) then this collector will be disabled until (i) a configurable amount of time (e.g 3h disabled then tries again 10x) or (ii) until restart/reload of the exporter

This could be behind an additional flag which allows to use/keep the existing behaviour or allows a variant of the new behaviour.

I would prefer option/variant b) (i) because it would reduce the noise in the log if a scape failed. It would fail 10x for several collectors generating logs but then it would be quiet for the specified amount of time or until a restart of the service.

Maybe an option to "disable logging if windows_exporter_collector_success could be set to 0" could help. If scrape was not successfully but exporter/collector was "working", we maybe do not need a windows event log or only with "info" level. However if collector was not able to run for other reason it should still generate a log.

Additional information

Compared to prometheus node_exporter this is the same behaviour. The node_exporter scrapes what is available and what is not available will be skipped and scrape success is "0" (node_scrape_collector_success).

This would allow universal configuration files independent on the infrastructure where the exporter will be installed on.
It would allow the exporter keep working even if the infrastructure below changed because the end-user changed something.

This is the link to the discussion.
https://grafana.slack.com/archives/C01050C3D8F/p1732025962482719

Acceptance Criteria

[] windows_exporter tries to scrape all collectors and does not crash/quit/stop starting. Instead keeps running.
[] if a scrape failed then "windows_exporter_collector_success" is set to 0 indicating a scrape failed
[] if scrapes fail repeatingly after a specified amount of time/retries there are no further logs generated to not spam the windows event log.

jkroepke · 2024-11-21T22:52:10Z

@Nachtfalkeaw are you open for testing a snapshot build? I would like to get some feedback here, if it meets your expectation?

Currently, I have option A in focus. B may be to complicated. It might not helpful it situation of temporary failure where users tries to fix an issue that prevent metrics collection, but they might not getting feedback. It also requires a complexity, which needs additional maintenance.

As I know, node_exporter works like option A.

Maybe an option to "disable logging if windows_exporter_collector_success could be set to 0" could help.

The leads to issue like #1743 where windows_exporter_collector_success is 0 and they ask for what the issue. If there are no logs present, it's impossible to provide any support. Discarding logs isn't an option here, but maybe you can just disable logging at alloy.

If scrape was not successfully but exporter/collector was "working", we maybe do not need a windows event log or only with "info" level. However if collector was not able to run for other reason it should still generate a log.

How you define "working"? It's that windows_exporter is calling a separate binary. Once a collector is enabled, it always runs.

Nachtfalkeaw · 2024-11-23T21:51:24Z

Hello @jkroepke
I can try to test this in my home lab. Not sure if I am able to test it on my business environment. Don't know what "snapshot build" means and what is different compared to the general installation/execution.

If scrape was not successfully but exporter/collector was "working", we maybe do not need a windows event log or only with "info" level. However if collector was not able to run for other reason it should still generate a log.

The collector could parse the config file correctly and the collector could start as expected to scrape but the result is "no infrastructure" then we probably do not need a log that no infrastructure exists because this will spam the logs with every scrape. However maybe this is relevant for someone to know so this should be behind a logging option/level like "info" or maybe "debug"

But if the collector's config file is correct and it started correct but the scrape failed because the scraped data has invalid syntax, format, ... we should get a log.

Or with other words:
missing infrastructure should not generate a log with every scrape for each component which has no infrastructure.

jkroepke · 2024-11-24T12:20:00Z

However maybe this is relevant for someone to know so this should be behind a logging option/level like "info" or maybe "debug"

I will keep the logs on warning level. Users that does want to see them, can set the log level to ERROR.

jkroepke · 2024-11-25T22:01:20Z

In https://github.com/prometheus-community/windows_exporter/releases/tag/v0.30.0-beta.5 collector failures are not longer logger. However, collector failure are logged once as warning on startup.

Few exceptions exists for unexpected errors, which are still logged as warning on each scrape.

Nachtfalkeaw · 2024-11-26T21:23:57Z

Hello,

I think mssql metric and log do not match. I do not have mssql on my home computer so failed collector is fine but metrics seems to be wrong.

time=2024-11-26T21:36:01.702+01:00 level=WARN source=main.go:207 msg="couldn't initialize collector" err="error build collector mssql: failed to build accessmethods collector: failed to create AccessMethods collector for instance MSSQLSERVER: failed to initialize collector: failed to add counter \\SQLServer:Access Methods\\AU cleanup batches/sec: Das angegebene Objekt wurde nicht auf dem Computer gefunden.\r\n\n (.......)

windows_exporter_collector_success{collector="mssql"} 1

Maybe same for textfile:

time=2024-11-26T21:36:04.934+01:00 level=ERROR source=textfile.go:359 msg="Error reading textfile Collector directory: C:\\Users\\Alexander\\Desktop\\textfile_inputs" collector=textfile err="error reading directory: CreateFile C:\\Users\\Alexander\\Desktop\\textfile_inputs: Das System kann die angegebene Datei nicht finden."

windows_exporter_collector_success{collector="textfile"} 1

It looks like the error handling of wrong (??) config is not working or I did it wrong:

config.yml
collector:
  process:
    include: [.*alloy.*]
    exclude: "Idle"

[.alloy.] leads to all "windows_process_" metrics for any process are exported. I would expect (a) wrong config or (b) alloy process or (c) nothing but I would not expect that any processes metrics' are exported/collected. Same behaviour if I remove the "exclude" part.

collectors:
  enabled: ad,adcs,adfs,cache,cpu,cpu_info,cs,container,diskdrive,dfsr,dhcp,dns,exchange,filetime,fsrmquota,hyperv,iis,license,logical_disk,logon,memory,mscluster,msmq,mssql,netframework,net,os,pagefile,perfdata,physical_disk,printer,process,remote_fx,scheduled_task,service,smb,smbclient,smtp,system,tcp,terminal_services,textfile,thermalzone,time,udp,update,vmware

# HELP windows_exporter_collector_success windows_exporter: Whether the collector was successful.
# TYPE windows_exporter_collector_success gauge
windows_exporter_collector_success{collector="ad"} 0
windows_exporter_collector_success{collector="adcs"} 0
windows_exporter_collector_success{collector="adfs"} 0
windows_exporter_collector_success{collector="cache"} 1
windows_exporter_collector_success{collector="container"} 0
windows_exporter_collector_success{collector="cpu"} 1
windows_exporter_collector_success{collector="cpu_info"} 1
windows_exporter_collector_success{collector="cs"} 1
windows_exporter_collector_success{collector="dfsr"} 0
windows_exporter_collector_success{collector="dhcp"} 0
windows_exporter_collector_success{collector="diskdrive"} 1
windows_exporter_collector_success{collector="dns"} 0
windows_exporter_collector_success{collector="exchange"} 0
windows_exporter_collector_success{collector="filetime"} 1
windows_exporter_collector_success{collector="fsrmquota"} 0
windows_exporter_collector_success{collector="hyperv"} 0
windows_exporter_collector_success{collector="iis"} 0
windows_exporter_collector_success{collector="license"} 1
windows_exporter_collector_success{collector="logical_disk"} 1
windows_exporter_collector_success{collector="logon"} 1
windows_exporter_collector_success{collector="memory"} 1
windows_exporter_collector_success{collector="mscluster"} 0
windows_exporter_collector_success{collector="msmq"} 0
windows_exporter_collector_success{collector="mssql"} 1
windows_exporter_collector_success{collector="net"} 1
windows_exporter_collector_success{collector="netframework"} 1
windows_exporter_collector_success{collector="os"} 1
windows_exporter_collector_success{collector="pagefile"} 1
windows_exporter_collector_success{collector="perfdata"} 1
windows_exporter_collector_success{collector="physical_disk"} 1
windows_exporter_collector_success{collector="printer"} 1
windows_exporter_collector_success{collector="process"} 1
windows_exporter_collector_success{collector="remote_fx"} 0
windows_exporter_collector_success{collector="scheduled_task"} 1
windows_exporter_collector_success{collector="service"} 1
windows_exporter_collector_success{collector="smb"} 1
windows_exporter_collector_success{collector="smbclient"} 0
windows_exporter_collector_success{collector="smtp"} 0
windows_exporter_collector_success{collector="system"} 1
windows_exporter_collector_success{collector="tcp"} 1
windows_exporter_collector_success{collector="terminal_services"} 0
windows_exporter_collector_success{collector="textfile"} 1
windows_exporter_collector_success{collector="thermalzone"} 0
windows_exporter_collector_success{collector="time"} 0
windows_exporter_collector_success{collector="udp"} 1
windows_exporter_collector_success{collector="update"} 1
windows_exporter_collector_success{collector="vmware"} 0

Can you tell me what I should focus on for further tests?
I had all collectors enabled and it exported metrics where possible. It did not crash or something else.

jkroepke · 2024-11-26T22:22:40Z

Hi,

thanks for testing the changes.

In context of textfile, I raised a fix in #1775

In context of the process collector, the config file is wrong. Omit the brackets on include.

collector:
  process:
    include: .*alloy.*
    exclude: "Idle"

Can you tell me what I should focus on for further tests?

I would like to know, if the current implementation met your expectation, if textfile and mssql collector are fixed.

I also run the export on a client machine and get noted of smbclient is 0. In that specific case, Windows just returns no data, because I have no active smb clients. In that case, the success metric should not be 0.

Nachtfalkeaw · 2024-11-28T20:18:43Z

@jkroepke

In context of the process collector, the config file is wrong. Omit the brackets on include.

I was sure the syntax is wrong but what I want to say is that with this wrong syntax in "include" and the correct in "exclude" it listed all available process metrics for all processes on the system.

I would expect that if there is wrong syntax the windows_exporter complains at start?

for mssql and the other collectors. I did some more tests and I think from what I can see it looks good.

Nachtfalkeaw added the ✨ enhancement label Nov 19, 2024

jkroepke mentioned this issue Nov 25, 2024

feat: Tolerate collector failures #1769

Merged

jkroepke closed this as completed in #1769 Nov 25, 2024

jkroepke mentioned this issue Nov 26, 2024

textfile: set windows_exporter_collector_success to 0, if an errors occurs #1775

Merged

This was referenced Nov 26, 2024

collector: don't fail if perf counters are empty. #1776

Merged

mssql: set windows_exporter_collector_success to 0, if errors occurs #1777

Merged

jkroepke mentioned this issue Dec 6, 2024

Performancecounter: fails if one or more following counters fail (failed to collect data: performance counter not initialized) #1807

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

If infrastructure of configured enabled_collectors is missing keep scraping and don't stop,crash or prevent start #1748

If infrastructure of configured enabled_collectors is missing keep scraping and don't stop,crash or prevent start #1748

Nachtfalkeaw commented Nov 19, 2024

jkroepke commented Nov 21, 2024

Nachtfalkeaw commented Nov 23, 2024

jkroepke commented Nov 24, 2024

jkroepke commented Nov 25, 2024

Nachtfalkeaw commented Nov 26, 2024

jkroepke commented Nov 26, 2024 •

edited

Loading

Nachtfalkeaw commented Nov 28, 2024

If infrastructure of configured enabled_collectors is missing keep scraping and don't stop,crash or prevent start #1748

If infrastructure of configured enabled_collectors is missing keep scraping and don't stop,crash or prevent start #1748

Comments

Nachtfalkeaw commented Nov 19, 2024

Problem Statement

Proposed Solution

Additional information

Acceptance Criteria

jkroepke commented Nov 21, 2024

Nachtfalkeaw commented Nov 23, 2024

jkroepke commented Nov 24, 2024

jkroepke commented Nov 25, 2024

Nachtfalkeaw commented Nov 26, 2024

jkroepke commented Nov 26, 2024 • edited Loading

Nachtfalkeaw commented Nov 28, 2024

jkroepke commented Nov 26, 2024 •

edited

Loading