-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collector netclass/bonding leads to scrape timeouts #1841
Comments
Are there any errors in the log? We considered timeouts before: #244 but decided against it for the kind of issues described there. |
FWIW I see the same long durations for netclass on some of our Kubernetes clusters (GKE). No errors in the node_exporter logs. I had a look in /sys/class/net/ on one of the affected nodes, and there isn't an excessive amount of interfaces, or anything that seems strange when accessing files in there. We don't have any interesting configuration of the node_exporter in either affected or unaffected clusters, no filtering or such. |
I tried building a minimal program that uses the same |
Noticed that our containers were getting cpu throttled on the affected clusters. Removed the CPU limit to tune for actual usage (which seems higher than previous releases) |
Nothing unusual, i'll try to redeploy and check
We know about cpu limit related issues and do not use them.
Thank you for providing the information! I've read the #244 and unfortunately do not see any good reasons there against For example in case of prometheus and some target connection timeout, it is only this target has loosing the data. And additional metric ( Wouldn't this timeout improve overall reliability of node_exporter? |
I am having a problem to setup my node exporter, it says Did I make something wrong or because my OS problem? I use "Debian GNU/Linux 10 (buster)" Please help me |
@zoeldjian This looks like an unrelated issue, best ask on the mailinglist unless you are sure it's a bug in the node-exporter. |
@discordianfish WRT #1841 (comment), do you think we should add timeouts for all collectors? The timeouts would be specific to any behaviors (and help us curb them) that would otherwise create latency for the entire metrics backed. I'm +1 for it, and I can send out a PR for the same if this is a welcome change. |
Host operating system: output of
uname -a
Linux 4.4.207-1.el7.elrepo.x86_64 #1 SMP Sat Dec 21 08:00:19 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of
node_exporter --version
prom/node-exporter:v1.0.1
node_exporter command line flags
Are you running node_exporter in Docker?
Yes, in k8s as a DaemonSet
What did you do that produced an error?
We're using
scrape_interval: 15s
andscrape_timeout: 15s
on prometheus side, and noticed that some nodes have holes in graphs:Which turns out to be due to large scrape time from
bonding
andnetclass
collectors:node_scrape_collector_duration_seconds
Sometimes even like this:
If we disable these collectors:
Then holes disappear (on graphs above after 17:30)
What did you expect to see?
Bonding
collector metrics are very valuable for us. Currently we have to produce same metrics via textfile collector and custom script.Is it possible to maybe add some configurable timeout for node_exporter, so that at least some metrics which are ready would be returned? Instead of failing the whole scrape.
In this case collectors maybe should also set
node_scrape_collector_success=0
to not hide the issue.Thank you.
The text was updated successfully, but these errors were encountered: