Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collector netclass/bonding leads to scrape timeouts #1841

Open
sepich opened this issue Sep 8, 2020 · 10 comments
Open

Collector netclass/bonding leads to scrape timeouts #1841

sepich opened this issue Sep 8, 2020 · 10 comments
Labels

Comments

@sepich
Copy link

sepich commented Sep 8, 2020

Host operating system: output of uname -a

Linux 4.4.207-1.el7.elrepo.x86_64 #1 SMP Sat Dec 21 08:00:19 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

prom/node-exporter:v1.0.1

node_exporter command line flags

            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/rootfs
            - --collector.netclass.ignored-devices=^(lo|docker[0-9]|kube-ipvs0|dummy0|kube-dummy-if|veth.+|br\-.+|cali\w{11}|tunl0|tun\-.+)$
            - --collector.netdev.device-blacklist=^(lo|docker[0-9]|kube-ipvs0|dummy0|kube-dummy-if|veth.+|br\-.+|cali\w{11}|tunl0|tun\-.+)$
            - --collector.filesystem.ignored-mount-points=^/(dev|sys|proc|host|etc|var/lib/kubelet|var/lib/docker/.+|home/.+|data/local-pv/.+)($|/)
            - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|efivarfs|tmpfs|nsfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rootfs|rpc_pipefs|securityfs|sysfs|tracefs)$
            - --collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-|sr|nbd)\d+$
            - --collector.netstat.fields=^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans|TCPRcvCollapsed|PruneCalled|RcvPruned)|Tcp_(ActiveOpens|InSegs|OutSegs|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts|RcvbufErrors|SndbufErrors))$
            - --no-collector.systemd
            - --no-collector.bcache
            - --no-collector.infiniband
            - --no-collector.wifi
            - --no-collector.ipvs

Are you running node_exporter in Docker?

Yes, in k8s as a DaemonSet

What did you do that produced an error?

We're using scrape_interval: 15s and scrape_timeout: 15s on prometheus side, and noticed that some nodes have holes in graphs:
image
Which turns out to be due to large scrape time from bonding and netclass collectors:
node_scrape_collector_duration_seconds
image
Sometimes even like this:

# time curl -s localhost:9100/metrics >/dev/null

real	0m42.589s
user	0m0.003s
sys	0m0.005s

If we disable these collectors:

            - --no-collector.bonding
            - --no-collector.netclass

Then holes disappear (on graphs above after 17:30)

What did you expect to see?

Bonding collector metrics are very valuable for us. Currently we have to produce same metrics via textfile collector and custom script.
Is it possible to maybe add some configurable timeout for node_exporter, so that at least some metrics which are ready would be returned? Instead of failing the whole scrape.
In this case collectors maybe should also set node_scrape_collector_success=0 to not hide the issue.
Thank you.

@discordianfish
Copy link
Member

Are there any errors in the log? We considered timeouts before: #244 but decided against it for the kind of issues described there.
Wondering what is going on here. This is the first report I see about this, so would like to understand why it takes so long first.

@carlpett
Copy link
Member

carlpett commented Sep 18, 2020

FWIW I see the same long durations for netclass on some of our Kubernetes clusters (GKE). No errors in the node_exporter logs.
Affected clusters are running 1.0.0 on Google COS, kernel 4.19.112+.
Unaffected clusters are running 0.18.0 on Google COS, 4.14.174+. I tried bumping one cluster to 1.0.0, and while the average collection time increase from 10 ms to 15 ms, it isn't exactly timing out...

I had a look in /sys/class/net/ on one of the affected nodes, and there isn't an excessive amount of interfaces, or anything that seems strange when accessing files in there.

We don't have any interesting configuration of the node_exporter in either affected or unaffected clusters, no filtering or such.

@carlpett
Copy link
Member

I tried building a minimal program that uses the same sysfs package to read netclass and time it. It takes a handful of milliseconds to run on one of the nodes that is affected, so it isn't anything obvious with the actual reading.

@carlpett
Copy link
Member

Noticed that our containers were getting cpu throttled on the affected clusters. Removed the CPU limit to tune for actual usage (which seems higher than previous releases)
@sepich do you have cpu limits set on your daemonset?

@sepich
Copy link
Author

sepich commented Sep 18, 2020

Are there any errors in the log?

Nothing unusual, i'll try to redeploy and check --log.level=debug logs

do you have cpu limits set on your daemonset?

We know about cpu limit related issues and do not use them.
But the point of original question is more about generic case of dropping whole scrape, due to single collector timeout.

but decided against it for the kind of issues described there.

Thank you for providing the information! I've read the #244 and unfortunately do not see any good reasons there against context.withTimeout for each individual collector.
Yes, that would leak goroutines in case of stuck nfs mount. But that's not the only issue which could happen, as provided above.
I do not suggest to hide this problematic nfs mount by returning the metrics, in such a case i suggest just setting node_scrape_collector_success=0 for collector in question (or something like that)

For example in case of prometheus and some target connection timeout, it is only this target has loosing the data. And additional metric (up=0) is set to show the issue. This connection timeout does not bring the whole prometheus down, or affects scraping of it's other targets.

Wouldn't this timeout improve overall reliability of node_exporter?

@discordianfish
Copy link
Member

I'm not against adding timeouts to collectors in general. @SuperQ @pgier wdyt?
Also, it would help if we had more detailed collector duration metrics in these cases..

@zoeldjian
Copy link

zoeldjian commented Oct 7, 2020

I am having a problem to setup my node exporter, it says
Oct 07 09:05:28 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:28Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:33 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:33Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:38 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:38Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:43 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:43Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:48 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:48Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:53 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:53Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:58 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:58Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:06:03 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:03Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:06:08 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:08Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:06:13 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:13Z" level=error msg="ERROR: diskstats collector failed a

Did I make something wrong or because my OS problem? I use "Debian GNU/Linux 10 (buster)"

Please help me

@discordianfish
Copy link
Member

@zoeldjian This looks like an unrelated issue, best ask on the mailinglist unless you are sure it's a bug in the node-exporter.

@rexagod
Copy link
Contributor

rexagod commented Mar 19, 2024

@discordianfish WRT #1841 (comment), do you think we should add timeouts for all collectors? The timeouts would be specific to any behaviors (and help us curb them) that would otherwise create latency for the entire metrics backed.

I'm +1 for it, and I can send out a PR for the same if this is a welcome change.

@discordianfish
Copy link
Member

@rexagod Yeah I was just mentioning this in #2960 - sounds like a good idea.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants