Collector netclass/bonding leads to scrape timeouts #1841

sepich · 2020-09-08T15:05:58Z

Host operating system: output of `uname -a`

Linux 4.4.207-1.el7.elrepo.x86_64 #1 SMP Sat Dec 21 08:00:19 EST 2019 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

prom/node-exporter:v1.0.1

node_exporter command line flags

            - --path.procfs=/host/proc
            - --path.sysfs=/host/sys
            - --path.rootfs=/rootfs
            - --collector.netclass.ignored-devices=^(lo|docker[0-9]|kube-ipvs0|dummy0|kube-dummy-if|veth.+|br\-.+|cali\w{11}|tunl0|tun\-.+)$
            - --collector.netdev.device-blacklist=^(lo|docker[0-9]|kube-ipvs0|dummy0|kube-dummy-if|veth.+|br\-.+|cali\w{11}|tunl0|tun\-.+)$
            - --collector.filesystem.ignored-mount-points=^/(dev|sys|proc|host|etc|var/lib/kubelet|var/lib/docker/.+|home/.+|data/local-pv/.+)($|/)
            - --collector.filesystem.ignored-fs-types=^(autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|efivarfs|tmpfs|nsfs|fusectl|hugetlbfs|mqueue|overlay|proc|procfs|pstore|rootfs|rpc_pipefs|securityfs|sysfs|tracefs)$
            - --collector.diskstats.ignored-devices=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p|dm-|sr|nbd)\d+$
            - --collector.netstat.fields=^(.*_(InErrors|InErrs)|Ip_Forwarding|Ip(6|Ext)_(InOctets|OutOctets)|Icmp6?_(InMsgs|OutMsgs)|TcpExt_(Listen.*|Syncookies.*|TCPSynRetrans|TCPRcvCollapsed|PruneCalled|RcvPruned)|Tcp_(ActiveOpens|InSegs|OutSegs|PassiveOpens|RetransSegs|CurrEstab)|Udp6?_(InDatagrams|OutDatagrams|NoPorts|RcvbufErrors|SndbufErrors))$
            - --no-collector.systemd
            - --no-collector.bcache
            - --no-collector.infiniband
            - --no-collector.wifi
            - --no-collector.ipvs

Are you running node_exporter in Docker?

Yes, in k8s as a DaemonSet

What did you do that produced an error?

We're using scrape_interval: 15s and scrape_timeout: 15s on prometheus side, and noticed that some nodes have holes in graphs:

Which turns out to be due to large scrape time from bonding and netclass collectors:
node_scrape_collector_duration_seconds

Sometimes even like this:

# time curl -s localhost:9100/metrics >/dev/null

real	0m42.589s
user	0m0.003s
sys	0m0.005s

If we disable these collectors:

            - --no-collector.bonding
            - --no-collector.netclass

Then holes disappear (on graphs above after 17:30)

What did you expect to see?

Bonding collector metrics are very valuable for us. Currently we have to produce same metrics via textfile collector and custom script.
Is it possible to maybe add some configurable timeout for node_exporter, so that at least some metrics which are ready would be returned? Instead of failing the whole scrape.
In this case collectors maybe should also set node_scrape_collector_success=0 to not hide the issue.
Thank you.

The text was updated successfully, but these errors were encountered:

discordianfish · 2020-09-16T08:06:01Z

Are there any errors in the log? We considered timeouts before: #244 but decided against it for the kind of issues described there.
Wondering what is going on here. This is the first report I see about this, so would like to understand why it takes so long first.

carlpett · 2020-09-18T11:51:35Z

FWIW I see the same long durations for netclass on some of our Kubernetes clusters (GKE). No errors in the node_exporter logs.
Affected clusters are running 1.0.0 on Google COS, kernel 4.19.112+.
Unaffected clusters are running 0.18.0 on Google COS, 4.14.174+. I tried bumping one cluster to 1.0.0, and while the average collection time increase from 10 ms to 15 ms, it isn't exactly timing out...

I had a look in /sys/class/net/ on one of the affected nodes, and there isn't an excessive amount of interfaces, or anything that seems strange when accessing files in there.

We don't have any interesting configuration of the node_exporter in either affected or unaffected clusters, no filtering or such.

carlpett · 2020-09-18T12:13:23Z

I tried building a minimal program that uses the same sysfs package to read netclass and time it. It takes a handful of milliseconds to run on one of the nodes that is affected, so it isn't anything obvious with the actual reading.

carlpett · 2020-09-18T12:24:59Z

Noticed that our containers were getting cpu throttled on the affected clusters. Removed the CPU limit to tune for actual usage (which seems higher than previous releases)
@sepich do you have cpu limits set on your daemonset?

sepich · 2020-09-18T17:40:00Z

Are there any errors in the log?

Nothing unusual, i'll try to redeploy and check --log.level=debug logs

do you have cpu limits set on your daemonset?

We know about cpu limit related issues and do not use them.
But the point of original question is more about generic case of dropping whole scrape, due to single collector timeout.

but decided against it for the kind of issues described there.

Thank you for providing the information! I've read the #244 and unfortunately do not see any good reasons there against context.withTimeout for each individual collector.
Yes, that would leak goroutines in case of stuck nfs mount. But that's not the only issue which could happen, as provided above.
I do not suggest to hide this problematic nfs mount by returning the metrics, in such a case i suggest just setting node_scrape_collector_success=0 for collector in question (or something like that)

For example in case of prometheus and some target connection timeout, it is only this target has loosing the data. And additional metric (up=0) is set to show the issue. This connection timeout does not bring the whole prometheus down, or affects scraping of it's other targets.

Wouldn't this timeout improve overall reliability of node_exporter?

discordianfish · 2020-09-23T13:40:25Z

I'm not against adding timeouts to collectors in general. @SuperQ @pgier wdyt?
Also, it would help if we had more detailed collector duration metrics in these cases..

zoeldjian · 2020-10-07T09:19:49Z

I am having a problem to setup my node exporter, it says
Oct 07 09:05:28 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:28Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:33 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:33Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:38 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:38Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:43 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:43Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:48 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:48Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:53 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:53Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:05:58 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:05:58Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:06:03 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:03Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:06:08 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:08Z" level=error msg="ERROR: diskstats collector failed a
Oct 07 09:06:13 ip-172-26-5-229 node_exporter[10227]: time="2020-10-07T09:06:13Z" level=error msg="ERROR: diskstats collector failed a

Did I make something wrong or because my OS problem? I use "Debian GNU/Linux 10 (buster)"

Please help me

discordianfish · 2020-10-08T09:09:23Z

@zoeldjian This looks like an unrelated issue, best ask on the mailinglist unless you are sure it's a bug in the node-exporter.

rexagod · 2024-03-19T13:00:00Z

@discordianfish WRT #1841 (comment), do you think we should add timeouts for all collectors? The timeouts would be specific to any behaviors (and help us curb them) that would otherwise create latency for the entire metrics backed.

I'm +1 for it, and I can send out a PR for the same if this is a welcome change.

discordianfish · 2024-03-21T12:24:02Z

@rexagod Yeah I was just mentioning this in #2960 - sounds like a good idea.

discordianfish added accepted bug platform/Linux Linux specific issue labels Sep 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collector netclass/bonding leads to scrape timeouts #1841

Collector netclass/bonding leads to scrape timeouts #1841

sepich commented Sep 8, 2020

discordianfish commented Sep 16, 2020

carlpett commented Sep 18, 2020 •

edited

Loading

carlpett commented Sep 18, 2020

carlpett commented Sep 18, 2020

sepich commented Sep 18, 2020

discordianfish commented Sep 23, 2020

zoeldjian commented Oct 7, 2020 •

edited

Loading

discordianfish commented Oct 8, 2020

rexagod commented Mar 19, 2024

discordianfish commented Mar 21, 2024

Collector netclass/bonding leads to scrape timeouts #1841

Collector netclass/bonding leads to scrape timeouts #1841

Comments

sepich commented Sep 8, 2020

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

discordianfish commented Sep 16, 2020

carlpett commented Sep 18, 2020 • edited Loading

carlpett commented Sep 18, 2020

carlpett commented Sep 18, 2020

sepich commented Sep 18, 2020

discordianfish commented Sep 23, 2020

zoeldjian commented Oct 7, 2020 • edited Loading

discordianfish commented Oct 8, 2020

rexagod commented Mar 19, 2024

discordianfish commented Mar 21, 2024

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`

carlpett commented Sep 18, 2020 •

edited

Loading

zoeldjian commented Oct 7, 2020 •

edited

Loading