Gracefully handle small "idle" backward counter jumps #1903

bmerry · 2020-12-01T06:41:47Z

Host operating system: output of `uname -a`

Linux imgr15 4.15.0-74-generic Cut version 0.10.0. #83~16.04.1-Ubuntu SMP Wed Dec 18 04:56:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Linux imgr23 5.3.0-62-generic Add TCP state collector #56~18.04.1-Ubuntu SMP Wed Jun 24 16:17:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of `node_exporter --version`

Docker image prom/node-exporter:v1.0.1

node_exporter command line flags

Docker-compose config:

version: '2'
services:
  node-exporter:
    image: prom/node-exporter:v1.0.1
    command:
      - '--collector.textfile.directory=/textfile'
      - '--no-collector.zfs'
      - '--no-collector.wifi'
      - '--no-collector.bcache'
      - '--path.rootfs=/host'
      - '--collector.vmstat.fields=^(oom_kill|pgpg|pswp|nr|pg.*fault).*'
      - '--collector.filesystem.ignored-fs-types=^(tmpfs|autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|fuse.lxcfs|nfs|nsfs|ceph|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$$'
    network_mode: host
    pid: host
    volumes:
      - /var/spool/node-exporter:/textfile:ro
      - /:/host:ro,rslave

Are you running node_exporter in Docker?

Yes

What did you do that produced an error?

Ran node-exporter.

What did you expect to see?

No warning messages.

What did you see instead?

Lots of messages like

2020-11-29 06:29:28.518 imgr15.sdp.mkat.karoo.kat.ac.za node-exporter:   CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats
2020-11-29 18:35:02.572 imgr15.sdp.mkat.karoo.kat.ac.za node-exporter:   CPU Iowait counter jumped backwards

This was discussed on the mailing list here, but discussion died down so I'm opening a ticket so that it doesn't get lost. The jumps I see are all -0.01s; there is speculation that this is caused by some sort of race condition or the kernel accounting, and that for such small jumps one shouldn't reset, but rather export the highest value seen.

The text was updated successfully, but these errors were encountered:

SuperQ · 2020-12-01T07:51:19Z

Thanks for reporting another instance of this.

We already handle the jumps gracefully, the message is informing you that it's happening. We've already downgraded this Warning to Debug in the latest code0, so they will go away in the next release.

This is very unusual that Idle would jump backwards. It's only expected in cases where the CPU count is changed. If the delta is small, maybe we need to add some additional checks to see if Idle has changed more than some percent.

I'm interested to know more about your specific deployment (VM? What platform? Hardware?) that is triggering the full Idle counter reset. What is your scrape interval? Prometheus version? Do you run Prometheus in HA?

bmerry · 2020-12-01T08:03:11Z

It's not a new instance - it's the same instance discussed on the mailing list, where I've provided some information about the machines. But in summary:

no VM, Ubuntu 18.04
Intel(R) Xeon(R) CPU E5-2643 v4 @ 3.40GHz (doesn't seem to happen on our other servers, but these are our higher core-count servers)
Seems to happen much more with 4.15 kernel, but also happening on 5.3
16s scrape interval
Prometheus run with Docker as described in this bug report, no HA

I hadn't spotted that the message was downgraded to debug in master. I'm happy for this to be closed in that case.

SuperQ · 2020-12-01T08:08:36Z

Can you post some sample values reported in the logs for the Idle jumps?

I think it might be worth improving the Idle backwards jump case, but I need some data.

bmerry · 2020-12-01T08:19:05Z

I don't have the raw logs any more from my experiment with customising the log message to report the jump size, but here's what I reported at the time:

Again, this is just eyeballing logs, but I've looked at maybe 100 instances and they're all jumping back by 0.01 seconds (which is also the resolution). The idle counters are in the range 1-2 million, the others in the range 10k-100k.

ringyear · 2021-06-19T06:20:28Z

I encounter the same problem, The result shows as following image.

we run Prometheus on docker under bare-meta server(kernel 4.4.180-2.el7.elrepo.x86_64)

SuperQ · 2021-06-19T06:54:44Z

@ringyear With iowait this is a known kernel problem because the kernel does not track things atomically. Please upgrade your node_exporter to the current release.

ringyear · 2021-06-19T07:36:04Z

Got it, thks

discordianfish · 2021-07-01T09:27:38Z

I don't think there is much we can do here. Shall we close @SuperQ?

SuperQ · 2021-07-01T10:07:29Z

@discordianfish I was thinking about changing the way we cache and handle counter resets. Right now we reset everything if idle jumps back at all. This was an assumption that idle would only jump backwards in the event of a hotplug CPU event.

But this issue is some evidence that this isn't always the case.

The Linux CPU idle stat can also jump backwards slightly in some cases. Fixes: #1903 Signed-off-by: Ben Kochie <superq@gmail.com>

The Linux CPU idle stat can also jump backwards slightly in some cases. Allow the jump back up to 3 seconds before we attempt to reset the CPU counter cache. Fixes: #1903 Signed-off-by: Ben Kochie <superq@gmail.com>

The Linux CPU idle stat can also jump backwards slightly in some cases. Allow the jump back up to 3 seconds before we attempt to reset the CPU counter cache. Fixes: prometheus#1903 Signed-off-by: Ben Kochie <superq@gmail.com>

SuperQ changed the title ~~Gracefully handle small backward counter jumps~~ Gracefully handle small "idle" backward counter jumps Jul 1, 2021

SuperQ mentioned this issue Jul 4, 2021

Handle small backwards jumps in CPU idle #2067

Merged

SuperQ added a commit that referenced this issue Jul 4, 2021

Handle small backwards jumps in CPU idle

bde48ea

The Linux CPU idle stat can also jump backwards slightly in some cases. Fixes: #1903 Signed-off-by: Ben Kochie <superq@gmail.com>

SuperQ added a commit that referenced this issue Jul 5, 2021

Handle small backwards jumps in CPU idle

922f2ee

The Linux CPU idle stat can also jump backwards slightly in some cases. Fixes: #1903 Signed-off-by: Ben Kochie <superq@gmail.com>

SuperQ closed this as completed in #2067 Jul 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gracefully handle small "idle" backward counter jumps #1903

Gracefully handle small "idle" backward counter jumps #1903

bmerry commented Dec 1, 2020 •

edited

Loading

SuperQ commented Dec 1, 2020

bmerry commented Dec 1, 2020

SuperQ commented Dec 1, 2020

bmerry commented Dec 1, 2020

ringyear commented Jun 19, 2021

SuperQ commented Jun 19, 2021

ringyear commented Jun 19, 2021

discordianfish commented Jul 1, 2021

SuperQ commented Jul 1, 2021

Gracefully handle small "idle" backward counter jumps #1903

Gracefully handle small "idle" backward counter jumps #1903

Comments

bmerry commented Dec 1, 2020 • edited Loading

Host operating system: output of uname -a

node_exporter version: output of node_exporter --version

node_exporter command line flags

Are you running node_exporter in Docker?

What did you do that produced an error?

What did you expect to see?

What did you see instead?

SuperQ commented Dec 1, 2020

bmerry commented Dec 1, 2020

SuperQ commented Dec 1, 2020

bmerry commented Dec 1, 2020

ringyear commented Jun 19, 2021

SuperQ commented Jun 19, 2021

ringyear commented Jun 19, 2021

discordianfish commented Jul 1, 2021

SuperQ commented Jul 1, 2021

bmerry commented Dec 1, 2020 •

edited

Loading

Host operating system: output of `uname -a`

node_exporter version: output of `node_exporter --version`