Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully handle small "idle" backward counter jumps #1903

Closed
bmerry opened this issue Dec 1, 2020 · 9 comments · Fixed by #2067
Closed

Gracefully handle small "idle" backward counter jumps #1903

bmerry opened this issue Dec 1, 2020 · 9 comments · Fixed by #2067

Comments

@bmerry
Copy link

bmerry commented Dec 1, 2020

Host operating system: output of uname -a

  • Linux imgr15 4.15.0-74-generic Cut version 0.10.0. #83~16.04.1-Ubuntu SMP Wed Dec 18 04:56:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Linux imgr23 5.3.0-62-generic Add TCP state collector #56~18.04.1-Ubuntu SMP Wed Jun 24 16:17:03 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

Docker image prom/node-exporter:v1.0.1

node_exporter command line flags

Docker-compose config:

version: '2'
services:
  node-exporter:
    image: prom/node-exporter:v1.0.1
    command:
      - '--collector.textfile.directory=/textfile'
      - '--no-collector.zfs'
      - '--no-collector.wifi'
      - '--no-collector.bcache'
      - '--path.rootfs=/host'
      - '--collector.vmstat.fields=^(oom_kill|pgpg|pswp|nr|pg.*fault).*'
      - '--collector.filesystem.ignored-fs-types=^(tmpfs|autofs|binfmt_misc|cgroup|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|mqueue|fuse.lxcfs|nfs|nsfs|ceph|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|sysfs|tracefs)$$'
    network_mode: host
    pid: host
    volumes:
      - /var/spool/node-exporter:/textfile:ro
      - /:/host:ro,rslave

Are you running node_exporter in Docker?

Yes

What did you do that produced an error?

Ran node-exporter.

What did you expect to see?

No warning messages.

What did you see instead?

Lots of messages like

2020-11-29 06:29:28.518 imgr15.sdp.mkat.karoo.kat.ac.za node-exporter:   CPU Idle counter jumped backwards, possible hotplug event, resetting CPU stats
2020-11-29 18:35:02.572 imgr15.sdp.mkat.karoo.kat.ac.za node-exporter:   CPU Iowait counter jumped backwards

This was discussed on the mailing list here, but discussion died down so I'm opening a ticket so that it doesn't get lost. The jumps I see are all -0.01s; there is speculation that this is caused by some sort of race condition or the kernel accounting, and that for such small jumps one shouldn't reset, but rather export the highest value seen.

@SuperQ
Copy link
Member

SuperQ commented Dec 1, 2020

Thanks for reporting another instance of this.

We already handle the jumps gracefully, the message is informing you that it's happening. We've already downgraded this Warning to Debug in the latest code0, so they will go away in the next release.

This is very unusual that Idle would jump backwards. It's only expected in cases where the CPU count is changed. If the delta is small, maybe we need to add some additional checks to see if Idle has changed more than some percent.

I'm interested to know more about your specific deployment (VM? What platform? Hardware?) that is triggering the full Idle counter reset. What is your scrape interval? Prometheus version? Do you run Prometheus in HA?

@bmerry
Copy link
Author

bmerry commented Dec 1, 2020

It's not a new instance - it's the same instance discussed on the mailing list, where I've provided some information about the machines. But in summary:

  • no VM, Ubuntu 18.04
  • Intel(R) Xeon(R) CPU E5-2643 v4 @ 3.40GHz (doesn't seem to happen on our other servers, but these are our higher core-count servers)
  • Seems to happen much more with 4.15 kernel, but also happening on 5.3
  • 16s scrape interval
  • Prometheus run with Docker as described in this bug report, no HA

I hadn't spotted that the message was downgraded to debug in master. I'm happy for this to be closed in that case.

@SuperQ
Copy link
Member

SuperQ commented Dec 1, 2020

Can you post some sample values reported in the logs for the Idle jumps?

I think it might be worth improving the Idle backwards jump case, but I need some data.

@bmerry
Copy link
Author

bmerry commented Dec 1, 2020

I don't have the raw logs any more from my experiment with customising the log message to report the jump size, but here's what I reported at the time:

Again, this is just eyeballing logs, but I've looked at maybe 100 instances and they're all jumping back by 0.01 seconds (which is also the resolution). The idle counters are in the range 1-2 million, the others in the range 10k-100k.

@ringyear
Copy link

I encounter the same problem, The result shows as following image.
image
we run Prometheus on docker under bare-meta server(kernel 4.4.180-2.el7.elrepo.x86_64)

@SuperQ
Copy link
Member

SuperQ commented Jun 19, 2021

@ringyear With iowait this is a known kernel problem because the kernel does not track things atomically. Please upgrade your node_exporter to the current release.

@ringyear
Copy link

Got it, thks

@discordianfish
Copy link
Member

I don't think there is much we can do here. Shall we close @SuperQ?

@SuperQ
Copy link
Member

SuperQ commented Jul 1, 2021

@discordianfish I was thinking about changing the way we cache and handle counter resets. Right now we reset everything if idle jumps back at all. This was an assumption that idle would only jump backwards in the event of a hotplug CPU event.

But this issue is some evidence that this isn't always the case.

@SuperQ SuperQ changed the title Gracefully handle small backward counter jumps Gracefully handle small "idle" backward counter jumps Jul 1, 2021
SuperQ added a commit that referenced this issue Jul 4, 2021
The Linux CPU idle stat can also jump backwards slightly in some cases.

Fixes: #1903

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit that referenced this issue Jul 5, 2021
The Linux CPU idle stat can also jump backwards slightly in some cases.

Fixes: #1903

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit that referenced this issue Jul 7, 2021
The Linux CPU idle stat can also jump backwards slightly in some cases.
Allow the jump back up to 3 seconds before we attempt to reset the CPU
counter cache.

Fixes: #1903

Signed-off-by: Ben Kochie <superq@gmail.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
The Linux CPU idle stat can also jump backwards slightly in some cases.
Allow the jump back up to 3 seconds before we attempt to reset the CPU
counter cache.

Fixes: prometheus#1903

Signed-off-by: Ben Kochie <superq@gmail.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
The Linux CPU idle stat can also jump backwards slightly in some cases.
Allow the jump back up to 3 seconds before we attempt to reset the CPU
counter cache.

Fixes: prometheus#1903

Signed-off-by: Ben Kochie <superq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants