-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gracefully handle small "idle" backward counter jumps #1903
Comments
Thanks for reporting another instance of this. We already handle the jumps gracefully, the message is informing you that it's happening. We've already downgraded this Warning to Debug in the latest code0, so they will go away in the next release. This is very unusual that Idle would jump backwards. It's only expected in cases where the CPU count is changed. If the delta is small, maybe we need to add some additional checks to see if Idle has changed more than some percent. I'm interested to know more about your specific deployment (VM? What platform? Hardware?) that is triggering the full Idle counter reset. What is your scrape interval? Prometheus version? Do you run Prometheus in HA? |
It's not a new instance - it's the same instance discussed on the mailing list, where I've provided some information about the machines. But in summary:
I hadn't spotted that the message was downgraded to debug in master. I'm happy for this to be closed in that case. |
Can you post some sample values reported in the logs for the Idle jumps? I think it might be worth improving the Idle backwards jump case, but I need some data. |
I don't have the raw logs any more from my experiment with customising the log message to report the jump size, but here's what I reported at the time:
|
@ringyear With iowait this is a known kernel problem because the kernel does not track things atomically. Please upgrade your node_exporter to the current release. |
Got it, thks |
I don't think there is much we can do here. Shall we close @SuperQ? |
@discordianfish I was thinking about changing the way we cache and handle counter resets. Right now we reset everything if But this issue is some evidence that this isn't always the case. |
The Linux CPU idle stat can also jump backwards slightly in some cases. Fixes: #1903 Signed-off-by: Ben Kochie <superq@gmail.com>
The Linux CPU idle stat can also jump backwards slightly in some cases. Fixes: #1903 Signed-off-by: Ben Kochie <superq@gmail.com>
The Linux CPU idle stat can also jump backwards slightly in some cases. Allow the jump back up to 3 seconds before we attempt to reset the CPU counter cache. Fixes: #1903 Signed-off-by: Ben Kochie <superq@gmail.com>
The Linux CPU idle stat can also jump backwards slightly in some cases. Allow the jump back up to 3 seconds before we attempt to reset the CPU counter cache. Fixes: prometheus#1903 Signed-off-by: Ben Kochie <superq@gmail.com>
The Linux CPU idle stat can also jump backwards slightly in some cases. Allow the jump back up to 3 seconds before we attempt to reset the CPU counter cache. Fixes: prometheus#1903 Signed-off-by: Ben Kochie <superq@gmail.com>
Host operating system: output of
uname -a
node_exporter version: output of
node_exporter --version
Docker image prom/node-exporter:v1.0.1
node_exporter command line flags
Docker-compose config:
Are you running node_exporter in Docker?
Yes
What did you do that produced an error?
Ran node-exporter.
What did you expect to see?
No warning messages.
What did you see instead?
Lots of messages like
This was discussed on the mailing list here, but discussion died down so I'm opening a ticket so that it doesn't get lost. The jumps I see are all -0.01s; there is speculation that this is caused by some sort of race condition or the kernel accounting, and that for such small jumps one shouldn't reset, but rather export the highest value seen.
The text was updated successfully, but these errors were encountered: