Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: prevent negative counter from iowait decrease (backport 1.6.x) #18849

Merged
merged 1 commit into from
Oct 24, 2023

Conversation

tgross
Copy link
Member

@tgross tgross commented Oct 24, 2023

The iowait metric obtained from /proc/stat can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the proc(5) pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator.

Backport-of: #18835 (the code touched here was moved, so this backport had to be done by hand)

The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Backport-of: #18835
@tgross tgross requested review from shoenig and lgfa29 October 24, 2023 14:02
@tgross tgross changed the title metrics: prevent negative counter from iowait decrease metrics: prevent negative counter from iowait decrease (backport 1.6.x) Oct 24, 2023
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgross tgross merged commit b3d099c into release/1.6.x Oct 24, 2023
29 of 30 checks passed
@tgross tgross deleted the negative-metrics-1.6.x branch October 24, 2023 14:37
@tgross tgross added backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Oct 24, 2023
tgross added a commit that referenced this pull request Oct 24, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Backport-of: #18835
tgross added a commit that referenced this pull request Oct 24, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Backport-of: #18835
@tgross tgross removed backport/1.4.x backport to 1.4.x release line backport/1.5.x backport to 1.5.x release line labels Oct 24, 2023
@tgross
Copy link
Member Author

tgross commented Oct 24, 2023

Backported manually to 1.5.x and 1.4.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants