Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value #18804

Closed
lhossack opened this issue Oct 19, 2023 · 5 comments · Fixed by #18835

Comments

@lhossack
Copy link

Nomad version

$ nomad version
Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

$ uname -a
Linux <hostname> 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux

Hardware:
UCSB-B200-M4
UCSB-B200-M5

Issue

Nomad is occasionally reporting negative cpu tick count since last epoch (see client log/ error messages below).

I believe this is likely benign, but does generate several logs over extended periods of time (on average ~24 log messages/ day/ host).

Reproduction steps

Run the nomad service; possibly for several hours/ days and check the logs for the error message

Rarely, I also get a response to

nomad node status -self

like this

Host Resource Utilization
CPU              Memory           Disk
-472/120000 MHz  6.8 GiB/376 GiB  (/dev/mapper/encryptedvol)

Expected Result

Time/ ticks should move forward (this number should be positive, or at least 0)

Actual Result

Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -1310.6796

Nomad Client logs (if appropriate)

Oct 19 04:40:31 c2007 nomad[60804]: agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -1310.6796
Oct 19 04:40:31 c2007 nomad[60804]: 2023-10-19T04:40:31.905Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -1310.6796
Oct 19 04:40:31 c2007 nomad[60804]: agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -414.2857
Oct 19 04:40:31 c2007 nomad[60804]: 2023-10-19T04:40:31.900Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -414.2857
Oct 19 04:23:44 c2007 nomad[60804]: agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -15
Oct 19 04:23:44 c2007 nomad[60804]: 2023-10-19T04:23:44.192Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -15
Oct 19 04:00:32 c2007 nomad[60804]: agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -990
Oct 19 04:00:32 c2007 nomad[60804]: 2023-10-19T04:00:32.781Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -990

Additional Details

I've traced this in v1.6.1 to these areas:
Host stats collected here:

func (h *HostStatsCollector) Collect() error {

Ticks calculated here based on values from /proc/stat:

nomad/client/stats/cpu.go

Lines 132 to 133 in 515895c

idle, user, system, total := percentCalculator.Calculate(cpuStat)
ticks := (total / 100.0) * (float64(stats.CpuTotalTicks()) / float64(len(cpuStats)))

I wrote a quick python script and was unable to observe any instances of ticks in /proc/stat decreasing, even when run in parallel with nomad outputting these logs. That test is inconclusive and could use more investigation, so currently it's unclear to me whether this is a bug in nomad, or in our kernel/ lower down the stack.

@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 19, 2023
@tgross
Copy link
Member

tgross commented Oct 19, 2023

Hi @lhossack! Thanks for this report! You've caught the logging that @lgfa29 put in place to figure out the cause of panics discussed in #15861. Which means we can finally make progress on hunting that down. The work you've done here to focus in on client/stats is a great start. I'll cross-link to #15861 and see what folks on the team think about where to take this next.

@tgross tgross self-assigned this Oct 23, 2023
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 23, 2023
@tgross tgross closed this as completed Oct 23, 2023
Nomad - Community Issues Triage automation moved this from Triaging to Done Oct 23, 2023
@tgross tgross reopened this Oct 23, 2023
Nomad - Community Issues Triage automation moved this from Done to Needs Triage Oct 23, 2023
@tgross
Copy link
Member

tgross commented Oct 23, 2023

(apologies for the misclick that closed the issue. fixed)

@tgross
Copy link
Member

tgross commented Oct 23, 2023

I wrote a quick python script and was unable to observe any instances of ticks in /proc/stat decreasing, even when run in parallel with nomad outputting these logs. That test is inconclusive and could use more investigation, so currently it's unclear to me whether this is a bug in nomad, or in our kernel/ lower down the stack.

I did some digging on this and it turns out the iowait metric can decrease, particularly on multi-core systems (i.e. basically all of them, these days).

From the man page (emphasis mine):

iowait (since Linux 2.5.41)
(5) Time waiting for I/O to complete. This value is not reliable, for the following reasons:

  • The CPU will not wait for I/O to complete; iowait is the time that a task is waiting for I/O to complete. When a CPU goes into idle state for outstanding task I/O, another task will be scheduled on this CPU.

  • On a multi-core CPU, the task waiting for I/O to complete is not running on any CPU, so the iowait of each CPU is difficult to calculate.

  • The value in this field may decrease in certain conditions.

Apparently the relevant condition here is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. And this is considered a "oh that's unfortunately, but can't break ABI now" for the kernel, and we need to work around this.

The relevant bits in Nomad are in client/stats/host.go#L275-L288 (which in main has been moved to client/hoststats/host.go#L272-L284).

Here we're getting the "current busy" time and that includes the IOWait ticks, which may have decremented. This Calculate method returns a negative value for total if:

  • iowait decreases from the previous check, and
  • the increase in idle is more than the increase in busy

The following test demonstrates the bug:

func TestHostCpuStatsCalculator_DecreasedIOWait(t *testing.T) {
	times := cpu.TimesStat{
		CPU:    "cpu0",
		User:   20000,
		Nice:   100,
		System: 9000,
		Idle:   370000,
		Iowait: 700,
	}

	calculator := NewHostCpuStatsCalculator()
	calculator.Calculate(times)

	times = cpu.TimesStat{
		CPU:    "cpu0",
		User:   20000,
		Nice:   100,
		System: 9000,
		Idle:   380000,
		Iowait: 600,
	}

	_, _, _, total := calculator.Calculate(times)
	must.GreaterEq(t, 0.0, total, must.Sprint("total must never be negative"))
}

I'll have a PR up to fix this shortly.

tgross added a commit that referenced this issue Oct 23, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: #15861
Fixes: #18804
tgross added a commit that referenced this issue Oct 23, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: #15861
Fixes: #18804
@tgross
Copy link
Member

tgross commented Oct 23, 2023

Fix in #18835

@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Oct 23, 2023
@tgross tgross added this to the 1.7.0 milestone Oct 23, 2023
@lhossack
Copy link
Author

Hi @tgross, awesome!

This aligns with my empirical findings over the weekend; I observed on one of our (64-core) servers with an updated & longer test logging /proc/stat at 2Hz for 24 hours, had iowait decrease in ~1481 epochs for single cores, 512 were sufficient to decrease overall cpu iowait for all cores.

No other /proc/stat columns decreased. Of these, none were significant enough to cause the total cpu usage as calculated in nomad to decrease, but nomad also checks /proc/stat at a much higher frequency so I expect that is what causes the "negative value" for total ticks.

Great work :) Looking forward to seeing the PR merged, and happy to know I can safely ignore & drop these logs

tgross added a commit that referenced this issue Oct 24, 2023
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: #15861
Fixes: #18804
Nomad - Community Issues Triage automation moved this from In Progress to Done Oct 24, 2023
nvanthao pushed a commit to nvanthao/nomad that referenced this issue Mar 1, 2024
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see hashicorp#15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: hashicorp#15861
Fixes: hashicorp#18804
nvanthao pushed a commit to nvanthao/nomad that referenced this issue Mar 1, 2024
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see hashicorp#15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: hashicorp#15861
Fixes: hashicorp#18804
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants