Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value #18804

lhossack · 2023-10-19T06:04:28Z

Nomad version

$ nomad version
Nomad v1.6.1
BuildDate 2023-07-21T13:49:42Z
Revision 515895c7690cdc72278018dc5dc58aca41204ccc

Operating system and Environment details

$ uname -a
Linux <hostname> 4.19.0-20-amd64 #1 SMP Debian 4.19.235-1 (2022-03-17) x86_64 GNU/Linux

Hardware:
UCSB-B200-M4
UCSB-B200-M5

Issue

Nomad is occasionally reporting negative cpu tick count since last epoch (see client log/ error messages below).

I believe this is likely benign, but does generate several logs over extended periods of time (on average ~24 log messages/ day/ host).

Reproduction steps

Run the nomad service; possibly for several hours/ days and check the logs for the error message

Rarely, I also get a response to

nomad node status -self

like this

Host Resource Utilization
CPU              Memory           Disk
-472/120000 MHz  6.8 GiB/376 GiB  (/dev/mapper/encryptedvol)

Expected Result

Time/ ticks should move forward (this number should be positive, or at least 0)

Actual Result

Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -1310.6796

Nomad Client logs (if appropriate)

Oct 19 04:40:31 c2007 nomad[60804]: agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -1310.6796
Oct 19 04:40:31 c2007 nomad[60804]: 2023-10-19T04:40:31.905Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -1310.6796
Oct 19 04:40:31 c2007 nomad[60804]: agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -414.2857
Oct 19 04:40:31 c2007 nomad[60804]: 2023-10-19T04:40:31.900Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -414.2857
Oct 19 04:23:44 c2007 nomad[60804]: agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -15
Oct 19 04:23:44 c2007 nomad[60804]: 2023-10-19T04:23:44.192Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -15
Oct 19 04:00:32 c2007 nomad[60804]: agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -990
Oct 19 04:00:32 c2007 nomad[60804]: 2023-10-19T04:00:32.781Z [ERROR] agent: Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value -990

Additional Details

I've traced this in v1.6.1 to these areas:
Host stats collected here:

nomad/client/stats/host.go

Line 109 in 515895c

func (h *HostStatsCollector) Collect() error {

Ticks calculated here based on values from /proc/stat:

nomad/client/stats/cpu.go

Lines 132 to 133 in 515895c

    
           idle, user, system, total := percentCalculator.Calculate(cpuStat) 
        
           ticks := (total / 100.0) * (float64(stats.CpuTotalTicks()) / float64(len(cpuStats)))

I wrote a quick python script and was unable to observe any instances of ticks in /proc/stat decreasing, even when run in parallel with nomad outputting these logs. That test is inconclusive and could use more investigation, so currently it's unclear to me whether this is a bug in nomad, or in our kernel/ lower down the stack.

The text was updated successfully, but these errors were encountered:

tgross · 2023-10-19T13:03:18Z

Hi @lhossack! Thanks for this report! You've caught the logging that @lgfa29 put in place to figure out the cause of panics discussed in #15861. Which means we can finally make progress on hunting that down. The work you've done here to focus in on client/stats is a great start. I'll cross-link to #15861 and see what folks on the team think about where to take this next.

tgross · 2023-10-23T14:11:58Z

(apologies for the misclick that closed the issue. fixed)

tgross · 2023-10-23T14:28:25Z

I wrote a quick python script and was unable to observe any instances of ticks in /proc/stat decreasing, even when run in parallel with nomad outputting these logs. That test is inconclusive and could use more investigation, so currently it's unclear to me whether this is a bug in nomad, or in our kernel/ lower down the stack.

I did some digging on this and it turns out the iowait metric can decrease, particularly on multi-core systems (i.e. basically all of them, these days).

From the man page (emphasis mine):

iowait (since Linux 2.5.41)
(5) Time waiting for I/O to complete. This value is not reliable, for the following reasons:

The CPU will not wait for I/O to complete; iowait is the time that a task is waiting for I/O to complete. When a CPU goes into idle state for outstanding task I/O, another task will be scheduled on this CPU.

On a multi-core CPU, the task waiting for I/O to complete is not running on any CPU, so the iowait of each CPU is difficult to calculate.

The value in this field may decrease in certain conditions.

Apparently the relevant condition here is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. And this is considered a "oh that's unfortunately, but can't break ABI now" for the kernel, and we need to work around this.

The relevant bits in Nomad are in client/stats/host.go#L275-L288 (which in main has been moved to client/hoststats/host.go#L272-L284).

Here we're getting the "current busy" time and that includes the IOWait ticks, which may have decremented. This Calculate method returns a negative value for total if:

iowait decreases from the previous check, and
the increase in idle is more than the increase in busy

The following test demonstrates the bug:

func TestHostCpuStatsCalculator_DecreasedIOWait(t *testing.T) {
	times := cpu.TimesStat{
		CPU:    "cpu0",
		User:   20000,
		Nice:   100,
		System: 9000,
		Idle:   370000,
		Iowait: 700,
	}

	calculator := NewHostCpuStatsCalculator()
	calculator.Calculate(times)

	times = cpu.TimesStat{
		CPU:    "cpu0",
		User:   20000,
		Nice:   100,
		System: 9000,
		Idle:   380000,
		Iowait: 600,
	}

	_, _, _, total := calculator.Calculate(times)
	must.GreaterEq(t, 0.0, total, must.Sprint("total must never be negative"))
}

I'll have a PR up to fix this shortly.

The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: #15861 Fixes: #18804

tgross · 2023-10-23T14:40:16Z

Fix in #18835

lhossack · 2023-10-23T15:14:37Z

Hi @tgross, awesome!

This aligns with my empirical findings over the weekend; I observed on one of our (64-core) servers with an updated & longer test logging /proc/stat at 2Hz for 24 hours, had iowait decrease in ~1481 epochs for single cores, 512 were sufficient to decrease overall cpu iowait for all cores.

No other /proc/stat columns decreased. Of these, none were significant enough to cause the total cpu usage as calculated in nomad to decrease, but nomad also checks /proc/stat at a much higher frequency so I expect that is what causes the "negative value" for total ticks.

Great work :) Looking forward to seeing the PR merged, and happy to know I can safely ignore & drop these logs

The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: #15861 Fixes: #18804

The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see hashicorp#15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: hashicorp#15861 Fixes: hashicorp#18804

lhossack added the type/bug label Oct 19, 2023

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 19, 2023

tgross mentioned this issue Oct 19, 2023

nomad process dies with panic: counter cannot decrease in value #15861

Closed

tgross self-assigned this Oct 23, 2023

tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 23, 2023

tgross added theme/metrics theme/client labels Oct 23, 2023

tgross closed this as completed Oct 23, 2023

Nomad - Community Issues Triage automation moved this from Triaging to Done Oct 23, 2023

tgross reopened this Oct 23, 2023

Nomad - Community Issues Triage automation moved this from Done to Needs Triage Oct 23, 2023

tgross mentioned this issue Oct 23, 2023

metrics: prevent negative counter from iowait decrease #18835

Merged

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Oct 23, 2023

tgross added this to the 1.7.0 milestone Oct 23, 2023

tgross closed this as completed in #18835 Oct 24, 2023

Nomad - Community Issues Triage automation moved this from In Progress to Done Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value #18804

Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value #18804

lhossack commented Oct 19, 2023

tgross commented Oct 19, 2023

tgross commented Oct 23, 2023

tgross commented Oct 23, 2023

tgross commented Oct 23, 2023

lhossack commented Oct 23, 2023

Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value #18804

Attempting to increment Prometheus counter nomad_client_host_cpu_total_ticks_count with value negative value #18804

Comments

lhossack commented Oct 19, 2023

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Nomad Client logs (if appropriate)

Additional Details

tgross commented Oct 19, 2023

tgross commented Oct 23, 2023

tgross commented Oct 23, 2023

tgross commented Oct 23, 2023

lhossack commented Oct 23, 2023