Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "total_ticks_count" for CPU metrics #17579

Merged
merged 2 commits into from
Jul 5, 2023
Merged

Conversation

TrueBrain
Copy link
Contributor

@TrueBrain TrueBrain commented Jun 17, 2023

This counter tells you the total amount of ticks for that CPU
entry since the start of Nomad.

Let me give you a bit of context why I always get a bit annoyed when there isn't a CPU counter available in metrics.

Once upon a day, when I was a small boy, I had my Nagios telling me all my servers were running at 100% CPU. But whenever I logged in, they never were above 1%. But the alarms kept going, and the frustration got high. So what was the problem?

When Nagios kicks off, it starts to call a lot of scripts on a host. The last one of the series was: measure current CPU percentage. And well .. because a fraction early other measurement scripts kicked in, at that exact moment in time, the CPU was always 100%. But that was not actually what I wanted to know .. I wanted to know the average percentage since last measurement.

Nomad does this pretty well: it calculates the percentage for all drivers I checked based on the last time it looked at that information. So percentages are average, and they avoid the above issue. But, kinda.

When pushing these averages to the metrics, they are pushed as a guage. This means I am now obligated to pull the metrics as often as Nomad pushes them, to make sure I have an accurate view of the CPU usage. But that is unlikely to work out, as timers drift, and anything polling will miss a sample from time to time.

So as I found out back in the days with Nagios, the far better way to represent CPU time, is in ticks since start, as counter. That way, who-ever reads the metric can figure out themselves what the delta was since last poll, and with that the average CPU usage. In ticks in this case, but I can imagine one could also argue the case for milliseconds.

Basically, this brings what Nomad is doing internally also external, so other tools can do the same thing as Nomad is doing internally: measure the ticks whenever, and calculate based on the time-delta what the average was.

I went for the name counter_ticks, although I would rather have changed total_ticks into a Counter .. but changing the meaning of an existing measurement is in general not the nicest thing to do :D

PS: I guess "ticks" is a bit vague, as it means MHz in this case. But it was already called "ticks" everywhere, so I stayed with that name. But counter_mhz might be more descriptive.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @TrueBrain and thanks for the PR!

I've left some comments to address. Additionally, if you run make cl it'll create a changelog entry.

website/content/docs/operations/metrics-reference.mdx Outdated Show resolved Hide resolved
Comment on lines -3042 to +3043
metrics.SetGaugeWithLabels([]string{"client", "host", "cpu", "total"}, float32(cpu.Total), labels)
// Keep "total" around to remain compatible with older consumers of the metrics
metrics.SetGaugeWithLabels([]string{"client", "host", "cpu", "total"}, float32(cpu.TotalPercent), labels)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do (rarely) drop metrics, but we try to have a deprecation window of 1 major release. Once this is merged I'll add an upgrade note marking the metric deprecated for whatever release it'll ship in. (Likely the upcoming 1.6.0.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is an excellent idea, tnx; I was struggling how to do it properly :)

client/client.go Outdated Show resolved Hide resolved
This counter tells you the total amount of ticks for that CPU
entry since the start of Nomad.
@TrueBrain TrueBrain changed the title Add "counter_ticks" for CPU metrics Add "total_ticks_count" for CPU metrics Jun 20, 2023
@TrueBrain
Copy link
Contributor Author

Just checking in: is there anything else needed to move this PR forward? If so, please do let me know! Tnx :)

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just going to wait for the docs preview to run and then if that looks good I'll merge this.

@tgross
Copy link
Member

tgross commented Jul 5, 2023

Docs look good: https://nomad-a8xmfz7gg-hashicorp.vercel.app/nomad/docs/operations/metrics-reference#client-metrics

@tgross tgross merged commit ede662a into hashicorp:main Jul 5, 2023
@TrueBrain TrueBrain deleted the cpu-counters branch July 5, 2023 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

2 participants