-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "total_ticks_count" for CPU metrics #17579
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @TrueBrain and thanks for the PR!
I've left some comments to address. Additionally, if you run make cl
it'll create a changelog entry.
metrics.SetGaugeWithLabels([]string{"client", "host", "cpu", "total"}, float32(cpu.Total), labels) | ||
// Keep "total" around to remain compatible with older consumers of the metrics | ||
metrics.SetGaugeWithLabels([]string{"client", "host", "cpu", "total"}, float32(cpu.TotalPercent), labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do (rarely) drop metrics, but we try to have a deprecation window of 1 major release. Once this is merged I'll add an upgrade note marking the metric deprecated for whatever release it'll ship in. (Likely the upcoming 1.6.0.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is an excellent idea, tnx; I was struggling how to do it properly :)
This counter tells you the total amount of ticks for that CPU entry since the start of Nomad.
Just checking in: is there anything else needed to move this PR forward? If so, please do let me know! Tnx :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just going to wait for the docs preview to run and then if that looks good I'll merge this.
This counter tells you the total amount of ticks for that CPU
entry since the start of Nomad.
Let me give you a bit of context why I always get a bit annoyed when there isn't a CPU counter available in metrics.
Once upon a day, when I was a small boy, I had my Nagios telling me all my servers were running at 100% CPU. But whenever I logged in, they never were above 1%. But the alarms kept going, and the frustration got high. So what was the problem?
When Nagios kicks off, it starts to call a lot of scripts on a host. The last one of the series was: measure current CPU percentage. And well .. because a fraction early other measurement scripts kicked in, at that exact moment in time, the CPU was always 100%. But that was not actually what I wanted to know .. I wanted to know the average percentage since last measurement.
Nomad does this pretty well: it calculates the percentage for all drivers I checked based on the last time it looked at that information. So percentages are average, and they avoid the above issue. But, kinda.
When pushing these averages to the metrics, they are pushed as a guage. This means I am now obligated to pull the metrics as often as Nomad pushes them, to make sure I have an accurate view of the CPU usage. But that is unlikely to work out, as timers drift, and anything polling will miss a sample from time to time.
So as I found out back in the days with Nagios, the far better way to represent CPU time, is in ticks since start, as counter. That way, who-ever reads the metric can figure out themselves what the delta was since last poll, and with that the average CPU usage. In ticks in this case, but I can imagine one could also argue the case for milliseconds.
Basically, this brings what Nomad is doing internally also external, so other tools can do the same thing as Nomad is doing internally: measure the ticks whenever, and calculate based on the time-delta what the average was.
I went for the name
counter_ticks
, although I would rather have changedtotal_ticks
into a Counter .. but changing the meaning of an existing measurement is in general not the nicest thing to do :DPS: I guess "ticks" is a bit vague, as it means MHz in this case. But it was already called "ticks" everywhere, so I stayed with that name. But
counter_mhz
might be more descriptive.