-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question] Telemetry documentation inaccuracies #7773
Comments
Hi @m1keil! Thanks for reporting this!
Yeah that seems like a stale documentation item. I'll mark this as a documentation bug to fix.
That's typically the way you'd see it with other monitoring tools like
You should be able to get this via the |
Thanks for the response @tgross .
Yes, of-course, there are ways to get this information externally of the monitoring system. However, without having this info in the telemetry, there's no way to dynamically find how close is the allocation to the max. |
@tgross @m1keil The CPU resources you assign in your jobspec or in quota definitions are measured in Mhz, not percent. %CPU might be interesting for cluster or node operators but not really for the people running jobs on the cluster because of ☝️ Now, the
The UI reports both percent and Mhz (0 Mhz / 100 Mhz reserved) Since PR6784 added a This would bring CPU telemetry in line with the Memory telemetry, allowing the same procedures to be used for monitoring both. |
I don't think it would. Unless Basically I just want to understand if using Nomad's telemetry I can answer the simple question of "is my node hitting the max CPU utilization ceiling?" |
@m1keil Well, The metric I am suggesting ( ☝️ is useful for job operators since it exposes the same unit of measurement (Mhz) that is used elsewhere in Nomad (Jobspec, Quota and the CLI/UI) and easily let's you check CPU consumption or if
Now, Nomad also exposes host metrics such as You question of |
@henrikjohansen first of all, thanks for the detailed answers. Much appreciated. Some background about me, I'm all of the above (job and cluster operator). Regarding You are totally right about the host metrics. That would answer my question. However, I think I was in a bit of a rush when I was writing that question down. I want to be able to tell what is the current CPU consumption of a single allocation. And I want that to be a percentage. Just like you would see it in The metric you are suggesting ( But isn't it just easier to make a |
That depends on how you define 'consumption' I guess. You can easily run into situations where Nomad is unable to schedule more work to a node even though it's essentially idle ... because all available resources are reserved to jobs running on that node (and thus considered consumed) .
Personally I don't think that this makes much sense since literally everything else in Nomad uses Mhz. If you want to see the current CPU consumption of an allocation the suggested metric ( If you want to see if an allocation is consuming more CPU resources than you have reserved in your jobspecs If you would like to check if jobs are getting CPU throttled because of CPU contention issues on a node you should use Lastly, using As a side node, if you want to check CPU contention issues on a per-node/OS level then CPU run-queue statistics are as important as %CPU since contention can occur without 100% CPU utilization.
Indeed and IIRC these also handle capacity that you might have reserved in your nomad agent config correctly.
In our case, no. We use different physical hardware and expose those as node classes in Nomad (think high-cpu, gpu, high-mem, etc). Now, what does 23% utilization mean on an AMD EPYC node with 256 cores compared to a 32 core Intel node? Consumption of 25000 Mhz however means the same regardless of ☝️ |
It's my understanding that Tbh, I see no reason why both use cases can't be answered. I do understand your point. I just don't see it as "either this or that". But it's less important at the moment. Looks like I need to fallback to cAdvisor in order to monitor docker directly to get these numbers. |
I'm running Nomad (0.10.2) with Docker task driver (19.03.8) prometheus metrics enabled.
In the telemetry docs it says:
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_percent
: Total CPU resources consumed by the task across all coresFirst of all, it seems like the docs are a bit out of date in regards to the labels & metric names, as its now changed to nomad.client.allocs.cpu.system/user/total_percent and labels, similar to how Host Metrics are documented (the post 0.7 change).
The description of the metric is confusing as well.
The
total_percent
metric can easily spike above 100%, which means it's not "across all cores" but actually a summary of. Maybe it's a language barrier on my side. I was expecting a "normalized" value here. I.e in a 4 core machine, having 1 core at 100% use leads to 25% in this metric. But in here we get 100% instead (so the max is 400%).Because Nomad doesn't expose any "Number of cores" metric, this makes it hard to estimate how high the utilization really is. Additionally, there are no metrics that can tell how far of are we from the CPU resource limit in a similar fashion to what is being graphed on the allocation page in the nomad UI.
The memory related metrics for the allocation doesn't have similar issues as we can compute utilization by doing
nomad_client_allocs_memory_usage
/nomad_client_allocs_memory_allocated
.(btw
nomad_client_allocs_memory_usage
is not documented either or maybe it was renamed fromnomad_client_allocs_memory_used
).Questions are:
a) Is this a documentation bug? Or am I just reading this incorrectly?
b) Is there any way we can graph the utilization of an allocation against its CPU resource limit?
The text was updated successfully, but these errors were encountered: