Feature Request: Add specification-based metrics #4280

gmichalec-pandora · 2018-05-10T19:06:38Z

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.7.1 (0b295d3)

Operating system and Environment details

Debian Linux 8.7

Issue

We've been using the prometheus-exported job metrics, which have been really useful. However, it would be even more useful if we could get metrics based on the latest evaluated job specification. There may be other useful metrics, but the metrics that immediately jump to mind are:

per-task-group group_count (to validate running allocations == intended allocations)
per-task desired resource metrics (cpu, iops, memory, network mbits)
Memory in particular would be useful in order to determine tasks that are close to getting OOM-killed.

The text was updated successfully, but these errors were encountered:

mlehner616 · 2018-06-29T23:45:40Z

I second this request and am quite surprised this wasn't already an included metric. Especially since Nomad has such strict hard memory caps.

In order for
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.max_usage
or
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.rss

to be actually useful, we need a way to determine the percentage of memory the allocation has used or has free, whether I have to calculate it against another metric that just has .memory.allocated or is a percentage like memory.total_percent doesn't matter but this is a really important metric to alleviate the pain of Nomad's strict memory caps.

I personally suggest the following to be consistent with the other allocs metrics. I think it should just show the defined allocated memory as specified in the latest successful deployment (which should be documented)
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.allocated

nvx · 2019-03-29T06:18:32Z

Anyone?

Having the ability to compare current alloc resource consumption (especially memory) against the limit for those resources would be incredibly useful.

Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```

schmichael · 2019-03-29T18:17:25Z

Sorry for the lack of response. I quick added an allocated metric in bytes in #5492. There's a Linux binary attached if anyone is will to test.

nvx · 2019-04-03T07:14:42Z

That looks like it'll solve my use case!

Related to #4280 This PR adds `client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge in bytes to metrics to ease calculating how close a task is to OOMing. ``` 'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000 'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000 'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000 'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000 'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000 'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000 ```

gmichalec-pandora · 2019-10-02T18:36:54Z

Having the memory metric is great, but it would still be nice to get other metrics based on the intent of the submitted job spec. Here are the metrics we are 'backfilling' via a spec-polling process:

nomad_job_spec_task_cpu_allocation_mhz (cpu resources requested per task)
nomad_job_spec_task_network_allocation_mbits (network resources requested per task)
nomad_job_spec_task_group_count (allocation count requested per task group)
nomad_job_submit_time (the submitTime of the job)

submit_time is very useful for annotating deploys on dashboards.
having the group_count is also very useful for creating alerts based on running vs expected allocation counts

gmichalec-pandora · 2019-10-21T18:32:14Z

just to add a real-world use case for these, here's an example of an alerting query we have to notify when a service has less than 60% of its desired allocations reporting as health in consul:
sum(consul_health_service_status{job="consul-exporter", service_name="doppler"}) / max(nomad_job_spec_task_group_count{exported_job="doppler"}) < 0.6

hobochili · 2019-12-09T19:43:05Z

Following suit with #5492, which exposed per-task memory allocated metrics, I have opened #6784 to expose per-task CPU allocated metrics.

eidam · 2021-08-12T18:40:11Z

👍 for the group_count type of metric, we would love to scale up/down nodes based on the "desired" task group count.

atavakoliyext · 2023-03-16T19:53:04Z

Have there been any new developments on this request, particularly the ..._group_count metrics?

Miserlou mentioned this issue Jun 15, 2018

Improve Telemetry for Dispatch Jobs #4422

Open

schmichael mentioned this issue Mar 29, 2019

client: expose allocated memory per task #5492

Merged

schmichael added type/enhancement theme/client theme/metrics labels Mar 29, 2019

schmichael mentioned this issue Apr 8, 2019

Add metrics for allocated cpu(memory) for each task. #5526

Closed

hobochili mentioned this issue Nov 27, 2019

client: expose allocated CPU per task #6784

Merged

tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021

tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Add specification-based metrics #4280

Feature Request: Add specification-based metrics #4280

gmichalec-pandora commented May 10, 2018

mlehner616 commented Jun 29, 2018

nvx commented Mar 29, 2019

schmichael commented Mar 29, 2019

nvx commented Apr 3, 2019

gmichalec-pandora commented Oct 2, 2019

gmichalec-pandora commented Oct 21, 2019 •

edited

Loading

hobochili commented Dec 9, 2019

eidam commented Aug 12, 2021 •

edited

Loading

atavakoliyext commented Mar 16, 2023

Feature Request: Add specification-based metrics #4280

Feature Request: Add specification-based metrics #4280

Comments

gmichalec-pandora commented May 10, 2018

Nomad version

Operating system and Environment details

Issue

mlehner616 commented Jun 29, 2018

nvx commented Mar 29, 2019

schmichael commented Mar 29, 2019

nvx commented Apr 3, 2019

gmichalec-pandora commented Oct 2, 2019

gmichalec-pandora commented Oct 21, 2019 • edited Loading

hobochili commented Dec 9, 2019

eidam commented Aug 12, 2021 • edited Loading

atavakoliyext commented Mar 16, 2023

gmichalec-pandora commented Oct 21, 2019 •

edited

Loading

eidam commented Aug 12, 2021 •

edited

Loading