Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Add specification-based metrics #4280

Open
gmichalec-pandora opened this issue May 10, 2018 · 9 comments
Open

Feature Request: Add specification-based metrics #4280

gmichalec-pandora opened this issue May 10, 2018 · 9 comments

Comments

@gmichalec-pandora
Copy link

If you have a question, prepend your issue with [question] or preferably use the nomad mailing list.

If filing a bug please include the following:

Nomad version

Nomad v0.7.1 (0b295d3)

Operating system and Environment details

Debian Linux 8.7

Issue

We've been using the prometheus-exported job metrics, which have been really useful. However, it would be even more useful if we could get metrics based on the latest evaluated job specification. There may be other useful metrics, but the metrics that immediately jump to mind are:

  • per-task-group group_count (to validate running allocations == intended allocations)
  • per-task desired resource metrics (cpu, iops, memory, network mbits)
    Memory in particular would be useful in order to determine tasks that are close to getting OOM-killed.
@mlehner616
Copy link

I second this request and am quite surprised this wasn't already an included metric. Especially since Nomad has such strict hard memory caps.

In order for
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.max_usage
or
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.rss

to be actually useful, we need a way to determine the percentage of memory the allocation has used or has free, whether I have to calculate it against another metric that just has .memory.allocated or is a percentage like memory.total_percent doesn't matter but this is a really important metric to alleviate the pain of Nomad's strict memory caps.

I personally suggest the following to be consistent with the other allocs metrics. I think it should just show the defined allocated memory as specified in the latest successful deployment (which should be documented)
nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.allocated

@nvx
Copy link
Contributor

nvx commented Mar 29, 2019

Anyone?

Having the ability to compare current alloc resource consumption (especially memory) against the limit for those resources would be incredibly useful.

schmichael added a commit that referenced this issue Mar 29, 2019
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
@schmichael
Copy link
Member

Sorry for the lack of response. I quick added an allocated metric in bytes in #5492. There's a Linux binary attached if anyone is will to test.

@nvx
Copy link
Contributor

nvx commented Apr 3, 2019

That looks like it'll solve my use case!

schmichael added a commit that referenced this issue Apr 11, 2019
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
schmichael added a commit that referenced this issue Apr 16, 2019
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
schmichael added a commit that referenced this issue May 10, 2019
Related to #4280

This PR adds
`client.allocs.<job>.<group>.<alloc>.<task>.memory.allocated` as a gauge
in bytes to metrics to ease calculating how close a task is to OOMing.

```
'nomad.client.allocs.memory.allocated.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 268435456.000
'nomad.client.allocs.memory.cache.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 5677056.000
'nomad.client.allocs.memory.kernel_max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.kernel_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.max_usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8908800.000
'nomad.client.allocs.memory.rss.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 876544.000
'nomad.client.allocs.memory.swap.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 0.000
'nomad.client.allocs.memory.usage.example.cache.6d98cbaf-d6bc-2a84-c63f-bfff8905a9d8.redis.rusty': 8208384.000
```
@gmichalec-pandora
Copy link
Author

Having the memory metric is great, but it would still be nice to get other metrics based on the intent of the submitted job spec. Here are the metrics we are 'backfilling' via a spec-polling process:

  • nomad_job_spec_task_cpu_allocation_mhz (cpu resources requested per task)
  • nomad_job_spec_task_network_allocation_mbits (network resources requested per task)
  • nomad_job_spec_task_group_count (allocation count requested per task group)
  • nomad_job_submit_time (the submitTime of the job)

submit_time is very useful for annotating deploys on dashboards.
having the group_count is also very useful for creating alerts based on running vs expected allocation counts

@gmichalec-pandora
Copy link
Author

gmichalec-pandora commented Oct 21, 2019

just to add a real-world use case for these, here's an example of an alerting query we have to notify when a service has less than 60% of its desired allocations reporting as health in consul:
sum(consul_health_service_status{job="consul-exporter", service_name="doppler"}) / max(nomad_job_spec_task_group_count{exported_job="doppler"}) < 0.6

@hobochili
Copy link
Contributor

Following suit with #5492, which exposed per-task memory allocated metrics, I have opened #6784 to expose per-task CPU allocated metrics.

@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross removed this from Needs Roadmapping in Nomad - Community Issues Triage Mar 4, 2021
@eidam
Copy link

eidam commented Aug 12, 2021

👍 for the group_count type of metric, we would love to scale up/down nodes based on the "desired" task group count.

@atavakoliyext
Copy link

Have there been any new developments on this request, particularly the ..._group_count metrics?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants