Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad Metrics don't match Telemetry documentation #4126

Closed
jesusvazquez opened this issue Apr 10, 2018 · 7 comments
Closed

Nomad Metrics don't match Telemetry documentation #4126

jesusvazquez opened this issue Apr 10, 2018 · 7 comments

Comments

@jesusvazquez
Copy link
Contributor

Nomad version

Nomad v0.7.1 (0b295d399d00199cfab4621566babd25987ba06e)

Operating system and Environment details

Ubuntu 14.04 LTS
4.4.0-111-generic #134~14.04.1-Ubuntu SMP Mon Jan 15 15:39:56 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Issue

Metrics received by the nomad api don't match the official telemetry documentation.

Reproduction steps

Following with the prior reference to the official documentation, lets take for example nomad.client.allocs.*. I query the nomad API and I get nothing:

# curl 10.146.157.234:4646/v1/metrics | jq . | grep nomad.client.allocs
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9756    0  9756    0     0  2442k      0 --:--:-- --:--:-- --:--:-- 3175k

However, if I query asking for nomad.clients.allocations, I get the following:

# curl 10.146.157.234:4646/v1/metrics | jq . | grep nomad.client.allocations
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9756    0  9756    0     0  2650k      0 --:--:-- --:--:-- --:--:-- 3175k
      "Name": "nomad.client.allocations.blocked",
      "Name": "nomad.client.allocations.migrating",
      "Name": "nomad.client.allocations.pending",
      "Name": "nomad.client.allocations.running",
      "Name": "nomad.client.allocations.terminal",

Another example could be nomad.client.uptime which doesn't exist, instead I have nomad.uptime.

curl 10.146.157.234:4646/v1/metrics | jq . | grep uptime
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9766    0  9766    0     0  2644k      0 --:--:-- --:--:-- --:--:-- 3179k
      "Name": "nomad.uptime",

Here is the whole list of nomad.client metrics:

# curl 10.146.157.234:4646/v1/metrics | jq . | grep nomad.client
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  9760    0  9760    0     0  3111k      0 --:--:-- --:--:-- --:--:-- 4765k
      "Name": "nomad.client.allocated.cpu",
      "Name": "nomad.client.allocated.disk",
      "Name": "nomad.client.allocated.iops",
      "Name": "nomad.client.allocated.memory",
      "Name": "nomad.client.allocations.blocked",
      "Name": "nomad.client.allocations.migrating",
      "Name": "nomad.client.allocations.pending",
      "Name": "nomad.client.allocations.running",
      "Name": "nomad.client.allocations.terminal",
      "Name": "nomad.client.host.cpu.idle",
      "Name": "nomad.client.host.cpu.idle",
      "Name": "nomad.client.host.cpu.idle",
      "Name": "nomad.client.host.cpu.idle",
      "Name": "nomad.client.host.cpu.idle",
      "Name": "nomad.client.host.cpu.idle",
      "Name": "nomad.client.host.cpu.idle",
      "Name": "nomad.client.host.cpu.idle",
      "Name": "nomad.client.host.cpu.system",
      "Name": "nomad.client.host.cpu.system",
      "Name": "nomad.client.host.cpu.system",
      "Name": "nomad.client.host.cpu.system",
      "Name": "nomad.client.host.cpu.system",
      "Name": "nomad.client.host.cpu.system",
      "Name": "nomad.client.host.cpu.system",
      "Name": "nomad.client.host.cpu.system",
      "Name": "nomad.client.host.cpu.total",
      "Name": "nomad.client.host.cpu.total",
      "Name": "nomad.client.host.cpu.total",
      "Name": "nomad.client.host.cpu.total",
      "Name": "nomad.client.host.cpu.total",
      "Name": "nomad.client.host.cpu.total",
      "Name": "nomad.client.host.cpu.total",
      "Name": "nomad.client.host.cpu.total",
      "Name": "nomad.client.host.cpu.user",
      "Name": "nomad.client.host.cpu.user",
      "Name": "nomad.client.host.cpu.user",
      "Name": "nomad.client.host.cpu.user",
      "Name": "nomad.client.host.cpu.user",
      "Name": "nomad.client.host.cpu.user",
      "Name": "nomad.client.host.cpu.user",
      "Name": "nomad.client.host.cpu.user",
      "Name": "nomad.client.host.disk.available",
      "Name": "nomad.client.host.disk.available",
      "Name": "nomad.client.host.disk.inodes_percent",
      "Name": "nomad.client.host.disk.inodes_percent",
      "Name": "nomad.client.host.disk.size",
      "Name": "nomad.client.host.disk.size",
      "Name": "nomad.client.host.disk.used",
      "Name": "nomad.client.host.disk.used",
      "Name": "nomad.client.host.disk.used_percent",
      "Name": "nomad.client.host.disk.used_percent",
      "Name": "nomad.client.host.memory.available",
      "Name": "nomad.client.host.memory.free",
      "Name": "nomad.client.host.memory.total",
      "Name": "nomad.client.host.memory.used",
      "Name": "nomad.client.unallocated.cpu",
      "Name": "nomad.client.unallocated.disk",
      "Name": "nomad.client.unallocated.iops",
      "Name": "nomad.client.unallocated.memory",
@dadgar
Copy link
Contributor

dadgar commented Apr 10, 2018

Hey the "nomad.client.allocs" are emitted metrics: https://github.com/hashicorp/nomad/blob/master/client/alloc_runner.go#L774

I have also made a PR to fix the uptime metric as that was a mistake. If you see any other mismatched metrics please let us know.

@schmichael
Copy link
Member

Just to clarify that the nomad.client.allocs only exist once an allocation is running. There's a slight delay between an allocation running the first metrics being gathered and exposed.

@jesusvazquez
Copy link
Contributor Author

Thanks for the clarifications guys.

@schmichael I have allocations already running but I have a question though. If any allocation has failed ever then nomad would emit nomad.clients.allocs.failed or not?

Thanks for your time!

@schmichael
Copy link
Member

If any allocation has failed ever then nomad would emit nomad.clients.allocs.failed or not?

Yes. That metric is a counter, so it will increment with each allocation failure on a client node until that client node's Nomad process is restarted. Even if the failed allocations are garbage collected the counter will not reset until the process exits.

Hope that helps!

@jesusvazquez
Copy link
Contributor Author

Yeah thats enough! Thanks for the quick responses guys. Keep up the good work!

@surajthakur
Copy link

@alex,

I didnt understand about the "nomad.client.allocs" are emitted metrics.
How could I enable it?

Basically my objective is to get the data if any of my job is restarting or in a failed state.

I am using Prometheus capturing data from nomad telemetry.

@github-actions
Copy link

github-actions bot commented Nov 9, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 9, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants