Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing kernel memory usage metrics #11048

Open
nvx opened this issue Aug 12, 2021 · 4 comments
Open

Missing kernel memory usage metrics #11048

nvx opened this issue Aug 12, 2021 · 4 comments

Comments

@nvx
Copy link
Contributor

nvx commented Aug 12, 2021

Nomad version

Nomad v1.1.0 (2678c36)

Operating system and Environment details

RHEL7 x64 and RancherOS

Issue

Nomad 0.12.3 used to expose metrics nomad.client.allocs.memory.kernel_max_usage and nomad.client.allocs.memory.kernel_usage (as nomad_client_allocs_memory_kernel_max_usage when using Prometheus) as per https://www.nomadproject.io/docs/operations/metrics

However upgrading some workers to 1.1.0 a while back, I've noticed these kernel memory metrics are now missing. other memory metrics (all prefixed with nomad.client.allocs.memory.) show up (specifically I can see allocated, cache, max_usage, rss, swap, and usage fine).

Reproduction steps

Scrape Prometheus metrics with at least one running alloc.

Expected Result

Expected to see kernel memory metrics as per the docs

Actual Result

Kernel memory metrics are missing

@nvx nvx added the type/bug label Aug 12, 2021
@jrasell
Copy link
Member

jrasell commented Aug 13, 2021

Hi @nvx, I believe this is a result of the change within #10376 which landed in v1.1.0. If you have clients still running v0.12.3 or have the information available, was this data point ever at a non-zero value? If you're able to provide the driver that you're seeing this with also, that would be great.

@jrasell jrasell self-assigned this Aug 13, 2021
@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 13, 2021
@jrasell jrasell moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Aug 13, 2021
@nvx
Copy link
Contributor Author

nvx commented Aug 16, 2021

Hi @nvx, I believe this is a result of the change within #10376 which landed in v1.1.0. If you have clients still running v0.12.3 or have the information available, was this data point ever at a non-zero value? If you're able to provide the driver that you're seeing this with also, that would be great.

My workloads all run using the docker driver.

I've still got v0.12.3 clients, I just had a look at the reported metrics from them and they all appear to be 0 for the workloads I'm running. I must admit when I built dashboards I just looked at the docs and built them from there (which for the metrics I was interested in meant adding up all the memory usage metrics to compare against the threshold) so I didn't notice that kernel memory usage was always 0 at the time. Is there any documentation somewhere indicating which metrics are applicable to which drivers? I couldn't spot anything at a glance.

I must admit under Prometheus silently dropping metrics with a value of 0 can cause issues though as doing something like memory_rss+memory_kernel to get the total memory usage will result in the entire series getting dropped and no value returned at all if memory_kernel is absent. This can make making dashboards difficult if a metrics will be absent in some cases but present (and useful to represent) in other cases.

@jrasell
Copy link
Member

jrasell commented Aug 16, 2021

Thanks for the information @nvx.

The Docker driver doesn't expose the memory kernel_max_usage data point and therefore the metric is not missing. You are seeing the correct behaviour as a result of the recent fix.

I do agree with both your points regarding better documentation, as well as better support in registering Prometheus metrics so they are always available for scraping and not silently dropped. In order to properly capture these I would like to close this issue out, and raise two new issues, linked to this one, detailing the above items. Does this sound acceptable to you?

@nvx
Copy link
Contributor Author

nvx commented Aug 18, 2021

I do agree with both your points regarding better documentation, as well as better support in registering Prometheus metrics so they are always available for scraping and not silently dropped. In order to properly capture these I would like to close this issue out, and raise two new issues, linked to this one, detailing the above items. Does this sound acceptable to you?

Sounds great, cheers!

@tgross tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants