Missing kernel memory usage metrics #11048

nvx · 2021-08-12T23:52:06Z

Nomad version

Nomad v1.1.0 (2678c36)

Operating system and Environment details

RHEL7 x64 and RancherOS

Issue

Nomad 0.12.3 used to expose metrics nomad.client.allocs.memory.kernel_max_usage and nomad.client.allocs.memory.kernel_usage (as nomad_client_allocs_memory_kernel_max_usage when using Prometheus) as per https://www.nomadproject.io/docs/operations/metrics

However upgrading some workers to 1.1.0 a while back, I've noticed these kernel memory metrics are now missing. other memory metrics (all prefixed with nomad.client.allocs.memory.) show up (specifically I can see allocated, cache, max_usage, rss, swap, and usage fine).

Reproduction steps

Scrape Prometheus metrics with at least one running alloc.

Expected Result

Expected to see kernel memory metrics as per the docs

Actual Result

Kernel memory metrics are missing

The text was updated successfully, but these errors were encountered:

jrasell · 2021-08-13T12:40:56Z

Hi @nvx, I believe this is a result of the change within #10376 which landed in v1.1.0. If you have clients still running v0.12.3 or have the information available, was this data point ever at a non-zero value? If you're able to provide the driver that you're seeing this with also, that would be great.

nvx · 2021-08-16T01:20:34Z

Hi @nvx, I believe this is a result of the change within #10376 which landed in v1.1.0. If you have clients still running v0.12.3 or have the information available, was this data point ever at a non-zero value? If you're able to provide the driver that you're seeing this with also, that would be great.

My workloads all run using the docker driver.

I've still got v0.12.3 clients, I just had a look at the reported metrics from them and they all appear to be 0 for the workloads I'm running. I must admit when I built dashboards I just looked at the docs and built them from there (which for the metrics I was interested in meant adding up all the memory usage metrics to compare against the threshold) so I didn't notice that kernel memory usage was always 0 at the time. Is there any documentation somewhere indicating which metrics are applicable to which drivers? I couldn't spot anything at a glance.

I must admit under Prometheus silently dropping metrics with a value of 0 can cause issues though as doing something like memory_rss+memory_kernel to get the total memory usage will result in the entire series getting dropped and no value returned at all if memory_kernel is absent. This can make making dashboards difficult if a metrics will be absent in some cases but present (and useful to represent) in other cases.

jrasell · 2021-08-16T12:24:00Z

Thanks for the information @nvx.

The Docker driver doesn't expose the memory kernel_max_usage data point and therefore the metric is not missing. You are seeing the correct behaviour as a result of the recent fix.

I do agree with both your points regarding better documentation, as well as better support in registering Prometheus metrics so they are always available for scraping and not silently dropped. In order to properly capture these I would like to close this issue out, and raise two new issues, linked to this one, detailing the above items. Does this sound acceptable to you?

nvx · 2021-08-18T14:09:00Z

I do agree with both your points regarding better documentation, as well as better support in registering Prometheus metrics so they are always available for scraping and not silently dropped. In order to properly capture these I would like to close this issue out, and raise two new issues, linked to this one, detailing the above items. Does this sound acceptable to you?

Sounds great, cheers!

nvx added the type/bug label Aug 12, 2021

jrasell self-assigned this Aug 13, 2021

jrasell added stage/needs-investigation stage/waiting-reply labels Aug 13, 2021

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 13, 2021

jrasell moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Aug 13, 2021

tgross moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Nov 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing kernel memory usage metrics #11048

Missing kernel memory usage metrics #11048

nvx commented Aug 12, 2021

jrasell commented Aug 13, 2021 •

edited

Loading

nvx commented Aug 16, 2021

jrasell commented Aug 16, 2021

nvx commented Aug 18, 2021

Missing kernel memory usage metrics #11048

Missing kernel memory usage metrics #11048

Comments

nvx commented Aug 12, 2021

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

jrasell commented Aug 13, 2021 • edited Loading

nvx commented Aug 16, 2021

jrasell commented Aug 16, 2021

nvx commented Aug 18, 2021

jrasell commented Aug 13, 2021 •

edited

Loading