Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metricbeat 7.12.1 kubernetes.container.memory.usage.limit.pct is calculated incorrectly #25657

Closed
F-Potter opened this issue May 11, 2021 · 11 comments · Fixed by #29547
Closed
Labels
Team:Integrations Label for the Integrations team :Windows

Comments

@F-Potter
Copy link

Hi,

Metricbeat is providing the kubernetes.container.memory.usage.limit.pct value by obtaining the memory.max_usage_in_bytes (which can be divined by setting a resource memory limit in the kubernetes deployment) and dividing it by the kubernetes.container.memory.usage.bytes.

While kubernetes is OOM killing containers based on the kubernetes.container.memory.workingset.bytes. Meaning that the kubernetes.container.memory.usage.limit.pct currently is not a good value to have alerting on, since the kubernetes.container.memory.usage.bytes is higher than the kubernetes.container.memory.workingset.bytes and giving false positives that the container is almost OOM killed, while in reality it is still okay, until the kubernetes.container.memory.workingset.bytes is reaching the resource memory limit.

Is it possible to adjust the kubernetes.container.memory.usage.limit.pct based on the kubernetes.container.memory.workingset.bytes?

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label May 11, 2021
@jsoriano
Copy link
Member

Hi @F-Potter,

kubernetes.container.memory.workingset.bytes is only greater than zero in Windows, are your affected nodes running in Windows?

If that is the case, I guess this is similar to the issue with pods, solved by #25428.

@jsoriano jsoriano added the Team:Integrations Label for the Integrations team label May 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/integrations (Team:Integrations)

@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label May 11, 2021
@F-Potter
Copy link
Author

Hi @jsoriano,

No the affected nodes are all Ubuntu 18.04.5 LTS

@F-Potter
Copy link
Author

Screenshot 2021-05-11 at 13 44 23

@F-Potter
Copy link
Author

as you see here the workingset is 580MB, the memory limit is 768MB (doesn't show in the output), but the limit.pct says it is 0,998 which is based on the memory.usage.bytes, which is 766.5MB

@brianharwell
Copy link
Contributor

@F-Potter At what point does the pod get OOM killed? Do you have an example of the usage.limit.pct going above 100%?

@jsoriano
Copy link
Member

as you see here the workingset is 580MB, the memory limit is 768MB (doesn't show in the output), but the limit.pct says it is 0,998 which is based on the memory.usage.bytes, which is 766.5MB

Oh yes, you are right, this value is also available in other OSs, this will need further investigation.

@F-Potter
Copy link
Author

@brianharwell will look at it, not sure if it stops at 100% or goes over it, but the issue is more that the wrong value is measured, since k8s is looking at the kubernetes.container.memory.workingset.bytes. So monitoring a different value will result in a different percentage which results in wrong alerting

@F-Potter
Copy link
Author

The limit.pct stops at 1 (100%), so won't get higher than that.

@brianharwell
Copy link
Contributor

I am curious to see how this works on Linux because on Windows, I get memory errors when the workingset bytes is 72% of the memory limit. I can try my test app on Linux and see what happens.

@faec
Copy link
Contributor

faec commented Aug 26, 2021

This doesn't look strictly incorrect -- usage.limit.pct is still measuring a correct, useful value, and it's the value corresponding to usage.bytes, which is what would be expected from the metric name. I think the confusion here is that "usage" (as I understand it) is the full allocated memory of the container, including pages that may be on disk, idle, etc. So we wouldn't expect to see anything go above 100%, but also, "usage" of 99% isn't necessarily worrying the way a working set of that size would be. Maybe we should also be calculating and providing memory.workingset.limit.pct so users can monitor the appropriate one for their situation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Integrations Label for the Integrations team :Windows
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants