-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pod and container usage percentage is not properly calculated #10658
Comments
Sorry to report but am having same issue where limits pct is showing same as node pct and also does not reflect the resource quotas set. Am using Metricbeat version 6.6. |
Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix #10658
…#10946) Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix elastic#10658 (cherry picked from commit 106df3d)
…#10946) Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix elastic#10658 (cherry picked from commit 106df3d)
…#10946) Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix elastic#10658 (cherry picked from commit 106df3d)
…#10946) Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix elastic#10658 (cherry picked from commit 106df3d)
…#11057) Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix #10658 (cherry picked from commit 106df3d)
…#11058) Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix #10658 (cherry picked from commit 106df3d)
I've been testing 6.6.2 stack, getting CPU and Memory % according to limits Can you make sure you are receiving |
When I filter kubernetes.container.cpu.usage.limit.pct:exists it returns nothing same for memory.usage.limit.pct |
I am reopening this as there seems to be still issues around this. |
Hi @jsoriano I think we hit the same issue, we are running 7.2, adding some of the details we found to see if there are helpful for the investigation: Sorry about the format but I couldn't copy that properly on github for some reason. |
@mingue does this pod have multiple containers? |
@jsoriano no they only run one container per pod |
For the moment I found a workaround. It looks like limit metrics at the container level are calculated correctly. So I switched my dashboards to get data from containers. You can then use kubernetes.container._module.pod.name if you want to group metrics by each pod. This might not work for scenarios where you have several containers for one pod, depending on your needs. |
Hi, We have been giving another try to reproduce these issues, but we haven't been able to do so. We think that the documentation of these fields can be improved, and we will do it to remark that for pod percentage calculations, if any of their containers don't have a limit configured, then the available resources in the node are considered the limit. This is why in some cases the percentage that appears is lower than expected at the pod level, but correct at the container level. Take into account that Kubernetes doesn't have resource usage metrics at the pod level (for resources as CPU or Memory). To calculate these metrics for pods, Metricbeat aggregates the metrics of their containers. Regarding percentages on limits there can be three cases:
In any case, the resource usage percentages at the pod level can be hiding problems in an specific containers for pods with multiple containers, so these metrics can be good to have an overview of the global state of a system, but for more fine-grained details, it is better to use the container-level metric on these scenarios. @mingue if you are still seeing pod percentages with lower than expected values in single-container pods, please send us a complete pod manifest that can be used to reproduce the issue. @adalga are you still experiencing issues with these metrics? Have you tried to update metricbeat? What version of kubernetes are you using? Thanks all for your reports! |
Improvements in docs have been merged, I am closing this issue, we will consider reopening it again if we find a way to reproduce the problems. |
@jsoriano : We are currently still running into this issue with Metricbeat 7.9.2. on an Azure AKS Cluster and on the Even though we have limits set for all our containers (at least for Our SetupWe are using the ECK-Operator to deploy metricbeat - but I guess that shoulnd't make a difference. We are using a I also provided some data (resource-quota as well as document-extracts for
IdeasAt this point I am wondering if this is caused either by:
Any other ideas what is the cause for this? ConfigsDaemonset Config metricbeat:
autodiscover:
providers:
- hints:
default_config: {}
enabled: "true"
node: ${NODE_NAME}
type: kubernetes
# COLLECT PROMETHEUS METRICS
- type: kubernetes
node: ${NODE_NAME}
ssl:
verification_mode: "none"
include_annotations: ["prometheus.io.scrape"]
templates:
- condition:
contains:
kubernetes.annotations.prometheus.io/scrape: "true"
config:
- module: prometheus
metricsets: ["collector"]
hosts: ["${data.host}:${data.kubernetes.annotations.prometheus.io/port}"]
metrics_path: "${data.kubernetes.annotations.prometheus.io/path}"
modules:
# COLLECT NODE SERVER HOST METRICS
- module: system
period: 10s
metricsets:
- cpu
- load
- memory
- network
# COLLECT NODE SERVER STORAGE METRICS (less frequent)
- module: system
period: 1m
metricsets:
- filesystem
- fsstat
processors:
- drop_event:
when:
regexp:
system:
filesystem:
mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib)($|/)
# COLLECT HOST KUBERNETES DATA
- module: kubernetes
period: 10s
node: ${NODE_NAME}
hosts:
- https://${NODE_NAME}:10250
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
metricsets:
- node
- system
- pod
- container
- volume
processors:
- add_cloud_metadata: {}
- add_host_metadata: {} Deployment Configmetricbeat:
modules:
- module: kubernetes
period: 10s
host: ${NODE_NAME}
metricsets:
- state_node
- state_deployment
- state_replicaset
- state_statefulset
- state_pod
- state_container
- state_cronjob
- state_resourcequota
- state_service
- state_persistentvolume
- state_persistentvolumeclaim
- state_storageclass
- event
hosts:
- "kube-state-metrics.monitoring.svc.cluster.local:8080"
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
processors:
- add_cloud_metadata: {}
- add_host_metadata: {}
- add_kubernetes_metadata:
in_cluster: true
# workaround for https://github.com/elastic/beats/issues/17447 until version 7.10.0 is released
- drop_fields:
when:
equals:
kubernetes.service.cluster_ip: "None"
fields: ["kubernetes.service.cluster_ip"]
ignore_missing: true Example Data For Pod fluentd-0Pod Limit$ kubectl get pod -n monitoring fluentd-0 -o jsonpath='{range .spec.containers[*]}{.resources}{"\n"}{end}'
{"limits":{"memory":"1Gi"},"requests":{"cpu":"300m","memory":"1Gi"}} kubernetes.state_container metricReturned by the Single-Pod-Deployment: "kubernetes": {
"statefulset": {
"name": "fluentd"
},
"labels": { ... },
"container": {
"memory": {
"request": {
"bytes": 1073741824
},
"limit": {
"bytes": 1073741824
}
},
"id": "...",
"image": "...",
"name": "fluentd",
"status": {
"restarts": 0,
"ready": true,
"phase": "running"
},
"cpu": {
"request": {
"cores": 0.3
}
}
},
"namespace": "monitoring",
"pod": {
"name": "fluentd-0",
"uid": "12bc54e3-0200-41c0-ae3f-f39812c36848"
},
"node": { ... }
} kubernetes.container metricReturned by the Daemonset: "kubernetes": {
"pod": {
"name": "fluentd-0"
},
"container": {
"name": "fluentd",
"cpu": {
"usage": {
"core": {
"ns": 3800034327709
},
"node": {
"pct": 0.0018974465
},
"limit": {
"pct": 0.0018974465
},
"nanocores": 7589786
}
},
"memory": {
"usage": {
"node": {
"pct": 0.010616298949767972
},
"limit": {
"pct": 0.010616298949767972
},
"bytes": 357859328
},
"workingset": {
"bytes": 357851136
},
"rss": {
"bytes": 243007488
},
"pagefaults": 82343,
"majorpagefaults": 0,
"available": {
"bytes": 715890688
}
}
} |
@jsoriano we have tested with 7.17.3 and 8.1.3 and we are hitting this issue. We have between 1 and 3 sidecar containers, looking at this code https://github.com/elastic/beats/pull/6158/files we initially thought maybe the limits are not set for some of the containers, but they are set for all of them. There is a 5m period on both metricsets, metricbeat is running as a daemonset. One issue is that the metrics are sometimes even more than 2min apart, seems like a lot since the period is the same, but I guess it depends on the startup time between the leader and the metricbeat monitoring the current pod : Screenshot is for memory, but the same issue is for cpu. For the same pod, a few days ago there are metrics computed correctly, so it seems like a race condition when correlating the documents from the different metricsets: Related question, is there a way to align the startup period to the clock time, for example a period of 15m to be best effort aligned to :00, :15, :30, :45 regardless of when the metricbeat pod starts ? |
…#10946) (elastic#11059) Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix elastic#10658 (cherry picked from commit db4b4c2)
…#10946) (elastic#11060) Some metrics in metricbeat kubernetes module are cached during a time, if they are not updated they are removed. But it is usual to have pods or containers that are not updated during more time that the expiration cache. Current implementation was not renovating expiration times for cache entries so all were eventually removed if updates for them are not received. Replace it with the cache implementation available in libbeat, but keeping the existing interface. Also, use slashes instead of dashes to generate unique container uids. Dashes can be used by kubernetes names, what could lead to ambiguous keys for the caches. Fix elastic#10658 (cherry picked from commit db4b4c2)
It has been reported that in some cases pod and/or container uage percentages are not being correctly calculated. This seems related to the retrieval or calculation of container limits.
For example:
There are also reports in which the percentages seem to be calculated during some periods with the container limits, and some periods with the node limits.
Related issues fixed in the past:
It is not clear how to reproduce these issues.
The text was updated successfully, but these errors were encountered: