-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kepler does not report metrics on resources outside of some system namespaces #1771
Comments
I have updated kepler to the latest version (0.7.12). There was no improvement. I.e. the problem persists The configuration and values used are: |
@BoyanBanev i looked the cluster B log and metrics. There were processes (ie. mongodb_exporte) recorded in the log, but not showing on the metrics. For reference, scraping kepler endpoint only shows the current active processes. If the process is not active, then kepler doesn't report any metrics on that process during that sample window. Can you query prometheus and see if the results match what you have on cluster B? Prometheus keeps a historical records of all kepler reported metrics scraped over time. |
thanks @rootfs! That is the problem exactly. I see all my apps in the logs, but no metrics. Unfortunately I dont keep this data. I will re-run a test and make sure the processes are active. How long in is the sampling window? I.e. for how long need a process be active so it is picked up by kepler? |
I narrowed the problem down to different behavior of kepler-exporter on the control and worker nodes of my cluster. When trying to access the metrics endpoint on the work node, I get no response:
However, when using another URL, I do get the forwarding:
which means that the web server works and I dont have connectivity or network policy issues. Unfortunately I get nothing in the logs, when trying to read the metrics end point. any help is greatly appreciated! |
What happened?
We have deployed kepler in 2 clusters. The only difference between them is that Cluster A is dual stack and Cluster B is IPv6 only
Kepler reports metrics correctly from cluster A. From Cluster B we can only see metrics reported for some system namespaces (e.g. kube-system) and kepler itself
ClusterA.log
ClusterA_metrics.txt
ClusterB.log
ClusterB_metrics.txt
What did you expect to happen?
I expect that kepler reports metrics for all resources on Cluster B
How can we reproduce it (as minimally and precisely as possible)?
Run kepler in an ipv6 only kubernetes cluster
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Name: kepler
Selector: app.kubernetes.io/component=exporter,app.kubernetes.io/name=kepler
Node-Selector: kubernetes.io/os=linux
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=kepler
app.kubernetes.io/version=release-0.7.10
helm.sh/chart=kepler-0.5.7
helm.toolkit.fluxcd.io/name=kepler
helm.toolkit.fluxcd.io/namespace=kepler
Annotations: deprecated.daemonset.template.generation: 1
meta.helm.sh/release-name: kepler
meta.helm.sh/release-namespace: kepler
telegraf.influxdata.com/class: app
telegraf.influxdata.com/env-fieldref-HOSTIP: status.hostIP
telegraf.influxdata.com/env-fieldref-NAMESPACE: metadata.namespace
telegraf.influxdata.com/env-fieldref-PODIP: status.podIP
telegraf.influxdata.com/env-fieldref-PODNAME: metadata.name
telegraf.influxdata.com/volume-mounts: {"cdi-user":"/var/local"}
Desired Number of Nodes Scheduled: 6
Current Number of Nodes Scheduled: 6
Number of Nodes Scheduled with Up-to-date Pods: 6
Number of Nodes Scheduled with Available Pods: 6
Number of Nodes Misscheduled: 0
Pods Status: 6 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/name=kepler
monitoring=aid
Annotations: telegraf.influxdata.com/class: app
telegraf.influxdata.com/env-fieldref-HOSTIP: status.hostIP
telegraf.influxdata.com/env-fieldref-NAMESPACE: metadata.namespace
telegraf.influxdata.com/env-fieldref-PODIP: status.podIP
telegraf.influxdata.com/env-fieldref-PODNAME: metadata.name
telegraf.influxdata.com/volume-mounts: {"cdi-user":"/var/local"}
Service Account: kepler
Containers:
kepler-exporter:
Image: artifactory.devops.telekom.de/dtt-cbdev-boyanslab-dev-docker/kepler:0.7.10
Port: 9102/TCP
Host Port: 9102/TCP
Args:
-v=$(KEPLER_LOG_LEVEL)
Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
Environment:
NODE_IP: (v1:status.hostIP)
NODE_NAME: (v1:spec.nodeName)
METRIC_PATH: /metrics
BIND_ADDRESS: 0.0.0.0:9102
BIND_ADDRESS: 0.0.0.0:9102
CGROUP_METRICS: *
CPU_ARCH_OVERRIDE:
ENABLE_EBPF_CGROUPID: true
ENABLE_GPU: false
ENABLE_PROCESS_METRICS: false
ENABLE_QAT: true
EXPOSE_CGROUP_METRICS: true
EXPOSE_HW_COUNTER_METRICS: true
EXPOSE_IRQ_COUNTER_METRICS: true
KEPLER_LOG_LEVEL: 6
METRIC_PATH: /metrics
Mounts:
/lib/modules from lib-modules (rw)
/proc from proc (rw)
/sys from tracing (rw)
/usr/src from usr-src (rw)
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType: DirectoryOrCreate
tracing:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType: Directory
usr-src:
Type: HostPath (bare host directory volume)
Path: /usr/src
HostPathType: Directory
cdi-user:
Type: Secret (a volume populated by a Secret)
SecretName: cdi-user-appmetrics
Optional: false
Node-Selectors: kubernetes.io/os=linux
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)
The text was updated successfully, but these errors were encountered: