Kepler does not report metrics on resources outside of some system namespaces #1771

BoyanBanev · 2024-09-09T12:58:35Z

What happened?

We have deployed kepler in 2 clusters. The only difference between them is that Cluster A is dual stack and Cluster B is IPv6 only

Kepler reports metrics correctly from cluster A. From Cluster B we can only see metrics reported for some system namespaces (e.g. kube-system) and kepler itself

ClusterA.log
ClusterA_metrics.txt
ClusterB.log
ClusterB_metrics.txt

What did you expect to happen?

I expect that kepler reports metrics for all resources on Cluster B

How can we reproduce it (as minimally and precisely as possible)?

Run kepler in an ipv6 only kubernetes cluster

Anything else we need to know?

No response

Kepler image tag

0.7.10

Kubernetes version

$ kubectl version
Client Version: v1.30.3
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.29.4

Cloud provider or bare metal

bare metal

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

Name: kepler
Selector: app.kubernetes.io/component=exporter,app.kubernetes.io/name=kepler
Node-Selector: kubernetes.io/os=linux
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=kepler
app.kubernetes.io/version=release-0.7.10
helm.sh/chart=kepler-0.5.7
helm.toolkit.fluxcd.io/name=kepler
helm.toolkit.fluxcd.io/namespace=kepler
Annotations: deprecated.daemonset.template.generation: 1
meta.helm.sh/release-name: kepler
meta.helm.sh/release-namespace: kepler
telegraf.influxdata.com/class: app
telegraf.influxdata.com/env-fieldref-HOSTIP: status.hostIP
telegraf.influxdata.com/env-fieldref-NAMESPACE: metadata.namespace
telegraf.influxdata.com/env-fieldref-PODIP: status.podIP
telegraf.influxdata.com/env-fieldref-PODNAME: metadata.name
telegraf.influxdata.com/volume-mounts: {"cdi-user":"/var/local"}
Desired Number of Nodes Scheduled: 6
Current Number of Nodes Scheduled: 6
Number of Nodes Scheduled with Up-to-date Pods: 6
Number of Nodes Scheduled with Available Pods: 6
Number of Nodes Misscheduled: 0
Pods Status: 6 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: app.kubernetes.io/component=exporter
app.kubernetes.io/name=kepler
monitoring=aid
Annotations: telegraf.influxdata.com/class: app
telegraf.influxdata.com/env-fieldref-HOSTIP: status.hostIP
telegraf.influxdata.com/env-fieldref-NAMESPACE: metadata.namespace
telegraf.influxdata.com/env-fieldref-PODIP: status.podIP
telegraf.influxdata.com/env-fieldref-PODNAME: metadata.name
telegraf.influxdata.com/volume-mounts: {"cdi-user":"/var/local"}
Service Account: kepler
Containers:
kepler-exporter:
Image: artifactory.devops.telekom.de/dtt-cbdev-boyanslab-dev-docker/kepler:0.7.10
Port: 9102/TCP
Host Port: 9102/TCP
Args:
-v=$(KEPLER_LOG_LEVEL)
Liveness: http-get http://:9102/healthz delay=10s timeout=10s period=60s #success=1 #failure=5
Environment:
NODE_IP: (v1:status.hostIP)
NODE_NAME: (v1:spec.nodeName)
METRIC_PATH: /metrics
BIND_ADDRESS: 0.0.0.0:9102
BIND_ADDRESS: 0.0.0.0:9102
CGROUP_METRICS: *
CPU_ARCH_OVERRIDE:
ENABLE_EBPF_CGROUPID: true
ENABLE_GPU: false
ENABLE_PROCESS_METRICS: false
ENABLE_QAT: true
EXPOSE_CGROUP_METRICS: true
EXPOSE_HW_COUNTER_METRICS: true
EXPOSE_IRQ_COUNTER_METRICS: true
KEPLER_LOG_LEVEL: 6
METRIC_PATH: /metrics
Mounts:
/lib/modules from lib-modules (rw)
/proc from proc (rw)
/sys from tracing (rw)
/usr/src from usr-src (rw)
Volumes:
lib-modules:
Type: HostPath (bare host directory volume)
Path: /lib/modules
HostPathType: DirectoryOrCreate
tracing:
Type: HostPath (bare host directory volume)
Path: /sys
HostPathType: Directory
proc:
Type: HostPath (bare host directory volume)
Path: /proc
HostPathType: Directory
usr-src:
Type: HostPath (bare host directory volume)
Path: /usr/src
HostPathType: Directory
cdi-user:
Type: Secret (a volume populated by a Secret)
SecretName: cdi-user-appmetrics
Optional: false
Node-Selectors: kubernetes.io/os=linux
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

BoyanBanev · 2025-02-07T14:32:42Z

I have updated kepler to the latest version (0.7.12). There was no improvement. I.e. the problem persists

The configuration and values used are:

kepler_configs.zip

rootfs · 2025-02-07T15:49:44Z

@BoyanBanev i looked the cluster B log and metrics. There were processes (ie. mongodb_exporte) recorded in the log, but not showing on the metrics. For reference, scraping kepler endpoint only shows the current active processes. If the process is not active, then kepler doesn't report any metrics on that process during that sample window. Can you query prometheus and see if the results match what you have on cluster B? Prometheus keeps a historical records of all kepler reported metrics scraped over time.

rootfs · 2025-02-07T15:50:03Z

@sthaha @vimalk78

BoyanBanev · 2025-02-10T07:18:48Z

thanks @rootfs! That is the problem exactly. I see all my apps in the logs, but no metrics. Unfortunately I dont keep this data. I will re-run a test and make sure the processes are active. How long in is the sampling window? I.e. for how long need a process be active so it is picked up by kepler?

BoyanBanev · 2025-02-10T14:52:53Z

@rootfs @sthaha @vimalk78

I narrowed the problem down to different behavior of kepler-exporter on the control and worker nodes of my cluster.
In short:
kepler-exporter works fine on control nodes. Therefore system related namespaces and processes are measured
kepler-exporter does not work fine on the worker nodes.

When trying to access the metrics endpoint on the work node, I get no response:

 curl http://127.0.0.1:9102/metrics

^C

However, when using another URL, I do get the forwarding:

 curl http://127.0.0.1:9102/
<html>
                                        <head><title>Energy Stats Exporter</title></head>
                                        <body>
                                        <h1>Energy Stats Exporter</h1>
                                        <p><a href="/metrics">Metrics</a></p>
                                        </body>

which means that the web server works and I dont have connectivity or network policy issues.

Unfortunately I get nothing in the logs, when trying to read the metrics end point.
I attach logs from kepler-exporter running on worker (NOK) and control nodes (OK)

kepler 7.12.zip

any help is greatly appreciated!

BoyanBanev added the kind/bug report bug issue label Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kepler does not report metrics on resources outside of some system namespaces #1771

Kepler does not report metrics on resources outside of some system namespaces #1771

BoyanBanev commented Sep 9, 2024

BoyanBanev commented Feb 7, 2025

rootfs commented Feb 7, 2025

rootfs commented Feb 7, 2025

BoyanBanev commented Feb 10, 2025

BoyanBanev commented Feb 10, 2025 •

edited

Loading

Kepler does not report metrics on resources outside of some system namespaces #1771

Kepler does not report metrics on resources outside of some system namespaces #1771

Comments

BoyanBanev commented Sep 9, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kepler image tag

Kubernetes version

Cloud provider or bare metal

OS version

Install tools

Kepler deployment config

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

BoyanBanev commented Feb 7, 2025

rootfs commented Feb 7, 2025

rootfs commented Feb 7, 2025

BoyanBanev commented Feb 10, 2025

BoyanBanev commented Feb 10, 2025 • edited Loading

BoyanBanev commented Feb 10, 2025 •

edited

Loading