vgpu-monitor-metrics does not show in grafana #410

ltm920716 · 2024-08-01T08:08:55Z

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

the vgpu monitor metrics does not show in grafana，dcgm-exporter is ok
Prometheus scrape metrics like bellow：
I create a vgpu pod bellow：
the grafana bellow：

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

there are something should update in readme
1、the export-name for Prometheus configuration in current hami is hami-device-plugin-monitor , not vgpu-device-plugin-monitor
2、should add ServeAccount part for monitor
3、the grafana-json should update to enable select Prometheus source

Common error checking:

The output of nvidia-smi -a on your host
Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
The vgpu-device-plugin container logs
The vgpu-scheduler container logs
The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

Docker version from docker version
Docker command, image and tag used
Kernel version from uname -a
Any relevant kernel output lines from dmesg

The text was updated successfully, but these errors were encountered:

chaunceyjiang · 2024-08-02T02:48:50Z

You need to create a Prometheus ServiceMonitor.

ltm920716 · 2024-08-02T03:22:41Z

hi @chaunceyjiang ,
I am sorry that maybe I do not get your point

Now I have started the Prometheus server and got the dcgm-exporter metrics、the vgpu exporter metrics. As this grafana guide shows：https://github.com/Project-HAMi/HAMi/blob/master/docs/dashboard.md，

I think what should I do is only to import the given grafana-json and select the existed Prometheus source， that is right？

chaunceyjiang · 2024-08-02T03:38:08Z

I think what should I do is only to import the given grafana-json and select the existed Prometheus source， that is right？

You need to create a Prometheus ServiceMonitor.

the vgpu monitor metrics does not show in grafana，dcgm-exporter is ok

Because the dcgm-exporter includes a ServiceMonitor.
https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/templates/service-monitor.yaml

ltm920716 · 2024-08-02T05:00:17Z

hello，
thanks！ I will have a try later.

That is to say even I could get vgpu-metris by http://{scheduler ip}:{monitorPort}/metrics like：

# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 3.145728e+09
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 1.073741824e+10
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 2.5757220864e+10
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 2.5757220864e+10
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 2.415919104e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 2.415919104e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 2.415919104e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 2.5757220864e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 6.442450944e+09
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 2.415919104e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 2.5757220864e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 1
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 1
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 1
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 1
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 0.13020833333333334
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 0
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 0.4444444444444444
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 1
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 0
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 0
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 1
# HELP nodeGPUOverview GPU overview on a certain node
.....

and scrape it to Prometheus server like：

I must deploy a new servicemonitor for grafana.

thanks again！

ltm920716 · 2024-08-02T06:19:38Z

hi @chaunceyjiang ,
I deploy servicemonitor as following：

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
kubectl apply -f servicemonitor.yaml -n kube-system

here is servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-device-plugin-monitor
  namespace: kube-system
  labels:
    app.kubernetes.io/component: hami-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
  namespaceSelector:
    matchNames:
    - kube-system
  endpoints:
  - port: monitorport
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s

verify:

$ kubectl get servicemonitor -n kube-system
NAME                         AGE
hami-device-plugin-monitor   10m

now, grafana shows only one vgpu panel, others no data still

chaunceyjiang · 2024-08-02T06:32:37Z

vgpu-metris by http://{scheduler ip}:{monitorPort}/metrics like：

If the GPU算力使用率 has no value, you can check if the above URL is returning the Device_utilization_desc_of_container metric. If it is returning something, you should then check if the PromQL in the JSON file is written correctly.

ltm920716 · 2024-08-02T07:11:28Z

sure， there is！

it shows that uitlization is 0, but is 22% actually

chaunceyjiang · 2024-08-09T02:26:00Z

What is the value set for your 'nvidia. com/gpucores'?

ltm920716 · 2024-08-09T03:04:41Z

What is the value set for your 'nvidia. com/gpucores'?

I do not set 'nvidia. com/gpucores'， only set memory

chaunceyjiang · 2024-08-09T03:18:56Z

Could you try setting a value for nvidia.com/gpucores, for example 50?

ltm920716 · 2024-08-21T14:26:40Z

hi @chaunceyjiang
I test that does not match the request
I apply a test pod bellow：

$ kc describe pod instance-test-5b79c65497-cn4x2 -n t-maas
Name:             instance-test-5b79c65497-cn4x2
Namespace:        t-maas
Priority:         0
Service Account:  default
Node:             k8s-node1/192.168.10.230
Start Time:       Wed, 21 Aug 2024 22:08:51 +0800
Labels:           app=instance-test
                  pod-template-hash=5b79c65497
Annotations:      hami.io/bind-phase: success
                  hami.io/bind-time: 1724249331
                  hami.io/vgpu-devices-allocated: GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76,NVIDIA,500,5:;
                  hami.io/vgpu-devices-to-allocate: ;
                  hami.io/vgpu-node: k8s-node1
                  hami.io/vgpu-time: 1724249331
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "kube-ovn",
                        "ips": [
                            "10.244.3.186"
                        ],
                        "default": true,
                        "dns": {},
                        "gateway": [
                            "10.244.0.1"
                        ]
                    }]
                  nvidia.com/use-gputype: 4090
                  ovn.kubernetes.io/allocated: true
                  ovn.kubernetes.io/cidr: 10.244.0.0/16
                  ovn.kubernetes.io/gateway: 10.244.0.1
                  ovn.kubernetes.io/ip_address: 10.244.3.186
                  ovn.kubernetes.io/logical_router: ovn-cluster
                  ovn.kubernetes.io/logical_switch: ovn-default
                  ovn.kubernetes.io/mac_address: 00:00:00:10:0F:D2
                  ovn.kubernetes.io/pod_nic_type: veth-pair
                  ovn.kubernetes.io/routed: true
Status:           Running
IP:               10.244.3.186
IPs:
  IP:           10.244.3.186
Controlled By:  ReplicaSet/instance-test-5b79c65497
Containers:
  instance-test:
    Container ID:   containerd://717fec0d124e3deb2ea96c2cd10a01726bc4f1df9d325d8a59d3977e23cb58f3
    Image:          torch-cuda:2.0
    Image ID:       sha256:7f4965979f9de78b468b2444b58a83658b9d75929939e928f4393eec404a3181
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Wed, 21 Aug 2024 22:08:54 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:       1
      nvidia.com/gpucores:  5
      nvidia.com/gpumem:    500
    Requests:
      nvidia.com/gpu:       1
      nvidia.com/gpucores:  5
      nvidia.com/gpumem:    500
    Environment:
      PARAM_A:  100
      PARAM_B:  20
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjvdl (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-bjvdl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From            Message
  ----    ------          ----  ----            -------
  Normal  Scheduled       12m   hami-scheduler  Successfully assigned t-maas/instance-test-5b79c65497-cn4x2 to k8s-node1
  Normal  AddedInterface  12m   multus          Add eth0 [10.244.3.186/16] from kube-ovn
  Normal  Pulled          12m   kubelet         Container image "torch-cuda:2.0" already present on machine
  Normal  Created         12m   kubelet         Created container instance-test
  Normal  Started         12m   kubelet         Started container instance-test

this pod requests 5 gpucores，but the monitor shows it used more than 5：

in pod it shows more than vgpuMonitor

so is this a bug or something else?

thanks

ltm920716 · 2024-08-22T02:34:56Z

I get that the gpu util is not the core percent，so is there some metrics could show that the pod dose use the fix percent core？

fangfenghuang mentioned this issue Sep 5, 2024

fix grafana dashboard #471

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vgpu-monitor-metrics does not show in grafana #410

vgpu-monitor-metrics does not show in grafana #410

ltm920716 commented Aug 1, 2024

chaunceyjiang commented Aug 2, 2024

ltm920716 commented Aug 2, 2024

chaunceyjiang commented Aug 2, 2024

ltm920716 commented Aug 2, 2024

ltm920716 commented Aug 2, 2024

chaunceyjiang commented Aug 2, 2024

ltm920716 commented Aug 2, 2024

chaunceyjiang commented Aug 9, 2024

ltm920716 commented Aug 9, 2024

chaunceyjiang commented Aug 9, 2024

ltm920716 commented Aug 21, 2024 •

edited

Loading

ltm920716 commented Aug 22, 2024

vgpu-monitor-metrics does not show in grafana #410

vgpu-monitor-metrics does not show in grafana #410

Comments

ltm920716 commented Aug 1, 2024

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

chaunceyjiang commented Aug 2, 2024

ltm920716 commented Aug 2, 2024

chaunceyjiang commented Aug 2, 2024

ltm920716 commented Aug 2, 2024

ltm920716 commented Aug 2, 2024

chaunceyjiang commented Aug 2, 2024

ltm920716 commented Aug 2, 2024

chaunceyjiang commented Aug 9, 2024

ltm920716 commented Aug 9, 2024

chaunceyjiang commented Aug 9, 2024

ltm920716 commented Aug 21, 2024 • edited Loading

ltm920716 commented Aug 22, 2024

ltm920716 commented Aug 21, 2024 •

edited

Loading