Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vgpu-monitor-metrics does not show in grafana #410

Open
9 tasks
ltm920716 opened this issue Aug 1, 2024 · 12 comments
Open
9 tasks

vgpu-monitor-metrics does not show in grafana #410

ltm920716 opened this issue Aug 1, 2024 · 12 comments

Comments

@ltm920716
Copy link

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

  • the vgpu monitor metrics does not show in grafana,dcgm-exporter is ok

  • Prometheus scrape metrics like bellow:
    image

  • I create a vgpu pod bellow:
    image

  • the grafana bellow:
    image
    image

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

there are something should update in readme
1、the export-name for Prometheus configuration in current hami is hami-device-plugin-monitor , not vgpu-device-plugin-monitor
2、should add ServeAccount part for monitor
3、the grafana-json should update to enable select Prometheus source

Common error checking:

  • The output of nvidia-smi -a on your host
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
  • The vgpu-device-plugin container logs
  • The vgpu-scheduler container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Docker version from docker version
  • Docker command, image and tag used
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
@chaunceyjiang
Copy link
Contributor

You need to create a Prometheus ServiceMonitor.

@ltm920716
Copy link
Author

hi @chaunceyjiang ,
I am sorry that maybe I do not get your point

Now I have started the Prometheus server and got the dcgm-exporter metrics、the vgpu exporter metrics. As this grafana guide shows:https://github.com/Project-HAMi/HAMi/blob/master/docs/dashboard.md

I think what should I do is only to import the given grafana-json and select the existed Prometheus source, that is right?

@chaunceyjiang
Copy link
Contributor

I think what should I do is only to import the given grafana-json and select the existed Prometheus source, that is right?

You need to create a Prometheus ServiceMonitor.

the vgpu monitor metrics does not show in grafana,dcgm-exporter is ok

Because the dcgm-exporter includes a ServiceMonitor.
https://github.com/NVIDIA/dcgm-exporter/blob/main/deployment/templates/service-monitor.yaml

@ltm920716
Copy link
Author

hello,
thanks! I will have a try later.

That is to say even I could get vgpu-metris by http://{scheduler ip}:{monitorPort}/metrics like:

# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 100
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 3.145728e+09
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 1.073741824e+10
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 2.5757220864e+10
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 2.5757220864e+10
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 2.415919104e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 2.415919104e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 2.415919104e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 2.5757220864e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 6.442450944e+09
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 2.415919104e+10
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 2.5757220864e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 1
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 1
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 1
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 0
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 1
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-00a68a2a-2396-8081-5f48-df0e5cde5212",nodeid="k8s-node2",zone="vGPU"} 0.13020833333333334
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-67d337dd-8d61-225b-9202-d12e8d593d9f",nodeid="k8s-node2",zone="vGPU"} 0
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-7b167507-2819-1a5d-a53f-645e5f460f63",nodeid="k8s-node2",zone="vGPU"} 0.4444444444444444
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76",nodeid="k8s-node1",zone="vGPU"} 1
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-ae8f8c32-ea81-3d4d-1579-17d21a4ceb60",nodeid="k8s-node2",zone="vGPU"} 0
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-c77bca75-2ff0-544c-820b-14eec9e90350",nodeid="k8s-node2",zone="vGPU"} 0
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-e047ddd9-a446-c1aa-7d3d-963e4b99d4dc",nodeid="k8s-node1",zone="vGPU"} 1
# HELP nodeGPUOverview GPU overview on a certain node
.....

and scrape it to Prometheus server like:
image

I must deploy a new servicemonitor for grafana.

thanks again!

@ltm920716
Copy link
Author

hi @chaunceyjiang ,
I deploy servicemonitor as following:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack
kubectl apply -f servicemonitor.yaml -n kube-system

here is servicemonitor.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: hami-device-plugin-monitor
  namespace: kube-system
  labels:
    app.kubernetes.io/component: hami-device-plugin
spec:
  selector:
    matchLabels:
      app.kubernetes.io/component: hami-device-plugin
  namespaceSelector:
    matchNames:
    - kube-system
  endpoints:
  - port: monitorport
    path: /metrics
    interval: 30s
    scrapeTimeout: 10s

verify:

$ kubectl get servicemonitor -n kube-system
NAME                         AGE
hami-device-plugin-monitor   10m

now, grafana shows only one vgpu panel, others no data still
image

image

@chaunceyjiang
Copy link
Contributor

vgpu-metris by http://{scheduler ip}:{monitorPort}/metrics like:

If the GPU算力使用率 has no value, you can check if the above URL is returning the Device_utilization_desc_of_container metric. If it is returning something, you should then check if the PromQL in the JSON file is written correctly.

@ltm920716
Copy link
Author

sure, there is!
image

it shows that uitlization is 0, but is 22% actually
image

@chaunceyjiang
Copy link
Contributor

What is the value set for your 'nvidia. com/gpucores'?

@ltm920716
Copy link
Author

What is the value set for your 'nvidia. com/gpucores'?

I do not set 'nvidia. com/gpucores', only set memory

@chaunceyjiang
Copy link
Contributor

Could you try setting a value for nvidia.com/gpucores, for example 50?

@ltm920716
Copy link
Author

ltm920716 commented Aug 21, 2024

hi @chaunceyjiang
I test that does not match the request
I apply a test pod bellow:

$ kc describe pod instance-test-5b79c65497-cn4x2 -n t-maas
Name:             instance-test-5b79c65497-cn4x2
Namespace:        t-maas
Priority:         0
Service Account:  default
Node:             k8s-node1/192.168.10.230
Start Time:       Wed, 21 Aug 2024 22:08:51 +0800
Labels:           app=instance-test
                  pod-template-hash=5b79c65497
Annotations:      hami.io/bind-phase: success
                  hami.io/bind-time: 1724249331
                  hami.io/vgpu-devices-allocated: GPU-903ecef4-bb8d-d7fe-65e2-e327e6258e76,NVIDIA,500,5:;
                  hami.io/vgpu-devices-to-allocate: ;
                  hami.io/vgpu-node: k8s-node1
                  hami.io/vgpu-time: 1724249331
                  k8s.v1.cni.cncf.io/network-status:
                    [{
                        "name": "kube-ovn",
                        "ips": [
                            "10.244.3.186"
                        ],
                        "default": true,
                        "dns": {},
                        "gateway": [
                            "10.244.0.1"
                        ]
                    }]
                  nvidia.com/use-gputype: 4090
                  ovn.kubernetes.io/allocated: true
                  ovn.kubernetes.io/cidr: 10.244.0.0/16
                  ovn.kubernetes.io/gateway: 10.244.0.1
                  ovn.kubernetes.io/ip_address: 10.244.3.186
                  ovn.kubernetes.io/logical_router: ovn-cluster
                  ovn.kubernetes.io/logical_switch: ovn-default
                  ovn.kubernetes.io/mac_address: 00:00:00:10:0F:D2
                  ovn.kubernetes.io/pod_nic_type: veth-pair
                  ovn.kubernetes.io/routed: true
Status:           Running
IP:               10.244.3.186
IPs:
  IP:           10.244.3.186
Controlled By:  ReplicaSet/instance-test-5b79c65497
Containers:
  instance-test:
    Container ID:   containerd://717fec0d124e3deb2ea96c2cd10a01726bc4f1df9d325d8a59d3977e23cb58f3
    Image:          torch-cuda:2.0
    Image ID:       sha256:7f4965979f9de78b468b2444b58a83658b9d75929939e928f4393eec404a3181
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Wed, 21 Aug 2024 22:08:54 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      nvidia.com/gpu:       1
      nvidia.com/gpucores:  5
      nvidia.com/gpumem:    500
    Requests:
      nvidia.com/gpu:       1
      nvidia.com/gpucores:  5
      nvidia.com/gpumem:    500
    Environment:
      PARAM_A:  100
      PARAM_B:  20
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-bjvdl (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  kube-api-access-bjvdl:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From            Message
  ----    ------          ----  ----            -------
  Normal  Scheduled       12m   hami-scheduler  Successfully assigned t-maas/instance-test-5b79c65497-cn4x2 to k8s-node1
  Normal  AddedInterface  12m   multus          Add eth0 [10.244.3.186/16] from kube-ovn
  Normal  Pulled          12m   kubelet         Container image "torch-cuda:2.0" already present on machine
  Normal  Created         12m   kubelet         Created container instance-test
  Normal  Started         12m   kubelet         Started container instance-test

this pod requests 5 gpucores,but the monitor shows it used more than 5:
微信截图_20240821221938

in pod it shows more than vgpuMonitor
image

so is this a bug or something else?

thanks

@ltm920716
Copy link
Author

I get that the gpu util is not the core percent,so is there some metrics could show that the pod dose use the fix percent core?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants