Skip to content

Commit

Permalink
Add node metrics Grafana dashboard (#190)
Browse files Browse the repository at this point in the history
* 1) Add node metrics dashboard. 2) Add PCIe throughput in Gaudi dashboard.

* Use dynamic uid in gaudi grafana dashboard

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
joshuayao and pre-commit-ci[bot] authored Nov 6, 2024
1 parent aef0cf5 commit a19f42e
Show file tree
Hide file tree
Showing 4 changed files with 24,038 additions and 3 deletions.
20 changes: 17 additions & 3 deletions evals/benchmark/grafana/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,9 @@ Next, run Prometheus server `nohup ./prometheus --config.file=./prometheus.yml &

You should now access `localhost:9090/targets?search=` to open the Prometheus UI.

### 1.1 CPU Metrics (optional)
### 1.1 Node Metrics (optional)

The Prometheus Node Exporter is required for collecting CPU metrics. Deploy the Node Exporter via tarball by the [guide](https://prometheus.io/docs/guides/node-exporter/#installing-and-running-the-node-exporter).
The Prometheus Node Exporter is required for collecting CPU/memory/network/storage metrics metrics. Deploy the Node Exporter via tarball by the [guide](https://prometheus.io/docs/guides/node-exporter/#installing-and-running-the-node-exporter).

Or install it in a K8S cluster by the following commands:

Expand All @@ -47,7 +47,7 @@ Ensure namespace `monitoring` was created in your K8S environment.
```bash
git clone https://github.com/opea-project/GenAIEval.git
cd GenAIEval/evals/benchmark/grafana/
kubectl apply -f prometheus_cpu_exporter.yaml
kubectl apply -f prometheus_node_exporter.yaml
```

Add the following configuration to `prometheus.yml`:
Expand All @@ -60,6 +60,13 @@ scrape_configs:
- targets: ["<NODE1_IP>:9100", "<NODE2_IP>:9100", ...]
```

The following Grafana dashboards rely on Prometheus Node Exporter:
- cpu_grafana.json
- node_grafana.json

Tested on the Prometheus Node Exporter `0.16.0`.


### 1.2 Intel® Gaudi® Metrics (optional)

The Intel Gaudi Prometheus Metrics Exporter is required for collecting Intel® Gaudi® AI accelerator metrics.
Expand Down Expand Up @@ -87,6 +94,12 @@ scrape_configs:
- targets: ["<NODE1_IP>:41611", "<NODE2_IP>:41611", ...]
```

The following Grafana dashboard rely on Intel Gaudi Prometheus Metrics Exporter:
- gaudi_grafana.json

Tested on the Intel Gaudi Prometheus Metrics Exporter `1.17.0`.


Restart Prometheus after saving the changes.

## 2. Setup Grafana
Expand Down Expand Up @@ -129,3 +142,4 @@ In this folder, we also provides some Grafana dashboard JSON files for your refe
- `redis_grafana.json`: A sample Grafana dashboard JSON file for visualizing the Redis metrics. For importing the redis metrics, you need to add the new connection and Redis data source in Grafana. Please refer this [link](https://grafana.com/grafana/plugins/redis-datasource/?tab=installation) for more details.
- `gaudi_grafana.json`: A sample Grafana dashboard JSON file for visualizing the Intel® Gaudi® AI accelerator metrics in a container cluster for compute workload.
- `cpu_grafana.json`: A sample Grafana dashboard JSON file for visualizing the CPU metrics.
- `node_grafana.json`: A sample Grafana dashboard JSON file for visualizing the node metrics.
122 changes: 122 additions & 0 deletions evals/benchmark/grafana/gaudi_grafana.json
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,128 @@
],
"type": "gauge"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"description": "PCIe Throughput.",
"fieldConfig": {
"defaults": {
"color": {
"mode": "palette-classic"
},
"custom": {
"axisBorderShow": false,
"axisCenteredZero": false,
"axisColorMode": "text",
"axisLabel": "",
"axisPlacement": "auto",
"barAlignment": 0,
"drawStyle": "line",
"fillOpacity": 10,
"gradientMode": "none",
"hideFrom": {
"legend": false,
"tooltip": false,
"viz": false
},
"insertNulls": false,
"lineInterpolation": "linear",
"lineWidth": 1,
"pointSize": 5,
"scaleDistribution": {
"type": "linear"
},
"showPoints": "never",
"spanNulls": false,
"stacking": {
"group": "A",
"mode": "none"
},
"thresholdsStyle": {
"mode": "line+area"
}
},
"mappings": [],
"min": 0,
"thresholds": {
"mode": "absolute",
"steps": [
{
"color": "transparent",
"value": null
}
]
},
"unit": "binBps"
},
"overrides": []
},
"gridPos": {
"h": 7,
"w": 8,
"x": 14,
"y": 12
},
"id": 41,
"options": {
"legend": {
"calcs": [],
"displayMode": "list",
"placement": "bottom",
"showLegend": false
},
"tooltip": {
"maxHeight": 600,
"mode": "multi",
"sort": "none"
}
},
"pluginVersion": "11.1.0",
"targets": [
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "habanalabs_pcie_receive_throughput{UUID=\"$hpu\", instance=\"$node\"}",
"format": "time_series",
"instant": false,
"interval": "",
"legendFormat": "{{uuid}}",
"range": true,
"refId": "A"
},
{
"datasource": {
"type": "prometheus",
"uid": "${DS_PROMETHEUS}"
},
"editorMode": "code",
"exemplar": false,
"expr": "habanalabs_pcie_transmit_throughput{UUID=\"$hpu\", instance=\"$node\"}",
"format": "time_series",
"hide": false,
"instant": false,
"legendFormat": "__auto",
"range": true,
"refId": "B"
}
],
"title": "PCIe Throughput",
"transformations": [
{
"id": "labelsToFields",
"options": {
"valueLabel": "pod_name"
}
}
],
"type": "timeseries"
},
{
"datasource": {
"type": "prometheus",
Expand Down
Loading

0 comments on commit a19f42e

Please sign in to comment.