Skip to content

Commit

Permalink
Update observability README + fix typos (#556)
Browse files Browse the repository at this point in the history
* Update observability README + fix typos

* Give image files reasonable names

Scaling them down, and converting to 8-bit would be good next step,
to make also their sizes to more reasonable.

Signed-off-by: Eero Tamminen <eero.t.tamminen@intel.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
eero-t and pre-commit-ci[bot] authored Nov 15, 2024
1 parent 8c4a698 commit 1d77b81
Show file tree
Hide file tree
Showing 5 changed files with 38 additions and 28 deletions.
66 changes: 38 additions & 28 deletions kubernetes-addons/Observability/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ kubectl port-forward service/grafana 3000:80

Open your browser and navigate to http://localhost:3000. Use "admin/prom-operator" as the username and the password to login.

## 2. Metric for Gaudi Hardware(v1.16.2)
## 2. Metrics for Gaudi Hardware (v1.16.2)

To monitor Gaudi hardware metrics, you can use the following steps:

Expand All @@ -64,8 +64,6 @@ kubectl apply -f ./habana/metric-exporter-serviceMonitor.yaml

### Step 4: Verify the metrics

The metric endpoints for habana will be a headless service, so we need to get endpoint to verify

```
# To get the metric endpoints, e.g. to get first endpoint to test
habana_metric_url=`kubectl -n monitoring get ep metric-exporter -o jsonpath="{.subsets[].addresses[0].ip}:{..subsets[].ports[0].port}"`
Expand Down Expand Up @@ -95,58 +93,70 @@ promhttp_metric_handler_requests_total{code="503"} 0

### Step 5: Import the dashboard into Grafana

Manually import ./habana/Dashboard-Gaudi-HW.json into Grafana
![alt text](image-1.png)
Manually import the [`Dashboard-Gaudi-HW.json`](./habana/Dashboard-Gaudi-HW.json) file into Grafana
![Gaudi HW dashboard](./assets/habana.png)

## 3. Metric for OPEA/chatqna
## 3. Metrics for OPEA applications

To monitor ChatQnA metrics including TGI-gaudi,TEI,TEI-Reranking and other micro services, you can use the following steps:
To monitor OPEA application metrics including TGI-gaudi, TEI, TEI-Reranking and other micro services, you can use the following steps:

### Step 1: Install ChatQnA by Helm
### Step 1: Install application with Helm

Install Helm (version >= 3.15) first. Refer to the [Helm Installation Guide](https://helm.sh/docs/intro/install/) for more information.

Refer to the [ChatQnA helm chart](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md) for instructions on deploying ChatQnA into Kubernetes on Xeon & Gaudi.
Install OPEA application as described in [Helm charts README](../../helm-charts/README.md).

### Step 2: Install all the serviceMonitor
For example, to install ChatQnA, follow [ChatQnA helm chart](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/chatqna/README.md) for instructions on deploying it to Kubernetes.

> NOTE:
> If the chatQnA installed into another instance instead of chatqna(Default instance name),you should modify the
> matchLabels app.kubernetes.io/instance:${instanceName} with proper instanceName
Make sure to enable [Helm monitoring option](../../helm-charts/monitoring.md).

```
kubectl apply -f chatqna/
```
### Step 2: Install dashboards

Here are few Grafana dashboards for monitoring different aspects of OPEA applications:

- [`queue_size_embedding_rerank_tgi.json`](./chatqna/dashboard/queue_size_embedding_rerank_tgi.json): queue size of TGI-gaudi, TEI-Embedding, TEI-reranking
- [`tgi_grafana.json`](./chatqna/dashboard/tgi_grafana.json): `tgi-gaudi` text generation inferencing service utilization
- [`opea-scaling.json`](./opea-apps/opea-scaling.json): scaling, request rates and failures for OPEA application megaservice, TEI-reranking, TEI-embedding, and TGI

### Step 3: Install the dashboard
You can either:

- manually import tgi_grafana.json into the Grafana to monitor the tgi-gaudi utilization
- manually import queue_size_embedding_rerank_tgi.json into the Grafana to monitor the queue size of TGI-gaudi,TEI-Embedding,TEI-reranking
- OR you could create dashboard to monitor all the services in ChatQnA by yourself
- Import them manually to Grafana,
- Use [`update-dashboards.sh`](./update-dashboards.sh) script to add them to Kubernetes as Grafana dashboard configMaps
- (Script assumes Prometheus / Grafana to be installed according to above instructions)
- Or create your own dashboards based on them

![alt text](image-2.png)
Note: when dashboard is imported to Grafana, you can directly save changes to it, but those dashboards go away if Grafana is removed / re-installed.

## 4. Metric for PCM(Intel® Performance Counter Monitor)
Whereas with dashboard configMaps, Grafana saves changes to a selected file, but you need to remember to re-apply them to Kubernetes / Grafana, for your changes to be there when that dashboard is reloaded.

![TGI dashboard](./assets/tgi.png)
![Scaling dashboard](./assets/opea-scaling.png)

## 4. Metrics for PCM (Intel® Performance Counter Monitor)

### Step 1: Install PCM

Please refer this repo to install [Intel® PCM](https://github.com/intel/pcm)
Please refer to this repo to install [Intel® PCM](https://github.com/intel/pcm)

### Step 2: Modify & Install pcm-service

modify the pcm/pcm-service.yaml to set the addresses
modify the `pcm/pcm-service.yaml` file to set the addresses

```
kubectl apply -f pcm/pcm-service.yaml
```

### Step 3: Install pcm serviceMonitor
### Step 3: Install PCM serviceMonitor

```
kubectl apply -f pcm/pcm-serviceMonitor.yaml
```

### Step 4: Install the pcm dashboard
### Step 4: Install the PCM dashboard

manually import the [`pcm-dashboard.json`](./pcm/pcm-dashboard.json) file into the Grafana
![PCM dashboard](./assets/pcm.png)

## More dashboards

manually import the pcm/pcm-dashboard.json into the Grafana
![alt text](image.png)
GenAIEval repository includes additional [dashboards](https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/grafana).
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes

0 comments on commit 1d77b81

Please sign in to comment.