Skip to content

Commit

Permalink
Adds extra paragraph and screenshot for further validating DCGM expor…
Browse files Browse the repository at this point in the history
…ter. Specifically calls out the DCGM_FI_PROF_GR_ENGINE_ACTIVE metric.
  • Loading branch information
thomasvn committed Dec 10, 2024
1 parent 4cabeb8 commit 067e2f1
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 0 deletions.
Binary file added images/gpu-prometheus-query-gr-engine-active.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions install-and-configure/advanced-configuration/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,10 @@ Open the Prometheus web interface in your browser by navigating to `http://local

![Prometheus query showing DCGM Exporter metric](/images/gpu-prometheus-query.png)

Additionally, check the `DCGM_FI_PROF_GR_ENGINE_ACTIVE` metric. This is the metric Kubecost uses to determine GPU utilization. GPU efficiency features in the UI are only enabled when there are nonzero values for this metric.

![Prometheus query showing DCGM Exporter metric](/images/gpu-prometheus-query-gr-engine-active.png)

## Shared GPU Support

Kubecost supports NVIDIA GPU sharing using either the CUDA [time-slicing](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html) or [Multi-Process Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) methods. MIG is currently unsupported but is being evaluated for a future release. When employing either time-slicing or MPS, you must use the `renameByDefault=true` option in the [NVIDIA device plugin's](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#shared-access-to-gpus) configuration stanza. This parameter instructs the device plugin to advertise the resource `nvidia.com/gpu.shared` on nodes where GPU sharing is enabled. Without this configuration option, the device plugin will instead advertise `nvidia.com/gpu` which will mean Kubecost is unable to disambiguate an "exclusive" GPU access request from a shared GPU access request. As a result, Kubecost's cost information will be inaccurate.
Expand Down

0 comments on commit 067e2f1

Please sign in to comment.