Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds extra paragraph and screenshot to further validate DCGM exporter #1159

Merged
merged 2 commits into from
Dec 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
4 changes: 4 additions & 0 deletions install-and-configure/advanced-configuration/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -349,6 +349,10 @@ Open the Prometheus web interface in your browser by navigating to `http://local

![Prometheus query showing DCGM Exporter metric](/images/gpu-prometheus-query.png)

Additionally, check the `DCGM_FI_PROF_GR_ENGINE_ACTIVE` metric. This is the metric Kubecost currently uses to determine GPU utilization. GPU efficiency features in the UI are only enabled when there are non-zero values for this metric.

![Prometheus query showing DCGM Exporter metric](/images/gpu-prometheus-query-gr-engine-active.png)

## Shared GPU Support

Kubecost supports NVIDIA GPU sharing using either the CUDA [time-slicing](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html) or [Multi-Process Service (MPS)](https://docs.nvidia.com/deploy/mps/index.html) methods. MIG is currently unsupported but is being evaluated for a future release. When employing either time-slicing or MPS, you must use the `renameByDefault=true` option in the [NVIDIA device plugin's](https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#shared-access-to-gpus) configuration stanza. This parameter instructs the device plugin to advertise the resource `nvidia.com/gpu.shared` on nodes where GPU sharing is enabled. Without this configuration option, the device plugin will instead advertise `nvidia.com/gpu` which will mean Kubecost is unable to disambiguate an "exclusive" GPU access request from a shared GPU access request. As a result, Kubecost's cost information will be inaccurate.
Expand Down
Loading