Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document setup with CDI annotations #515

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ TODO: We are still in the process of migrating GFD to this repo. Once this is re
-->
* [Deploying via `helm install` with a direct URL to the `helm` package](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package)
- [Building and Running Locally](#building-and-running-locally)
- Advanced Topics
* [Using CDI](#docs/cdi/md)
- [Changelog](#changelog)
- [Issues and Contributing](#issues-and-contributing)

Expand Down Expand Up @@ -699,7 +701,7 @@ These values are as follows:
(default 'false')
deviceListStrategy:
the desired strategy for passing the device list to the underlying runtime
[envvar | volume-mounts] (default "envvar")
[envvar | volume-mounts | cdi-annotations ] (default "envvar")
deviceIDStrategy:
the desired strategy for passing device IDs to the underlying runtime
[uuid | index] (default "uuid")
Expand Down Expand Up @@ -871,6 +873,7 @@ $ helm upgrade -i nvdp \
https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.14.4.tgz
```
-->

## Building and Running Locally

The next sections are focused on building the device plugin locally and running it.
Expand Down
34 changes: 34 additions & 0 deletions docs/cdi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
## Using CDI with the NVIDIA Device Plugin

The GPU Device plugin can be configured to use the [Container Device Interface (CDI)](https://tags.cncf.io/container-device-interface)
to specify which devices need to be injected into a container once a device is
allocated.

This may resolve issues around a container losing access to devices under container
updates. These typically manifest as:
```
NVML: Unknown error
```
in the container.

### Prerequisites

1. The host container runtime must be CDI-enabled. This includes `containerd`
1.7 and newer, and CRI-O 1.24 and newer.
2. The nvidia runtime should _not_ be the default runtime, but it must still be
installed, and configured as an available runtime. See the instructions for:
* [containerd](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd)
* [CRI-O](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o)
3. A [Runtime Class](https://kubernetes.io/docs/concepts/containers/runtime-class/) is created and associated
with the `nvidia` runtime.

### Configuration

Two things need to be considered here, ensuring that the GPU Device Plugin container has access to
NVIDIA GPU drivers and devices, and ensuring that CDI specifications are generated and devices are
requested using CDI. This can be done by including the following arguments in the Helm install command:
* `--set runtimeClassName=nvidia`: ensures that the device plugin is started with the `nvidia` runtime and has access to the NVIDIA GPU driver and devices.
* `--set nvidiaDriverRoot=/` (or `--set nvidiaDriverRoot=/run/nvidia/driver` if the driver container is used): ensures that the driver files are available to generate the correct CDI specifications.
* `--set deviceListStrategy=cdi-annotations`: configures annotations to be used to request CDI devices from the CDI-enabled container engine instead of the `NVIDIA_VISIBLE_DEVICES` environment variable.

Note that other utility pods such as the DCGM exporter must also be configured to use the `nvidia` RuntimeClass instead of relying on the `nvidia` runtime being configured as the default.
Loading