NVIDIA · elezar · Jun 22, 2023
diff --git a/README.md b/README.md
@@ -27,6 +27,8 @@ TODO: We are still in the process of migrating GFD to this repo. Once this is re
 -->
   * [Deploying via `helm install` with a direct URL to the `helm` package](#deploying-via-helm-install-with-a-direct-url-to-the-helm-package)
 - [Building and Running Locally](#building-and-running-locally)
+- Advanced Topics
+  * [Using CDI](#docs/cdi/md)
 - [Changelog](#changelog)
 - [Issues and Contributing](#issues-and-contributing)
 
@@ -699,7 +701,7 @@ These values are as follows:
       (default 'false')
   deviceListStrategy:
       the desired strategy for passing the device list to the underlying runtime
-      [envvar | volume-mounts] (default "envvar")
+      [envvar | volume-mounts | cdi-annotations ] (default "envvar")
   deviceIDStrategy:
       the desired strategy for passing device IDs to the underlying runtime
       [uuid | index] (default "uuid")
@@ -871,6 +873,7 @@ $ helm upgrade -i nvdp \
     https://nvidia.github.io/k8s-device-plugin/stable/nvidia-device-plugin-0.14.4.tgz
 ```
 -->
+
 ## Building and Running Locally
 
 The next sections are focused on building the device plugin locally and running it.

diff --git a/docs/cdi.md b/docs/cdi.md
@@ -0,0 +1,34 @@
+## Using CDI with the NVIDIA Device Plugin
+
+The GPU Device plugin can be configured to use the [Container Device Interface (CDI)](https://tags.cncf.io/container-device-interface)
+to specify which devices need to be injected into a container once a device is
+allocated.
+
+This may resolve issues around a container losing access to devices under container
+updates. These typically manifest as:
+```
+NVML: Unknown error
+```
+in the container.
+
+### Prerequisites
+
+1. The host container runtime must be CDI-enabled. This includes `containerd`
+   1.7 and newer, and CRI-O 1.24 and newer.
+2. The nvidia runtime should _not_ be the default runtime, but it must still be
+   installed, and configured as an available runtime. See the instructions for:
+    * [containerd](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd)
+    * [CRI-O](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-cri-o)
+3. A [Runtime Class](https://kubernetes.io/docs/concepts/containers/runtime-class/) is created and associated
+with the `nvidia` runtime.
+
+### Configuration
+
+Two things need to be considered here, ensuring that the GPU Device Plugin container has access to
+NVIDIA GPU drivers and devices, and ensuring that CDI specifications are generated and devices are
+requested using CDI. This can be done by including the following arguments in the Helm install command:
+* `--set runtimeClassName=nvidia`: ensures that the device plugin is started with the `nvidia` runtime and has access to the NVIDIA GPU driver and devices.
+* `--set nvidiaDriverRoot=/` (or `--set nvidiaDriverRoot=/run/nvidia/driver` if the driver container is used): ensures that the driver files are available to generate the correct CDI specifications.
+* `--set deviceListStrategy=cdi-annotations`: configures annotations to be used to request CDI devices from the CDI-enabled container engine instead of the `NVIDIA_VISIBLE_DEVICES` environment variable.
+
+Note that other utility pods such as the DCGM exporter must also be configured to use the `nvidia` RuntimeClass instead of relying on the `nvidia` runtime being configured as the default.