This directory contains the code for a Kubernetes device plugin for NVIDIA GPUs.
This device plugin provides deepomatic.com/shared-gpu
resources, which are GPU devices mapping to a common, shared NVIDIA GPU.
The daemonset manifest can be used to deploy this device plugin to a cluster (1.9 onwards):
kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/cmd/nvidia_gpu/daemonset.yaml
This device plugin requires that NVIDIA drivers and libraries are installed in a particular way.
Examples of how driver installation needs to be done can be found at:
-
For COS:
- Installer code: https://github.com/GoogleCloudPlatform/cos-gpu-installer
- Installer daemonset: https://github.com/GoogleCloudPlatform/Deepomatic/blob/master/daemonset.yaml
-
For Ubuntu (experimental):
In short, this device plugins expects that all the nvidia libraries needed by the containers are present under a single directory on the host. You can specify the directory on the host containing nvidia libraries using -host-path
. You can specify the location to mount that directory in all the containers using -container-path
. For example, let's say on the host all nvidia libraries are present under /var/lib/nvidia/lib64
and you want to make these libraries available to containers under /usr/local/nvidia/lib64
, then you would use -host-path=/var/lib/nvidia/lib64
and -container-path=/usr/local/nvidia/lib64
.
We rely on beta GKE features, and not everything works perfectly as of 2018-10-25.
In GKE the upstream GCP nvidia device plugin is automatically installed via a kubernetes addon as well as a node taint.
This creates some issues with this device plugin as the node taints and pods tolerations mechanism with device plugins essentially assumes only one extended resource type is available per node.
In GKE with this device plugin we end up with two extended resources: nvidia.com/gpu
(automatically) and the new deepomatic.com/shared-gpu
.
In short: Pods requesting deepomatic.com/shared-gpu
must thus explicitely tolerate nvidia.com/gpu
taints for them to be scheduled.
Another possible workaround is to also request 0
nvidia.com/gpu
: the ExtendedResoureToleration-admission-controller will then add all the required tolerations.
-
Create a Shared GPUs node-pool with:
- one GPU
- node label:
deepomatic.com/shared-gpu=true
- node taint:
deepomatic.com/shared-gpu=present:NoSchedule
-
Install the
deepomatic-shared-gpu-gcp-k8s-device-plugin
DaemonSet:kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/cmd/nvidia_gpu/daemonset.yaml
-
Install the nvidia driver (the docker image is preloaded on GKE, we use that with a special daemonset):
kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
-
Verify the
deepomatic.com/shared-gpu
resources appear:kubectl describe nodes -l deepomatic.com/shared-gpu=true | grep deepomatic.com/shared-gpu