Kubernetes Device Plugin for NVIDIA GPUs

This directory contains the code for a Kubernetes device plugin for NVIDIA GPUs.

This device plugin provides deepomatic.com/shared-gpu resources, which are GPU devices mapping to a common, shared NVIDIA GPU.

The daemonset manifest can be used to deploy this device plugin to a cluster (1.9 onwards):

kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/cmd/nvidia_gpu/daemonset.yaml

This device plugin requires that NVIDIA drivers and libraries are installed in a particular way.

Examples of how driver installation needs to be done can be found at:

For COS:
- Installer code: https://github.com/GoogleCloudPlatform/cos-gpu-installer
- Installer daemonset: https://github.com/GoogleCloudPlatform/Deepomatic/blob/master/daemonset.yaml
For Ubuntu (experimental):
- Installer code: https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/entrypoint.sh
- Installer daemonset: https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/nvidia-driver-installer/ubuntu/daemonset.yaml

In short, this device plugins expects that all the nvidia libraries needed by the containers are present under a single directory on the host. You can specify the directory on the host containing nvidia libraries using -host-path. You can specify the location to mount that directory in all the containers using -container-path. For example, let's say on the host all nvidia libraries are present under /var/lib/nvidia/lib64 and you want to make these libraries available to containers under /usr/local/nvidia/lib64, then you would use -host-path=/var/lib/nvidia/lib64 and -container-path=/usr/local/nvidia/lib64.

How to use in GKE

Beta

We rely on beta GKE features, and not everything works perfectly as of 2018-10-25.

In GKE the upstream GCP nvidia device plugin is automatically installed via a kubernetes addon as well as a node taint. This creates some issues with this device plugin as the node taints and pods tolerations mechanism with device plugins essentially assumes only one extended resource type is available per node. In GKE with this device plugin we end up with two extended resources: nvidia.com/gpu (automatically) and the new deepomatic.com/shared-gpu.

In short: Pods requesting deepomatic.com/shared-gpu must thus explicitely tolerate nvidia.com/gpu taints for them to be scheduled. Another possible workaround is to also request 0 nvidia.com/gpu: the ExtendedResoureToleration-admission-controller will then add all the required tolerations.

Create the cluster

Create a Shared GPUs node-pool with:
- one GPU
- node label: deepomatic.com/shared-gpu=true
- node taint: deepomatic.com/shared-gpu=present:NoSchedule

Install the deepomatic-shared-gpu-gcp-k8s-device-plugin DaemonSet:

kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/cmd/nvidia_gpu/daemonset.yaml

Install the nvidia driver (the docker image is preloaded on GKE, we use that with a special daemonset):

kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

Verify the deepomatic.com/shared-gpu resources appear:

kubectl describe nodes -l deepomatic.com/shared-gpu=true | grep deepomatic.com/shared-gpu

Use the `deepomatic.com/shared-gpu` Extended Resource

See the example or the demo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Kubernetes Device Plugin for NVIDIA GPUs

How to use in GKE

Beta

Create the cluster

Use the `deepomatic.com/shared-gpu` Extended Resource

Files

README.md

Latest commit

History

README.md

File metadata and controls

Kubernetes Device Plugin for NVIDIA GPUs

How to use in GKE

Beta

Create the cluster

Use the deepomatic.com/shared-gpu Extended Resource

Use the `deepomatic.com/shared-gpu` Extended Resource