Skip to content

Latest commit

 

History

History
60 lines (44 loc) · 4.12 KB

File metadata and controls

60 lines (44 loc) · 4.12 KB

Kubernetes Device Plugin for NVIDIA GPUs

This directory contains the code for a Kubernetes device plugin for NVIDIA GPUs.

This device plugin provides deepomatic.com/shared-gpu resources, which are GPU devices mapping to a common, shared NVIDIA GPU.

The daemonset manifest can be used to deploy this device plugin to a cluster (1.9 onwards):

kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/cmd/nvidia_gpu/daemonset.yaml

This device plugin requires that NVIDIA drivers and libraries are installed in a particular way.

Examples of how driver installation needs to be done can be found at:

In short, this device plugins expects that all the nvidia libraries needed by the containers are present under a single directory on the host. You can specify the directory on the host containing nvidia libraries using -host-path. You can specify the location to mount that directory in all the containers using -container-path. For example, let's say on the host all nvidia libraries are present under /var/lib/nvidia/lib64 and you want to make these libraries available to containers under /usr/local/nvidia/lib64, then you would use -host-path=/var/lib/nvidia/lib64 and -container-path=/usr/local/nvidia/lib64.

How to use in GKE

Beta

We rely on beta GKE features, and not everything works perfectly as of 2018-10-25.

In GKE the upstream GCP nvidia device plugin is automatically installed via a kubernetes addon as well as a node taint. This creates some issues with this device plugin as the node taints and pods tolerations mechanism with device plugins essentially assumes only one extended resource type is available per node. In GKE with this device plugin we end up with two extended resources: nvidia.com/gpu (automatically) and the new deepomatic.com/shared-gpu.

In short: Pods requesting deepomatic.com/shared-gpu must thus explicitely tolerate nvidia.com/gpu taints for them to be scheduled. Another possible workaround is to also request 0 nvidia.com/gpu: the ExtendedResoureToleration-admission-controller will then add all the required tolerations.

Create the cluster

  • Create a Shared GPUs node-pool with:

    • one GPU
    • node label: deepomatic.com/shared-gpu=true
    • node taint: deepomatic.com/shared-gpu=present:NoSchedule
  • Install the deepomatic-shared-gpu-gcp-k8s-device-plugin DaemonSet:

    kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/cmd/nvidia_gpu/daemonset.yaml
  • Install the nvidia driver (the docker image is preloaded on GKE, we use that with a special daemonset):

    kubectl apply -f https://raw.githubusercontent.com/Deepomatic/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
  • Verify the deepomatic.com/shared-gpu resources appear:

    kubectl describe nodes -l deepomatic.com/shared-gpu=true | grep deepomatic.com/shared-gpu
    

Use the deepomatic.com/shared-gpu Extended Resource

See the example or the demo.