Add support for GPU nodes #426

CecileRobertMichon · 2020-03-05T21:12:40Z

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

CecileRobertMichon · 2020-05-06T23:18:14Z

We should into how/if https://github.com/NVIDIA/gpu-operator can be leveraged for this. See https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html.

sozercan · 2020-05-07T04:25:05Z

/assign

sozercan · 2020-05-11T20:47:12Z

Looks like NVIDIA gpu-operator doesn't support containerd yet (containerd itself supports GPUs but needs runtime configuration changes to use them with device plugin).
NVIDIA/gpu-operator#7

If we don't want to wait for containerd support, we can resolve this similar to aks-engine, install GPU driver to N series nodes and then deploy device plugin separately. Not sure if capz has similar CSE execution on specific nodes though.

sozercan · 2020-05-11T21:12:06Z

If anyone wants to install drivers and device plugin manually, here are instructions:

Deploy capz with N Series SKU (includes NVIDIA GPUs).
Once cluster is up, ssh into each agent node that has GPUs and run these:
Install NVIDIA drivers:

sudo apt update
sudo apt install ubuntu-drivers-common -y
sudo ubuntu-drivers install
nvidia-smi # to verify gpu drivers are installed

Install NVIDIA container runtime:

curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$(. /etc/os-release;echo $ID$VERSION_ID)/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt install nvidia-container-runtime -y

Configure containerd:

sudo mkdir -p /etc/containerd
sudo vi config.toml # <- add config from: https://gist.github.com/sozercan/51a569cf173ef7e57a375978af8edf26
sudo systemctl restart containerd

Verify containerd can access the GPUs: (you should see same output as nvidia-smi)

sudo ctr images pull docker.io/nvidia/cuda:10.0-base
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:10.0-base nvidia-smi nvidia-smi

Deploy NVIDIA Device plugin: (run once in Kubernetes)

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

You should see following output in device plugin logs:

2020/05/09 04:46:53 Loading NVML
2020/05/09 04:46:53 Starting FS watcher.
2020/05/09 04:46:53 Starting OS watcher.
2020/05/09 04:46:53 Retreiving plugins.
2020/05/09 04:46:53 Starting GRPC server for 'nvidia.com/gpu'
2020/05/09 04:46:53 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia.sock
2020/05/09 04:46:53 Registered device plugin for 'nvidia.com/gpu' with Kubelet

alexeldeib · 2020-07-22T00:26:00Z

Can we possibly pre-provide a kubeadm config template to simplify this?

CecileRobertMichon · 2020-07-22T00:29:30Z

@alexeldeib you mean leverage post kubeadm commands to do the install?

CecileRobertMichon · 2020-07-22T00:29:54Z

/unassign @sozercan

alexeldeib · 2020-07-22T00:36:01Z

yeah, or even just stick it in a file and have the postKubeadmCommands be bash setup.sh.

I'm warming up to the idea of using the templatized types as a way to simplify defaulting / best practices. We could have something like a default GPU kubeadm config template, so users don't need to bring their own

CecileRobertMichon · 2020-07-22T00:50:54Z

Yeah I like the idea of having a "reference" flavor template for GPU w/ docs using the bash script Sertac shared above for now, and then maybe open a separate issue for switching the instructions to use the nvidia operator once that works with containerd.

I'm going to mark this as help wanted.

/help

k8s-ci-robot · 2020-07-22T00:50:55Z

@CecileRobertMichon:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Yeah I like the idea of having a "reference" flavor template for GPU w/ docs using the bash script Sertac shared above for now, and then maybe open a separate issue for switching the instructions to use the nvidia operator once that works with containerd.

I'm going to mark this as help wanted.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

CecileRobertMichon · 2020-08-12T22:18:58Z

There is also a VM extension available on Azure that might be worth looking into: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux

Not sure if it works with containerd though.

mboersma · 2020-10-15T15:46:46Z

/assign

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 5, 2020

justaugustus added this to the next milestone Mar 30, 2020

k8s-ci-robot assigned sozercan May 7, 2020

jackfrancis mentioned this issue May 15, 2020

Enable nvidia for 18.04-LTS Azure/aks-engine#3274

Closed

CecileRobertMichon added the parity Used to track feature parity with other Azure provisioning tools (AKS, AKS Engine, etc) label Jun 4, 2020

k8s-ci-robot unassigned sozercan Jul 22, 2020

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 22, 2020

CecileRobertMichon mentioned this issue Oct 8, 2020

Containerd + MicroK8s NVIDIA/gpu-operator#7

Closed

8 tasks

k8s-ci-robot assigned mboersma Oct 15, 2020

mboersma mentioned this issue Oct 18, 2020

✨ Support GPU nodes with "nvidia-gpu" flavor #1002

Merged

3 tasks

k8s-ci-robot closed this as completed in #1002 Oct 29, 2020

mboersma mentioned this issue Oct 29, 2020

Use NVIDIA's gpu-operator for GPU node support #1017

Closed

CecileRobertMichon removed this from the next milestone May 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for GPU nodes #426

Add support for GPU nodes #426

CecileRobertMichon commented Mar 5, 2020

CecileRobertMichon commented May 6, 2020

sozercan commented May 7, 2020

sozercan commented May 11, 2020

sozercan commented May 11, 2020 •

edited

Loading

alexeldeib commented Jul 22, 2020

CecileRobertMichon commented Jul 22, 2020

CecileRobertMichon commented Jul 22, 2020

alexeldeib commented Jul 22, 2020 •

edited

Loading

CecileRobertMichon commented Jul 22, 2020

k8s-ci-robot commented Jul 22, 2020

CecileRobertMichon commented Aug 12, 2020

mboersma commented Oct 15, 2020

Add support for GPU nodes #426

Add support for GPU nodes #426

Comments

CecileRobertMichon commented Mar 5, 2020

CecileRobertMichon commented May 6, 2020

sozercan commented May 7, 2020

sozercan commented May 11, 2020

sozercan commented May 11, 2020 • edited Loading

alexeldeib commented Jul 22, 2020

CecileRobertMichon commented Jul 22, 2020

CecileRobertMichon commented Jul 22, 2020

alexeldeib commented Jul 22, 2020 • edited Loading

CecileRobertMichon commented Jul 22, 2020

k8s-ci-robot commented Jul 22, 2020

CecileRobertMichon commented Aug 12, 2020

mboersma commented Oct 15, 2020

sozercan commented May 11, 2020 •

edited

Loading

alexeldeib commented Jul 22, 2020 •

edited

Loading