Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for GPU nodes #426

Closed
CecileRobertMichon opened this issue Mar 5, 2020 · 12 comments · Fixed by #1002
Closed

Add support for GPU nodes #426

CecileRobertMichon opened this issue Mar 5, 2020 · 12 comments · Fixed by #1002
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. parity Used to track feature parity with other Azure provisioning tools (AKS, AKS Engine, etc)

Comments

@CecileRobertMichon
Copy link
Contributor

/kind feature

Describe the solution you'd like
[A clear and concise description of what you want to happen.]
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 5, 2020
@justaugustus justaugustus added this to the next milestone Mar 30, 2020
@CecileRobertMichon
Copy link
Contributor Author

We should into how/if https://github.com/NVIDIA/gpu-operator can be leveraged for this. See https://docs.nvidia.com/datacenter/kubernetes/openshift-on-gpu-install-guide/index.html.

@sozercan
Copy link
Contributor

sozercan commented May 7, 2020

/assign

@sozercan
Copy link
Contributor

Looks like NVIDIA gpu-operator doesn't support containerd yet (containerd itself supports GPUs but needs runtime configuration changes to use them with device plugin).
NVIDIA/gpu-operator#7

If we don't want to wait for containerd support, we can resolve this similar to aks-engine, install GPU driver to N series nodes and then deploy device plugin separately. Not sure if capz has similar CSE execution on specific nodes though.

@sozercan
Copy link
Contributor

sozercan commented May 11, 2020

If anyone wants to install drivers and device plugin manually, here are instructions:

  • Deploy capz with N Series SKU (includes NVIDIA GPUs).

  • Once cluster is up, ssh into each agent node that has GPUs and run these:

  • Install NVIDIA drivers:

sudo apt update
sudo apt install ubuntu-drivers-common -y
sudo ubuntu-drivers install
nvidia-smi # to verify gpu drivers are installed
  • Install NVIDIA container runtime:
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-container-runtime/$(. /etc/os-release;echo $ID$VERSION_ID)/nvidia-container-runtime.list | sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list
sudo apt install nvidia-container-runtime -y
  • Configure containerd:
sudo mkdir -p /etc/containerd
sudo vi config.toml # <- add config from: https://gist.github.com/sozercan/51a569cf173ef7e57a375978af8edf26
sudo systemctl restart containerd
  • Verify containerd can access the GPUs: (you should see same output as nvidia-smi)
sudo ctr images pull docker.io/nvidia/cuda:10.0-base
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:10.0-base nvidia-smi nvidia-smi
  • Deploy NVIDIA Device plugin: (run once in Kubernetes)
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

You should see following output in device plugin logs:

2020/05/09 04:46:53 Loading NVML
2020/05/09 04:46:53 Starting FS watcher.
2020/05/09 04:46:53 Starting OS watcher.
2020/05/09 04:46:53 Retreiving plugins.
2020/05/09 04:46:53 Starting GRPC server for 'nvidia.com/gpu'
2020/05/09 04:46:53 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia.sock
2020/05/09 04:46:53 Registered device plugin for 'nvidia.com/gpu' with Kubelet

@CecileRobertMichon CecileRobertMichon added the parity Used to track feature parity with other Azure provisioning tools (AKS, AKS Engine, etc) label Jun 4, 2020
@alexeldeib
Copy link
Contributor

Can we possibly pre-provide a kubeadm config template to simplify this?

@CecileRobertMichon
Copy link
Contributor Author

@alexeldeib you mean leverage post kubeadm commands to do the install?

@CecileRobertMichon
Copy link
Contributor Author

/unassign @sozercan

@alexeldeib
Copy link
Contributor

alexeldeib commented Jul 22, 2020

yeah, or even just stick it in a file and have the postKubeadmCommands be bash setup.sh.

I'm warming up to the idea of using the templatized types as a way to simplify defaulting / best practices. We could have something like a default GPU kubeadm config template, so users don't need to bring their own

@CecileRobertMichon
Copy link
Contributor Author

Yeah I like the idea of having a "reference" flavor template for GPU w/ docs using the bash script Sertac shared above for now, and then maybe open a separate issue for switching the instructions to use the nvidia operator once that works with containerd.

I'm going to mark this as help wanted.

/help

@k8s-ci-robot
Copy link
Contributor

@CecileRobertMichon:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

Yeah I like the idea of having a "reference" flavor template for GPU w/ docs using the bash script Sertac shared above for now, and then maybe open a separate issue for switching the instructions to use the nvidia operator once that works with containerd.

I'm going to mark this as help wanted.

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 22, 2020
@CecileRobertMichon
Copy link
Contributor Author

There is also a VM extension available on Azure that might be worth looking into: https://docs.microsoft.com/en-us/azure/virtual-machines/extensions/hpccompute-gpu-linux

Not sure if it works with containerd though.

@mboersma
Copy link
Contributor

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. parity Used to track feature parity with other Azure provisioning tools (AKS, AKS Engine, etc)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants