Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use NVIDIA's gpu-operator for GPU node support #1017

Closed
mboersma opened this issue Oct 29, 2020 · 6 comments · Fixed by #1254
Closed

Use NVIDIA's gpu-operator for GPU node support #1017

mboersma opened this issue Oct 29, 2020 · 6 comments · Fixed by #1254
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Milestone

Comments

@mboersma
Copy link
Contributor

/kind feature

Describe the solution you'd like

#1002 implemented the "nvidia-gpu" flavor via postKubeadmCommands recommended by NVIDIA, as explained in this comment.

But NVIDIA's gpu-operator seems like a cleaner, more future-proof solution. We should investigate whether it supports containerd now and whether the current implementation could be replaced with gpu-operator.

Anything else you would like to add:

See the discussion in #426 and the current implementation in #1002.

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 29, 2020
@CecileRobertMichon CecileRobertMichon added this to the next milestone Nov 12, 2020
@jessehu
Copy link

jessehu commented Nov 28, 2020

NVIDIA's gpu-operator doesn't support containerd yet per discussion in NVIDIA/gpu-operator#7

@nader-ziada
Copy link
Contributor

/assign @mboersma

@rosskukulinski
Copy link

Containerd 1.4 support is now live in gpu-operator 1.4. +100 for leveraging gpu-operator

@shysank
Copy link
Contributor

shysank commented Feb 4, 2021

@mboersma are you working on this? If not, can I pick this one up?

@mboersma
Copy link
Contributor Author

mboersma commented Feb 4, 2021

/assign @shysank

@shysank I am not currently working on this, so please have at it (and thank you).

When I had looked at it in December, the issue was that restarting a node made Kubernetes lose track of the GPU device, which didn't seem to be a problem with the existing Pre|PostKubeAdmCommand approach. Hopefully that is fixed now or you can find a workaround.

@shysank
Copy link
Contributor

shysank commented Feb 11, 2021

There is an issue with gpu operator compatibility with containerd v1.3.0. tldr; containerd expects a default_runtime_name field to be set to nvidia. I have opened a pr which I believe is the fix. Will have to wait for the nvidia folks' confirmation, and the timeline for it to be available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants