Use NVIDIA's gpu-operator for GPU node support #1017

mboersma · 2020-10-29T18:14:55Z

/kind feature

Describe the solution you'd like

#1002 implemented the "nvidia-gpu" flavor via postKubeadmCommands recommended by NVIDIA, as explained in this comment.

But NVIDIA's gpu-operator seems like a cleaner, more future-proof solution. We should investigate whether it supports containerd now and whether the current implementation could be replaced with gpu-operator.

Anything else you would like to add:

See the discussion in #426 and the current implementation in #1002.

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

jessehu · 2020-11-28T03:17:03Z

NVIDIA's gpu-operator doesn't support containerd yet per discussion in NVIDIA/gpu-operator#7

nader-ziada · 2020-12-10T16:43:52Z

/assign @mboersma

rosskukulinski · 2021-01-22T18:20:01Z

Containerd 1.4 support is now live in gpu-operator 1.4. +100 for leveraging gpu-operator

shysank · 2021-02-04T20:23:17Z

@mboersma are you working on this? If not, can I pick this one up?

mboersma · 2021-02-04T21:13:19Z

/assign @shysank

@shysank I am not currently working on this, so please have at it (and thank you).

When I had looked at it in December, the issue was that restarting a node made Kubernetes lose track of the GPU device, which didn't seem to be a problem with the existing Pre|PostKubeAdmCommand approach. Hopefully that is fixed now or you can find a workaround.

shysank · 2021-02-11T22:45:57Z

There is an issue with gpu operator compatibility with containerd v1.3.0. tldr; containerd expects a default_runtime_name field to be set to nvidia. I have opened a pr which I believe is the fix. Will have to wait for the nvidia folks' confirmation, and the timeline for it to be available.

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 29, 2020

CecileRobertMichon added this to the next milestone Nov 12, 2020

k8s-ci-robot assigned mboersma Dec 10, 2020

nader-ziada modified the milestones: next, v0.4.11 Dec 10, 2020

k8s-ci-robot assigned shysank Feb 4, 2021

CecileRobertMichon modified the milestones: v0.4.11, v0.5.0 Feb 5, 2021

CecileRobertMichon modified the milestones: v0.5.0, v0.5.x Mar 18, 2021

shysank mentioned this issue Mar 22, 2021

Use nvidia gpu operator for nvidia-gpu flavor #1254

Merged

3 tasks

k8s-ci-robot closed this as completed in #1254 Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use NVIDIA's gpu-operator for GPU node support #1017

Use NVIDIA's gpu-operator for GPU node support #1017

mboersma commented Oct 29, 2020

jessehu commented Nov 28, 2020

nader-ziada commented Dec 10, 2020

rosskukulinski commented Jan 22, 2021

shysank commented Feb 4, 2021

mboersma commented Feb 4, 2021

shysank commented Feb 11, 2021

Use NVIDIA's gpu-operator for GPU node support #1017

Use NVIDIA's gpu-operator for GPU node support #1017

Comments

mboersma commented Oct 29, 2020

jessehu commented Nov 28, 2020

nader-ziada commented Dec 10, 2020

rosskukulinski commented Jan 22, 2021

shysank commented Feb 4, 2021

mboersma commented Feb 4, 2021

shysank commented Feb 11, 2021