Replies: 12 comments
-
I had this same issue and I was able to fix it by applying the changes from https://github.com/NVIDIA/k8s-device-plugin#configure-containerd in a config.toml.tmpl based on the format here: https://github.com/k3s-io/k3s/blob/master/pkg/agent/templates/templates_linux.go. That also included removing the default nvidia plugin detection in the template (which could probably be brought back to fit with the correct config). Here's the diff:
I restarted k3s and I also had to delete the nvidia-device-plugin-daemonset pod: After that it stopped showing:
And logged:
One thing to be aware of that I'm still checking on is that after a reboot, all of my kube-system pods started to fail with CrashLoopBackoff. I found that other people had an issue linked with the Cgroup line in #5454. I confirmed that removing the nvidia config from the config.toml.tmpl file stops the CrashLoopBackoff condition but I'm still not entirely sure why. edit: Note, after adding the SystemdCgroup line to the nvidia runtime option section, my containers stopped crashing:
|
Beta Was this translation helpful? Give feedback.
-
It sounds like the main difference here is just that we need to set Do you know which release of the nvidia container runtime started requiring this? |
Beta Was this translation helpful? Give feedback.
-
Relevant issue: NVIDIA/k8s-device-plugin#406 |
Beta Was this translation helpful? Give feedback.
-
After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116 |
Beta Was this translation helpful? Give feedback.
-
That link gives me HTTP 404 However, I have solved the The reason is that the k3s detects the nvidia container runtime, but it does not make it the default one. The Helm chart, or the |
Beta Was this translation helpful? Give feedback.
-
not work, |
Beta Was this translation helpful? Give feedback.
-
@xinmans Try applying this manifest:
And re-create the nvidia plugin. Relevant: NVIDIA/k8s-device-plugin#406 (comment) |
Beta Was this translation helpful? Give feedback.
-
There's a dot at the end of the URL for some reason, that needs to be removed. In any case, the mentioned article uses the GPU operator which in turn uses the operator framework which automates this whole process. It did immediately work for me, ymmv. https://github.com/NVIDIA/gpu-operator Using helm:
|
Beta Was this translation helpful? Give feedback.
-
@henryford my bad, I updated the medium article link. Good to see that you got it working. |
Beta Was this translation helpful? Give feedback.
-
I cannot get K38s to recognize my GPU. I have followed the official docs, and my config.toml lists the
But checking for GPU availability on my node I get:
and any pod intialized with GPU remains in Notes/additional questions:
|
Beta Was this translation helpful? Give feedback.
-
I'm going to convert this to a discussion, as it seems like a K8s/NVIDIA related issue, rather than a k3s bug |
Beta Was this translation helpful? Give feedback.
-
check this out: @jmagoon , thanks for the hint #9231 (comment) |
Beta Was this translation helpful? Give feedback.
-
Environmental Info:
K3s Version: v1.27.4+k3s1
Node(s) CPU architecture, OS, and Version:
169092810522.04~d567a38 SMP PREEMPT_DYNAMIC Tue A x86_64 x86_64 x86_64 GNU/LinuxCluster Configuration:
1 Server, 1 agent
Describe the bug:
Nvidia device plugin pod with crashloopbackoff, unable to detect GPU.
The documentation to enable GPU workload doesn't work anymore when using latest nvidia drivers (535) and Nvidia runtime toolkit (1.13.5) here https://docs.k3s.io/advanced?_highlight=nvidia#nvidia-container-runtime-support
Steps To Reproduce:
Note: I installed both with and without base because I wasn't sure how to proceed regarding CDI support in K3S
Note: I have restarted k3s-agent just in case
Note: there are additional containerd instructions required here which I didn't follow https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
Expected behavior:
Expecting kubectl describe node gpu1 detecting GPU specification and adding annotation
Actual behavior:
the node gpu1 not showing any GPU related component. I didn't run the nbody-gpu-benchmark pod to test, given the limit resource specification nbody-gpu-benchmark
Additional context / logs:
The K3S documentation for Nvidia runtime https://docs.k3s.io/advanced?_highlight=nvidia#nvidia-container-runtime-support describes a working solution using driver 515.
I used this approach successfully until now (with k3s. v1.24, NFD v.013 and gpu-feature-discovery) but I have recently upgraded my GPU and installed newer driver version 535 for compatibility. Also reinstalled k3s v1.27.4+k3s1 in the process
Ideas for resolution
it could be a regression issue by using latest nvidia driver 535 but haven't tested out yet, given how long it would take to downgrade and test out.
There are additional instruction for containerd configuration with runtime described in Nvidia device plugin which I didn't follow. https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
Shall I define them in config.toml.tmpl ?
There is now CDI https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#step-2-generate-a-cdi-specification but no instruction for containerd, even less for k3s.
Not sure if this is on K3S or Nvidia side, looking forward to hearing your feedback
Thank you in advance
Jean-Paul
Beta Was this translation helpful? Give feedback.
All reactions