Latest NVIDIA Container Runtime Support not working anymore with K3S #9231

jpabbuehl · 2023-08-26T08:42:11Z

jpabbuehl
Aug 26, 2023

Environmental Info:

K3s Version: v1.27.4+k3s1

Node(s) CPU architecture, OS, and Version:

Server - Linux pop-os 6.4.6-76060406-generic #202307241739~~1690928105~~22.04~d567a38 SMP PREEMPT_DYNAMIC Tue A x86_64 x86_64 x86_64 GNU/Linux
Agent - Linux gpu1 5.15.0-79-generic Ubuntu SMP Mon Jul 10 16:07:21 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux with NVIDIA RTX 4090

Cluster Configuration:

1 Server, 1 agent

Describe the bug:

Nvidia device plugin pod with crashloopbackoff, unable to detect GPU.
The documentation to enable GPU workload doesn't work anymore when using latest nvidia drivers (535) and Nvidia runtime toolkit (1.13.5) here https://docs.k3s.io/advanced?_highlight=nvidia#nvidia-container-runtime-support

Steps To Reproduce:

Installed K3s on server : curl -sfL https://get.k3s.io | sh -
Installed K3s on agent: curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=mynodetoken sh -

Install latest Nvidia drivers on agent with GPU RTX 4090 per Nvidia-Cuda documentation https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#common-installation-instructions-for-ubuntu

Note: I installed both with and without base because I wasn't sure how to proceed regarding CDI support in K3S

nvidia-smi
Sat Aug 26 07:42:01 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:04:00.0 Off |                  Off |
|  0%   39C    P8              37W / 450W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Install Nvidia container toolkit per Nvidia documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#step-1-install-nvidia-container-toolkit

nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.13.5
commit: 6b8589dcb4dead72ab64f14a5912886e6165c079

k3s is catching up Nvidia container runtime automatically

Note: I have restarted k3s-agent just in case

sudo cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml

# File generated by k3s. DO NOT EDIT. Use config.toml.tmpl instead.
version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true

[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/dc43f496a0a9ac19d3b2444d390db38e0cfb38e672721f838b075422b8734994/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

Testing succesfully containerd with nvidia runtime directly on agent after ssh

sudo ctr image pull docker.io/nvidia/cuda:12.1.1-base-ubuntu22.04
sudo ctr run --rm -t     --runc-binary=/usr/bin/nvidia-container-runtime     --env NVIDIA_VISIBLE_DEVICES=all     docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04     cuda-11.6.2-base-ubuntu20.04 nvidia-smi

Sat Aug 26 08:12:47 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:04:00.0 Off |                  Off |
|  0%   39C    P8              36W / 450W |      3MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Back to server (control plane) - Installing NVIDIA device plugin (v0.14) via helm per instruction in https://github.com/NVIDIA/k8s-device-plugin

Note: there are additional containerd instructions required here which I didn't follow https://github.com/NVIDIA/k8s-device-plugin#configure-containerd

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
  --namespace nvidia-device-plugin \
  --create-namespace \
  --version 0.14.1

Adding label and taint to restrict daemonset to agent with gpu installed only

kubectl label nodes gpu1 gpu=installed
kubectl taint nodes pop-os gpu:NoSchedule

Pods keep crashing, logs showing device not detected

kubectl get pods -n nvidia-device-plugin -o wide
NAME                              READY   STATUS             RESTARTS        AGE   IP          NODE   NOMINATED NODE   READINESS GATES
nvdp-nvidia-device-plugin-zfkj7   0/1     CrashLoopBackOff   18 (3m7s ago)   70m   10.42.1.5   gpu1   <none>           <none>

kubectl logs -n nvidia-device-plugin nvdp-nvidia-device-plugin-zfkj7
I0826 08:14:48.251257       1 main.go:154] Starting FS watcher.
I0826 08:14:48.251291       1 main.go:161] Starting OS watcher.
I0826 08:14:48.251529       1 main.go:176] Starting Plugins.
I0826 08:14:48.251544       1 main.go:234] Loading configuration.
I0826 08:14:48.251682       1 main.go:242] Updating config with default resource matching patterns.
I0826 08:14:48.251934       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0826 08:14:48.251947       1 main.go:256] Retreiving plugins.
W0826 08:14:48.252203       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0826 08:14:48.252244       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0826 08:14:48.252266       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0826 08:14:48.252276       1 factory.go:115] Incompatible platform detected
E0826 08:14:48.252281       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0826 08:14:48.252287       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0826 08:14:48.252292       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0826 08:14:48.252297       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0826 08:14:48.258856       1 main.go:123] error starting plugins: error creating plugin manager: unable to create plugin manager: platform detection failed

Expected behavior:

Expecting kubectl describe node gpu1 detecting GPU specification and adding annotation

Actual behavior:

the node gpu1 not showing any GPU related component. I didn't run the nbody-gpu-benchmark pod to test, given the limit resource specification nbody-gpu-benchmark

Additional context / logs:

The K3S documentation for Nvidia runtime https://docs.k3s.io/advanced?_highlight=nvidia#nvidia-container-runtime-support describes a working solution using driver 515.

I used this approach successfully until now (with k3s. v1.24, NFD v.013 and gpu-feature-discovery) but I have recently upgraded my GPU and installed newer driver version 535 for compatibility. Also reinstalled k3s v1.27.4+k3s1 in the process

Ideas for resolution

it could be a regression issue by using latest nvidia driver 535 but haven't tested out yet, given how long it would take to downgrade and test out.
There are additional instruction for containerd configuration with runtime described in Nvidia device plugin which I didn't follow. https://github.com/NVIDIA/k8s-device-plugin#configure-containerd
Shall I define them in config.toml.tmpl ?
There is now CDI https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#step-2-generate-a-cdi-specification but no instruction for containerd, even less for k3s.

Not sure if this is on K3S or Nvidia side, looking forward to hearing your feedback
Thank you in advance

Jean-Paul

jmagoon · 2023-08-30T23:29:25Z

jmagoon
Aug 30, 2023

I had this same issue and I was able to fix it by applying the changes from https://github.com/NVIDIA/k8s-device-plugin#configure-containerd in a config.toml.tmpl based on the format here: https://github.com/k3s-io/k3s/blob/master/pkg/agent/templates/templates_linux.go. That also included removing the default nvidia plugin detection in the template (which could probably be brought back to fit with the correct config). Here's the diff:

root@magoon:/var/lib/rancher/k3s/agent/etc/containerd# diff config.toml.default config.toml.nvidia
27a28
>   default_runtime_name = "nvidia"
74c75,78
< [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
---
> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
>   privileged_without_host_devices = false
>   runtime_engine = ""
>   runtime_root = ""
75a80,81
> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
>   BinaryName = "/usr/bin/nvidia-container-runtime"
>   SystemdCgroup = {{ .SystemdCgroup }}
76a83,84
> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
>   runtime_type = "io.containerd.runc.v2"
112,117d119
< {{range $k, $v := .ExtraRuntimes}}
< [plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}"]
<   runtime_type = "{{$v.RuntimeType}}"
< [plugins."io.containerd.grpc.v1.cri".containerd.runtimes."{{$k}}".options]
<   BinaryName = "{{$v.BinaryName}}"
< {{end}}

I restarted k3s and I also had to delete the nvidia-device-plugin-daemonset pod:
kubectl delete pod nvidia-device-plugin-daemonset-b6lqm -n kube-system

After that it stopped showing:

I0830 23:04:19.417692       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory

And logged:

I0830 23:07:20.795657       1 main.go:256] Retreiving plugins.
I0830 23:07:20.796097       1 factory.go:107] Detected NVML platform: found NVML library
I0830 23:07:20.796128       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0830 23:07:21.658884       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0830 23:07:21.659483       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0830 23:07:21.661373       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

One thing to be aware of that I'm still checking on is that after a reboot, all of my kube-system pods started to fail with CrashLoopBackoff. I found that other people had an issue linked with the Cgroup line in #5454. I confirmed that removing the nvidia config from the config.toml.tmpl file stops the CrashLoopBackoff condition but I'm still not entirely sure why.

edit: Note, after adding the SystemdCgroup line to the nvidia runtime option section, my containers stopped crashing:

> [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
>   BinaryName = "/usr/bin/nvidia-container-runtime"
>   SystemdCgroup = {{ .SystemdCgroup }}

0 replies

brandond · 2023-09-06T17:40:38Z

brandond
Sep 6, 2023
Collaborator

It sounds like the main difference here is just that we need to set SystemdCgroup in the nvidia runtime options?

Do you know which release of the nvidia container runtime started requiring this?

0 replies

matusnovak · 2023-09-23T15:31:28Z

matusnovak
Sep 23, 2023

Relevant issue: NVIDIA/k8s-device-plugin#406

0 replies

kannan-scalers-ai · 2023-09-28T06:20:09Z

kannan-scalers-ai
Sep 28, 2023

After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116

0 replies

matusnovak · 2023-10-04T22:04:57Z

matusnovak
Oct 4, 2023

After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116

That link gives me HTTP 404

However, I have solved the could not load NVML library: libnvidia-ml.so.1 issue by adding runtimeClassName: nvidia to the K8s device plugin from here https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml (or modify the Helm template, either works).

The reason is that the k3s detects the nvidia container runtime, but it does not make it the default one. The Helm chart, or the nvidia-device-plugin.yml, expect that the default runtime is nvidia, which is not. Explicitly adding that in, or using a correct k3s config template to make it the default runtime, will solve the issue.

0 replies

xinmans · 2023-10-15T13:46:33Z

xinmans
Oct 15, 2023

After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116

That link gives me HTTP 404

However, I have solved the could not load NVML library: libnvidia-ml.so.1 issue by adding runtimeClassName: nvidia to the K8s device plugin from here https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml (or modify the Helm template, either works).

The reason is that the k3s detects the nvidia container runtime, but it does not make it the default one. The Helm chart, or the nvidia-device-plugin.yml, expect that the default runtime is nvidia, which is not. Explicitly adding that in, or using a correct k3s config template to make it the default runtime, will solve the issue.

not work,
Error creating: pods "nvidia-device-plugin-daemonset-" is forbidden: pod rejected: RuntimeClass "nvidia" not found

0 replies

matusnovak · 2023-10-16T19:06:47Z

matusnovak
Oct 16, 2023

not work,
Error creating: pods "nvidia-device-plugin-daemonset-" is forbidden: pod rejected: RuntimeClass "nvidia" not found

@xinmans Try applying this manifest:

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia 
handler: nvidia

And re-create the nvidia plugin.

Relevant: NVIDIA/k8s-device-plugin#406 (comment)

0 replies

henryford · 2023-12-08T10:32:54Z

henryford
Dec 8, 2023

After trying out all the suggestions from here and other issues, I got it working by following this blog https://medium.com/sparque-labs/serving-ai-models-on-the-edge-using-nvidia-gpu-with-k3s-on-aws-part-4-dd48f8699116

There's a dot at the end of the URL for some reason, that needs to be removed. In any case, the mentioned article uses the GPU operator which in turn uses the operator framework which automates this whole process. It did immediately work for me, ymmv.

https://github.com/NVIDIA/gpu-operator

Using helm:

$: helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
   && helm repo update

$: helm install --wait nvidiagpu \
     -n gpu-operator --create-namespace \
    --set toolkit.env[0].name=CONTAINERD_CONFIG \
    --set toolkit.env[0].value=/var/lib/rancher/k3s/agent/etc/containerd/config.toml \
    --set toolkit.env[1].name=CONTAINERD_SOCKET \
    --set toolkit.env[1].value=/run/k3s/containerd/containerd.sock \
    --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS \
    --set toolkit.env[2].value=nvidia \
    --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT \
    --set-string toolkit.env[3].value=true \
     nvidia/gpu-operator

NAME: nvidiagpu
LAST DEPLOYED: Tue Aug  8 00:54:41 2023
NAMESPACE: gpu-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None

0 replies

kannan-scalers-ai · 2023-12-08T10:35:39Z

kannan-scalers-ai
Dec 8, 2023

@henryford my bad, I updated the medium article link. Good to see that you got it working.

0 replies

cboettig · 2024-01-10T19:30:10Z

cboettig
Jan 10, 2024

I cannot get K38s to recognize my GPU. I have followed the official docs, and my config.toml lists the nvidia entries:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  SystemdCgroup = true

nvidia-smi works as expected, even when testing with docker directly (which uses the nvidia-container toolkit, though I think my K38s is using the default containerd mode instead of docker mode):

docker run --rm -ti --gpus all nvidia/cuda:12.3.1-runtime-ubuntu22.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Wed Jan 10 19:19:23 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.06              Driver Version: 545.29.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080        Off | 00000000:0A:00.0 Off |                  N/A |
| 27%   34C    P8               3W / 225W |     15MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

But checking for GPU availability on my node I get:

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME     GPUs
thelio   <none>

and any pod intialized with GPU remains in Init:CrashLoopBackOff (i.e. the pods created in the gpu-operator helm chart shown above), or in Pending, waiting for gpu resource. What have I missed?

Notes/additional questions:

The docs mention that e.g. cuda-drivers-fabricmanager-515 nvidia-headless-515-server must be installed in addition. As you see in the nvidia-smi output, I'm on 545, but there is no 545 version of these packages available in the repos. Is that most likely my problem? How can that be resolved?
Some sources suggest that nvidia-runtime should also be the default runtime in containerd/config.toml. Is that really accurate? It looks to me like the official K3S docs configuration has this an opt-in via runtimeClassName: nvidia, which makes sense (no need to use it on pods that don't need GPU) and I'm using that test spec, so I don't think that's the issue?

0 replies

caroline-suse-rancher · 2024-01-12T20:56:54Z

caroline-suse-rancher
Jan 12, 2024
Collaborator

I'm going to convert this to a discussion, as it seems like a K8s/NVIDIA related issue, rather than a k3s bug

0 replies

Dan-Zi · 2024-09-04T08:00:02Z

Dan-Zi
Sep 4, 2024

check this out:
#9705 (comment)
server --default-runtime nvidia sets the node defautl runtime without patching.

@jmagoon , thanks for the hint #9231 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latest NVIDIA Container Runtime Support not working anymore with K3S #9231

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 12 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Latest NVIDIA Container Runtime Support not working anymore with K3S #9231

Replies: 12 comments

brandond Sep 6, 2023 Collaborator

caroline-suse-rancher Jan 12, 2024 Collaborator

brandond
Sep 6, 2023
Collaborator

caroline-suse-rancher
Jan 12, 2024
Collaborator