Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-device-plugin-daemonset toolkit validation fails with containerd #143

Closed
3 of 5 tasks
shysank opened this issue Feb 8, 2021 · 27 comments
Closed
3 of 5 tasks
Assignees

Comments

@shysank
Copy link

shysank commented Feb 8, 2021

1. Quick Debug Checklist

  • Are you running on an Ubuntu 18.04 node?
  • Are you running Kubernetes v1.13+?
  • Are you running Docker (>= 18.06) or CRIO (>= 1.13+)?
  • Do you have i2c_core and ipmi_msghandler loaded on the nodes?
  • Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces)

1. Issue or feature description

I am trying to use gpu operator in a kubernetes cluster created using cluster-api for azure. As I installed the operator, I'm running into an issue where the nvidia-device-plugin-daemonset fails to come up, and crashes in init container which tries to run a validation pod. On further inspection, I noticed that it was failing with ImageInspectError. The event log:

Events:
  Type     Reason         Age                   From               Message
  ----     ------         ----                  ----               -------
  Normal   Scheduled      10m                   default-scheduler  Successfully assigned gpu-operator-resources/nvidia-device-plugin-daemonset-f99md to cl-gpu-md-0-f4gm6
  Warning  InspectFailed  10m (x3 over 10m)     kubelet            Failed to inspect image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
  Warning  Failed         10m (x3 over 10m)     kubelet            Error: ImageInspectError
  Normal   Pulling        9m57s                 kubelet            Pulling image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
  Normal   Pulled         9m53s                 kubelet            Successfully pulled image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"
  Normal   Created        9m8s (x4 over 9m53s)  kubelet            Created container toolkit-validation
  Normal   Started        9m8s (x4 over 9m53s)  kubelet            Started container toolkit-validation
  Normal   Pulled         9m8s (x3 over 9m52s)  kubelet            Container image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" already present on machine
  Warning  BackOff        10s (x45 over 9m51s)  kubelet            Back-off restarting failed container

PS: I'm using containerd for container management

The vm type is azure ncv3 series

2. Steps to reproduce the issue

  1. Create k8s cluster with one worker node with nvidia gpu
  2. Once the nodes are ready, install the nvidia gpu operator using helm install --wait --generate-name nvidia/gpu-operator --set operator.defaultRuntime=containerd
  3. Observe kubectl -n gpu-operator-resources get pods
@shysank
Copy link
Author

shysank commented Feb 8, 2021

kubelet logs:

Feb 08 22:08:41 cl-gpu-md-0-f4gm6 kubelet[2382]: I0208 22:08:41.658431    2382 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "nvidia-device-plugin-token-m8t9x" (UniqueName: "kubernetes.io/secret/49229dd6-e95e-4238-aa84-132b5de04b1b-nvidia-device-plugin-token-m8t9x") pod "nvidia-device-plugin-daemonset-f99md" (UID: "49229dd6-e95e-4238-aa84-132b5de04b1b")
Feb 08 22:08:41 cl-gpu-md-0-f4gm6 kubelet[2382]: I0208 22:08:41.658477    2382 reconciler.go:207] operationExecutor.VerifyControllerAttachedVolume started for volume "device-plugin" (UniqueName: "kubernetes.io/host-path/49229dd6-e95e-4238-aa84-132b5de04b1b-device-plugin") pod "nvidia-device-plugin-daemonset-f99md" (UID: "49229dd6-e95e-4238-aa84-132b5de04b1b")
Feb 08 22:08:42 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:42.095523    2382 remote_image.go:87] ImageStatus "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" from image service failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Feb 08 22:08:42 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:42.095565    2382 kuberuntime_image.go:85] ImageStatus for image {"nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"} failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Feb 08 22:08:42 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:42.095602    2382 kuberuntime_manager.go:783] init container start failed: ImageInspectError: Failed to inspect image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Feb 08 22:08:42 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:42.095650    2382 pod_workers.go:191] Error syncing pod 49229dd6-e95e-4238-aa84-132b5de04b1b ("nvidia-device-plugin-daemonset-f99md_gpu-operator-resources(49229dd6-e95e-4238-aa84-132b5de04b1b)"), skipping: failed to "StartContainer" for "toolkit-validation" with ImageInspectError: "Failed to inspect image \"nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2\": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Feb 08 22:08:42 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:42.369027    2382 remote_image.go:87] ImageStatus "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" from image service failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Feb 08 22:08:42 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:42.369064    2382 kuberuntime_image.go:85] ImageStatus for image {"nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2"} failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Feb 08 22:08:42 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:42.369097    2382 kuberuntime_manager.go:783] init container start failed: ImageInspectError: Failed to inspect image "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"
Feb 08 22:08:42 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:42.369153    2382 pod_workers.go:191] Error syncing pod 49229dd6-e95e-4238-aa84-132b5de04b1b ("nvidia-device-plugin-daemonset-f99md_gpu-operator-resources(49229dd6-e95e-4238-aa84-132b5de04b1b)"), skipping: failed to "StartContainer" for "toolkit-validation" with ImageInspectError: "Failed to inspect image \"nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2\": rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused\""
Feb 08 22:08:43 cl-gpu-md-0-f4gm6 kubelet[2382]: E0208 22:08:43.369978    2382 remote_image.go:87] ImageStatus "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2" from image service failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused"

@shysank
Copy link
Author

shysank commented Feb 8, 2021

kubectl get pods -A

default                  gpu-operator-1612821988-node-feature-discovery-master-664dnsmww   1/1     Running                 0          107m
default                  gpu-operator-1612821988-node-feature-discovery-worker-64mcz       1/1     Running                 0          107m
default                  gpu-operator-1612821988-node-feature-discovery-worker-h5rws       1/1     Running                 0          107m
default                  gpu-operator-7d6d75f67c-jsrbx                                     1/1     Running                 0          107m
gpu-operator-resources   gpu-feature-discovery-zclf4                                       1/1     Running                 0          106m
gpu-operator-resources   nvidia-container-toolkit-daemonset-6n6q6                          1/1     Running                 0          106m
gpu-operator-resources   nvidia-device-plugin-daemonset-cgc57                              0/1     Init:CrashLoopBackOff   18         69m
gpu-operator-resources   nvidia-driver-daemonset-7b755                                     1/1     Running                 0          106m
kube-system              calico-kube-controllers-7b4f58f565-ps2fl                          1/1     Running                 0          137m
kube-system              calico-node-9mp6v                                                 1/1     Running                 0          137m
kube-system              calico-node-v7mct                                                 1/1     Running                 0          137m
kube-system              coredns-5644d7b6d9-nq7tz                                          1/1     Running                 0          141m
kube-system              coredns-5644d7b6d9-txjfx                                          1/1     Running                 0          141m
kube-system              etcd-cl-gpu-control-plane-462xk                                   1/1     Running                 0          141m
kube-system              kube-apiserver-cl-gpu-control-plane-462xk                         1/1     Running                 0          141m
kube-system              kube-controller-manager-cl-gpu-control-plane-462xk                1/1     Running                 0          140m
kube-system              kube-proxy-k78cd                                                  1/1     Running                 0          139m
kube-system              kube-proxy-rrl7v                                                  1/1     Running                 0          141m
kube-system              kube-scheduler-cl-gpu-control-plane-462xk                         1/1     Running                 0          140m

@shivamerla
Copy link
Contributor

@shysank Did you follow the documentation to install with correct config for containerd (i.e --set operator.defaultRuntime=containerd?) https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-the-gpu-operator

Also, which version of operator are you trying to install? Can you try with latest 1.5.2 operator and confirm?

@shysank
Copy link
Author

shysank commented Feb 9, 2021

@shysank Did you follow the documentation to install with correct config for containerd (i.e --set operator.defaultRuntime=containerd?) https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-the-gpu-operator

Yes, I ran helm install --wait --generate-name nvidia/gpu-operator --set operator.defaultRuntime=containerd

Also, which version of operator are you trying to install? Can you try with latest 1.5.2 operator and confirm?

I tried with 1.5.1, and 1.5.2, and got the same error.

@shysank
Copy link
Author

shysank commented Feb 9, 2021

This is what I see in /etc/containerd/config.toml

[plugins]

  [plugins.cri]

    [plugins.cri.containerd]

      [plugins.cri.containerd.default_runtime]
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = "io.containerd.runtime.v1.linux"

        [plugins.cri.containerd.default_runtime.options]
          Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

      [plugins.cri.containerd.runtimes]

        [plugins.cri.containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runtime.v1.linux"

          [plugins.cri.containerd.runtimes.nvidia.options]
            Runtime = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

@shivamerla
Copy link
Contributor

Adding @klueska for more inputs regarding error Error while dialing dial unix /run/containerd/containerd.sock: connect: connection refused

@klueska
Copy link
Contributor

klueska commented Feb 10, 2021

Is containerd running?
Can you launch other containers while it is in this state?
What is the output of systemctl status containerd?

@shysank
Copy link
Author

shysank commented Feb 10, 2021

Is containerd running?
Can you launch other containers while it is in this state?

Yes

capi@cl-gpu-md-0-f4gm6:~$ sudo ctr -n k8s.io c ls
CONTAINER                                                           IMAGE                                                                                                                          RUNTIME
09c9eeef67725285a8a7ef99d83454cec01511ee725e1c2fb25fccbbe70cbf1c    nvcr.io/nvidia/cuda@sha256:ed723a1339cddd75eb9f2be2f3476edf497a1b189c10c9bf9eb8da4a16a51a59                                    io.containerd.runc.v1
0b2131ab528b94f15d6daf1101f86cffd53e53802916d1d8b99ef303676779fb    k8s.gcr.io/coredns:1.6.2                                                                                                       io.containerd.runc.v1
0f5296c773af2f276ddc333cf57f9b7a1c78f87da9843e3657dc935a8fdce1c0    sha256:fe49caa20c30177786bc077a9ea90cffbea8dee29241aedeb40b93d5b27f27df                                                        io.containerd.runc.v1
16e9759d9860e71aa8e62d4cc591106f8b3ae644b7b02f4ea2766dd9386789a4    sha256:fe49caa20c30177786bc077a9ea90cffbea8dee29241aedeb40b93d5b27f27df                                                        io.containerd.runc.v1
1a320ec36a3497558557101449eabf0b84413ba9f23609b265a5188720976df2    docker.io/calico/pod2daemon-flexvol@sha256:79fcc69371ae1c44d9e14bd6fea0effb5a9b025e23d2b2aed1483d325657a8ce                    io.containerd.runc.v1
2af96a3a99810f54b1fbf2ac464a4f155026bb3b499e18ac025aa3c14ec9330c    k8s.gcr.io/pause:3.1                                                                                                           io.containerd.runc.v1
2c962e34b5a7640e5ff6493590f304c137e38b2ffe315ae3f4d4f71eaf8285d1    docker.io/calico/node:v3.16.3                                                                                                  io.containerd.runc.v1
421a02d18432e9717744914b4a5bbe2275cf170a4c93fd60f3fc40ca9da664c0    nvcr.io/nvidia/k8s/cuda-sample@sha256:4593078cdb8e786d35566faa2b84da1123acea42f0d4099e84e2af0448724af1                         io.containerd.runc.v1
423b9983540ae40d5ccd11194d9da003717dd6a449c64bb511b0652b6de70c36    nvcr.io/nvidia/k8s/container-toolkit:1.4.4-ubuntu18.04                                                                         io.containerd.runc.v1
501874f86dddf1f2a3900c2c80d14e5ce437102bc33bd865a56ab997a6165a2a    k8s.gcr.io/pause:3.1                                                                                                           io.containerd.runc.v1
55f38f18be83d331158b572e0035466c9f55b5cea76b74b6029ea8fa04d2bb90    quay.io/kubernetes_incubator/node-feature-discovery@sha256:a1e72dbc35a16cbdcf0007fc4fb207bce723ff67c61853d2d8d8051558ce6de7    io.containerd.runc.v1
571dabfbebfb21ddc06c6a4bbbf8c43646b9ecb3237e33e059f939fd5ffb4205    nvcr.io/nvidia/driver:450.80.02-ubuntu18.04                                                                                    io.containerd.runc.v1
621e71ef58084673083e627afe09e583dfa0df7c332b06eb6106bb97c19677fb    k8s.gcr.io/pause:3.1                                                                                                           io.containerd.runc.v1
69026289a603e5e695d90e8787d93f4e3df2bb73615a95a948cf731c96aa92d5    k8s.gcr.io/pause:3.1                                                                                                           io.containerd.runc.v1
7cea830b8de57b3a485226f34b00da3f962d17cc9bd58ea633cdae458cb90a6e    k8s.gcr.io/pause:3.1                                                                                                           io.containerd.runc.v1
b8b546f9ce55fc75a692821cf214c01c8e147d075d6dfbcc98d842b2c1606a83    k8s.gcr.io/pause:3.1                                                                                                           io.containerd.runc.v1
cdf608f6717db60f8be71f2f158598da6b9b51293268164eb809f611f447c672    k8s.gcr.io/kube-proxy:v1.16.3                                                                                                  io.containerd.runc.v1
d4500f6b59934103aa08d6af8fcb68403d9070f95d77f2e249f3e32d63558e1b    k8s.gcr.io/pause:3.1                                                                                                           io.containerd.runc.v1
e274dd79eab5dcea9fb5c95495c38dad990313ad05f80bd3b77466be41b38258    k8s.gcr.io/pause:3.1                                                                                                           io.containerd.runc.v1
fbfee37fb3585d3b25f482347187cb03d5f36c420be498b201aaecd73c1520f9    docker.io/calico/kube-controllers:v3.16.3                                                                                      io.containerd.runc.v1

What is the output of systemctl status containerd?

Yes

● containerd.service - containerd container runtime
   Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor preset: enabled)
   Active: active (running) since Tue 2021-02-09 00:19:08 UTC; 23h ago
     Docs: https://containerd.io
  Process: 29726 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
 Main PID: 29727 (containerd)
    Tasks: 207 (limit: 4915)
   CGroup: /system.slice/containerd.service
           ├─ 2800 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id e274dd79eab5dcea9fb5c95495c38dad990313ad05f80bd3b77466be41b38258 -address /run/containerd/containerd.sock
           ├─ 2932 /pause
           ├─ 2983 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id cdf608f6717db60f8be71f2f158598da6b9b51293268164eb809f611f447c672 -address /run/containerd/containerd.sock
           ├─ 3006 /usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf --hostname-override=cl-gpu-md-0-f4gm6
           ├─ 3513 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 7cea830b8de57b3a485226f34b00da3f962d17cc9bd58ea633cdae458cb90a6e -address /run/containerd/containerd.sock
           ├─ 3534 /pause
           ├─ 4072 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 2c962e34b5a7640e5ff6493590f304c137e38b2ffe315ae3f4d4f71eaf8285d1 -address /run/containerd/containerd.sock
           ├─ 4092 /usr/local/bin/runsvdir -P /etc/service/enabled
           ├─ 4188 runsv allocate-tunnel-addrs
           ├─ 4189 runsv monitor-addresses
           ├─ 4190 runsv felix
           ├─ 4191 calico-node -allocate-tunnel-addrs
           ├─ 4192 calico-node -monitor-addresses
           ├─ 4193 calico-node -felix
           ├─ 4488 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 501874f86dddf1f2a3900c2c80d14e5ce437102bc33bd865a56ab997a6165a2a -address /run/containerd/containerd.sock
           ├─ 4516 /pause
           ├─ 4546 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id d4500f6b59934103aa08d6af8fcb68403d9070f95d77f2e249f3e32d63558e1b -address /run/containerd/containerd.sock
           ├─ 4579 /pause
           ├─ 4632 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 0b2131ab528b94f15d6daf1101f86cffd53e53802916d1d8b99ef303676779fb -address /run/containerd/containerd.sock
           ├─ 4652 /coredns -conf /etc/coredns/Corefile
           ├─ 4721 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id fbfee37fb3585d3b25f482347187cb03d5f36c420be498b201aaecd73c1520f9 -address /run/containerd/containerd.sock
           ├─ 4742 /usr/bin/kube-controllers
           ├─21665 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id b8b546f9ce55fc75a692821cf214c01c8e147d075d6dfbcc98d842b2c1606a83 -address /run/containerd/containerd.sock
           ├─21688 /pause
           ├─21731 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 55f38f18be83d331158b572e0035466c9f55b5cea76b74b6029ea8fa04d2bb90 -address /run/containerd/containerd.sock
           ├─21758 nfd-worker --sleep-interval=60s --server=gpu-operator-1612829878-node-feature-discovery:8080
           ├─21949 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 621e71ef58084673083e627afe09e583dfa0df7c332b06eb6106bb97c19677fb -address /run/containerd/containerd.sock
           ├─21974 /pause
           ├─22024 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 571dabfbebfb21ddc06c6a4bbbf8c43646b9ecb3237e33e059f939fd5ffb4205 -address /run/containerd/containerd.sock
           ├─22047 /bin/bash /usr/local/bin/nvidia-driver init
           ├─22508 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 2af96a3a99810f54b1fbf2ac464a4f155026bb3b499e18ac025aa3c14ec9330c -address /run/containerd/containerd.sock
           ├─22532 /pause
           ├─29365 nvidia-persistenced --persistence-mode
           ├─29429 sleep infinity
           ├─29606 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 423b9983540ae40d5ccd11194d9da003717dd6a449c64bb511b0652b6de70c36 -address /run/containerd/containerd.sock
           ├─29627 nvidia-toolkit /usr/local/nvidia
           ├─29727 /usr/local/bin/containerd
           ├─29967 /usr/local/bin/containerd-shim-runc-v1 -namespace k8s.io -id 69026289a603e5e695d90e8787d93f4e3df2bb73615a95a948cf731c96aa92d5 -address /run/containerd/containerd.sock
           └─29989 /pause

@klueska
Copy link
Contributor

klueska commented Feb 10, 2021

Looking at your initial logs again, it seems that the ImageInspectError was just a transient error as the container was started. The real issue is that the running container seems to be in a crash loop one the image is successfully pulled and the container is running.

What do the container logs say:

kubectl logs -n gpu-operator-resources   nvidia-device-plugin-daemonset-cgc57 -c toolkit-validation

@shysank
Copy link
Author

shysank commented Feb 10, 2021

Ah I see this error Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)! [Vector addition of 50000 elements]

The driver version in ClusterPolicy is 450.80.02, and toolkit version is 1.4.4-ubuntu18.04. Not sure how to map the toolkit to a runtime version

@shysank
Copy link
Author

shysank commented Feb 10, 2021

@shivamerla Am I missing something here? Or is there a workaround for this?

@shysank
Copy link
Author

shysank commented Feb 11, 2021

I hacked to remove toolkit-validation init container and tried deploying the device plugin daemonset again, and got the following error:

2021/02/11 01:32:29 Loading NVML
2021/02/11 01:32:29 Failed to initialize NVML: could not load NVML library.
2021/02/11 01:32:29 If this is a GPU node, did you set the docker default runtime to `nvidia`?
2021/02/11 01:32:29 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2021/02/11 01:32:29 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2021/02/11 01:32:29 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes

I was looking into NVIDIA/k8s-device-plugin to see if there is any configuration for containerd, but couldn't find anything.

/cc @klueska

@shysank shysank changed the title ImageInspectError while running nvidia-device-plugin-daemonset init container nvidia-device-plugin-daemonset toolkit validation fails with containerd Feb 11, 2021
@shivamerla
Copy link
Contributor

@shysank somehow libnvidia libs are not getting injected into the device-plugin pod. With nvidia-container-runtime setup this should work if drivers are successfully loaded. We didn't see with internal testing with containerd. Can you try to validate on a different system to confirm if its persistent? You can also try, un-installing the chart, ensure container-runtime is set back to runc, restart containerd, and install gpu-operator again. Will try to attempt this internally to debug better.

@shysank
Copy link
Author

shysank commented Feb 11, 2021

Can you try to validate on a different system to confirm if its persistent?

Sure, I'll try in on a different kind of VM to confirm.

You can also try, un-installing the chart, ensure container-runtime is set back to runc, restart containerd, and install gpu-operator again.

Do you mean setting this -set operator.defaultRuntime=runc?

@shivamerla
Copy link
Contributor

Do you mean setting this -set operator.defaultRuntime=runc?

No, after operator is un-installed, toolkit will reset it back to runc in config.toml. You can confirm this happening and restart containerd just to be sure it takes effect.

@shysank
Copy link
Author

shysank commented Feb 11, 2021

Got it, Thanks!

@shysank
Copy link
Author

shysank commented Feb 11, 2021

The issue is because of a new property calleddefault_runtime_name added in containerd 1.3. Since this is not set to nvidia, it defaults to runc. I tried manually changing it to nvidia, and it worked. I've opened a pr for the same. wdyt?

/cc @shivamerla @klueska

@klueska
Copy link
Contributor

klueska commented Feb 12, 2021

The operator supports both v1 and v2 configs for containerd. The v1 config is only technically required for containerd 1.2, while either v1 or v2 should be usable by containerd 1.3+.

As some background, the default_runtime_name parameter is not technically part of the v1 config spec, and (if set) definitely causes issues when using the operator with containerd version 1.2. Interestingly, however, it seems to be OK when used in a v1 config with containerd 1.3+, even though it#s not technically part of the v1 spec.

For the v1 spec, you are supposed to use the default_runtime section to specify what the default runtime should be, which is what we do here: https://github.com/NVIDIA/container-config/blob/master/src/containerd.go#L320

For the v2 spec, the default_runtime_name parameter you mention, was introduced to avoid the need for a separate default_runtime section (which for v1 often times just ended up duplicating some other runtime section). We set the default_runtime_name for v2 configs here: https://github.com/NVIDIA/container-config/blob/master/src/containerd.go#L481

In your particular situation, you seem to be running a version of containerd which supports the v1 spec, but somehow requires the setting of default_runtime_name in the v1 spec (which technically shouldn't be necessary or allowed).

In your PR you seem to add support for default_runtime_name to our handling of the v1 spec (which can't be done, because it will break support for containerd 1.2), and you duplicate the logic we already have for default_runtime_name in the v2 spec (making the change unnecessary).

In general, the reason you are seeing the v1 spec at all (even though you are on a newer version of containerd which supports the v2 spec ) is because we fallback to writing a config from the v1 spec in situations where no previous config existed. In theory, all supported versions of containerd (i.e. 1.2, 1.3, and 1.4) should be able to parse and apply a v1 config even if they also support v2. It seems that your version doesn't do this as expected though.

As such, I think the right fix is to add logic in the toolkit container to actually inspect the containerd version and explicitly only apply the v1 config for containerd 1.2, and the v2 config for 1.3 and 1.4 (unless of course a v1 config is already in place which I expect to be rare).

@shysank
Copy link
Author

shysank commented Feb 16, 2021

Thanks @klueska for the explanation!

As such, I think the right fix is to add logic in the toolkit container to actually inspect the containerd version and explicitly only apply the v1 config for containerd 1.2, and the v2 config for 1.3 and 1.4 (unless of course a v1 config is already in place which I expect to be rare).

I'll try and submit a pr for this

@shysank
Copy link
Author

shysank commented Feb 22, 2021

@shivamerla is there an approximate timeline when we'll get a new version of toolkit?

@shivamerla
Copy link
Contributor

@shysank will double check with @klueska and confirm on this.

@klueska
Copy link
Contributor

klueska commented Feb 25, 2021

@shivamerla I believe it is published now in v1.4.5 of the toolkit-container (and included in the new 1.6.0 version of the operator)

@elezar
Copy link
Member

elezar commented Feb 25, 2021

@klueska @shysank @shivamerla yes, these changes should be available in v1.4.5 of the toolkit container through v1.6.0 of the operator.

@shysank please check this when you get a chance and close the ticket accordingly.

@shysank
Copy link
Author

shysank commented Feb 25, 2021

I tested it with 1.6.0, and it still fails. The containerd config is updated correctly this time with v2 config, but we forgot to set version=2 when the config file is empty, which means the config will not be parsed correctly. Submitted a fix for this, PTAL.

@shivamerla
Copy link
Contributor

Fixed it with 1.6.1

@shysank
Copy link
Author

shysank commented Mar 8, 2021

@shivamerla I just tested it with 1.6.1, and ran into another issue related to the fix:
containerd[32689]: containerd: toml: cannot load TOML value of type string into a Go integer

I think version = "2" should be version = 2. Do you want me to create a new issue for this one?

/cc @klueska @elezar

@gocpplua
Copy link

gocpplua commented Dec 6, 2022

This issue has been resolved:

  1. vim /etc/docker/daemon.json
    add: "default-runtime": "nvidia",

  2. systemctl restart docker

Then:
kubectl describe node -A | grep nvidia
nvidia.com/gpu: 4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants