Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pods with runtimeClassName: nvidia get killed after a few seconds #7130

Closed
maaft opened this issue Mar 21, 2023 · 23 comments
Closed

pods with runtimeClassName: nvidia get killed after a few seconds #7130

maaft opened this issue Mar 21, 2023 · 23 comments
Assignees

Comments

@maaft
Copy link

maaft commented Mar 21, 2023

Environmental Info:
Tested K3s Versions: v1.25.6 and v1.25.7

k3s version v1.25.6+k3s1 (9176e03c)
go version go1.19.5

Node(s) CPU architecture, OS, and Version:
Linux agent 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:
3 servers, 1 agent, connected via wireguard mesh

kubectl get nodes -o wide

NAME              STATUS     ROLES                       AGE   VERSION        INTERNAL-IP       EXTERNAL-IP       OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
control-plane-1   Ready      control-plane,etcd,master   38m   v1.25.6+k3s1   100.121.166.220   100.121.166.220   Ubuntu 22.04.2 LTS   5.15.0-56-generic   containerd://1.6.15-k3s1
control-plane-2   Ready      control-plane,etcd,master   37m   v1.25.6+k3s1   100.121.227.93    100.121.227.93    Ubuntu 22.04.2 LTS   5.15.0-56-generic   containerd://1.6.15-k3s1
control-plane-3   Ready      control-plane,etcd,master   36m   v1.25.6+k3s1   100.121.146.219   100.121.146.219   Ubuntu 22.04.2 LTS   5.15.0-56-generic   containerd://1.6.15-k3s1
agent             Ready      <none>                      32m   v1.25.6+k3s1   100.121.42.124    100.121.42.124    Ubuntu 22.04.2 LTS   5.15.0-67-generic   containerd://1.6.15-k3s1

Describe the bug:
I've installed k3s-agent on a machine that has GPUs built-in. I followed all the steps outlined here.

I also installed nvidia-device-plugin with:

---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: "nvidia"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      runtimeClassName: nvidia
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
        - image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
          name: nvidia-device-plugin-ctr
          env:
            - name: FAIL_ON_INIT_ERROR
              value: "false"
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

kubectl -n kube-system logs nvidia-device-plugin-daemonset-xtt9q

2023/03/21 11:52:11 Starting FS watcher.
2023/03/21 11:52:11 Starting OS watcher.
2023/03/21 11:52:11 Starting Plugins.
2023/03/21 11:52:11 Loading configuration.
2023/03/21 11:52:11 Updating config with default resource matching patterns.
2023/03/21 11:52:11 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "uuid"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2023/03/21 11:52:11 Retreiving plugins.
2023/03/21 11:52:11 Detected NVML platform: found NVML library
2023/03/21 11:52:11 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2023/03/21 11:52:17 Starting GRPC server for 'nvidia.com/gpu'
2023/03/21 11:52:17 Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
2023/03/21 11:52:17 Registered device plugin for 'nvidia.com/gpu' with Kubelet

The nvidia-device-plugin is running correctly without any restarts.

I'm able to create nvidia-enabled pods that output correctly the GPUs:

apiVersion: v1
kind: Pod
metadata:
 name: nvidia-smi
spec:
 runtimeClassName: "nvidia"
 containers:
    - name: nvidia-smi
      image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
      command:
        - "nvidia-smi"
      resources:
        requests:
          nvidia.com/gpu: "1"
        limits:
          nvidia.com/gpu: "1"

The output is as expected (kubectl logs nvidia-smi):

Tue Mar 21 12:03:05 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:40:00.0 Off |                  N/A |
| 27%   29C    P8     8W / 250W |      0MiB / 11264MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here comes the weird part

Let's now run e.g. nginx with runtimeClassName: nvidia:

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  namespace: denkjobs
spec:
  restartPolicy: "Never"
  containers:
  - name: nginx
    image: "nginx"
    imagePullPolicy: IfNotPresent
    resources:
      requests:
        cpu: 500m
        memory: 512Mi
  runtimeClassName: nvidia

After 0-30 seconds the pod gets killed:

NAME         READY   STATUS      RESTARTS   AGE
nginx        0/1     Error       0          13m

or

NAME         READY   STATUS      RESTARTS   AGE
nginx        0/1     Completed       0          13m
Name:                nginx
Namespace:           denkjobs
Priority:            0
Runtime Class Name:  nvidia
Service Account:     default
Node:                agent/100.121.42.124
Start Time:          Tue, 21 Mar 2023 12:56:25 +0100
Labels:              <none>
Annotations:         <none>
Status:              Failed
IP:                  10.42.3.7
IPs:
  IP:  10.42.3.7
Containers:
  nginx:
    Container ID:  containerd://084bd9cb01e8de9c2e9e193f6c511be492de489a2a51ae7dcecc8f634219f672
    Image:         nginx
    Image ID:      docker.io/library/nginx@sha256:aa0afebbb3cfa473099a62c4b32e9b3fb73ed23f2a75a65ce1d4b4f55a5c2ef2
    Port:          <none>
    Host Port:     <none>
    Command:
      sleep
    Args:
      infinity
    State:          Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Tue, 21 Mar 2023 12:56:32 +0100
      Finished:     Tue, 21 Mar 2023 12:57:03 +0100
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        500m
      memory:     512Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-rdwhv (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-rdwhv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  13m   default-scheduler  Successfully assigned denkjobs/nginx to denkmaschine3
  Normal  Pulling    13m   kubelet            Pulling image "nginx"
  Normal  Pulled     13m   kubelet            Successfully pulled image "nginx" in 6.850461587s (6.850472207s including waiting)
  Normal  Created    13m   kubelet            Created container nginx
  Normal  Started    13m   kubelet            Started container nginx
  Normal  Killing    13m   kubelet            Stopping container nginx

Same pod without runtimeClussName: nvidia keeps running:

NAME    READY   STATUS    RESTARTS   AGE
nginx   1/1     Running   0          2m

Additional Info

nvidia-enabled containers directly run with ctr run --gpus 0 ... are not getting killed

journalctl -fu k3s-agent during the relevant lifetime of the nginx pod:

Mar 21 12:13:31 agent k3s[185245]: I0321 12:13:31.877137  185245 topology_manager.go:205] "Topology Admit Handler"
Mar 21 12:13:31 agent k3s[185245]: E0321 12:13:31.877228  185245 cpu_manager.go:394] "RemoveStaleState: removing container" podUID="6e7c7c7f-2901-4bdc-a9dc-25efa296db46" containerName="nvidia-smi"
Mar 21 12:13:31 agent k3s[185245]: I0321 12:13:31.877275  185245 memory_manager.go:345] "RemoveStaleState removing state" podUID="6e7c7c7f-2901-4bdc-a9dc-25efa296db46" containerName="nvidia-smi"
Mar 21 12:13:31 agent k3s[185245]: I0321 12:13:31.964635  185245 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-r8bqx\" (UniqueName: \"kubernetes.io/projected/b6968b99-f11d-40f6-b483-8132f90748e4-kube-api-access-r8bqx\") pod \"nginx\" (UID: \"b6968b99-f11d-40f6-b483-8132f90748e4\") " pod="denkjobs/nginx"
Mar 21 12:13:34 agent k3s[185245]: I0321 12:13:34.399208  185245 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="205f3c0abcfcfd53bf62da65ac80ed05acf882ca77f11147d4134e50cfaa17c0"
Mar 21 12:13:35 agent k3s[185245]: I0321 12:13:35.489268  185245 reconciler.go:211] "operationExecutor.UnmountVolume started for volume \"kube-api-access-r8bqx\" (UniqueName: \"kubernetes.io/projected/b6968b99-f11d-40f6-b483-8132f90748e4-kube-api-access-r8bqx\") pod \"b6968b99-f11d-40f6-b483-8132f90748e4\" (UID: \"b6968b99-f11d-40f6-b483-8132f90748e4\") "
Mar 21 12:13:35 agent k3s[185245]: I0321 12:13:35.491411  185245 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/b6968b99-f11d-40f6-b483-8132f90748e4-kube-api-access-r8bqx" (OuterVolumeSpecName: "kube-api-access-r8bqx") pod "b6968b99-f11d-40f6-b483-8132f90748e4" (UID: "b6968b99-f11d-40f6-b483-8132f90748e4"). InnerVolumeSpecName "kube-api-access-r8bqx". PluginName "kubernetes.io/projected", VolumeGidValue ""
Mar 21 12:13:35 agent k3s[185245]: I0321 12:13:35.589435  185245 reconciler.go:399] "Volume detached for volume \"kube-api-access-r8bqx\" (UniqueName: \"kubernetes.io/projected/b6968b99-f11d-40f6-b483-8132f90748e4-kube-api-access-r8bqx\") on node \"agent\" DevicePath \"\""
Mar 21 12:13:40 agent k3s[185245]: I0321 12:13:40.591952  185245 scope.go:115] "RemoveContainer" containerID="084bd9cb01e8de9c2e9e193f6c511be492de489a2a51ae7dcecc8f634219f672"

cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml

version = 2

[plugins."io.containerd.internal.v1.opt"]
  path = "/var/lib/rancher/k3s/agent/containerd"
[plugins."io.containerd.grpc.v1.cri"]
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"
  enable_selinux = false
  enable_unprivileged_ports = true
  enable_unprivileged_icmp = true
  sandbox_image = "rancher/mirrored-pause:3.6"

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true


[plugins."io.containerd.grpc.v1.cri".cni]
  bin_dir = "/var/lib/rancher/k3s/data/630c40ff866a3db218a952ebd4fd2a5cfe1543a1a467e738cb46a2ad4012d6f1/bin"
  conf_dir = "/var/lib/rancher/k3s/agent/etc/cni/net.d"


[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  SystemdCgroup = true

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"

/var/log/syslog after creating the nginx pod with nvidia runtime class

  • here you can see that the pod is killed immediately after startup
Mar 21 13:23:23 agent k3s[1541]: I0321 13:23:23.125548    1541 topology_manager.go:205] "Topology Admit Handler"
Mar 21 13:23:23 agent k3s[1541]: E0321 13:23:23.125647    1541 cpu_manager.go:394] "RemoveStaleState: removing container" podUID="669c3b19-2aa4-4ee0-b9b6-0f1de6214ddd" containerName="nginx"
Mar 21 13:23:23 agent k3s[1541]: I0321 13:23:23.125712    1541 memory_manager.go:345] "RemoveStaleState removing state" podUID="669c3b19-2aa4-4ee0-b9b6-0f1de6214ddd" containerName="nginx"
Mar 21 13:23:23 agent systemd[1]: Created slice libcontainer container kubepods-burstable-pod6972d491_5270_4f99_a28a_691f0d6f90bf.slice.
Mar 21 13:23:23 agent k3s[1541]: I0321 13:23:23.177399    1541 reconciler.go:357] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kube-api-access-mcf9r\" (UniqueName: \"kubernetes.io/projected/6972d491-5270-4f99-a28a-691f0d6f90bf-kube-api-access-mcf9r\") pod \"nginx\" (UID: \"6972d491-5270-4f99-a28a-691f0d6f90bf\") " pod="denkjobs/nginx"
Mar 21 13:23:23 agent systemd-udevd[8856]: Using default interface naming scheme 'v249'.
Mar 21 13:23:23 agent systemd-networkd[1364]: vethb9798904: Link UP
Mar 21 13:23:23 agent networkd-dispatcher[1392]: WARNING:Unknown index 17 seen, reloading interface list
Mar 21 13:23:23 agent kernel: [ 3513.105032] cni0: port 3(vethb9798904) entered blocking state
Mar 21 13:23:23 agent kernel: [ 3513.105037] cni0: port 3(vethb9798904) entered disabled state
Mar 21 13:23:23 agent kernel: [ 3513.105107] device vethb9798904 entered promiscuous mode
Mar 21 13:23:23 agent systemd-networkd[1364]: vethb9798904: Gained carrier
Mar 21 13:23:23 agent kernel: [ 3513.109075] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Mar 21 13:23:23 agent kernel: [ 3513.109089] IPv6: ADDRCONF(NETDEV_CHANGE): vethb9798904: link becomes ready
Mar 21 13:23:23 agent kernel: [ 3513.109092] cni0: port 3(vethb9798904) entered blocking state
Mar 21 13:23:23 agent kernel: [ 3513.109095] cni0: port 3(vethb9798904) entered forwarding state
Mar 21 13:23:24 agent systemd-networkd[1364]: vethb9798904: Link DOWN
Mar 21 13:23:24 agent systemd-networkd[1364]: vethb9798904: Lost carrier
Mar 21 13:23:24 agent kernel: [ 3513.834318] cni0: port 3(vethb9798904) entered disabled state
Mar 21 13:23:24 agent kernel: [ 3513.835028] device vethb9798904 left promiscuous mode
Mar 21 13:23:24 agent kernel: [ 3513.835032] cni0: port 3(vethb9798904) entered disabled state
Mar 21 13:23:24 agent systemd[1]: run-k3s-containerd-io.containerd.runtime.v2.task-k8s.io-04265b8965b24c6b11467e691d58b531f3c7d42a4f7aaeef51726464cda1e822-rootfs.mount: Deactivated successfully.
Mar 21 13:23:24 agent systemd[1]: run-k3s-containerd-io.containerd.grpc.v1.cri-sandboxes-04265b8965b24c6b11467e691d58b531f3c7d42a4f7aaeef51726464cda1e822-shm.mount: Deactivated successfully.
Mar 21 13:23:24 agent systemd[1]: run-netns-cni\x2dff2367b4\x2dd457\x2d349f\x2dc460\x2dbbdac8f453d9.mount: Deactivated successfully.
Mar 21 13:23:24 agent k3s[1541]: I0321 13:23:24.863403    1541 pod_container_deletor.go:79] "Container not found in pod's containers" containerID="04265b8965b24c6b11467e691d58b531f3c7d42a4f7aaeef51726464cda1e822"
Mar 21 13:23:25 agent k3s[1541]: I0321 13:23:25.998077    1541 reconciler.go:211] "operationExecutor.UnmountVolume started for volume \"kube-api-access-mcf9r\" (UniqueName: \"kubernetes.io/projected/6972d491-5270-4f99-a28a-691f0d6f90bf-kube-api-access-mcf9r\") pod \"6972d491-5270-4f99-a28a-691f0d6f90bf\" (UID: \"6972d491-5270-4f99-a28a-691f0d6f90bf\") "
Mar 21 13:23:26 agent k3s[1541]: I0321 13:23:26.000564    1541 operation_generator.go:890] UnmountVolume.TearDown succeeded for volume "kubernetes.io/projected/6972d491-5270-4f99-a28a-691f0d6f90bf-kube-api-access-mcf9r" (OuterVolumeSpecName: "kube-api-access-mcf9r") pod "6972d491-5270-4f99-a28a-691f0d6f90bf" (UID: "6972d491-5270-4f99-a28a-691f0d6f90bf"). InnerVolumeSpecName "kube-api-access-mcf9r". PluginName "kubernetes.io/projected", VolumeGidValue ""
Mar 21 13:23:26 agent systemd[1]: var-lib-kubelet-pods-6972d491\x2d5270\x2d4f99\x2da28a\x2d691f0d6f90bf-volumes-kubernetes.io\x7eprojected-kube\x2dapi\x2daccess\x2dmcf9r.mount: Deactivated successfully.
Mar 21 13:23:26 agent k3s[1541]: I0321 13:23:26.099100    1541 reconciler.go:399] "Volume detached for volume \"kube-api-access-mcf9r\" (UniqueName: \"kubernetes.io/projected/6972d491-5270-4f99-a28a-691f0d6f90bf-kube-api-access-mcf9r\") on node \"agent\" DevicePath \"\""
Mar 21 13:23:26 agent systemd[1]: Removed slice libcontainer container kubepods-burstable-pod6972d491_5270_4f99_a28a_691f0d6f90bf.slice.

Anyone having any ideas how to debug this further?

What is killing my pods?

I'm completely lost...

@brandond
Copy link
Member

brandond commented Mar 21, 2023

Something is killing it, but I'm not sure what. You might ask in the upstream nvidia device plugin projects?

Just out of curiosity, why are you trying to run things like nginx that don't need GPUs with the nvidia container runtime?

@maaft
Copy link
Author

maaft commented Mar 21, 2023

nginx is just a small dummy container. For Easy reproduction of this issue. Could be any other image

@brandond
Copy link
Member

Does it still get killed if you put GPU resources in the pod spec? I suspect one of the nvidia operators is doing something.

@maaft
Copy link
Author

maaft commented Mar 22, 2023

yes, I never started a nvidia-runtime pod without gpu requests. But that might also be interesting to try. Also I'll repost in nvidia-device-plugin repo.

@maaft
Copy link
Author

maaft commented Mar 22, 2023

I tried again with a single server / single node setup on my GPU machine:

NAME              STATUS   ROLES                       AGE     VERSION        INTERNAL-IP       EXTERNAL-IP       OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
agent             Ready    control-plane,etcd,master   4m24s   v1.25.7+k3s1   100.121.42.124    100.121.42.124    Ubuntu 22.04.2 LTS   5.15.0-67-generic   containerd://1.6.15-k3s1

now, the test-pod is not being killed.

Then I tried with 1 cp node and 1 agent node:

NAME              STATUS   ROLES                       AGE     VERSION        INTERNAL-IP       EXTERNAL-IP       OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
control-plane-1   Ready    control-plane,etcd,master   7m34s   v1.25.7+k3s1   100.121.166.220   100.121.166.220   Ubuntu 22.04.2 LTS   5.15.0-56-generic   containerd://1.6.15-k3s1
agent             Ready    <none>                      4m24s   v1.25.7+k3s1   100.121.42.124    100.121.42.124    Ubuntu 22.04.2 LTS   5.15.0-67-generic   containerd://1.6.15-k3s1

With an additional control-plane node, the test pod is getting killed again..

Might this be some network issue? Both machines see each other perfectly through wireguard mesh.

Any ideas to debug networking/etcd issues further?

@maaft
Copy link
Author

maaft commented Mar 22, 2023

Okay, I think I found the bug(?). Not sure who's responsible (cri, k3s, nvidia-device-plugin) but I think its on k3s's side due to the dependency on the cluster-setup.

tl;dr: The error is caused when any resource limits and requests differ. E.g. requests.cpu = 1 and limits.cpu = 2 -> pod gets killed. Doesn't matter if nvidia-gpu is also requested or not.

Steps to reproduce

  1. nodes needed: 1 node with GPU, 1 node without
  2. install k3s server on the node without GPU
  3. install k3s agent on the node with GPU
  4. install nvidia-device-plugin
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: "nvidia"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      runtimeClassName: nvidia
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
        - image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
          name: nvidia-device-plugin-ctr
          env:
            - name: FAIL_ON_INIT_ERROR
              value: "false"
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins
  1. verify that GPU is allocatable with kubectl describe node agent
  2. create following test pod and observe that it is getting killed after roughly 5 seconds (grace period)
apiVersion: v1
kind: Pod
metadata:
 name: test
spec:
  restartPolicy: "Never"
  runtimeClassName: "nvidia"
  terminationGracePeriodSeconds: 5
  containers:
  - name: nvidia-smi
    image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
    command:
      - "sleep"
    args:
      - "infinity"
    resources:
      requests:
        cpu: "1"
        memory: 1Gi
      limits:
        cpu: "2"
        memory: 1Gi
  1. create following pod and observer that the pod is not killed and runs infinitely
apiVersion: v1
kind: Pod
metadata:
 name: test2
spec:
  restartPolicy: "Never"
  runtimeClassName: "nvidia"
  terminationGracePeriodSeconds: 5
  containers:
  - name: nvidia-smi
    image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
    command:
      - "sleep"
    args:
      - "infinity"
    resources:
      requests:
        cpu: "1"
        memory: 1Gi
      limits:
        cpu: "1"
        memory: 1Gi

@brandond can you try to reproduce? That'll be great!

@brandond
Copy link
Member

brandond commented Mar 22, 2023

The error is caused when any resource limits and requests differ. E.g. requests.cpu = 1 and limits.cpu = 2 -> pod gets killed.

Are you perhaps low on resources? It sounds like the pod is getting preempted. Do you have more cores requested than your node actually has available?

@maaft
Copy link
Author

maaft commented Mar 22, 2023

No, I'm certainly not. My node has 40 cores, 256 Gi RAM and 8 RTX 2080Ti. The mentioned test pods are the only "workload" running.

Also keep in mind, that this only happens with Nvidia runtime. Normal runtime is fine.

@brandond
Copy link
Member

This isn't something that k3s or containerd would do on its own, I suspect something in the nvidia stack is finding and terminating the pod based on the requested resources. I think you'll need to track it down with those projects, I don't have any idea where to start with that.

@caroline-suse-rancher
Copy link
Contributor

Closing since it appears that this is an issue that should first be investigated with nvidia

@sidewinder12s
Copy link

sidewinder12s commented Sep 27, 2023

@brandond I think this might be a k3s issue. I was also running into the same. Adding the systemdcgroup = true to the nvidia runtime runc options seemed to resolve it, though we were similarly unable to find any logs as to who or why containers were getting SIGTERMs (But at last theoretically could see why we'd see really strange behavior if the kubelet and part of containerd/systemd units were running in 1 cgroup while the containers are not).

NVIDIA/nvidia-container-toolkit#28 (comment)

(I will add we're still on 1.23, Ubuntu 22.04 so potentially this is not an issue on a newer release, but not having that set on the runtime seems to run counter to upstream Kubernetes guidance)

@brandond brandond reopened this Sep 27, 2023
@brandond brandond moved this from Done Issue to Peer Review in K3s Development Sep 27, 2023
@brandond brandond self-assigned this Sep 27, 2023
@brandond brandond added this to the v1.28.3+k3s1 milestone Sep 27, 2023
@sidewinder12s
Copy link

sidewinder12s commented Sep 27, 2023

And to add further mystery, this is the containerd configuration that the nvidia-runtime-toolkit produces according to their configuration instructions for containerd:

disabled_plugins = ["cri"]
version = 1

[plugins]

  [plugins.cri]

    [plugins.cri.containerd]

      [plugins.cri.containerd.runtimes]

        [plugins.cri.containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"

          [plugins.cri.containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            Runtime = "/usr/bin/nvidia-container-runtime"

ref: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#configuring-containerd

@brandond
Copy link
Member

brandond commented Sep 27, 2023

Yeah, that's the very-deprecated version 1 containerd config schema. We're on version 2 which is also deprecated.

Also, that has the CRI plugin disabled, which won't work with Kubernetes since it requires CRI. I suspect those docs are very out of date.

@sidewinder12s
Copy link

Excellent

Otherwise having trouble finding a good reference for how nvidia-container-toolkit actually expects containerd to be configured beyond that command.

@brandond
Copy link
Member

might need to see what the operator does to the containerd config on a kubeadm-based cluster?

@sidewinder12s
Copy link

sidewinder12s commented Sep 27, 2023

might need to see what the operator does to the containerd config on a kubeadm-based cluster?

I at least do not have that available, we have manual nodes with multi-GPU/MIG and nvidia-device-plugin.

This is one of the concerns I'd found around enabling systemdcgroup. (It turns out we had enabled the udev workaround previous in this environment for docker)

NVIDIA/nvidia-docker#1730

@brandond
Copy link
Member

Hmm. Well, if you're up for it, can you try installing k3s with INSTALL_K3S_COMMIT=5fea74a78b7ce172fd6e6e760f63c57add7c479f and see if nvidia pods break when you to systemctl daemon-reload ?

I'm not sure that would prevent me from merging this change though- breaking when systemd is reloaded, vs being not usable at all because pods are constantly killed, seems like a net improvement

@sidewinder12s
Copy link

We manually overrode the containerd config on 1.23 with the param from your commit (since I think you didn't backport it all the way to 1.23), planned on testing the systemctl daemon-reload though.

@sidewinder12s
Copy link

With 1.23, latest nvidia-container-toolkit, the udev workaround mentioned in that issue, it doesn't appear we have issues when running daemon-reload and running that nvidia-smi loop sample pod, while having systemdcgroup enabled.

@maaft
Copy link
Author

maaft commented Sep 28, 2023

@brandond I think this might be a k3s issue. I was also running into the same. Adding the systemdcgroup = true to the nvidia runtime runc options seemed to resolve it, though we were similarly unable to find any logs as to who or why containers were getting SIGTERMs (But at last theoretically could see why we'd see really strange behavior if the kubelet and part of containerd/systemd units were running in 1 cgroup while the containers are not).

NVIDIA/nvidia-container-toolkit#28 (comment)

(I will add we're still on 1.23, Ubuntu 22.04 so potentially this is not an issue on a newer release, but not having that set on the runtime seems to run counter to upstream Kubernetes guidance)

Hi @sidewinder12s

After I had these issues I found out that GPU pods were not killed when I used Ubuntu 20.04 as an OS. As soon as 22.04 (same config otherwise) was used, pods where killed again.

@brandond I hope that this helps a bit in resolving the issue.

@VestigeJ VestigeJ self-assigned this Oct 4, 2023
@VestigeJ
Copy link

VestigeJ commented Oct 6, 2023

##Environment Details
I am unable to reproduce this currently 😢

My main option for testing this is using AWS
ami-09a839986e2dd68cd
p3.2xlarge which uses a K100 GPU - there seems to be missing support on Nvidia's side now for K80? this might be wrong info

VERSION=v1.25.7+k3s1

Infrastructure

  • Cloud

Node(s) CPU architecture, OS, and version:

Linux 5.15.0-1019-aws x86_64 GNU/Linux
PRETTY_NAME="Ubuntu 22.04.1 LTS"

Cluster Configuration:

NAME STATUS ROLES AGE VERSION
ip-1-1-1-64 Ready 38m v1.25.7+k3s1
ip-1-1-2-54 Ready control-plane,etcd,master 40m v1.25.7+k3s1

Config.yaml:

write-kubeconfig-mode: 644
debug: true
token: YOUR_TOKEN_HERE
profile: cis-1.23
selinux: true
cluster-init: true

YOUR_REPRODUCED_RESULTS_HERE

$ curl https://get.k3s.io --output install-"k3s".sh
$ sudo chmod +x install-"k3s".sh
$ sudo groupadd --system etcd && sudo useradd -s /sbin/nologin --system -g etcd etcd
$ sudo modprobe ip_vs_rr
$ sudo modprobe ip_vs_wrr
$ sudo modprobe ip_vs_sh
$ sudo printf "on_oovm.panic_on_oom=0 \nvm.overcommit_memory=1 \nkernel.panic=10 \nkernel.panic_ps=1 \nkernel.panic_on_oops=1 \n" > ~/90-kubelet.conf
$ sudo cp 90-kubelet.conf /etc/sysctl.d/ 
$ sudo systemctl restart systemd-sysctl
$ sudo INSTALL_K3S_VERSION=v1.25.7+k3s1 INSTALL_K3S_EXEC=server ./install-k3s.sh
$ get_helm //downloads and sets up helm
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia    && helm repo update
$ helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false --set toolkit.enabled=false
$ watch -n 2 kg no,po -A
$ kgn //kubectl get nodes
$ kd node ip-1-1-1-64 //observe nvidia detected on the agent node
$ vim killpod.yaml //pod expected to die didn't die
$ vim no-killpod.yaml //pod expected to live continued to live without the runtime class declared
$ k create -f killpod.yaml
$ k create -f no-killpod.yaml
$ watch -n 2 kgp -A
$ vim resnet_k8s.yaml //tried deploying a resnet model for inference 
$ k create -f resnet_k8s.yaml
$ vim gpu-pod.yaml //ended up using the kubernetes test images which runs some vector calculations on the gpu which ran good
$ k create -f gpu-pod.yaml
$ watch -n 2 kgp -A
$ k logs pod/gpu-pod
$ vim cuda-pod.yaml
$ k create -f cuda-pod.yaml
$ k logs vec-add-pod
$ get_report //generates this comment template

Results:

$ kgn

NAME              STATUS   ROLES                       AGE   VERSION
ip-1-1-1-64       Ready    <none>                      68m   v1.25.7+k3s1
ip-1-1-2-54       Ready    control-plane,etcd,master   72m   v1.25.7+k3s1

$ kgp -A

NAMESPACE      NAME                                                              READY   STATUS      RESTARTS   AGE
default        gpu-pod                                                           0/1     Completed   0          40m
default        resnet-deployment-5545df5796-2htkv                                1/1     Running     0          48m
default        resnet-deployment-5545df5796-2kznh                                1/1     Running     0          48m
default        resnet-deployment-5545df5796-h8sdm                                1/1     Running     0          48m
default        test                                                              1/1     Running     0          61m
default        test2                                                             1/1     Running     0          60m
default        vec-add-pod                                                       0/1     Completed   0          34m
gpu-operator   gpu-feature-discovery-75qlx                                       1/1     Running     0          68m
gpu-operator   gpu-operator-1696617688-node-feature-discovery-master-5557wjbj5   1/1     Running     0          68m
gpu-operator   gpu-operator-1696617688-node-feature-discovery-worker-7s5qq       1/1     Running     0          68m
gpu-operator   gpu-operator-1696617688-node-feature-discovery-worker-tt8wr       1/1     Running     0          68m
gpu-operator   gpu-operator-54f566b547-fl62v                                     1/1     Running     0          68m
gpu-operator   nvidia-cuda-validator-m9tsl                                       0/1     Completed   0          68m
gpu-operator   nvidia-dcgm-exporter-sr5v6                                        1/1     Running     0          68m
gpu-operator   nvidia-device-plugin-daemonset-t4lw6                              1/1     Running     0          68m
gpu-operator   nvidia-operator-validator-6cmgb                                   1/1     Running     0          68m
kube-system    coredns-597584b69b-2pl5t                                          1/1     Running     0          72m
kube-system    helm-install-traefik-crd-b4blg                                    0/1     Completed   0          72m
kube-system    helm-install-traefik-v678w                                        0/1     Completed   1          72m
kube-system    local-path-provisioner-79f67d76f8-mxzvt                           1/1     Running     0          72m
kube-system    metrics-server-5f9f776df5-27bcd                                   1/1     Running     0          72m
kube-system    svclb-resnet-service-f21f6046-kvrcz                               1/1     Running     0          48m
kube-system    svclb-resnet-service-f21f6046-mjgqd                               1/1     Running     0          48m
kube-system    svclb-traefik-a4ae30a7-nzl92                                      2/2     Running     0          70m
kube-system    svclb-traefik-a4ae30a7-rrjgm                                      2/2     Running     0          72m
kube-system    traefik-66c46d954f-dcxtt                                          1/1     Running     0          72m

$ k logs vec-add-pod

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

this blip does appear very very briefly on the agent nodes output from nvidia-smi but I wasn't able to capture it in the split second it takes to run.

@rancher-max
Copy link
Contributor

@maaft @sidewinder12s We were unable to reproduce in our test environments (see above). However, there was some work done in #8470 to hopefully fix the issue. If you want to try to see if it works there for you, you can install k3s off the most recent commit: curl -sfL https://get.k3s.io | INSTALL_K3S_COMMIT=ba750e28b73ad7c69bc05e01fb30f87fa539762e sh -

We can leave this issue open until you're able to take a look, but hoping that we will be able to resolve and close this by the upcoming October releases. Thank you!

@brandond brandond moved this from Peer Review to To Test in K3s Development Oct 13, 2023
@VestigeJ VestigeJ moved this from To Test to Needs Additional in K3s Development Oct 13, 2023
@caroline-suse-rancher caroline-suse-rancher moved this from Needs Additional to To Test in K3s Development Oct 16, 2023
@caroline-suse-rancher caroline-suse-rancher moved this from To Test to Needs Additional in K3s Development Oct 16, 2023
@caroline-suse-rancher caroline-suse-rancher removed this from the v1.28.3+k3s1 milestone Nov 14, 2023
@rancher-max
Copy link
Contributor

I am going to close this out as no news is good news in this case I take it. Feel free to open a new issue if there are still problems here, but I believe it was fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

6 participants