ubuntu 22.04: pods get killed when using nvidia-runtime and pod resources differ between limits and requests #7334

maaft · 2023-04-21T11:47:08Z

Not sure who's responsible here: (cri, k3s, nvidia-device-plugin ?).

related issue for k8s-device-plugin: https://gitlab.com/nvidia/kubernetes/device-plugin/-/issues/7

bug not present on ubuntu 20.04 !

tl;dr: The error is caused when any resource limits and requests differ. E.g. requests.cpu = 1 and limits.cpu = 2 -> pod gets killed. Doesn't matter if nvidia-gpu-resource is actually requested or not.

Steps to reproduce
0. install ubuntu22.04 on gpu node

nodes needed: 1 node with GPU, 1 node without
install k3s server on the node without GPU
install k3s agent on the node with GPU
install nvidia-device-plugin

---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: "nvidia"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      runtimeClassName: nvidia
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
        - image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
          name: nvidia-device-plugin-ctr
          env:
            - name: FAIL_ON_INIT_ERROR
              value: "false"
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop: ["ALL"]
          volumeMounts:
            - name: device-plugin
              mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

verify that GPU is allocatable with kubectl describe node agent
create following test pod and observe that it is getting killed after roughly 5 seconds (grace period)

apiVersion: v1
kind: Pod
metadata:
 name: test
spec:
  restartPolicy: "Never"
  runtimeClassName: "nvidia"
  terminationGracePeriodSeconds: 5
  containers:
  - name: nvidia-smi
    image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
    command:
      - "sleep"
    args:
      - "infinity"
    resources:
      requests:
        cpu: "1"
        memory: 1Gi
      limits:
        cpu: "2"
        memory: 1Gi

create following pod and observer that the pod is not killed and runs infinitely

apiVersion: v1
kind: Pod
metadata:
 name: test2
spec:
  restartPolicy: "Never"
  runtimeClassName: "nvidia"
  terminationGracePeriodSeconds: 5
  containers:
  - name: nvidia-smi
    image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
    command:
      - "sleep"
    args:
      - "infinity"
    resources:
      requests:
        cpu: "1"
        memory: 1Gi
      limits:
        cpu: "1"
        memory: 1Gi

Would be awesome if anyone could reproduce this!

The text was updated successfully, but these errors were encountered:

brandond · 2023-04-21T15:34:22Z

Duplicate of #7130 - but as discussed there, it appears that something in the nvidia device plugin is doing this, not K3s or containerd.

github-project-automation bot added this to K3s Development Apr 21, 2023

github-project-automation bot moved this to New in K3s Development Apr 21, 2023

maaft mentioned this issue Apr 21, 2023

ubuntu 22.04: pods get killed when any pod resources differ between limits and requests NVIDIA/k8s-device-plugin#394

Open

brandond closed this as completed Apr 21, 2023

github-project-automation bot moved this from New to Done Issue in K3s Development Apr 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ubuntu 22.04: pods get killed when using nvidia-runtime and pod resources differ between limits and requests #7334

ubuntu 22.04: pods get killed when using nvidia-runtime and pod resources differ between limits and requests #7334

maaft commented Apr 21, 2023 •

edited

Loading

brandond commented Apr 21, 2023

ubuntu 22.04: pods get killed when using nvidia-runtime and pod resources differ between limits and requests #7334

ubuntu 22.04: pods get killed when using nvidia-runtime and pod resources differ between limits and requests #7334

Comments

maaft commented Apr 21, 2023 • edited Loading

brandond commented Apr 21, 2023

maaft commented Apr 21, 2023 •

edited

Loading