You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr: The error is caused when any resource limits and requests differ. E.g. requests.cpu = 1 and limits.cpu = 2 -> pod gets killed. Doesn't matter if nvidia-gpu-resource is actually requested or not.
Steps to reproduce
0. install ubuntu22.04 on gpu node
nodes needed: 1 node with GPU, 1 node without
install k3s server on the node without GPU
install k3s agent on the node with GPU
install nvidia-device-plugin
---
apiVersion: node.k8s.io/v1kind: RuntimeClassmetadata:
name: nvidiahandler: "nvidia"
---
apiVersion: apps/v1kind: DaemonSetmetadata:
name: nvidia-device-plugin-daemonsetnamespace: kube-systemspec:
selector:
matchLabels:
name: nvidia-device-plugin-dsupdateStrategy:
type: RollingUpdatetemplate:
metadata:
labels:
name: nvidia-device-plugin-dsspec:
runtimeClassName: nvidiatolerations:
- key: nvidia.com/gpuoperator: Existseffect: NoSchedule# Mark this pod as a critical add-on; when enabled, the critical add-on# scheduler reserves resources for critical add-on pods so that they can# be rescheduled after a failure.# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/priorityClassName: "system-node-critical"containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0name: nvidia-device-plugin-ctrenv:
- name: FAIL_ON_INIT_ERRORvalue: "false"securityContext:
allowPrivilegeEscalation: falsecapabilities:
drop: ["ALL"]volumeMounts:
- name: device-pluginmountPath: /var/lib/kubelet/device-pluginsvolumes:
- name: device-pluginhostPath:
path: /var/lib/kubelet/device-plugins
verify that GPU is allocatable with kubectl describe node agent
create following test pod and observe that it is getting killed after roughly 5 seconds (grace period)
Not sure who's responsible here: (cri, k3s, nvidia-device-plugin ?).
related issue for k8s-device-plugin: https://gitlab.com/nvidia/kubernetes/device-plugin/-/issues/7
bug not present on ubuntu 20.04 !
tl;dr: The error is caused when any resource limits and requests differ. E.g. requests.cpu = 1 and limits.cpu = 2 -> pod gets killed. Doesn't matter if
nvidia-gpu
-resource is actually requested or not.Steps to reproduce
0. install ubuntu22.04 on gpu node
kubectl describe node agent
Would be awesome if anyone could reproduce this!
The text was updated successfully, but these errors were encountered: