-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pods with runtimeClassName: nvidia
get killed after a few seconds
#7130
Comments
Something is killing it, but I'm not sure what. You might ask in the upstream nvidia device plugin projects? Just out of curiosity, why are you trying to run things like nginx that don't need GPUs with the nvidia container runtime? |
nginx is just a small dummy container. For Easy reproduction of this issue. Could be any other image |
Does it still get killed if you put GPU resources in the pod spec? I suspect one of the nvidia operators is doing something. |
yes, I never started a nvidia-runtime pod without gpu requests. But that might also be interesting to try. Also I'll repost in nvidia-device-plugin repo. |
I tried again with a single server / single node setup on my GPU machine: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
agent Ready control-plane,etcd,master 4m24s v1.25.7+k3s1 100.121.42.124 100.121.42.124 Ubuntu 22.04.2 LTS 5.15.0-67-generic containerd://1.6.15-k3s1
now, the test-pod is not being killed. Then I tried with 1 cp node and 1 agent node: NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
control-plane-1 Ready control-plane,etcd,master 7m34s v1.25.7+k3s1 100.121.166.220 100.121.166.220 Ubuntu 22.04.2 LTS 5.15.0-56-generic containerd://1.6.15-k3s1
agent Ready <none> 4m24s v1.25.7+k3s1 100.121.42.124 100.121.42.124 Ubuntu 22.04.2 LTS 5.15.0-67-generic containerd://1.6.15-k3s1 With an additional control-plane node, the test pod is getting killed again.. Might this be some network issue? Both machines see each other perfectly through wireguard mesh. Any ideas to debug networking/etcd issues further? |
Okay, I think I found the bug(?). Not sure who's responsible (cri, k3s, nvidia-device-plugin) but I think its on k3s's side due to the dependency on the cluster-setup. tl;dr: The error is caused when any resource limits and requests differ. E.g. requests.cpu = 1 and limits.cpu = 2 -> pod gets killed. Doesn't matter if nvidia-gpu is also requested or not. Steps to reproduce
---
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: "nvidia"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
runtimeClassName: nvidia
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.13.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
apiVersion: v1
kind: Pod
metadata:
name: test
spec:
restartPolicy: "Never"
runtimeClassName: "nvidia"
terminationGracePeriodSeconds: 5
containers:
- name: nvidia-smi
image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
command:
- "sleep"
args:
- "infinity"
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "2"
memory: 1Gi
apiVersion: v1
kind: Pod
metadata:
name: test2
spec:
restartPolicy: "Never"
runtimeClassName: "nvidia"
terminationGracePeriodSeconds: 5
containers:
- name: nvidia-smi
image: "nvidia/cuda:12.1.0-base-ubuntu18.04"
command:
- "sleep"
args:
- "infinity"
resources:
requests:
cpu: "1"
memory: 1Gi
limits:
cpu: "1"
memory: 1Gi @brandond can you try to reproduce? That'll be great! |
Are you perhaps low on resources? It sounds like the pod is getting preempted. Do you have more cores requested than your node actually has available? |
No, I'm certainly not. My node has 40 cores, 256 Gi RAM and 8 RTX 2080Ti. The mentioned test pods are the only "workload" running. Also keep in mind, that this only happens with Nvidia runtime. Normal runtime is fine. |
This isn't something that k3s or containerd would do on its own, I suspect something in the nvidia stack is finding and terminating the pod based on the requested resources. I think you'll need to track it down with those projects, I don't have any idea where to start with that. |
Closing since it appears that this is an issue that should first be investigated with nvidia |
@brandond I think this might be a k3s issue. I was also running into the same. Adding the NVIDIA/nvidia-container-toolkit#28 (comment) (I will add we're still on 1.23, Ubuntu 22.04 so potentially this is not an issue on a newer release, but not having that set on the runtime seems to run counter to upstream Kubernetes guidance) |
And to add further mystery, this is the containerd configuration that the nvidia-runtime-toolkit produces according to their configuration instructions for containerd: disabled_plugins = ["cri"]
version = 1
[plugins]
[plugins.cri]
[plugins.cri.containerd]
[plugins.cri.containerd.runtimes]
[plugins.cri.containerd.runtimes.nvidia]
privileged_without_host_devices = false
runtime_engine = ""
runtime_root = ""
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
Runtime = "/usr/bin/nvidia-container-runtime" |
Yeah, that's the very-deprecated version 1 containerd config schema. We're on version 2 which is also deprecated. Also, that has the CRI plugin disabled, which won't work with Kubernetes since it requires CRI. I suspect those docs are very out of date. |
Excellent Otherwise having trouble finding a good reference for how nvidia-container-toolkit actually expects containerd to be configured beyond that command. |
might need to see what the operator does to the containerd config on a kubeadm-based cluster? |
I at least do not have that available, we have manual nodes with multi-GPU/MIG and nvidia-device-plugin. This is one of the concerns I'd found around enabling systemdcgroup. (It turns out we had enabled the udev workaround previous in this environment for docker) |
Hmm. Well, if you're up for it, can you try installing k3s with I'm not sure that would prevent me from merging this change though- breaking when systemd is reloaded, vs being not usable at all because pods are constantly killed, seems like a net improvement |
We manually overrode the containerd config on 1.23 with the param from your commit (since I think you didn't backport it all the way to 1.23), planned on testing the systemctl daemon-reload though. |
With 1.23, latest nvidia-container-toolkit, the udev workaround mentioned in that issue, it doesn't appear we have issues when running daemon-reload and running that nvidia-smi loop sample pod, while having systemdcgroup enabled. |
After I had these issues I found out that GPU pods were not killed when I used Ubuntu 20.04 as an OS. As soon as 22.04 (same config otherwise) was used, pods where killed again. @brandond I hope that this helps a bit in resolving the issue. |
##Environment Details My main option for testing this is using AWS VERSION=v1.25.7+k3s1 Infrastructure
Node(s) CPU architecture, OS, and version: Linux 5.15.0-1019-aws x86_64 GNU/Linux Cluster Configuration: NAME STATUS ROLES AGE VERSION Config.yaml:
YOUR_REPRODUCED_RESULTS_HERE
Results: $ kgn
$ kgp -A
$ k logs vec-add-pod
this blip does appear very very briefly on the agent nodes output from nvidia-smi but I wasn't able to capture it in the split second it takes to run. |
@maaft @sidewinder12s We were unable to reproduce in our test environments (see above). However, there was some work done in #8470 to hopefully fix the issue. If you want to try to see if it works there for you, you can install k3s off the most recent commit: We can leave this issue open until you're able to take a look, but hoping that we will be able to resolve and close this by the upcoming October releases. Thank you! |
I am going to close this out as no news is good news in this case I take it. Feel free to open a new issue if there are still problems here, but I believe it was fixed. |
Environmental Info:
Tested K3s Versions: v1.25.6 and v1.25.7
Node(s) CPU architecture, OS, and Version:
Linux agent 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Cluster Configuration:
3 servers, 1 agent, connected via wireguard mesh
kubectl get nodes -o wide
Describe the bug:
I've installed k3s-agent on a machine that has GPUs built-in. I followed all the steps outlined here.
I also installed
nvidia-device-plugin
with:kubectl -n kube-system logs nvidia-device-plugin-daemonset-xtt9q
The
nvidia-device-plugin
is running correctly without any restarts.I'm able to create nvidia-enabled pods that output correctly the GPUs:
The output is as expected (
kubectl logs nvidia-smi
):Here comes the weird part
Let's now run e.g.
nginx
withruntimeClassName: nvidia
:After 0-30 seconds the pod gets killed:
or
Same pod without
runtimeClussName: nvidia
keeps running:Additional Info
nvidia-enabled containers directly run with
ctr run --gpus 0 ...
are not getting killedjournalctl -fu k3s-agent
during the relevant lifetime of thenginx
pod:cat /var/lib/rancher/k3s/agent/etc/containerd/config.toml
/var/log/syslog
after creating thenginx
pod with nvidia runtime classAnyone having any ideas how to debug this further?
What is killing my pods?
I'm completely lost...
The text was updated successfully, but these errors were encountered: