[QUESTION/HELP] Installing NVIDIA GPU operator on a k3d cluster #1458

saicharithpasula · 2024-07-01T03:21:13Z

saicharithpasula
Jul 1, 2024

Hello,

I am trying to install NVIDIA GPU operator on a K3d cluster. I have a GPU cluster setup according to the docs(https://k3d.io/v5.6.3/usage/advanced/cuda/) and am able to access the GPU using any pods created in the cluster.

But when I try to install NVIDIA GPU operator as described in the install guide (https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#) the driver validator is stuck in a error loop. Here are the logs from the pod

sudo kubectl logs nvidia-operator-validator-8ft7q -n gpu-operator -c driver-validation
time="2024-07-01T03:00:09Z" level=info msg="version: 0fe1e8db, commit: 0fe1e8d"
time="2024-07-01T03:00:09Z" level=info msg="Detected pre-installed driver on the host"
running command chroot with args [/host nvidia-smi]
Mon Jul  1 03:00:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000001:00:00.0 Off |                    0 |
| N/A   32C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
time="2024-07-01T03:00:09Z" level=info msg="creating symlinks under /dev/char that correspond to NVIDIA character devices"
time="2024-07-01T03:00:09Z" level=info msg="Error: error validating driver installation: error creating symlink creator: failed to load NVIDIA kernel modules: failed to load module nvidia: exit status 127; output=chroot: failed to run command '/sbin/modprobe': No such file or directory\n\n\nFailed to create symlinks under /dev/char that point to all possible NVIDIA character devices.\nThe existence of these symlinks is required to address the following bug:\n\n    https://github.com/NVIDIA/gpu-operator/issues/430\n\nThis bug impacts container runtimes configured with systemd cgroup management enabled.\nTo disable the symlink creation, set the following envvar in ClusterPolicy:\n\n    validator:\n      driver:\n        env:\n        - name: DISABLE_DEV_CHAR_SYMLINK_CREATION\n          value: \"true\""

Can you help me solve this issue?

iwilltry42 · 2024-07-04T05:46:26Z

iwilltry42
Jul 4, 2024
Maintainer

Hey!
I personally never used this, but maybe some of the original authors of the guide have also used the operator and may be able to help:
@markrexwinkel
@vainkop
@dbreyfogle

0 replies

saicharithpasula · 2024-07-07T21:04:51Z

saicharithpasula
Jul 7, 2024
Author

Hello @iwilltry42 , thank you for your reply. I modified the helm chart of gpu operator to disable the symllink creator and was able to avoid the above issue. But now I am running into a new error Error: failed to generate container "7ef3f1c73ce130dc46badbdef38af202bd8e52e495b0e285f823dfd320f4288e" spec: failed to generate spec: path "/run/nvidia/driver" is mounted on "/run" but it is not a shared or slave mount

Can you tell me if this is an error with my underlying docker infrastructure? Specifically I am not sure how to mount the /run folder as a shared mount

1 reply

saicharithpasula Jul 7, 2024
Author

I looked at this issue #1063 and exec into the container and running the mount --make-rshared / command inside the container solved this issue but isnt there a permanent solution? I thought this error was fixed in this PR #1268?
Does using custom docker image to create the cluster cause this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[QUESTION/HELP] Installing NVIDIA GPU operator on a k3d cluster #1458

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[QUESTION/HELP] Installing NVIDIA GPU operator on a k3d cluster #1458

saicharithpasula Jul 1, 2024

Replies: 2 comments · 1 reply

iwilltry42 Jul 4, 2024 Maintainer

saicharithpasula Jul 7, 2024 Author

saicharithpasula Jul 7, 2024 Author

saicharithpasula
Jul 1, 2024

Replies: 2 comments 1 reply

iwilltry42
Jul 4, 2024
Maintainer

saicharithpasula
Jul 7, 2024
Author

saicharithpasula Jul 7, 2024
Author