Newer runc versions break support for NVIDIA GPUs #3708

klueska · 2023-01-23T19:45:30Z

The following logic originally introduced in libcontainer/cgroups/systemd/common.go and then later moved / extended to the following in libcontainer/cgroups/devices/systemd.go will never match against NVIDIA GPU devices.

                } else {
                        // "_ n:m _" rules are just a path in /dev/{block,char}/.
                        switch rule.Type {
                        case devices.BlockDevice:
                                entry.Path = fmt.Sprintf("/dev/block/%d:%d", rule.Major, rule.Minor)
                        case devices.CharDevice:
                                entry.Path = fmt.Sprintf("/dev/char/%d:%d", rule.Major, rule.Minor)
                        }
                        // systemd will issue a warning if the path we give here doesn't exist.
                        // Since all of this logic is best-effort anyway (we manually set these
                        // rules separately to systemd) we can safely skip entries that don't
                        // have a corresponding path.
                        if _, err := os.Stat(entry.Path); err != nil {
                                // Also check /sys/dev so that we don't depend on /dev/{block,char}
                                // being populated. (/dev/{block,char} is populated by udev, which
                                // isn't strictly required for systemd). Ironically, this happens most
                                // easily when starting containerd within a runc created container
                                // itself.

                                // We don't bother with securejoin here because we create entry.Path
                                // right above here, so we know it's safe.
                                if _, err := os.Stat("/sys" + entry.Path); err != nil {
                                        logrus.Warnf("skipping device %s for systemd: %s", entry.Path, err)
                                        continue
                                }
                        }
                }
                deviceAllowList = append(deviceAllowList, entry)

The end result is that containers have access to GPUs when they first come online, but then lose access to GPUs as soon as systemd is triggered to run some reevaluation of the cgroups it manages (e.g. with something as simple as a systemctl daemon-reload).

The reason being that the NVIDIA driver does not register any of its devices with /sys/dev/char. The NVIDIA driver is closed source / non-GPL, and thus is unable to call the standard linux helpers to register the devices it manages here. In fact, the device nodes that ultimately get created under /dev for all NVIDIA devices are all triggered from user-space code.

I have filed an issue internal to NVIDIA to investigate what can be done on the driver front here, but until something gets put in place, GPU support on newer runcs (with systemd cgroup integration) remains to be broken.

We are also investigating workarounds to manually create the necessary symlinks under /dev/char, but so far nothing we have come up with is "fool-proof".

Would it be possible to use the device node under /dev for the DeviceAllowList here rather than relying on the existence of a node under /dev/char or /sys/dev/char?

Here is a simple reproducer using docker on ubuntu22.04 (with a single K80 GPU on it):

$ docker run -d --rm --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"
bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b

$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

ubuntu@ip-172-31-57-160:~$ sudo systemctl daemon-reload

ubuntu@ip-172-31-57-160:~$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
Failed to initialize NVML: Unknown Error
Failed to initialize NVML: Unknown Error

Which is then fixed if I (manually) do the following:

$ sudo ln -s ../nvidia0 /dev/char/195:0
$ sudo ln -s ../nvidiactl /dev/char/195:255
$ sudo ln -s ../nvidia-modeset /dev/char/195:254
$ sudo ln -s ../nvidia-uvm /dev/char/510:0
$ sudo ln -s ../nvidia-uvm-tools /dev/char/510:1
$ sudo ln -s ../nvidia-caps/nvidia-cap1 /dev/char/235:1
$ sudo ln -s ../nvidia-caps/nvidia-cap2 /dev/char/235:2

Rerunning the above -- the issue is now resolved:

$ docker run -d --rm --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"
cb1a10d2467b4974006bd8cec66c7528cbf244c4b724b2325390bd0d9ae4e605

$ docker logs cb1a10d2467b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

$ sudo systemctl daemon-reload

$ docker logs cb1a10d2467b
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

The text was updated successfully, but these errors were encountered:

klueska · 2023-01-23T19:48:15Z

/cc @kolyshkin

kolyshkin · 2023-01-23T23:00:30Z

Well, all I can say that this is complicated; I was looking into that while reviewing PR #3568 (see its comments).

The problem is, runtime spec (https://github.com/opencontainers/runtime-spec/blob/main/config-linux.md to be precise) defines a host device by its major:minor pair (path is only used for the in-container device node). Similarly, both cgroup v1 and cgroup v2 device controllesr use major:minor pair.

OTOH systemd uses path to refer to a device. Since we don't have a way to specify a path (in runtime-spec, that is), we need a way to convert from a major:minor pair to a device name (to supply that to systemd). Fortunately, we have /dev/{block,char}/MM:mm nodes so we can translate easily.

Unfortunately, this is not working for NVidia devices. A workaround is proposed in #3568 but in essence this is a hardcoded kludge, and unfortunately I don't see a clean way to handle this (other than introducing host path to runtime-spec, that is).

The non-hardcoded kludge would be to read and parse the whole /dev, trying to find the node with the needed major:minor. This is time consuming (we need to stat(2) every file) and may have some security implications.

Now, all this is happening because Nvidia doesn't follow the standard practice of device registration. So, for me, the best solution seems to be doing those /dev/char symlinks you demonstrated above in your workaround in some post module loading scripts.

If you have any ideas @klueska please let us know.

klueska · 2023-01-24T07:06:11Z

Yeah, the hardcoded kludge in #3568 is definitely not the answer. It's not even guaranteed to work as the device majors (for some of the devices) can change, and the nvidia-caps device major is not even considered.

Two options I see that could help are:

Re-evaluate the set of allowed-devices after the prestart/createRuntime hooks have run
Introduce a new hook before the initial cgroup setup happens

This would allow us to insert a hook that creates the necessary /dev/char files "just-in-time" before the deviceAllowList is populated.

Unfortunately, this won't work for the rootless case, but it would at least provide a well-defined "out" for most of our users.

Regarding:

Now, all this is happening because Nvidia doesn't follow the standard practice of device registration.

I agree the best solution is for NVIDIA to start registering its devices under /sys/dev. This problem would simply go away if this was done properly. Unfortunately, there is internal resistance to do so because of GPL compatibility issues. It could be done in the new open source version of the NVIDIA kernel module (https://github.com/NVIDIA/open-gpu-kernel-modules), but that is not production ready on all systems yet, and will likely take years for it to become so.

Regarding:

So, for me, the best solution seems to be doing those /dev/char symlinks you demonstrated above in your workaround in some post module loading scripts.

Unfortunately simply running a post-module-loading script is not sufficient for two reasons:

Some device nodes only get created long after the module has loaded (in support of something called MIG, which allows you to dynamically divide a GPU into a set of smaller mini-gpus).
We often load the NVIDIA kernel module from within a container, meaning that this script would end up creating symlinks under /dev/char relative to the container's root, not the host root. All of the NVIDIA device nodes that we create in this scenario are also relative to the container's /dev, so the script would somehow need to know the actual host path to these devices to create the symlinks in /dev/char anyway.

Workarounds for both (1) and (2) are feasible (i.e. start a long-running script (on the host) that uses inotifywait to watch for the creation of nvidia device nodes at either /dev or a well-defined location where our "driver container" has mapped its /dev directory back onto the host).

However, we'd much prefer a "just-in-time" solution (if possible) until the NVIDIA driver team is able to come up with a proper solution at the driver level.

klueska · 2023-01-24T09:12:08Z

Thinking about it a bit more....

Is there any particular reason you can't just use the absolute path to the device node inside the container? From what I understood in your response, device.Path contains this information (modulo the container root, which I assume shouldn't be too hard to make available here). Presumably you create this device node inside the container with the correct major / minor number, so why not just point systemd at this instead of its "original" reference on the host at /dev/char?

kolyshkin · 2023-04-24T23:24:06Z

I might finally have a fix for that: #3842

kolyshkin · 2023-04-25T19:29:37Z

@klueska can you please check if the patch from #3842 / #3845 fixes this issue?

klueska · 2023-04-25T20:10:51Z

@elezar can you take a look while I am away.

elezar · 2023-04-25T20:17:19Z

Sure. Will have a look tomorrow.

kolyshkin · 2023-04-25T23:09:00Z

With the fix I was able to give access to a device which does not have an appropriate /dev/{char,block}/MM:mm entry, using a systemd cgroup manager, so I guess this is now fixed, but just want to confirm.

elezar · 2023-04-26T14:36:27Z

I can confirm that the failure mode is no longer triggered with the fixes in place.

I first started to reproduce the behaviour that @klueska described using:

$ runc --version
runc version 1.1.5
commit: v1.1.5-0-gf19387a
spec: 1.0.2-dev
go: go1.19.7
libseccomp: 2.5.3

that was installed on my test system.

I then cloned the runc repo and set up runc-dev as a runtime in my docker daemon config:

{
    "runtimes": {
        "runc-dev": {
            "args": [],
            "path": "/home/ubuntu/src/runc/runc"
        }
    }
}

For the tips of main and the release-1.1 branch, I then built runc from source and started the test using the runc-dev runtime for the following versions:

$ ./runc --version
runc version 1.1.0+dev
commit: v1.1.0-515-g30f9f808
spec: 1.1.0-rc.1
go: go1.20.3
libseccomp: 2.5.3

$ ./runc --version
runc version 1.1.6+dev
commit: v1.1.6-4-gf72cd0a6
spec: 1.0.2-dev
go: go1.20.3
libseccomp: 2.5.3

This was from a clean boot and no nvidia-specific symlinks are present in /dev/char:

$ ls -al /dev/char | grep nvidia | wc -l
0

cyphar · 2023-04-28T15:07:27Z

Can we close this now that #3842 / #3845 have been merged (and 1.1.7 is out)? The fix is only going to work with systemd 240+ but there's not much we can do about that.

elezar · 2023-04-28T15:14:43Z

I will leave it to @klueska to make that call. I do believe he's out for a couple of weeks, so it won't be immediately.

kolyshkin · 2023-06-05T20:01:50Z

No response from @klueska, so I consider this fixed. A comment can still be added to a closed bug, so feel free to share your findings.

vsxen · 2024-07-09T04:03:01Z

runc 1.12 ,Redhat9 systemd 252 still has issues.

kolyshkin · 2024-07-09T07:40:59Z

runc 1.12 ,Redhat9 systemd 252 still has issues.

Please file a separate bug with a detailed reproducer.

vsxen · 2024-07-09T08:41:01Z

# install nvidia driver & nvidia-container-toolkit

nvidia-smi
Tue Jul  9 16:30:21 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:2F:00.0 Off |                    0 |
| N/A   50C    P8             16W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00000000:86:00.0 Off |                    0 |
| N/A   47C    P8             16W /   70W |       1MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+


wget https://github.com/containerd/containerd/releases/download/v1.7.16/cri-containerd-cni-1.7.16-linux-amd64.tar.gz

#  run a container

ctr -n k8s.io run --net-host --cgroup "k8s.slice:demo:nvidia" --runc-systemd-cgroup --rm -t --env NVIDIA_VISIBLE_DEVICES=all --runc-binary /usr/bin/nvidia-container-runtime nvidia/k8s-device-plugin  nvidia-smi bash

# daemon  reload
systemctl daemon-reload


nvidia-smi
Failed to initialize NVML: Unknown Error


nvidia-container-runtime -v
runc version 1.1.12
commit: v1.1.12-0-g51d5e946
spec: 1.0.2-dev
go: go1.20.13
libseccomp: 2.5.2

#systemctl --version
systemd 252 (252-32.el9_4)

Another thing I am curious about is that DeviceAllow has never had Nvidia configuration

# /run/systemd/transient/demo-nvidia.scope.d/50-DeviceAllow.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:1 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

kolyshkin · 2024-07-09T22:10:27Z

@vsxen I don't see you explicitly granting access to any nvidia devices.

Also, as I said earlier, please file a separate bug report if you feel something is broken.

This was referenced Mar 30, 2023

linuxPackages.nvidia_x11_production: 525.89.02 -> 525.105.17 NixOS/nixpkgs#223939

Merged

runc: missing reverse device mapping in /dev NixOS/nixpkgs#224217

Open

kolyshkin mentioned this issue Apr 24, 2023

libct/cg/sd: use systemd version when generating device properties #3842

Merged

This was referenced Apr 26, 2023

[1.1] libct/cg/sd: use systemd version when generating dev props #3845

Merged

Release 1.1.7 #3846

Merged

kolyshkin mentioned this issue Apr 26, 2023

libct/cg/sd/v2: rely on systemd to set device access rules #3847

Open

gaopeiliang mentioned this issue Apr 27, 2023

"Failed to initialize NVML: Unknown Error" after random amount of time NVIDIA/nvidia-docker#1671

Closed

7 tasks

shivamerla mentioned this issue Apr 27, 2023

Container lost access to GPU on driver version 515.105.01 NVIDIA/nvidia-docker#1748

Closed

9 tasks

Morph-Ed mentioned this issue Apr 28, 2023

Update to runc 1.1.7 - Fixes issue with broken NVidia device support vmware/photon#1461

Open

Kiskae mentioned this issue May 30, 2023

linuxPackages.nvidia_x11_beta: 530.41.03 -> 535.43.02 NixOS/nixpkgs#235036

Merged

12 tasks

kolyshkin closed this as completed Jun 5, 2023

Kiskae mentioned this issue Jul 18, 2023

linuxPackages.nvidia_x11_legacy470: 470.182.03 -> 470.199.02 NixOS/nixpkgs#244170

Merged

12 tasks

YJDoc2 mentioned this issue Jan 10, 2024

Ideas - Feel free to post ideas. containers/youki#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newer runc versions break support for NVIDIA GPUs #3708

Newer runc versions break support for NVIDIA GPUs #3708

klueska commented Jan 23, 2023 •

edited

Loading

klueska commented Jan 23, 2023

kolyshkin commented Jan 23, 2023

klueska commented Jan 24, 2023

klueska commented Jan 24, 2023

kolyshkin commented Apr 24, 2023

kolyshkin commented Apr 25, 2023

klueska commented Apr 25, 2023

elezar commented Apr 25, 2023

kolyshkin commented Apr 25, 2023

elezar commented Apr 26, 2023

cyphar commented Apr 28, 2023

elezar commented Apr 28, 2023

kolyshkin commented Jun 5, 2023

vsxen commented Jul 9, 2024

kolyshkin commented Jul 9, 2024

vsxen commented Jul 9, 2024

kolyshkin commented Jul 9, 2024

Newer runc versions break support for NVIDIA GPUs #3708

Newer runc versions break support for NVIDIA GPUs #3708

Comments

klueska commented Jan 23, 2023 • edited Loading

klueska commented Jan 23, 2023

kolyshkin commented Jan 23, 2023

klueska commented Jan 24, 2023

klueska commented Jan 24, 2023

kolyshkin commented Apr 24, 2023

kolyshkin commented Apr 25, 2023

klueska commented Apr 25, 2023

elezar commented Apr 25, 2023

kolyshkin commented Apr 25, 2023

elezar commented Apr 26, 2023

cyphar commented Apr 28, 2023

elezar commented Apr 28, 2023

kolyshkin commented Jun 5, 2023

vsxen commented Jul 9, 2024

kolyshkin commented Jul 9, 2024

vsxen commented Jul 9, 2024

kolyshkin commented Jul 9, 2024

klueska commented Jan 23, 2023 •

edited

Loading