-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to runc 1.1.7 - Fixes issue with broken NVidia device support #1461
Comments
@Morph-Ed Working on it. Once will submit the change i will update here. |
Changes merged 772d14d |
@tapakund @ntsbtz Unfortunately something is missing. Having updated runc+containerd, there is still an issue with rootless docker.
|
Do |
Hi @sshedi, Many thanks for the weekend support !!
# For Photon OS Docker rootless support, see
#
# - https://vmware.github.io/photon/docs-v5/administration-guide/containers/docker-rootless-support/
# - https://github.com/vmware/photon/issues/1461
#
# Tested on Workstation17 with provisioned vm from photon-hw15-5.0-dde71ec57.x86_64.ova
# 1) login as root
# 2) save the following content e.g to a file `/tmp/photon-docker-rootless-install.sh`
# 3) run `chmod 777 /tmp/photon-docker-rootless-install.sh`
# 4) run `/tmp/photon-docker-rootless-install.sh`
# 5) If the script finished successfully, login with the user specified and rerun the script.
#
ROOTLESS_USER="test_user"
if [ `whoami | grep -o "root" | wc -l` -eq 1 ]; then
if [ ! -f "/usr/bin/dockerd-rootless-setuptool.sh" ]; then
# Update runc and containterd with respect to NVIDIA/nvidia-docker#1461
tdnf update -y runc containerd
tdnf install -y shadow fuse slirp4netns libslirp
tdnf install -y docker-rootless
useradd -m $ROOTLESS_USER
echo Set a password for $ROOTLESS_USER.
passwd $ROOTLESS_USER
echo "$ROOTLESS_USER:100000:65536" >> /etc/subuid
echo "$ROOTLESS_USER:100000:65536" >> /etc/subgid
echo "kernel.unprivileged_userns_clone = 1" >> /etc/sysctl.d/50-rootless.conf
chmod 644 /etc/subuid /etc/subgid /etc/sysctl.d/50-rootless.conf
sysctl --system
modprobe ip_tables
echo Now login as $ROOTLESS_USER in a new for example putty window and rerun the script.
fi
fi
# login as $ROOTLESS_USER and rerun the script
if [ `whoami | grep -o "$ROOTLESS_USER" | wc -l` -eq 1 ]; then
if test -f "/usr/bin/dockerd-rootless-setuptool.sh"; then
systemctl --user restart dbus
if [ `dockerd-rootless-setuptool.sh check | grep -o "Requirements are satisfied" | wc -l` -eq 1 ]; then
dockerd-rootless-setuptool.sh install
if [ `cat $HOME/.bashrc | grep -o "export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock" | wc -l` -eq 0 ]; then
cat << EOF_bashrc >> $HOME/.bashrc
export PATH=/usr/bin:$PATH
export DOCKER_HOST=unix:///run/user/$(id -u)/docker.sock
EOF_bashrc
fi
echo The installation has finished. Check the output of hello-world.
if [ `docker run -it hello-world | grep -o "Hello from Docker!" | wc -l` -eq 1 ]; then
echo The installation successfully finished.
fi
fi
fi
fi
edited: |
Try this script and check: (run the script as root)
|
Thanks @sshedi ****** Your pipeline test script made it clear what was missing. Pipeline scripts hopefully will become a standard. Makes life so much easier. Every enthusiast should contribute with pipeline scripts. Wasn't aware of the dbus comment of @iwaseyusuke. The docs should be updated. @Morph-Ed Can you retest the NVidia container? |
@dcasota Happy to, is there a build process output somewhere with an ISO/OVA built with the latest code changes, so I can test? Or do I need to build it myself, or is there another way? Apologies for the naive question, first time I've been involved in Photon dev. |
Hi @Morph-Ed, Yes sure, here a suggestion:
Both subjects might find a way into the docs. There is no 4.0 Rev3 and 5.0 Rev1 iso so far, but make build of an iso for the 5.0 GA bits including latest updates works. Hope this helps. Please, confirm as soon as the issue can be closed. I was and still am learning as well thanks to the Photon OS team + community. |
Running the CUDA docker image gives me the following error
My VM has 4GB RAM (all reserved) with the P620 GFX card passed through |
@Morph-Ed could you generate&attach the debug logs? see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html#generating-debugging-logs |
Searching the error text gives me this issue. I'll see if there is anything useful to be had from that thread |
Downgrading to Docker 20.10.23-1 is a potential fix, since I have 23.0.2-1.ph5 installed. what is the best way of doing that, my repos don't have any other versions available - is there an older repo I can try to utilise? |
@sshedi does the Photon OS team recommend a NVidia-related combination of vhw + Photon OS release +Nvidia drivers/cuda/container-tool-kit ? the "testing with latest bits" approach on Photon OS 5.0 does not work because from Nvidia no support yet for docker latest. @Morph-Ed My last setup tested was on Photon OS 4.0 rev2 with updates until August 2022 and Nvidia drivers 470.141 (Cuda 11.4). The next weeks, I might restart a research about valid combinations. |
@dcasota Thanks - A colleague tells me "Photon 4 (version unknown but upgraded from 3) with NVidia 525.89.02 and CUDA 12.0 works" I will start investigating older drivers |
TLDR: A Working combo is:
Starting from the beginning... this tests a deployment of ESXI VM | 4 CPU | 4GB (reserved) | 16GB Thin prov | Hypervisor.CPUID.v0 FALSE | NVidia P620 All commands as root
Install NVidia Drivers
SMI Test
Prep for NVidia Container Toolkit
Install specific version of the NVidia container toolkit
|
Cannot get Photon 5 to work (updated to use runc 1.1.7) Tried with both
The NVidia driver here is from: https://us.download.nvidia.com/XFree86/Linux-x86_64/525.116.04/NVIDIA-Linux-x86_64-525.116.04.run Which is a different driver family(?) than the one that was working in my previous experiment with Photon 3.0 (that used the Tesla driver from here https://us.download.nvidia.com/tesla/470.141.03/NVIDIA-Linux-x86_64-470.141.03.run)
dmesg
|
@Morph-Ed double check the setup result with a newer nvidia-container-toolkit, see release notes https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/release-notes.html with respect to rpc related fixes. |
Failed with both a newer, and older version of docker
|
Also works with Photon 3 Rev 3 Update 1
I'm unable to update nvidia-container-toolkit to 1.13.1. It complains about nvidia-container-toolkit-base conflicting. Updating from here to Photo4 using the update script, causes the CUDA docker command to fail |
For this error:
Try enabling full support of memory overcommit: sysctl -w vm.overcommit_memory=1 |
Is your feature request related to a problem? Please describe.
Photon 5 RC contains runc 1.1.4 which contains a bug regarding NVidia device registration, and as a result fails in providing NVidia CUDA support to docker containers.
See opencontainers/runc#3708
Describe the solution you'd like
I would like runc version 1.1.7 included
https://github.com/opencontainers/runc/releases/tag/v1.1.7
Describe alternatives you've considered
There might be previous versions of NVidia drivers that work. But this seems to be the most comprehensive and simple fix
Additional context
No response
The text was updated successfully, but these errors were encountered: