Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

system has unsupported display driver / cuda driver combination #1256

Closed
5 of 8 tasks
ChaiBapchya opened this issue Apr 24, 2020 · 3 comments · Fixed by apache/mxnet#18186
Closed
5 of 8 tasks

system has unsupported display driver / cuda driver combination #1256

ChaiBapchya opened this issue Apr 24, 2020 · 3 comments · Fixed by apache/mxnet#18186

Comments

@ChaiBapchya
Copy link

ChaiBapchya commented Apr 24, 2020

1. Issue or feature description

CUDA: Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination

2. Steps to reproduce the issue

Too lengthy/not possible to share.

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
 WARNING, the following logs are for debugging purposes only --

I0424 07:05:10.961294 3010 nvc.c:281] initializing library context (version=1.0.7, build=b71f87c04b8eca8a16bf60995506c35c937347d9)
I0424 07:05:10.961331 3010 nvc.c:255] using root /
I0424 07:05:10.961341 3010 nvc.c:256] using ldcache /etc/ld.so.cache
I0424 07:05:10.961348 3010 nvc.c:257] using unprivileged user 1000:1000
W0424 07:05:10.962501 3011 nvc.c:186] failed to set inheritable capabilities
W0424 07:05:10.962538 3011 nvc.c:187] skipping kernel modules load due to failure
I0424 07:05:10.962720 3012 driver.c:133] starting driver service
I0424 07:05:10.987696 3010 nvc_info.c:438] requesting driver information with ''
I0424 07:05:10.987894 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.440.33.01
I0424 07:05:10.987949 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/tls/libnvidia-tls.so.440.33.01
I0424 07:05:10.987987 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.440.33.01 over /usr/lib/x86_64-linux-gn
u/tls/libnvidia-tls.so.440.33.01
I0424 07:05:10.988024 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.440.33.01
I0424 07:05:10.988073 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.440.33.01
I0424 07:05:10.988125 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.440.33.01
I0424 07:05:10.988180 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.440.33.01
I0424 07:05:10.988216 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.440.33.01
I0424 07:05:10.988268 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.440.33.01
I0424 07:05:10.988319 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.440.33.01
I0424 07:05:10.988354 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.440.33.01
I0424 07:05:10.988390 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.440.33.01
I0424 07:05:10.988430 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.440.33.01
I0424 07:05:10.988479 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.440.33.01
I0424 07:05:10.988521 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.440.33.01
I0424 07:05:10.988570 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.440.33.01
I0424 07:05:10.988606 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.440.33.01
I0424 07:05:10.988643 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.440.33.01
I0424 07:05:10.988694 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.440.33.01
I0424 07:05:10.988730 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.440.33.01
I0424 07:05:10.988853 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.440.33.01
I0424 07:05:10.988939 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.440.33.01
I0424 07:05:10.988977 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.440.33.01
I0424 07:05:10.989014 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.440.33.01
I0424 07:05:10.989050 3010 nvc_info.c:152] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.440.33.01
W0424 07:05:10.989072 3010 nvc_info.c:303] missing library libvdpau_nvidia.so
W0424 07:05:10.989079 3010 nvc_info.c:307] missing compat32 library libnvidia-ml.so
W0424 07:05:10.989087 3010 nvc_info.c:307] missing compat32 library libnvidia-cfg.so
W0424 07:05:10.989094 3010 nvc_info.c:307] missing compat32 library libcuda.so
W0424 07:05:10.989099 3010 nvc_info.c:307] missing compat32 library libnvidia-opencl.so
W0424 07:05:10.989104 3010 nvc_info.c:307] missing compat32 library libnvidia-ptxjitcompiler.so
W0424 07:05:10.989108 3010 nvc_info.c:307] missing compat32 library libnvidia-fatbinaryloader.so
W0424 07:05:10.989119 3010 nvc_info.c:307] missing compat32 library libnvidia-compiler.so
W0424 07:05:10.989125 3010 nvc_info.c:307] missing compat32 library libvdpau_nvidia.so
W0424 07:05:10.989132 3010 nvc_info.c:307] missing compat32 library libnvidia-encode.so
W0424 07:05:10.989138 3010 nvc_info.c:307] missing compat32 library libnvidia-opticalflow.so
W0424 07:05:10.989147 3010 nvc_info.c:307] missing compat32 library libnvcuvid.so
W0424 07:05:10.989158 3010 nvc_info.c:307] missing compat32 library libnvidia-eglcore.so
W0424 07:05:10.989165 3010 nvc_info.c:307] missing compat32 library libnvidia-glcore.so
W0424 07:05:10.989175 3010 nvc_info.c:307] missing compat32 library libnvidia-tls.so
W0424 07:05:10.989180 3010 nvc_info.c:307] missing compat32 library libnvidia-glsi.so
W0424 07:05:10.989188 3010 nvc_info.c:307] missing compat32 library libnvidia-fbc.so
W0424 07:05:10.989198 3010 nvc_info.c:307] missing compat32 library libnvidia-ifr.so
W0424 07:05:10.989204 3010 nvc_info.c:307] missing compat32 library libnvidia-rtcore.so
W0424 07:05:10.989211 3010 nvc_info.c:307] missing compat32 library libnvoptix.so
W0424 07:05:10.989215 3010 nvc_info.c:307] missing compat32 library libGLX_nvidia.so
W0424 07:05:10.989222 3010 nvc_info.c:307] missing compat32 library libEGL_nvidia.so
W0424 07:05:10.989230 3010 nvc_info.c:307] missing compat32 library libGLESv2_nvidia.so
W0424 07:05:10.989240 3010 nvc_info.c:307] missing compat32 library libGLESv1_CM_nvidia.so
W0424 07:05:10.989250 3010 nvc_info.c:307] missing compat32 library libnvidia-glvkspirv.so
W0424 07:05:10.989256 3010 nvc_info.c:307] missing compat32 library libnvidia-cbl.so
I0424 07:05:10.989472 3010 nvc_info.c:233] selecting /usr/bin/nvidia-smi
I0424 07:05:10.989496 3010 nvc_info.c:233] selecting /usr/bin/nvidia-debugdump
I0424 07:05:10.989515 3010 nvc_info.c:233] selecting /usr/bin/nvidia-persistenced
I0424 07:05:10.989540 3010 nvc_info.c:233] selecting /usr/bin/nvidia-cuda-mps-control
I0424 07:05:10.989561 3010 nvc_info.c:233] selecting /usr/bin/nvidia-cuda-mps-server
I0424 07:05:10.989589 3010 nvc_info.c:370] listing device /dev/nvidiactl
I0424 07:05:10.989598 3010 nvc_info.c:370] listing device /dev/nvidia-uvm
I0424 07:05:10.989607 3010 nvc_info.c:370] listing device /dev/nvidia-uvm-tools
I0424 07:05:10.989617 3010 nvc_info.c:370] listing device /dev/nvidia-modeset
I0424 07:05:10.989650 3010 nvc_info.c:274] listing ipc /run/nvidia-persistenced/socket
W0424 07:05:10.989668 3010 nvc_info.c:278] missing ipc /tmp/nvidia-mps
I0424 07:05:10.989675 3010 nvc_info.c:494] requesting device information with ''
I0424 07:05:10.995334 3010 nvc_info.c:524] listing device /dev/nvidia0 (GPU-4cfe4f25-9d56-b1f1-edb8-dfa13fc461ae at 00000000:00:1e.0)
NVRM version:   440.33.01
CUDA version:   10.2

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Tesla
GPU UUID:       GPU-4cfe4f25-9d56-b1f1-edb8-dfa13fc461ae
Bus Location:   00000000:00:1e.0
Architecture:   7.5
  • Kernel version from uname -a
Linux ip-172-31-32-87 4.15.0-1057-aws #59-Ubuntu SMP Wed Dec 4 10:02:00 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a
==============NVSMI LOG==============

Timestamp                           : Fri Apr 24 07:06:35 2020
Driver Version                      : 440.33.01
CUDA Version                        : 10.2

Attached GPUs                       : 1
GPU 00000000:00:1E.0
    Product Name                    : Tesla T4
    Product Brand                   : Tesla
    Display Mode                    : Enabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 4000
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 1561719002810
    GPU UUID                        : GPU-4cfe4f25-9d56-b1f1-edb8-dfa13fc461ae
    Minor Number                    : 0
    VBIOS Version                   : 90.04.84.00.06
    MultiGPU Board                  : No
    Board ID                        : 0x1e
    GPU Part Number                 : 900-2G183-0000-001
    Inforom Version
        Image Version               : G183.0200.00.02
        OEM Object                  : 1.1
        ECC Object                  : 5.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    GPU Virtualization Mode
        Virtualization Mode         : Pass-Through
        Host VGPU Mode              : N/A
    IBMNPU
        Relaxed Ordering Mode       : N/A
    PCI
        Bus                         : 0x00
        Device                      : 0x1E
        Domain                      : 0x0000
        Device Id                   : 0x1EB810DE
        Bus Id                      : 00000000:00:1E.0
        Sub System Id               : 0x12A210DE
        GPU Link Info
            PCIe Generation

  • Docker version from docker version
 docker version
Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b7f0
 Built:             Wed Mar 11 01:25:46 2020
 OS/Arch:           linux/amd64
 Experimental:      false
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                        Version            Architecture       Description
+++-===========================-==================-==================-============================================================
un  libgldispatch0-nvidia       <none>             <none>             (no description available)
ii  libnvidia-cfg1-440:amd64    440.33.01-0ubuntu1 amd64              NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any          <none>             <none>             (no description available)
un  libnvidia-common            <none>             <none>             (no description available)
ii  libnvidia-common-440        440.33.01-0ubuntu1 all                Shared files used by the NVIDIA libraries
ii  libnvidia-compute-440:amd64 440.33.01-0ubuntu1 amd64              NVIDIA libcompute package
ii  libnvidia-container-tools   1.0.7-1            amd64              NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64  1.0.7-1            amd64              NVIDIA container runtime library
un  libnvidia-decode            <none>             <none>             (no description available)
ii  libnvidia-decode-440:amd64  440.33.01-0ubuntu1 amd64              NVIDIA Video Decoding runtime libraries
un  libnvidia-encode            <none>             <none>             (no description available)
ii  libnvidia-encode-440:amd64  440.33.01-0ubuntu1 amd64              NVENC Video Encoding runtime library
un  libnvidia-fbc1              <none>             <none>             (no description available)
ii  libnvidia-fbc1-440:amd64    440.33.01-0ubuntu1 amd64              NVIDIA OpenGL-based Framebuffer Capture runtime library
un  libnvidia-gl                <none>             <none>             (no description available)
ii  libnvidia-gl-440:amd64      440.33.01-0ubuntu1 amd64              NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
un  libnvidia-ifr1              <none>             <none>             (no description available)
ii  libnvidia-ifr1-440:amd64    440.33.01-0ubuntu1 amd64              NVIDIA OpenGL-based Inband Frame Readback runtime library
un  libnvidia-ml1               <none>             <none>             (no description available)
un  nvidia-304                  <none>             <none>             (no description available)
un  nvidia-340                  <none>             <none>             (no description available)
un  nvidia-384                  <none>             <none>             (no description available)
un  nvidia-390                  <none>             <none>             (no description available)
ii  nvidia-compute-utils-440    440.33.01-0ubuntu1 amd64              NVIDIA compute utilities
un  nvidia-container-runtime    <none>             <none>             (no description available)
un  nvidia-container-runtime-ho <none>             <none>             (no description available)
ii  nvidia-container-toolkit    1.0.5-1            amd64              NVIDIA container runtime hook
ii  nvidia-dkms-440             440.33.01-0ubuntu1 amd64              NVIDIA DKMS package
un  nvidia-dkms-kernel          <none>             <none>             (no description available)
ii  nvidia-driver-440           440.33.01-0ubuntu1 amd64              NVIDIA driver metapackage
un  nvidia-driver-binary        <none>             <none>             (no description available)
un  nvidia-kernel-common        <none>             <none>             (no description available)
ii  nvidia-kernel-common-440    440.33.01-0ubuntu1 amd64              Shared files used with the kernel module
un  nvidia-kernel-source        <none>             <none>             (no description available)
ii  nvidia-kernel-source-440    440.33.01-0ubuntu1 amd64              NVIDIA kernel source package
un  nvidia-legacy-340xx-vdpau-d <none>             <none>             (no description available)
ii  nvidia-modprobe             440.33.01-0ubuntu1 amd64              Load the NVIDIA kernel driver and create device files
un  nvidia-opencl-icd           <none>             <none>             (no description available)
un  nvidia-persistenced         <none>             <none>             (no description available)
ii  nvidia-prime                0.8.8.2            all                Tools to enable NVIDIA's Prime
ii  nvidia-settings             440.33.01-0ubuntu1 amd64              Tool for configuring the NVIDIA graphics driver
un  nvidia-settings-binary      <none>             <none>             (no description available)
un  nvidia-smi                  <none>             <none>             (no description available)
un  nvidia-utils                <none>             <none>             (no description available)
ii  nvidia-utils-440            440.33.01-0ubuntu1 amd64              NVIDIA driver support binaries
un  nvidia-vdpau-driver         <none>             <none>             (no description available)
ii  xserver-xorg-video-nvidia-4 440.33.01-0ubuntu1 amd64              NVIDIA binary Xorg driver
  • NVIDIA container library version from nvidia-container-cli -V
version: 1.0.7
build date: 2020-01-21T18:59+00:00
build revision: b71f87c04b8eca8a16bf60995506c35c937347d9
build compiler: x86_64-linux-gnu-gcc-7 7.4.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • Docker command, image and tag used
    incubator-mxnet repo uses the base image as nvidia/cuda-10-1 docker image
    The command that fails for me :
sudo docker run --gpus all -v /home/ubuntu/incubator-mxnet:/work/mxnet -v /home/ubuntu/incubator-mxnet/build:/work/build  mxnetci/build.ubuntu_gpu_cu101         /work/runtime_functions.sh         integrationtest_ubuntu_gpu_python
@ChaiBapchya
Copy link
Author

Rootcause of this issue :

  1. cuda driver mismatch between host & docker

Host machine

    Fri Apr 24 21:45:38 2020
    +-----------------------------------------------------------------------------+
    NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2

Docker container

    nvidia-smi
    Fri Apr 24 21:46:58 2020
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.1
  1. compat libcuda.so
    https://github.com/apache/incubator-mxnet/blob/v1.7.x/ci/docker/Dockerfile.build.ubuntu_gpu_cu101#L82
    ENV LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/cuda/compat
This resulted in cuda compat being included in the mxnet binary
    $ ldd incubator-mxnet/build/libmxnet.so
    ...
    libcuda.so.1 => /usr/local/cuda/compat/libcuda.so.1 (0x00007f819bb25000)

Solution :

Remove the compat libcuda.so

Reasoning :

If hosts have new enough driver then there is no need to use compat lib.

@Thunder003
Copy link

I got stuck in the same problem. I was installing CUDA 11.1 on a server machine whose Docker image had a driver version compatible with CUDA 10. I changed the docker image and it worked.

@barzan-hayati
Copy link

barzan-hayati commented Jul 12, 2023

1. Issue or feature description

CUDA: Check failed: e == cudaSuccess (803 vs. 0) : system has unsupported display driver / cuda driver combination

I have the same error for mxnet/python:1.9.1_gpu_cu112_py3 docker container and by removing libcuda.so in container, it has been resolved.

Solution :

Remove the compat libcuda.so

After doing that, the nvidia setting in container is

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   35C    P8     6W /  75W |    277MiB /  4096MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

and in host

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.108.03   Driver Version: 510.108.03   CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
| 26%   35C    P8     6W /  75W |    263MiB /  4096MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1085      G   /usr/lib/xorg/Xorg                 92MiB |
|    0   N/A  N/A      1313      G   /usr/bin/gnome-shell               26MiB |
|    0   N/A  N/A      1874      G   /usr/lib/firefox/firefox          141MiB |
+-----------------------------------------------------------------------------+

As you see here I have the same version of CUDA in host and container meanwhile before doing this solution I had CUDA Version 11.2 in container.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants