Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA_Runtime_Discovery Did not find cupti on Arm system with nvhpc #2433

Closed
williamfgc opened this issue Jul 2, 2024 · 9 comments · Fixed by JuliaGPU/CUDA_Runtime_Discovery.jl#13
Labels
bug Something isn't working needs information Further information is requested

Comments

@williamfgc
Copy link

williamfgc commented Jul 2, 2024

Describe the bug

CUDA.jl can't find cupti even though the nvhpc 24.5 search location for path extras/CUPTI is supported by CUDA_Runtime_Discovery as in this line
This is happening on an Arm cluster at OLCF - Wombat. Any help would be appreciated.

To reproduce

$ JULIA_DEBUG=CUDA_Runtime_Discovery julia

using CUDA

 Debug: Did not find cupti
└ @ CUDA_Runtime_Discovery ~/.julia/packages/CUDA_Runtime_Discovery/lnYIW/src/CUDA_Runtime_Discovery.jl:139
┌ Debug: Looking for library cupti, version 9.0.0 or 9.1.0 or 9.2.0 or 10.0.0 or 10.1.0 or .... /sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda
│   all_names =
│    259-element Vector{String}:
│     "libcupti.so"
│       ....
│     "libcupti.so.2025.5.3"
│   all_locations =
│    5-element Vector{String}:
│     "/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda"
│     "/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/lib"
│     "/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/lib64"
│     "/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/libx64"
│     "/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/targets/sbsa-linux/lib"
└ @ CUDA_Runtime_Discovery ~/.julia/packages/CUDA_Runtime_Discovery/lnYIW/src/CUDA_Runt

This is reproducible with the CUDA.jl master branch and the libraries exist:

$ find /sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda -name "libcupti.so"
/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/11.8/extras/CUPTI/lib64/libcupti.so
/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/12.4/extras/CUPTI/lib64/libcupti.so

Expected behavior

CUDA_Runtime_Discovery should be able to find the existing nvhpc libraries

Version info

Details on Julia: v1.10.4 for Arm

 versioninfo()
Julia Version 1.10.4
Commit 48d4fd48430 (2024-06-04 10:41 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (aarch64-linux-gnu)
  CPU: 80 × Neoverse-N1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, neoverse-n1)
Threads: 1 default, 0 interactive, 1 GC (on 80 virtual cores)
Environment:
  LD_LIBRARY_PATH = /sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/comm_libs/nvshmem/lib:/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/comm_libs/nccl/lib:/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/math_libs/lib64:/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/compilers/lib:/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/compilers/extras/qd/lib:/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/extras/CUPTI/lib64:/sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/lib64

Details on CUDA:

julia> CUDA.versioninfo()
ERROR: CUDA runtime not found
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] functional
   @ ~/.julia/packages/CUDA/ZBbOx/src/initialization.jl:24 [inlined]
 [3] versioninfo(io::Base.TTY)
   @ CUDA ~/.julia/packages/CUDA/ZBbOx/src/utilities.jl:42
 [4] top-level scope
   @ REPL[2]:1

Additional context

Tried also using > add CUDA_Runtime_Discovery#master but with the same outcome as above.
Also setting CUDA.set_runtime_version!(v"12.4"; local_toolkit=true) did not help.
CC @cwinogrodzki
nvidia-smi can see the GPUs:

nvidia-smi 
Tue Jul  2 18:06:56 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 0000000C:01:00.0 Off |                    0 |
| N/A   23C    P0              31W / 250W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 0000000D:01:00.0 Off |                    0 |
| N/A   23C    P0              33W / 250W |      0MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
@williamfgc williamfgc added the bug Something isn't working label Jul 2, 2024
@maleadt
Copy link
Member

maleadt commented Jul 3, 2024

Please post the full debug output.

From the partial output, it looks like CUDA_Runtime_Discovery.jl has discovered /sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda as your CUDA toolkit, while CUPTI seems stored in /sw/wombat/Nvidia_HPC_SDK/Linux_aarch64/24.5/cuda/11.8, which is not a valid layout of the CUDA toolkit. So I'm left wondering if CUDA_Runtime_Discovery picked up that directory, or whether it was configured through the environment.

@maleadt maleadt added the needs information Further information is requested label Jul 3, 2024
@williamfgc
Copy link
Author

@maleadt thanks for the quick response and guidance. We can confirm that cupti is found when setting the proper CUDA_HOME path. As you pointed out, this is non-standard as different CUDA versions needed to coexist on this platform. I will go ahead and close this issue. Thanks a lot.

@carstenbauer
Copy link
Member

carstenbauer commented Aug 7, 2024

FWIW, I ran into the same thing on DelftBlue (not an ARM system). Fixed by setting export CUDA_HOME=$NVHPC_ROOT/cuda/12.1.

@maleadt
Copy link
Member

maleadt commented Aug 7, 2024

Does $NVHPC_ROOT/cuda/12.1 contain a full CUDA SDK? If so, is there a way CUDA_Runtime_Discovery.jl should have picked that up? We normally look through e.g. ptxas as discoverable on PATH, or of course by looking at CUDA_HOME, but it'd be nice if this happens automatically.

@carstenbauer
Copy link
Member

Does $NVHPC_ROOT/cuda/12.1 contain a full CUDA SDK?

Looks like it: tree_nvhpc_cuda_121.txt

If so, is there a way CUDA_Runtime_Discovery.jl should have picked that up?

Good question. Maybe CUDA_Runtime_Discovery.jl can also look for NVHPC_ROOT (perhaps with a lower precedence than CUDA_HOME and co)? Interestingly, CUDA_HOME and co are not set by the nvhpc module.

@maleadt
Copy link
Member

maleadt commented Aug 7, 2024

NVHPC_ROOT looks like an "official" NVIDIA env var, so yeah this is probably something CUDA_Runtime_Discovery.jl should pick up. One complication however is the cuda/12.1 suffix, and which specific version to use. Isn't there anything in the environment indicating that v12.1 was selected here?

Since I don't have access to a system with NVHPC set-up like that, could you maybe create a PR to CUDA_Runtime_Discovery.jl that works on your cluster?

@carstenbauer
Copy link
Member

carstenbauer commented Aug 7, 2024

Isn't there anything in the environment indicating that v12.1 was selected here?

In my case I had used CUDA.set_runtime_version!(v"12.1"; local_toolkit=true), so the information is available as a preference, yes. (Otherwise, there is only a single subfolder that has a version-like name. But parsing this is pretty ugly, of course.)

@carstenbauer
Copy link
Member

Since I don't have access to a system with NVHPC set-up like that, could you maybe create a PR to CUDA_Runtime_Discovery.jl that works on your cluster?

I'm busy with other stuff in the next couple of days, but I'll try to find some time for it.

@carstenbauer
Copy link
Member

For future readers, we now try to deduce the CUDA path from the NVHPC_ROOT environment variable in CUDA_Runtime_Discovery.jl (JuliaGPU/CUDA_Runtime_Discovery.jl#13).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs information Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants