Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow Nvidia driver script to set recommendations for LD_PRELOAD #754

Open
wants to merge 33 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

ocaisa
Copy link
Member

@ocaisa ocaisa commented Sep 27, 2024

No description provided.

Copy link

eessi-bot bot commented Sep 27, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

Instance boegel-bot-deucalion is configured to build for:

  • architectures: aarch64/a64fx
  • repositories: eessi.io-2023.06-software

Copy link

eessi-bot bot commented Sep 27, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

@ocaisa ocaisa marked this pull request as ready for review September 27, 2024 16:02
@ocaisa
Copy link
Member Author

ocaisa commented Sep 27, 2024

Example output:

[rocky@ip-172-31-27-81 software-layer]$  ./scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh --ld-preload --no-download
Found NVIDIA GPU driver version 545.23.08
Found host CUDA version 12.3
Using default list of libraries
Matched 48 CUDA Libraries

When attempting to use LD_PRELOAD we exclude anything related to graphics
libXext.so.6 is NOT in the provided  preload list, filtering /lib64/libGL.so.1.
libXext.so.6 is NOT in the provided  preload list, filtering /lib64/libGL.so.
libXext.so.6 is NOT in the provided  preload list, filtering /lib64/libGLX_nvidia.so.0.
libXext.so.6 is NOT in the provided  preload list, filtering /lib64/libGLX.so.0.
libXext.so.6 is NOT in the provided  preload list, filtering /lib64/libGLX.so.
libwayland-server.so.0 is NOT in the provided  preload list, filtering /lib64/libnvidia-egl-wayland.so.1.
libXext.so.6 is NOT in the provided  preload list, filtering /lib64/libnvidia-fbc.so.1.
libXext.so.6 is NOT in the provided  preload list, filtering /lib64/libnvidia-fbc.so.
libXNVCtrl.so.0 is NOT in the provided  preload list, filtering /lib64/libnvidia-gtk3.so.545.23.08.

The recommended way to use LD_PRELOAD is to only use it when you need to:

export EESSI_GPU_LD_PRELOAD="/lib64/libcuda.so.1:/lib64/libcuda.so:/lib64/libcudadebugger.so.1:/lib64/libnvcuvid.so.1:/lib64/libnvcuvid.so:/lib64/libnvidia-cfg.so.1:/lib64/libnvidia-cfg.so:/lib64/libnvidia-eglcore.so.545.23.08:/lib64/libnvidia-encode.so.1:/lib64/libnvidia-encode.so:/lib64/libnvidia-glcore.so.545.23.08:/lib64/libnvidia-glsi.so.545.23.08:/lib64/libnvidia-glvkspirv.so.545.23.08:/lib64/libnvidia-gpucomp.so.545.23.08:/lib64/libnvidia-ml.so.1:/lib64/libnvidia-ml.so:/lib64/libnvidia-nvvm.so.4:/lib64/libnvidia-nvvm.so:/lib64/libnvidia-opencl.so.1:/lib64/libnvidia-opticalflow.so.1:/lib64/libnvidia-ptxjitcompiler.so.1:/lib64/libnvidia-ptxjitcompiler.so:/lib64/libnvidia-rtcore.so.545.23.08:/lib64/libnvidia-tls.so.545.23.08:/lib64/libnvoptix.so.1:/lib64/libOpenCL.so.1"
export EESSI_OVERRIDE_GPU_CHECK="1"

Then you can set LD_PRELOAD only when you want to run a GPU application, e.g.,
    LD_PRELOAD="$EESSI_GPU_LD_PRELOAD" device_query

@boegel boegel added 2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia labels Oct 9, 2024
@boegel
Copy link
Contributor

boegel commented Oct 9, 2024

@ocaisa There's duplicate entries here, libcuda.so is a symlink for libcuda.so.1, only one is needed

# Filter out all symlinks and libraries that have missing library dependencies under EESSI
filtered_libraries=()
for library in "${matched_libraries[@]}"; do
if [ ! -L "$library" ]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too aggressive, instead we should just resolve the symlink and remove duplicate entries

@ocaisa
Copy link
Member Author

ocaisa commented Oct 9, 2024

This is resulting in about 400MB of preload:

{EESSI 2023.06} [rocky@ip-172-31-20-85 software-layer]$ IFS=':'; for path in $EESSI_GPU_LD_PRELOAD; do ls -lh $path; done; unset IFS
-rwxr-xr-x 1 root root 29M Nov  6  2023 /usr/lib64/libcuda.so.545.23.08
-rwxr-xr-x 1 root root 11M Nov  6  2023 /usr/lib64/libcudadebugger.so.545.23.08
-rwxr-xr-x 1 root root 9.6M Nov  6  2023 /usr/lib64/libnvcuvid.so.545.23.08
-rwxr-xr-x 1 root root 269K Nov  6  2023 /usr/lib64/libnvidia-cfg.so.545.23.08
-rwxr-xr-x 1 root root 566K Nov  6  2023 /usr/lib64/libnvidia-glsi.so.545.23.08
-rwxr-xr-x 1 root root 8.7M Nov  6  2023 /usr/lib64/libnvidia-glvkspirv.so.545.23.08
-rwxr-xr-x 1 root root 42M Nov  7  2023 /usr/lib64/libnvidia-gpucomp.so.545.23.08
-rwxr-xr-x 1 root root 1.9M Nov  6  2023 /usr/lib64/libnvidia-ml.so.545.23.08
-rwxr-xr-x 1 root root 83M Nov  7  2023 /usr/lib64/libnvidia-nvvm.so.545.23.08
-rwxr-xr-x 1 root root 24M Nov  6  2023 /usr/lib64/libnvidia-opencl.so.545.23.08
-rwxr-xr-x 1 root root 26M Nov  6  2023 /usr/lib64/libnvidia-ptxjitcompiler.so.545.23.08
-rwxr-xr-x 1 root root 103M Nov  7  2023 /usr/lib64/libnvidia-rtcore.so.545.23.08
-rwxr-xr-x 1 root root 19K Nov  6  2023 /usr/lib64/libnvidia-tls.so.545.23.08
-rwxr-xr-x 1 root root 58M Nov  7  2023 /usr/lib64/libnvoptix.so.545.23.08
-rwxr-xr-x 1 root root 131K Apr 12  2021 /usr/lib64/libOpenCL.so.1.0.0

@ocaisa
Copy link
Member Author

ocaisa commented Oct 10, 2024

@boegel I've played with this a lot today and I'm happy with the functionality now:

{EESSI 2023.06} [rocky@ip-172-31-20-85 software-layer]$ ./scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh --no-download --ld-preload
Found host CUDA version 7.5
Found NVIDIA GPU driver version 545.23.08
Using default list of libraries
Matched 48 CUDA Libraries

When attempting to use LD_PRELOAD we exclude anything related to graphics
Match found for libcuda.so for CUDA compat libraries
Match found for libcudadebugger.so for CUDA compat libraries
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libEGL.so.1
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libEGL.so
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libGLESv1_CM.so.1
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libGLESv1_CM.so
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libGLESv2.so.2
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libGLESv2.so
libGLX.so.0 is NOT in the provided preload list, filtering /lib64/libGL.so.1
libGLX.so.0 is NOT in the provided preload list, filtering /lib64/libGL.so
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX_nvidia.so.0
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so.0
libXext.so.6 is NOT in the provided preload list, filtering /lib64/libGLX.so
libwayland-server.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-egl-wayland.so.1
libnvcuvid.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-encode.so.1
libnvcuvid.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-encode.so
libGL.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so.1
libGL.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-fbc.so
libXNVCtrl.so.0 is NOT in the provided preload list, filtering /lib64/libnvidia-gtk3.so.545.23.08
Match found for libnvidia-nvvm.so for CUDA compat libraries
libnvcuvid.so.1 is NOT in the provided preload list, filtering /lib64/libnvidia-opticalflow.so.1
Match found for libnvidia-ptxjitcompiler.so for CUDA compat libraries
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libOpenGL.so.0
libGLdispatch.so.0 is NOT in the provided preload list, filtering /lib64/libOpenGL.so

The recommended way to use LD_PRELOAD is to only use it when you need to.

A minimal preload which should work in most cases:
export EESSI_GPU_COMPAT_LD_PRELOAD="/usr/lib64/libcuda.so.545.23.08:/usr/lib64/libcudadebugger.so.545.23.08:/usr/lib64/libnvidia-nvvm.so.545.23.08:/usr/lib64/libnvidia-ptxjitcompiler.so.545.23.08"

A corner-case full preload (which is hard on memory) for exceptional use:
export EESSI_GPU_LD_PRELOAD="/usr/lib64/libcuda.so.545.23.08:/usr/lib64/libcudadebugger.so.545.23.08:/usr/lib64/libEGL_nvidia.so.545.23.08:/usr/lib64/libGLdispatch.so.0.0.0:/usr/lib64/libGLESv1_CM_nvidia.so.545.23.08:/usr/lib64/libGLESv2_nvidia.so.545.23.08:/usr/lib64/libnvcuvid.so.545.23.08:/usr/lib64/libnvidia-cfg.so.545.23.08:/usr/lib64/libnvidia-eglcore.so.545.23.08:/usr/lib64/libnvidia-glcore.so.545.23.08:/usr/lib64/libnvidia-glsi.so.545.23.08:/usr/lib64/libnvidia-glvkspirv.so.545.23.08:/usr/lib64/libnvidia-gpucomp.so.545.23.08:/usr/lib64/libnvidia-ml.so.545.23.08:/usr/lib64/libnvidia-nvvm.so.545.23.08:/usr/lib64/libnvidia-opencl.so.545.23.08:/usr/lib64/libnvidia-ptxjitcompiler.so.545.23.08:/usr/lib64/libnvidia-rtcore.so.545.23.08:/usr/lib64/libnvidia-tls.so.545.23.08:/usr/lib64/libnvoptix.so.545.23.08:/usr/lib64/libOpenCL.so.1.0.0"
export EESSI_OVERRIDE_GPU_CHECK="1"

Then you can set LD_PRELOAD only when you want to run a GPU application, e.g.,
    LD_PRELOAD="$EESSI_GPU_COMPAT_LD_PRELOAD" device_query

@ocaisa
Copy link
Member Author

ocaisa commented Oct 17, 2024

bot: build repo:eessi.io-2023.06-software arch:x86_64/generic

Copy link

eessi-bot bot commented Oct 17, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/generic from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/generic
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/generic resulted in:

Updates by the bot instance boegel-bot-deucalion (click for details)
  • account ocaisa has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Oct 17, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/generic from ocaisa

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/generic
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/generic resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Oct 17, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-generic for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.10/pr_754/23806

date job status comment
Oct 17 12:33:27 UTC 2024 submitted job id 23806 awaits release by job manager
Oct 17 12:33:30 UTC 2024 released job awaits launch by Slurm scheduler
Oct 17 12:34:35 UTC 2024 running job 23806 is running
Oct 17 12:40:51 UTC 2024 finished
😁 SUCCESS (click triangle for details)
Details
✅ job output file slurm-23806.out
✅ no message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
✅ found message(s) matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-generic-1729168457.tar.gzsize: 0 MiB (4682 bytes)
entries: 1
modules under 2023.06/software/linux/x86_64/generic/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/generic/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/generic
2023.06/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
Oct 17 12:40:51 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %scale=1_node %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos /aeb2d9df @BotBuildTests:x86-64-generic-node+default
P: perf: 484.052 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %scale=1_node %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos /04ff9ece @BotBuildTests:x86-64-generic-node+default
P: perf: 507.606 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_Micro_Benchmarks_coll %benchmark_info=mpi.collective.osu_allreduce %scale=1_node %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %device_type=cpu /31ac6ab9 @BotBuildTests:x86-64-generic-node+default
P: latency: 5.5 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_Micro_Benchmarks_coll %benchmark_info=mpi.collective.osu_allreduce %scale=1_node %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %device_type=cpu /f3be40a2 @BotBuildTests:x86-64-generic-node+default
P: latency: 5.3 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_Micro_Benchmarks_coll %benchmark_info=mpi.collective.osu_alltoall %scale=1_node %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %device_type=cpu /10e66fba @BotBuildTests:x86-64-generic-node+default
P: latency: 7.98 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_Micro_Benchmarks_coll %benchmark_info=mpi.collective.osu_alltoall %scale=1_node %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %device_type=cpu /5be57ae7 @BotBuildTests:x86-64-generic-node+default
P: latency: 7.91 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_latency %scale=1_node %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %device_type=cpu /c8c9aff5 @BotBuildTests:x86-64-generic-node+default
P: latency: 0.62 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_latency %scale=1_node %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %device_type=cpu /9795e491 @BotBuildTests:x86-64-generic-node+default
P: latency: 0.64 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_bw %scale=1_node %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %device_type=cpu /48da21c5 @BotBuildTests:x86-64-generic-node+default
P: bandwidth: 10600.45 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_bw %scale=1_node %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %device_type=cpu /1b8c1ca2 @BotBuildTests:x86-64-generic-node+default
P: bandwidth: 10212.21 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-23806.out
✅ no message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@ocaisa ocaisa added the ready-to-deploy Mark a PR as ready to deploy label Oct 17, 2024
@ocaisa ocaisa changed the title Allow Nvidia driver script to set LD_PRELOAD Allow Nvidia driver script to set recommendations for LD_PRELOAD Oct 17, 2024
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh Outdated Show resolved Hide resolved
@TopRichard
Copy link
Collaborator

Also tested the script within eessi_container :

Found host CUDA version 9.0
Found NVIDIA GPU driver version 535.129.03
Using downloaded list of libraries
Matched 41 CUDA Libraries
The host GPU driver libraries (v535.129.03) have already been linked! (based on /cvmfs/software.eessi.io/host_injections/nvidia/aarch64/host/driver_version.txt)
Successfully created symlink between /cvmfs/software.eessi.io/host_injections/nvidia/aarch64/latest and lib in /cvmfs/software.eessi.io/host_injections/2023.06/compat/linux/aarch64
Host NVIDIA GPU drivers linked successfully for EESSI

Accepted all except one

Co-authored-by: TopRichard <121792457+TopRichard@users.noreply.github.com>
@ocaisa
Copy link
Member Author

ocaisa commented Nov 7, 2024

@TopRichard This will need to be re-tested now to make sure the changes haven't had an unintended impact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia ready-to-deploy Mark a PR as ready to deploy
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants