Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Static CUDA build failure #567

Closed
v-dobrev opened this issue Oct 28, 2022 · 12 comments · Fixed by #568
Closed

Static CUDA build failure #567

v-dobrev opened this issue Oct 28, 2022 · 12 comments · Fixed by #568

Comments

@v-dobrev
Copy link
Member

I was trying to build HiOp through Spack and noticed that the command

./bin/spack install -j 128 --fresh hiop+cuda cuda_arch=70

fails on LLNL's Lassen machine. The errors look like this:

/usr/tce/packages/cuda/cuda-11.5.0/lib64/libcurand_static.a(curand.o): In function `curandCreateGenerator':
curand.compute_86.cudafe1.cpp:(.text+0xdee4): undefined reference to `culibosEnterCriticalSection'
...

It looks like HiOp needs to link to libculibos.a in this case -- probably by adding it in this list:

if(HIOP_BUILD_STATIC)
target_link_libraries(hiop_cuda INTERFACE
CUDA::cusolver_static
CUDA::cusparse_static
CUDA::cudart_static
CUDA::cublasLt_static
CUDA::curand_static
)
endif()

As a workaround I was able to build HiOp by adding the +shared variant in Spack. In that case both libcurand.so and libcurand_static.a are present at the command line and linking works fine.

@cnpetra
Copy link
Collaborator

cnpetra commented Oct 29, 2022

Thanks for reporting and for the suggestion, Veselin. Interesting that it fails on lassen. We run the CI on lassen and everything looks fine. We'll look into it.

@nychiang : can you replicate this on Lassen?

@cameronrutherford any chance this is spack-related?

@cameronrutherford
Copy link
Collaborator

cameronrutherford commented Oct 29, 2022

https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html#culibos

The documentation suggests that this is only a static library, and that it shouldn't be touched by consumers.

I'm not sure where this dependency is being pulled in, but we could link to the target directly, or link against some of the static libraries mentioned when building shared.

EDIT: Don't think this is a spack issue FWIW

@pelesh
Copy link
Collaborator

pelesh commented Oct 31, 2022

@v-dobrev, could you try to build HiOp with CUDA without Spack? If so could you provide output when you make VERBOSE=1? I just built afresh from develop branch with CUDA enabled on a similar Power9/V100 machine and I couldn't reproduce this issue.

@pelesh
Copy link
Collaborator

pelesh commented Oct 31, 2022

The error undefined reference to culibosEnterCriticalSection looks like a bug in CMake. On my compile line I get libculibos.a correctly recognized by CMake. Below is a snippet from my make output.

... /.../cuda/11.5.2/lib64/libculibos.a /.../cuda/11.5.2/lib64/libcurand_static.a -lcudadevrt -lcudart_static -lrt

@v-dobrev, what is the CMake version you are using?

@cameronrutherford, are you sure we have minimum CMake version correctly specified in Spack and HiOp? I built HiOp with v3.21.3.

@cameronrutherford
Copy link
Collaborator

@pelesh CMake version is 3.18 and consistent across CMake and Spack configuration. It's possible we need to bump to 3.20. We should probably always test with minimum versions in CI, so perhaps we should also start enforcing that.

@nychiang
Copy link
Collaborator

I have no problem to build hiop on lassen, with cmake/3.20.2. gcc/8.3.1 and cuda/11.7.

@pelesh
Copy link
Collaborator

pelesh commented Nov 1, 2022

I rebuilt from scratch on Power9/V100 using CMake 3.18 this time and everything still works just fine. I agree with @cameronrutherford, it is unlikely this is a Spack issue -- we use spack builds on all CI pipelines.

@v-dobrev
Copy link
Member Author

v-dobrev commented Nov 4, 2022

@v-dobrev, what is the CMake version you are using?

In Spack, I have cmake@3.20.2 configured as external package.

I'm also using gcc@8.3.1 and cuda@11.5.0.

Note that the Spack spec I used results in a pure static build, i.e. the cmake config line has (among other flags): -DHIOP_BUILD_STATIC:BOOL=ON, -DHIOP_BUILD_SHARED:BOOL=OFF. MAGMA is also enabled because I enable +cuda.

I'll try a similar build outside of Spack and I'll report back.

@v-dobrev
Copy link
Member Author

v-dobrev commented Nov 4, 2022

I was able to reproduce the issue outside of Spack on Lassen using following steps:

Load the following modules:

ml gcc/8.3.1
ml cuda/11.5.0
ml cmake/3.20.2

resulting in the following loaded modules:

Currently Loaded Modules:
  1) StdEnv (S)   2) git/2.29.1   3) gcc/8.3.1   4) spectrum-mpi/rolling-release   5) cuda/11.5.0   6) cmake/3.20.2

Then clone HiOp and run:

cd hiop
mkdir build
cd build
cmake .. \
  -DCMAKE_INSTALL_PREFIX=../install \
  -DCMAKE_BUILD_TYPE=RelWithDebInfo \
  -DBUILD_TESTING=OFF \
  -DCMAKE_VERBOSE_MAKEFILE=ON \
  -DHIOP_USE_GPU=ON \
  -DHIOP_USE_MAGMA=OFF \
  -DHIOP_BUILD_STATIC=ON \
  -DHIOP_BUILD_SHARED=OFF \
  -DHIOP_USE_MPI=ON \
  -DHIOP_DEEPCHECKS=OFF \
  -DHIOP_USE_CUDA=ON \
  -DHIOP_USE_HIP=OFF \
  -DHIOP_USE_RAJA=OFF \
  -DHIOP_USE_UMPIRE=OFF \
  -DHIOP_WITH_KRON_REDUCTION=OFF \
  -DHIOP_SPARSE=OFF \
  -DHIOP_USE_COINHSL=OFF \
  -DHIOP_TEST_WITH_BSUB=OFF \
  -DHIOP_USE_GINKGO=OFF \
  -DHIOP_USE_CUSOLVER=OFF \
  -DMPI_C_COMPILER=mpicc \
  -DMPI_CXX_COMPILER=mpicxx \
  -DMPI_Fortran_COMPILER=mpif90 \
  -DCMAKE_CUDA_ARCHITECTURES=70 \
  -DHIOP_USE_STRUMPACK=OFF

And finally run make:

make

The error happens when linking the first test:

...
[ 47%] Linking CXX executable testMatrixSymSparse
...
/usr/tce/packages/gcc/gcc-8.3.1/bin/c++ -O2 -g -DNDEBUG -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-2020.08.19/lib -pthread CMakeFiles/testMatrixSymSparse.dir/testMatrixSymSparse.cpp.o CMakeFiles/testMatrixSymSparse.dir/LinAlg/matrixTestsSymSparseTriplet.cpp.o CMakeFiles/testMatrixSymSparse.dir/cmake_device_link.o -o testMatrixSymSparse   -L/usr/tce/packages/cuda/cuda-11.5.0/nvidia/targets/ppc64le-linux/lib/stubs  -L/usr/tce/packages/cuda/cuda-11.5.0/nvidia/targets/ppc64le-linux/lib  -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib ../src/libhiop.a /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpiprofilesupport.so /usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib/libmpi_ibm.so /usr/lib64/libessl.so /usr/lib64/libblas.so /usr/lib64/libessl.so /usr/lib64/libblas.so -lm -ldl /usr/tce/packages/cuda/cuda-11.5.0/nvidia/lib64/libcusolver_static.a /usr/tce/packages/cuda/cuda-11.5.0/nvidia/lib64/libcublas_static.a /usr/tce/packages/cuda/cuda-11.5.0/nvidia/lib64/libcusparse_static.a /usr/tce/packages/cuda/cuda-11.5.0/nvidia/lib64/libcudart_static.a -lpthread -ldl /usr/lib64/librt.so /usr/tce/packages/cuda/cuda-11.5.0/nvidia/lib64/libcublasLt_static.a /usr/tce/packages/cuda/cuda-11.5.0/nvidia/lib64/libculibos.a /usr/tce/packages/cuda/cuda-11.5.0/nvidia/lib64/libcurand_static.a -lcudadevrt -lcudart_static -lrt -lpthread -ldl 
/usr/tce/packages/cuda/cuda-11.5.0/nvidia/lib64/libcurand_static.a(curand.o): In function `curandCreateGenerator':
curand.compute_86.cudafe1.cpp:(.text+0xdee4): undefined reference to `culibosEnterCriticalSection'
curand.compute_86.cudafe1.cpp:(.text+0xdf28): undefined reference to `culibosLeaveCriticalSection'
curand.compute_86.cudafe1.cpp:(.text+0xe078): undefined reference to `culibosEnterCriticalSection'
curand.compute_86.cudafe1.cpp:(.text+0xe0fc): undefined reference to `culibosInitializeCriticalSection'
...

@jwang125
Copy link
Collaborator

jwang125 commented Nov 4, 2022

I can reproduce this on develop. Changing two options though, can make compiling successful. They are:
-DHIOP_BUILD_SHARED=ON
and add
-DLAPACK_LIBRARIES='-llapack -lblas' .

@nychiang
Copy link
Collaborator

nychiang commented Nov 4, 2022

@v-dobrev
There are two issues here:

  1. It seems culibos is a legacy issue and that should have been fixed since CMAKE v3.17
    See [1] [2] [3]. I have added culibos into our cmake file when one asks for a static build. The fix is in branch cuda-static-fix.

  2. There are some lapack functions missing in Essl (see here). Without providing a path to lapack, find_package(LAPACK) on lassen only finds essl and blas and consequently we will have a compiling error about undefined reference to dposvx_'. To avoid this problem, you need to add cmake option -DLAPACK_LIBRARIES="-lessl -llapack -lblas"` into your command.

@cnpetra
Copy link
Collaborator

cnpetra commented Nov 4, 2022

I also think this issue is related to Lassen's CUDA libs. But will merge the PR, it seems to be harmless (time will tell).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants