Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't build ginkgo 1.8.0 with HIP #1693

Closed
lahwaacz opened this issue Oct 14, 2024 · 2 comments · Fixed by #1695
Closed

Can't build ginkgo 1.8.0 with HIP #1693

lahwaacz opened this issue Oct 14, 2024 · 2 comments · Fixed by #1695
Assignees
Milestone

Comments

@lahwaacz
Copy link
Contributor

Hi! I'm working on upgrading ginkgo-hpc for Arch Linux, so far I have the following build commands (omitted the parts for base and cuda packages):

  local common_cmake_flags=(
    -S $_pkgname-$pkgver -G Ninja
    -DCMAKE_BUILD_TYPE=None
    -DCMAKE_INSTALL_PREFIX=/usr
    -DGINKGO_BUILD_REFERENCE=ON
    -DGINKGO_BUILD_OMP=ON
    -DGINKGO_BUILD_MPI=ON
    -DGINKGO_HAVE_GPU_AWARE_MPI=ON
    -DGINKGO_BUILD_BENCHMARKS=ON
    -DGINKGO_BUILD_EXAMPLES=ON
    -DGINKGO_BUILD_DOC=ON
    -DGINKGO_BUILD_TESTS=ON
  )
  local _amdgpu_archs="gfx906"

  # -hip package
  # ginkgo has insufficient auto-detection for HIP_PATH https://github.com/ginkgo-project/ginkgo/issues/1624
  export ROCM_PATH=/opt/rocm
  export HIP_PATH="$ROCM_PATH"
  # Compile source code for supported GPU archs in parallel
  export HIPFLAGS="-parallel-jobs=$(nproc)"
  # Use gcc 13 toolchain as ROCm is not compatible with gcc 14.
  export HIPFLAGS="-parallel-jobs=$(nproc) --gcc-install-dir=/usr/lib/gcc/x86_64-pc-linux-gnu/13.3.0/"
  cmake -B build-hip "${common_cmake_flags[@]}" \
    -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
    -DCMAKE_CXX_FLAGS="${CXXFLAGS} -fcf-protection=none" \
    -DCMAKE_HIP_ARCHITECTURES="$_amdgpu_archs" \
    -DGINKGO_BUILD_CUDA=OFF \
    -DGINKGO_BUILD_HIP=ON \
    -DGINKGO_BUILD_SYCL=OFF
  cmake --build build-hip --verbose

I've backported eb97b49 but still get this error which does not seem to be fixed in math.hpp on develop:

/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:704:12: error: no matching function for call to 'zero'
  704 |     return zero<T>();
      |            ^~~~~~~
/build/ginkgo-hpc/src/ginkgo-1.8.0/common/unified/matrix/sellp_kernels.cpp:82:48: note: in instantiation of function template specialization 'gko::zero<std::complex<float>>' requested here
   82 |                     i < row_end ? in_vals[i] : zero(values[out_idx]);
      |                                                ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/omp/base/kernel_launch.hpp:34:17: note: in instantiation of function template specialization 'gko::kernels::omp::sellp::fill_in_matrix_data(std::shared_ptr<const DefaultExecutor>, const device_matrix_data<complex<float>, int> &, const int64 *, matrix::Sellp<complex<float>, int> *)::(anonymous class)::operator()<long, const int *, const std::complex<float> *, const long *, unsigned long, const unsigned long *, int *, std::complex<float> *>' requested here
   34 |         [&]() { fn(i, args...); }();
      |                 ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/omp/base/kernel_launch.hpp:34:15: note: while substituting into a lambda expression here
   34 |         [&]() { fn(i, args...); }();
      |               ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/omp/base/kernel_launch.hpp:111:5: note: in instantiation of function template specialization 'gko::kernels::omp::(anonymous namespace)::run_kernel_impl<(lambda at /build/ginkgo-hpc/src/ginkgo-1.8.0/common/unified/matrix/sellp_kernels.cpp:67:9), const int *, const std::complex<float> *, const long *, unsigned long, const unsigned long *, int *, std::complex<float> *>' requested here
  111 |     run_kernel_impl(exec, fn, size, map_to_device(args)...);
      |     ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/common/unified/matrix/sellp_kernels.cpp:65:5: note: in instantiation of function template specialization 'gko::kernels::omp::run_kernel<(lambda at /build/ginkgo-hpc/src/ginkgo-1.8.0/common/unified/matrix/sellp_kernels.cpp:67:9), const int *, const std::complex<float> *, const long *&, unsigned long, const unsigned long *, int *, std::complex<float> *>' requested here
   65 |     run_kernel(
      |     ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:686:1: note: candidate template ignored: requirement '!std::is_same<std::complex<float>, std::complex<float>>::value' was not satisfied [with T = std::complex<float>]
  686 | zero()
      | ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:628:33: note: candidate function not viable: call to __host__ function from __device__ function
  628 | GKO_INLINE __host__ constexpr T zero()
      |                                 ^
/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:702:35: note: candidate function template not viable: requires 1 argument, but 0 were provided
  702 | GKO_INLINE __device__ constexpr T zero(const T&)
      |                                   ^    ~~~~~~~~
/build/ginkgo-hpc/src/ginkgo-1.8.0/include/ginkgo/core/base/math.hpp:644:33: note: candidate function template not viable: requires 1 argument, but 0 were provided
  644 | GKO_INLINE __host__ constexpr T zero(const T&)
      |                                 ^    ~~~~~~~~

Note that this is part of a ROCm 6.2.2 rebuild, we were not able to build ginkgo 1.8.0 with ROCm 6.0. I'm also not sure about the -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc flag which was not needed before, but without it I get errors like this (maybe an ABI error when you try to link C++ code by GCC with HIP code?):

[1116/1485] : && /usr/bin/c++ -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=3 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection         -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -Wp,-D_GLIBCXX_ASSERTIONS -g -ffile-prefix-map=/build/ginkgo-hpc/src=/usr/src/debug/ginkgo-hpc -flto=auto -Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now          -Wl,-z,pack-relative-relocs -flto=auto     -Wl,-rpath -Wl,/usr/lib -Wl,--enable-new-dtags examples/ir-ilu-preconditioned-solver/CMakeFiles/ir-ilu-preconditioned-solver.dir/ir-ilu-preconditioned-solver.cpp.o -o examples/ir-ilu-preconditioned-solver/ir-ilu-preconditioned-solver  -Wl,-rpath,/build/ginkgo-hpc/src/build-hip/lib:/opt/rocm/lib  lib/libginkgo.so.1.8.0  lib/libginkgo_omp.so.1.8.0  lib/libginkgo_cuda.so.1.8.0  lib/libginkgo_reference.so.1.8.0  lib/libginkgo_hip.so.1.8.0  lib/libginkgo_dpcpp.so.1.8.0  lib/libginkgo_device.so.1.8.0  /usr/lib/libmpi.so  -Wl,-rpath-link,/opt/rocm/lib && :
FAILED: examples/ir-ilu-preconditioned-solver/ir-ilu-preconditioned-solver
: && /usr/bin/c++ -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions         -Wp,-D_FORTIFY_SOURCE=3 -Wformat -Werror=format-security         -fstack-clash-protection -fcf-protection         -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -Wp,-D_GLIBCXX_ASSERTIONS -g -ffile-prefix-map=/build/ginkgo-hpc/src=/usr/src/debug/ginkgo-hpc -flto=auto -Wl,-O1 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now          -Wl,-z,pack-relative-relocs -flto=auto     -Wl,-rpath -Wl,/usr/lib -Wl,--enable-new-dtags examples/ir-ilu-preconditioned-solver/CMakeFiles/ir-ilu-preconditioned-solver.dir/ir-ilu-preconditioned-solver.cpp.o -o examples/ir-ilu-preconditioned-solver/ir-ilu-preconditioned-solver  -Wl,-rpath,/build/ginkgo-hpc/src/build-hip/lib:/opt/rocm/lib  lib/libginkgo.so.1.8.0  lib/libginkgo_omp.so.1.8.0  lib/libginkgo_cuda.so.1.8.0  lib/libginkgo_reference.so.1.8.0  lib/libginkgo_hip.so.1.8.0  lib/libginkgo_dpcpp.so.1.8.0  lib/libginkgo_device.so.1.8.0  /usr/lib/libmpi.so  -Wl,-rpath-link,/opt/rocm/lib && :
/usr/bin/ld: /tmp/ccqgcF5r.ltrans2.ltrans.o: in function `gko::EnableDefaultFactory<gko::preconditioner::Jacobi<double, int>::Factory, gko::preconditioner::Jacobi<double, int>, gko::preconditioner::Jacobi<double, int>::parameters_type, gko::LinOpFactory>::generate_impl(std::shared_ptr<gko::LinOp const>) const':
/usr/include/c++/14.2.1/bits/shared_ptr.h:720:(.text+0x9452): undefined reference to `typeinfo for gko::HipExecutor'
/usr/bin/ld: /tmp/ccqgcF5r.ltrans2.ltrans.o:/usr/include/c++/14.2.1/bits/shared_ptr.h:720:(.text+0x99ee): undefined reference to `typeinfo for gko::HipExecutor'
/usr/bin/ld: lib/libginkgo_hip.so.1.8.0: undefined reference to `vtable for gko::HipExecutor'
collect2: error: ld returned 1 exit status
@MarcelKoch MarcelKoch self-assigned this Oct 15, 2024
@MarcelKoch
Copy link
Member

I could recreate the build issue with math.cpp using your settings and the rocm 6.2 image. I think I also have a fix, which I'm currently testing.

One comment on your cmake flags, by setting -DGINKGO_HAVE_GPU_AWARE_MPI=ON Ginkgo will assume that it is linked against a mpi library that supports device memory. If that is not the case, the mpi applications will just crash, without any indication as to why. So my suggestion would be to disable it and leave it to users to explicitly enable it, only if they know that their mpi supports device memory.

@lahwaacz
Copy link
Contributor Author

lahwaacz commented Oct 15, 2024

On second thought, -DCMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc might not be the intended way to compile Ginkgo, since hipcc treats all source files as HIP language source files. When I used -DCMAKE_CXX_COMPILER=/opt/rocm/lib/llvm/bin/amdclang++ instead, the build actually passed.

I have no idea why using g++ for CXX compiler does not work anymore, though 🤷

As for MPI, in Arch Linux we specifically have a GPU-aware OpenMPI package and don't support switching to another MPI library.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants