Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dense matrix needs RAJA when GPU is used #676

Merged
merged 4 commits into from
Feb 5, 2024
Merged

Conversation

nychiang
Copy link
Collaborator

Update the cmake file and add an assertion, to prevent compiling hiop with GPU support but without RAJA.

CLOSE #675

@cameronrutherford
Copy link
Collaborator

I didn't have enough time to got though why incline is failing and to fix the issue, but I am quite confused.

I had done some initial debugging to add more print statements and to work around the obvious errors, but I couldn't even push to the repo anymore...

ERROR: Permission to LLNL/hiop.git denied to cameronrutherford.
fatal: Could not read from remote repository.

I gave up getting ROCm 4.5.1 working again, and tried 5.3.0. 4.5.1 seemed unfixable, and then 5.3.0 seems more promising. ROCm 5.6.0+ and ROCm 6+ are both available now, so perhaps it would just be easiest to re-vamp HiOp CI a little to be more like Re::Solve that builds it's own ROCm/HIP and clang from source.

cc @jaelynlitz if you have any idea what changed on incline recently

@nychiang or @cnpetra I think I might need to be re-added to the repo in order to push again?

https://github.com/ORNL/ReSolve/blob/develop/buildsystem/spack/incline/spack.yaml - here is the ReSolve YAML for comparison / as a reference

@jaelynlitz
Copy link
Collaborator

I have a couple thoughts:

  • PNNL CI for hiop is still on the exasgd allocation, which will need to change or be dropped as that account is no longer funded.
  • there are some weird errors in the log - could be /tmp issues I was running into on the dl partition from the install of cuda12.3, might have to ping Tim about cleaning up /tmp on the dmi nodes
/tmp/slurmd/job3687796/slurm_script: line 107: module: command not found
ModuleCmd_Use.c(231):ERROR:64: Directory '/share/apps/modules/tools' not found
ModuleCmd_Use.c(231):ERROR:64: Directory '/share/apps/modules/compilers' not found
ModuleCmd_Use.c(231):ERROR:64: Directory '/share/apps/modules/mpi' not found
Using ./scripts/inclineVariables.sh
Currently Loaded Modulefiles:
  1) gcc/8.4.0         4) raja/0.14.0       7) magma/2.6.1
  2) rocm/4.5.1        5) openblas/0.3.18
  3) umpire/6.0.0      6) cmake/3.19.6
Build step
~/gitlab/142740/build ~/gitlab/142740
CMake Warning:
  No source or binary directory provided.  Both will be assumed to be the
  same as the current working directory, but note that this warning will
  become a fatal error in future CMake releases.
loading initial cache file /people/svcexasgd/gitlab/142740/scripts/clang-hip.cmake
CMake Error: The source directory "/people/svcexasgd/gitlab/142740/build" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
./scripts/defaultBuild.sh: line 16: -Wl,-rpath,/share/apps/gcc/8.4.0/lib64: No such file or directory

@nychiang
Copy link
Collaborator Author

@cameronrutherford I just checked it and you always have the read access to the hiop repos (same as @jaelynlitz). Not sure what the problem is.

@cameronrutherford
Copy link
Collaborator

@cameronrutherford I just checked it and you always have the read access to the hiop repos (same as @jaelynlitz). Not sure what the problem is.

@nychiang the error is misleading, and my language is intending to clarify - I can read all day long, but I cannot push changes

@cameronrutherford
Copy link
Collaborator

I have a couple thoughts:

  • PNNL CI for hiop is still on the exasgd allocation, which will need to change or be dropped as that account is no longer funded.
  • there are some weird errors in the log - could be /tmp issues I was running into on the dl partition from the install of cuda12.3, might have to ping Tim about cleaning up /tmp on the dmi nodes
/tmp/slurmd/job3687796/slurm_script: line 107: module: command not found
ModuleCmd_Use.c(231):ERROR:64: Directory '/share/apps/modules/tools' not found
ModuleCmd_Use.c(231):ERROR:64: Directory '/share/apps/modules/compilers' not found
ModuleCmd_Use.c(231):ERROR:64: Directory '/share/apps/modules/mpi' not found
Using ./scripts/inclineVariables.sh
Currently Loaded Modulefiles:
  1) gcc/8.4.0         4) raja/0.14.0       7) magma/2.6.1
  2) rocm/4.5.1        5) openblas/0.3.18
  3) umpire/6.0.0      6) cmake/3.19.6
Build step
~/gitlab/142740/build ~/gitlab/142740
CMake Warning:
  No source or binary directory provided.  Both will be assumed to be the
  same as the current working directory, but note that this warning will
  become a fatal error in future CMake releases.
loading initial cache file /people/svcexasgd/gitlab/142740/scripts/clang-hip.cmake
CMake Error: The source directory "/people/svcexasgd/gitlab/142740/build" does not appear to contain CMakeLists.txt.
Specify --help for usage, or press the help button on the CMake GUI.
./scripts/defaultBuild.sh: line 16: -Wl,-rpath,/share/apps/gcc/8.4.0/lib64: No such file or directory

I worked through some of these issues myself but unable to push changes. There are some small fixes to work through some of these issues, but clearly incline changed enough to have broken the builds completely with the old compiler. I maintain my position that we just need to build a la Re::Solve and have the compiler / ROCm be built from source by spack to avoid these issues.

@jaelynlitz lets disable exasgd account on HPC sooner rather than later :)

@nychiang nychiang force-pushed the cuda-needs-raja-fix branch from 77efbfc to 2400342 Compare February 2, 2024 16:37
@cnpetra cnpetra self-requested a review February 2, 2024 21:18
@cnpetra
Copy link
Collaborator

cnpetra commented Feb 2, 2024

I would prefer to have an error message when GPU is on but raja is off.

@nychiang
Copy link
Collaborator Author

nychiang commented Feb 3, 2024

@cameronrutherford I think now you can push to the repository. I added an error message if GPU is ON and RAJA is off. We'd like to merge this one first, and release v1.0.3. Would you mind filing your changes in another PR?

@nychiang
Copy link
Collaborator Author

nychiang commented Feb 5, 2024

@cnpetra Please release v1.0.3 after merging this PR. I will try to create update the Spack formula.

@cameronrutherford
Copy link
Collaborator

I made #680 to track Incline CI fixing and other associated issues that could be fixed by a CI re-vamp. This can be merged from my perspective, and Spack PR was merged this morning.

@cnpetra cnpetra merged commit ae5602c into develop Feb 5, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RAJA must be used when CUDA is selected.
4 participants