Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amd direct solver #521

Merged
merged 11 commits into from
Aug 8, 2022
Merged

Amd direct solver #521

merged 11 commits into from
Aug 8, 2022

Conversation

fritzgoebel
Copy link

@fritzgoebel fritzgoebel commented Jul 26, 2022

Adjusts the ginkgo solver to use our new LU implementation.

It is based on a ginkgo branch derived from the former GLU integration. In there, the numerical factorization is replaced with our new custom kernels that work on NVIDIA and AMD GPUs. Preprocessing and symbolic factorization are for now still CPU resident.

Generally, the executor choice in lines 263 and 349 should enable the numerical factorization to be executed on a:

  • CPU (gko::ReferenceExecutor::create())
  • NVIDIA GPU (gko::CudaExecutor::create(0, gko::ReferenceExecutor::create()))
  • AMD GPU (gko::HipExecutor::create(0, gko::ReferenceExecutor::create()))

By changing the executor I was able to successfully run the hiop tests both on MI100 GPUs on our own AMD system and on V100 GPUs on Summit.

A fairly easy addition would be a command line option choosing the hardware backend, where would I best put this?

As previously, this is based on an experimental state of work, so please expect things to be worked on and hence change.

@fritzgoebel fritzgoebel self-assigned this Jul 26, 2022
@pelesh
Copy link
Collaborator

pelesh commented Aug 2, 2022

A fairly easy addition would be a command line option choosing the hardware backend, where would I best put this?

I think using HiOp options would be the place where to select hardware backend.

@pelesh pelesh requested review from pelesh, kswirydo, nychiang and cameronrutherford and removed request for kswirydo August 2, 2022 16:20
@nychiang
Copy link
Collaborator

nychiang commented Aug 2, 2022

A fairly easy addition would be a command line option choosing the hardware backend, where would I best put this?

I think using HiOp options would be the place where to select hardware backend.

Can we use preprocessor directives? See here

@pelesh
Copy link
Collaborator

pelesh commented Aug 2, 2022

Can we use preprocessor directives? See here

Preprocessor directives are probably not the bast way to go about this (and I take full responsibility for the example you pointed to :), see e.g. #241). If you have CPU and GPU backends built, Ginkgo allows you to select at runtime on which device you want to run the computation. Using directives would just limit the flexibility provided by Ginkgo.

@cameronrutherford
Copy link
Collaborator

A command line option seems best I assume? We could just re-use hiop_compute_mode, and use CPU/GPU, with the AMD/CUDA distinction chosen at compile time.

I was able to build this branch with the glu branch specified @pelesh. #522 might want to be delayed and merged into this with a working version of ginkgo. I also think we should start moving spack.yaml files into the repositories, along with the variables file for CI on each respective platform.

Hopefully we can get PNNL CI re-enabled on at least one platform in either this or #522...

@nychiang
Copy link
Collaborator

nychiang commented Aug 3, 2022

Can we use preprocessor directives? See here

Preprocessor directives are probably not the bast way to go about this (and I take full responsibility for the example you pointed to :), see e.g. #241). If you have CPU and GPU backends built, Ginkgo allows you to select at runtime on which device you want to run the computation. Using directives would just limit the flexibility provided by Ginkgo.

I get your point! It's a good idea to have it as an option (or command line option), and I can use it for Strumpack as well.

@pelesh
Copy link
Collaborator

pelesh commented Aug 3, 2022

I was able to build this branch with the glu branch specified @pelesh. #522 might want to be delayed and merged into this with a working version of ginkgo. I also think we should start moving spack.yaml files into the repositories, along with the variables file for CI on each respective platform.

All great suggestions, @cameronrutherford! In this case, I think we can simply rebase amd_direct_solver wrt newell8 branch, so not to hold #522. imho, #522 is ready to merge, so we can rebase to develop after #522 is merged. You can add support for Ginkgo from glu_experimental here.

@cameronrutherford
Copy link
Collaborator

cameronrutherford commented Aug 3, 2022

I was able to build this branch with the glu branch specified @pelesh. #522 might want to be delayed and merged into this with a working version of ginkgo. I also think we should start moving spack.yaml files into the repositories, along with the variables file for CI on each respective platform.

All great suggestions, @cameronrutherford! In this case, I think we can simply rebase amd_direct_solver wrt newell8 branch, so not to hold #522. imho, #522 is ready to merge, so we can rebase to develop after #522 is merged. You can add support for Ginkgo from glu_experimental here.

SGTM. I will add the Marianas variables to get this branch working, along with some more documentation/information about the spack config used to generate the variables.

Hopefully, #522 can just be merged, and we can start seeing tests pass/fail at PNNL. I think your suggestion of just merging and rebasing should be fine.

@pelesh
Copy link
Collaborator

pelesh commented Aug 3, 2022

Thanks, @nkoukpaizan, for enabling ascent pipeline for this PR.

Ginkgo tests are failing because it seems current implementation is hard-wired to use HIP backend. We need to set options for choosing hardware to run.

25: Test command: /opt/ibm/jsm/bin/jsrun "-a" "1" "-g" "2" "-n" "1" "/gpfs/wolf/proj-shared/csc359/ci/266631/build/src/Drivers/Sparse/NlpSparseEx2.exe" "500" "-ginkgo" "-inertiafree" "-selfcheck"
25: Test timeout computed to be: 1800
25: terminate called after throwing an instance of 'gko::NotCompiled'
25:   what():  /gpfs/wolf/proj-shared/csc359/src/ginkgo/core/device_hooks/hip_hooks.cpp:82: feature raw_alloc is part of the hip module, which is not compiled on this system

@fritzgoebel
Copy link
Author

I added an option ginkgo_exec to the hiopOptions. This lets you choose between reference, CUDA and HIP executors with reference being the default.
For the tests, I for now stuck with the default, as we test all our functionality for equivalence against the reference executor. If you prefer, we can also make the executor in the tests depend on the setup or test all available executors, just let me know what you prefer.

@pelesh
Copy link
Collaborator

pelesh commented Aug 4, 2022

If you prefer, we can also make the executor in the tests depend on the setup or test all available executors, just let me know what you prefer.

My suggestion would be to add a command line option to sparse examples for selecting Ginkgo backend, and then add tests for reference, CUDA and HIP backends, for sparse examples 1 and 2. We plan to refactor CLI for examples soon, so a minimalistic solution would be the best.

@pelesh
Copy link
Collaborator

pelesh commented Aug 4, 2022

By the way, @fritzgoebel, does Ginkgo export CMake targets identifying what hardware backends are built? I just noticed we were assuming Ginkgo is providing CUDA backend only. It would be good to enable Ginkgo tests based on what is available in linked Ginkgo library.

@pelesh
Copy link
Collaborator

pelesh commented Aug 4, 2022

I was thinking of something like this:

if(HIOP_USE_GINKGO)
  add_test(NAME NlpSparse2_4 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo" "inertiafree" "-selfcheck")

  if(HIOP_USE_CUDA AND TARGET Ginkgo::CUDA)
    add_test(NAME NlpSparse2_5 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo_cuda" "-inertiafree" "-selfcheck" 
  endif()

  if(HIOP_USE_HIP AND TARGET Ginkgo::HIP)
    ...
  endif()
endif(HIOP_USE_GINKGO)

We would have to add flags -ginkgo_cuda and -ginkgo_hip to the example's CLI.

@fritzgoebel
Copy link
Author

I was thinking of something like this:

if(HIOP_USE_GINKGO)
  add_test(NAME NlpSparse2_4 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo" "inertiafree" "-selfcheck")

  if(HIOP_USE_CUDA AND TARGET Ginkgo::CUDA)
    add_test(NAME NlpSparse2_4 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo_cuda" "-inertiafree" "-selfcheck" 
  endif()

  if(HIOP_USE_HIP AND TARGET Ginkgo::HIP)
    ...
  endif()
endif(HIOP_USE_GINKGO)

We would have to add flags -ginkgo_cuda and -ginkgo_hip to the example's CLI.

The targets are always available for all backends, if the respective hardware backend is not built it is replaced with a stub. Trying to use a backend which is not built leads to an exception

Fritz Goebel and others added 3 commits August 5, 2022 11:31
* Update marianas variables and add spack.yaml.

* Add debugging lines for failing spack build.

* Fix syntax error.

* Update Newell variables and re-enable CI.

* Fix bugs in newell variables.

Fixup.
@pelesh
Copy link
Collaborator

pelesh commented Aug 5, 2022

Ginkgo with CUDA backend fails on Marianas pipeline, but passes on Newell pipeline. Is the correct build of Ginkgo available on Marianas?

The error message is here:

27: Setting up Ginkgo solver ... 
27: terminate called after throwing an instance of 'gko::CudaError'
27:   what():  /tmp/ruth521/spack-stage/spack-stage-ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/spack-src/cuda/base/executor.cpp:192: raw_copy_to: cudaErrorInvalidValue: invalid argument
27: [dlt03:33417] *** Process received signal ***

Comment on lines +26 to +27
# ginkgo@glu_experimental%gcc@10.2.0+cuda~develtools~full_optimizations~hwloc~ipo~oneapi+openmp~rocm+shared build_type=Release cuda_arch=60 arch=linux-centos7-zen2
module load ginkgo-glu_experimental-gcc-10.2.0-dbmokiq
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pelesh this is the version of ginkgo being used. If it's necessary, I can add more debugging information, or perhaps print the configuration used at the start of each pipeline?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run it manually on Marianas? The same test passes on other pipelines, so it looks as if this is Marianas specific issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the line throwing the exception.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is confirmed failing on both P100 and V100 platforms on Marianas. I got the same error as seen in CI pipelines

@cameronrutherford
Copy link
Collaborator

The latest commit should disable the tests that were failing on Marianas. Marianas is CentOS 7 with a max compute capability of 60, and so the current assumption is that there is a bug with that specific build combination.

@cameronrutherford
Copy link
Collaborator

PNNL CI is now passing with failing tests disabled. This means we are not testing with ginkgo+cuda on that platform, however tests pass on Ascent and Newell

Copy link
Collaborator

@pelesh pelesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thank you @fritzgoebel and @cameronrutherford. The Marianas issue seems to be beyond the scope of this PR.

@cnpetra cnpetra merged commit 039f6e3 into develop Aug 8, 2022
nychiang pushed a commit that referenced this pull request Sep 30, 2022
* Working Ginkgo direct solver on AMD

* fix build failure without ma57

* minor corrections

* Update Ascent CI script to use ginkgo@ea106a945a390a1580baee4648c19ca2b665acdf

* Add ginkgo_exec option to choose the hardware architecture the Ginkgo solver is run on.

* Add tests for CUDA and HIP backends

* Fix typo

* Fix PNNL CI (#526)

* Update marianas variables and add spack.yaml.

* Add debugging lines for failing spack build.

* Fix syntax error.

* Update Newell variables and re-enable CI.

* Fix bugs in newell variables.

Fixup.

* Disable ginkgo+cuda test on Marianas.

* Bugfix test config on marianas.

* Final attempt at disablingng specific tests.

Co-authored-by: Nicholson Koukpaizan <koukpaizannk@ornl.gov>
Co-authored-by: Cameron Rutherford <robert.rutherford@pnnl.gov>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants