Amd direct solver #521

fritzgoebel · 2022-07-26T13:32:53Z

Adjusts the ginkgo solver to use our new LU implementation.

It is based on a ginkgo branch derived from the former GLU integration. In there, the numerical factorization is replaced with our new custom kernels that work on NVIDIA and AMD GPUs. Preprocessing and symbolic factorization are for now still CPU resident.

Generally, the executor choice in lines 263 and 349 should enable the numerical factorization to be executed on a:

CPU (gko::ReferenceExecutor::create())
NVIDIA GPU (gko::CudaExecutor::create(0, gko::ReferenceExecutor::create()))
AMD GPU (gko::HipExecutor::create(0, gko::ReferenceExecutor::create()))

By changing the executor I was able to successfully run the hiop tests both on MI100 GPUs on our own AMD system and on V100 GPUs on Summit.

A fairly easy addition would be a command line option choosing the hardware backend, where would I best put this?

As previously, this is based on an experimental state of work, so please expect things to be worked on and hence change.

pelesh · 2022-08-02T16:20:19Z

A fairly easy addition would be a command line option choosing the hardware backend, where would I best put this?

I think using HiOp options would be the place where to select hardware backend.

nychiang · 2022-08-02T22:18:55Z

A fairly easy addition would be a command line option choosing the hardware backend, where would I best put this?

I think using HiOp options would be the place where to select hardware backend.

Can we use preprocessor directives? See here

pelesh · 2022-08-02T22:35:38Z

Can we use preprocessor directives? See here

Preprocessor directives are probably not the bast way to go about this (and I take full responsibility for the example you pointed to :), see e.g. #241). If you have CPU and GPU backends built, Ginkgo allows you to select at runtime on which device you want to run the computation. Using directives would just limit the flexibility provided by Ginkgo.

cameronrutherford · 2022-08-03T03:32:58Z

A command line option seems best I assume? We could just re-use hiop_compute_mode, and use CPU/GPU, with the AMD/CUDA distinction chosen at compile time.

I was able to build this branch with the glu branch specified @pelesh. #522 might want to be delayed and merged into this with a working version of ginkgo. I also think we should start moving spack.yaml files into the repositories, along with the variables file for CI on each respective platform.

Hopefully we can get PNNL CI re-enabled on at least one platform in either this or #522...

nychiang · 2022-08-03T05:40:39Z

Can we use preprocessor directives? See here

Preprocessor directives are probably not the bast way to go about this (and I take full responsibility for the example you pointed to :), see e.g. #241). If you have CPU and GPU backends built, Ginkgo allows you to select at runtime on which device you want to run the computation. Using directives would just limit the flexibility provided by Ginkgo.

I get your point! It's a good idea to have it as an option (or command line option), and I can use it for Strumpack as well.

…2b665acdf

pelesh · 2022-08-03T14:29:18Z

I was able to build this branch with the glu branch specified @pelesh. #522 might want to be delayed and merged into this with a working version of ginkgo. I also think we should start moving spack.yaml files into the repositories, along with the variables file for CI on each respective platform.

All great suggestions, @cameronrutherford! In this case, I think we can simply rebase amd_direct_solver wrt newell8 branch, so not to hold #522. imho, #522 is ready to merge, so we can rebase to develop after #522 is merged. You can add support for Ginkgo from glu_experimental here.

cameronrutherford · 2022-08-03T16:37:38Z

I was able to build this branch with the glu branch specified @pelesh. #522 might want to be delayed and merged into this with a working version of ginkgo. I also think we should start moving spack.yaml files into the repositories, along with the variables file for CI on each respective platform.

All great suggestions, @cameronrutherford! In this case, I think we can simply rebase amd_direct_solver wrt newell8 branch, so not to hold #522. imho, #522 is ready to merge, so we can rebase to develop after #522 is merged. You can add support for Ginkgo from glu_experimental here.

SGTM. I will add the Marianas variables to get this branch working, along with some more documentation/information about the spack config used to generate the variables.

Hopefully, #522 can just be merged, and we can start seeing tests pass/fail at PNNL. I think your suggestion of just merging and rebasing should be fine.

pelesh · 2022-08-03T17:57:07Z

Thanks, @nkoukpaizan, for enabling ascent pipeline for this PR.

Ginkgo tests are failing because it seems current implementation is hard-wired to use HIP backend. We need to set options for choosing hardware to run.

25: Test command: /opt/ibm/jsm/bin/jsrun "-a" "1" "-g" "2" "-n" "1" "/gpfs/wolf/proj-shared/csc359/ci/266631/build/src/Drivers/Sparse/NlpSparseEx2.exe" "500" "-ginkgo" "-inertiafree" "-selfcheck"
25: Test timeout computed to be: 1800
25: terminate called after throwing an instance of 'gko::NotCompiled'
25:   what():  /gpfs/wolf/proj-shared/csc359/src/ginkgo/core/device_hooks/hip_hooks.cpp:82: feature raw_alloc is part of the hip module, which is not compiled on this system

… solver is run on.

fritzgoebel · 2022-08-04T16:22:41Z

I added an option ginkgo_exec to the hiopOptions. This lets you choose between reference, CUDA and HIP executors with reference being the default.
For the tests, I for now stuck with the default, as we test all our functionality for equivalence against the reference executor. If you prefer, we can also make the executor in the tests depend on the setup or test all available executors, just let me know what you prefer.

pelesh · 2022-08-04T16:53:12Z

If you prefer, we can also make the executor in the tests depend on the setup or test all available executors, just let me know what you prefer.

My suggestion would be to add a command line option to sparse examples for selecting Ginkgo backend, and then add tests for reference, CUDA and HIP backends, for sparse examples 1 and 2. We plan to refactor CLI for examples soon, so a minimalistic solution would be the best.

pelesh · 2022-08-04T17:07:50Z

By the way, @fritzgoebel, does Ginkgo export CMake targets identifying what hardware backends are built? I just noticed we were assuming Ginkgo is providing CUDA backend only. It would be good to enable Ginkgo tests based on what is available in linked Ginkgo library.

pelesh · 2022-08-04T17:52:16Z

I was thinking of something like this:

if(HIOP_USE_GINKGO)
  add_test(NAME NlpSparse2_4 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo" "inertiafree" "-selfcheck")

  if(HIOP_USE_CUDA AND TARGET Ginkgo::CUDA)
    add_test(NAME NlpSparse2_5 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo_cuda" "-inertiafree" "-selfcheck" 
  endif()

  if(HIOP_USE_HIP AND TARGET Ginkgo::HIP)
    ...
  endif()
endif(HIOP_USE_GINKGO)

We would have to add flags -ginkgo_cuda and -ginkgo_hip to the example's CLI.

fritzgoebel · 2022-08-05T08:22:25Z

I was thinking of something like this:

if(HIOP_USE_GINKGO)
  add_test(NAME NlpSparse2_4 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo" "inertiafree" "-selfcheck")

  if(HIOP_USE_CUDA AND TARGET Ginkgo::CUDA)
    add_test(NAME NlpSparse2_4 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo_cuda" "-inertiafree" "-selfcheck" 
  endif()

  if(HIOP_USE_HIP AND TARGET Ginkgo::HIP)
    ...
  endif()
endif(HIOP_USE_GINKGO)

We would have to add flags -ginkgo_cuda and -ginkgo_hip to the example's CLI.

The targets are always available for all backends, if the respective hardware backend is not built it is replaced with a stub. Trying to use a backend which is not built leads to an exception

* Update marianas variables and add spack.yaml. * Add debugging lines for failing spack build. * Fix syntax error. * Update Newell variables and re-enable CI. * Fix bugs in newell variables. Fixup.

pelesh · 2022-08-05T15:46:53Z

Ginkgo with CUDA backend fails on Marianas pipeline, but passes on Newell pipeline. Is the correct build of Ginkgo available on Marianas?

The error message is here:

27: Setting up Ginkgo solver ... 
27: terminate called after throwing an instance of 'gko::CudaError'
27:   what():  /tmp/ruth521/spack-stage/spack-stage-ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/spack-src/cuda/base/executor.cpp:192: raw_copy_to: cudaErrorInvalidValue: invalid argument
27: [dlt03:33417] *** Process received signal ***

cameronrutherford · 2022-08-05T15:50:02Z

scripts/marianasVariables.sh

+# ginkgo@glu_experimental%gcc@10.2.0+cuda~develtools~full_optimizations~hwloc~ipo~oneapi+openmp~rocm+shared build_type=Release cuda_arch=60 arch=linux-centos7-zen2
+module load ginkgo-glu_experimental-gcc-10.2.0-dbmokiq


@pelesh this is the version of ginkgo being used. If it's necessary, I can add more debugging information, or perhaps print the configuration used at the start of each pipeline?

spack/spack@68a1d55#diff-541117f56f263caf881b5f12084c0f41eda7f049a954c4590fd8282fa3b525e8 this is the spack commit that added this branch

Can you run it manually on Marianas? The same test passes on other pipelines, so it looks as if this is Marianas specific issue.

This is the line throwing the exception.

Yes, this is confirmed failing on both P100 and V100 platforms on Marianas. I got the same error as seen in CI pipelines

cameronrutherford · 2022-08-05T19:43:19Z

The latest commit should disable the tests that were failing on Marianas. Marianas is CentOS 7 with a max compute capability of 60, and so the current assumption is that there is a bug with that specific build combination.

cameronrutherford · 2022-08-05T20:27:17Z

PNNL CI is now passing with failing tests disabled. This means we are not testing with ginkgo+cuda on that platform, however tests pass on Ascent and Newell

pelesh

Looks great! Thank you @fritzgoebel and @cameronrutherford. The Marianas issue seems to be beyond the scope of this PR.

* Working Ginkgo direct solver on AMD * fix build failure without ma57 * minor corrections * Update Ascent CI script to use ginkgo@ea106a945a390a1580baee4648c19ca2b665acdf * Add ginkgo_exec option to choose the hardware architecture the Ginkgo solver is run on. * Add tests for CUDA and HIP backends * Fix typo * Fix PNNL CI (#526) * Update marianas variables and add spack.yaml. * Add debugging lines for failing spack build. * Fix syntax error. * Update Newell variables and re-enable CI. * Fix bugs in newell variables. Fixup. * Disable ginkgo+cuda test on Marianas. * Bugfix test config on marianas. * Final attempt at disablingng specific tests. Co-authored-by: Nicholson Koukpaizan <koukpaizannk@ornl.gov> Co-authored-by: Cameron Rutherford <robert.rutherford@pnnl.gov>

Fritz Goebel added 3 commits July 26, 2022 12:53

Working Ginkgo direct solver on AMD

da2e9be

fix build failure without ma57

89eae13

minor corrections

7544e92

fritzgoebel self-assigned this Jul 26, 2022

pelesh requested review from pelesh, kswirydo, nychiang and cameronrutherford and removed request for kswirydo August 2, 2022 16:20

Update Ascent CI script to use ginkgo@ea106a945a390a1580baee4648c19ca…

9b5d779

…2b665acdf

Add ginkgo_exec option to choose the hardware architecture the Ginkgo…

830d445

… solver is run on.

cameronrutherford mentioned this pull request Aug 4, 2022

Update newell variables to cuda11.4 #522

Closed

Fritz Goebel and others added 3 commits August 5, 2022 11:31

Add tests for CUDA and HIP backends

0d75e94

Fix typo

624cda0

Fix PNNL CI (#526)

18601eb

* Update marianas variables and add spack.yaml. * Add debugging lines for failing spack build. * Fix syntax error. * Update Newell variables and re-enable CI. * Fix bugs in newell variables. Fixup.

cameronrutherford reviewed Aug 5, 2022

View reviewed changes

Disable ginkgo+cuda test on Marianas.

34b32d0

cameronrutherford added 2 commits August 5, 2022 12:54

Bugfix test config on marianas.

d3e4364

Final attempt at disablingng specific tests.

727566f

cameronrutherford approved these changes Aug 5, 2022

View reviewed changes

pelesh approved these changes Aug 5, 2022

View reviewed changes

cnpetra merged commit 039f6e3 into develop Aug 8, 2022

cameronrutherford mentioned this pull request Sep 1, 2022

Ginkgo + CUDA Tests Fail on Marianas #540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Amd direct solver #521

Amd direct solver #521

fritzgoebel commented Jul 26, 2022 •

edited

Loading

pelesh commented Aug 2, 2022

nychiang commented Aug 2, 2022

pelesh commented Aug 2, 2022 •

edited

Loading

cameronrutherford commented Aug 3, 2022

nychiang commented Aug 3, 2022

pelesh commented Aug 3, 2022

cameronrutherford commented Aug 3, 2022 •

edited

Loading

pelesh commented Aug 3, 2022

fritzgoebel commented Aug 4, 2022

pelesh commented Aug 4, 2022

pelesh commented Aug 4, 2022 •

edited

Loading

pelesh commented Aug 4, 2022 •

edited

Loading

fritzgoebel commented Aug 5, 2022

pelesh commented Aug 5, 2022

cameronrutherford Aug 5, 2022

cameronrutherford Aug 5, 2022

pelesh Aug 5, 2022

pelesh Aug 5, 2022

cameronrutherford Aug 5, 2022

cameronrutherford commented Aug 5, 2022

cameronrutherford commented Aug 5, 2022

pelesh left a comment

		# ginkgo@glu_experimental%gcc@10.2.0+cuda~develtools~full_optimizations~hwloc~ipo~oneapi+openmp~rocm+shared build_type=Release cuda_arch=60 arch=linux-centos7-zen2
		module load ginkgo-glu_experimental-gcc-10.2.0-dbmokiq

Amd direct solver #521

Amd direct solver #521

Conversation

fritzgoebel commented Jul 26, 2022 • edited Loading

pelesh commented Aug 2, 2022

nychiang commented Aug 2, 2022

pelesh commented Aug 2, 2022 • edited Loading

cameronrutherford commented Aug 3, 2022

nychiang commented Aug 3, 2022

pelesh commented Aug 3, 2022

cameronrutherford commented Aug 3, 2022 • edited Loading

pelesh commented Aug 3, 2022

fritzgoebel commented Aug 4, 2022

pelesh commented Aug 4, 2022

pelesh commented Aug 4, 2022 • edited Loading

pelesh commented Aug 4, 2022 • edited Loading

fritzgoebel commented Aug 5, 2022

pelesh commented Aug 5, 2022

cameronrutherford Aug 5, 2022

Choose a reason for hiding this comment

cameronrutherford Aug 5, 2022

Choose a reason for hiding this comment

pelesh Aug 5, 2022

Choose a reason for hiding this comment

pelesh Aug 5, 2022

Choose a reason for hiding this comment

cameronrutherford Aug 5, 2022

Choose a reason for hiding this comment

cameronrutherford commented Aug 5, 2022

cameronrutherford commented Aug 5, 2022

pelesh left a comment

Choose a reason for hiding this comment

fritzgoebel commented Jul 26, 2022 •

edited

Loading

pelesh commented Aug 2, 2022 •

edited

Loading

cameronrutherford commented Aug 3, 2022 •

edited

Loading

pelesh commented Aug 4, 2022 •

edited

Loading

pelesh commented Aug 4, 2022 •

edited

Loading