-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amd direct solver #521
Amd direct solver #521
Conversation
I think using HiOp options would be the place where to select hardware backend. |
Can we use preprocessor directives? See here |
Preprocessor directives are probably not the bast way to go about this (and I take full responsibility for the example you pointed to :), see e.g. #241). If you have CPU and GPU backends built, Ginkgo allows you to select at runtime on which device you want to run the computation. Using directives would just limit the flexibility provided by Ginkgo. |
A command line option seems best I assume? We could just re-use I was able to build this branch with the glu branch specified @pelesh. #522 might want to be delayed and merged into this with a working version of ginkgo. I also think we should start moving Hopefully we can get PNNL CI re-enabled on at least one platform in either this or #522... |
I get your point! It's a good idea to have it as an option (or command line option), and I can use it for Strumpack as well. |
All great suggestions, @cameronrutherford! In this case, I think we can simply rebase |
SGTM. I will add the Marianas variables to get this branch working, along with some more documentation/information about the spack config used to generate the variables. Hopefully, #522 can just be merged, and we can start seeing tests pass/fail at PNNL. I think your suggestion of just merging and rebasing should be fine. |
Thanks, @nkoukpaizan, for enabling ascent pipeline for this PR. Ginkgo tests are failing because it seems current implementation is hard-wired to use HIP backend. We need to set options for choosing hardware to run. 25: Test command: /opt/ibm/jsm/bin/jsrun "-a" "1" "-g" "2" "-n" "1" "/gpfs/wolf/proj-shared/csc359/ci/266631/build/src/Drivers/Sparse/NlpSparseEx2.exe" "500" "-ginkgo" "-inertiafree" "-selfcheck"
25: Test timeout computed to be: 1800
25: terminate called after throwing an instance of 'gko::NotCompiled'
25: what(): /gpfs/wolf/proj-shared/csc359/src/ginkgo/core/device_hooks/hip_hooks.cpp:82: feature raw_alloc is part of the hip module, which is not compiled on this system |
… solver is run on.
I added an option |
My suggestion would be to add a command line option to sparse examples for selecting Ginkgo backend, and then add tests for reference, CUDA and HIP backends, for sparse examples 1 and 2. We plan to refactor CLI for examples soon, so a minimalistic solution would be the best. |
By the way, @fritzgoebel, does Ginkgo export CMake targets identifying what hardware backends are built? I just noticed we were assuming Ginkgo is providing CUDA backend only. It would be good to enable Ginkgo tests based on what is available in linked Ginkgo library. |
I was thinking of something like this: if(HIOP_USE_GINKGO)
add_test(NAME NlpSparse2_4 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo" "inertiafree" "-selfcheck")
if(HIOP_USE_CUDA AND TARGET Ginkgo::CUDA)
add_test(NAME NlpSparse2_5 COMMAND ${RUNCMD} "$<TARGET_FILE:NlpSparseEx2.exe>" "500" "-ginkgo_cuda" "-inertiafree" "-selfcheck"
endif()
if(HIOP_USE_HIP AND TARGET Ginkgo::HIP)
...
endif()
endif(HIOP_USE_GINKGO) We would have to add flags |
The targets are always available for all backends, if the respective hardware backend is not built it is replaced with a stub. Trying to use a backend which is not built leads to an exception |
* Update marianas variables and add spack.yaml. * Add debugging lines for failing spack build. * Fix syntax error. * Update Newell variables and re-enable CI. * Fix bugs in newell variables. Fixup.
Ginkgo with CUDA backend fails on Marianas pipeline, but passes on Newell pipeline. Is the correct build of Ginkgo available on Marianas? The error message is here: 27: Setting up Ginkgo solver ...
27: terminate called after throwing an instance of 'gko::CudaError'
27: what(): /tmp/ruth521/spack-stage/spack-stage-ginkgo-glu_experimental-dbmokiqc3tlyvnwehe546lb25lrnuaod/spack-src/cuda/base/executor.cpp:192: raw_copy_to: cudaErrorInvalidValue: invalid argument
27: [dlt03:33417] *** Process received signal *** |
# ginkgo@glu_experimental%gcc@10.2.0+cuda~develtools~full_optimizations~hwloc~ipo~oneapi+openmp~rocm+shared build_type=Release cuda_arch=60 arch=linux-centos7-zen2 | ||
module load ginkgo-glu_experimental-gcc-10.2.0-dbmokiq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pelesh this is the version of ginkgo being used. If it's necessary, I can add more debugging information, or perhaps print the configuration used at the start of each pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spack/spack@68a1d55#diff-541117f56f263caf881b5f12084c0f41eda7f049a954c4590fd8282fa3b525e8 this is the spack commit that added this branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run it manually on Marianas? The same test passes on other pipelines, so it looks as if this is Marianas specific issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the line throwing the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is confirmed failing on both P100 and V100 platforms on Marianas. I got the same error as seen in CI pipelines
The latest commit should disable the tests that were failing on Marianas. Marianas is CentOS 7 with a max compute capability of 60, and so the current assumption is that there is a bug with that specific build combination. |
PNNL CI is now passing with failing tests disabled. This means we are not testing with ginkgo+cuda on that platform, however tests pass on Ascent and Newell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thank you @fritzgoebel and @cameronrutherford. The Marianas issue seems to be beyond the scope of this PR.
* Working Ginkgo direct solver on AMD * fix build failure without ma57 * minor corrections * Update Ascent CI script to use ginkgo@ea106a945a390a1580baee4648c19ca2b665acdf * Add ginkgo_exec option to choose the hardware architecture the Ginkgo solver is run on. * Add tests for CUDA and HIP backends * Fix typo * Fix PNNL CI (#526) * Update marianas variables and add spack.yaml. * Add debugging lines for failing spack build. * Fix syntax error. * Update Newell variables and re-enable CI. * Fix bugs in newell variables. Fixup. * Disable ginkgo+cuda test on Marianas. * Bugfix test config on marianas. * Final attempt at disablingng specific tests. Co-authored-by: Nicholson Koukpaizan <koukpaizannk@ornl.gov> Co-authored-by: Cameron Rutherford <robert.rutherford@pnnl.gov>
Adjusts the ginkgo solver to use our new LU implementation.
It is based on a ginkgo branch derived from the former GLU integration. In there, the numerical factorization is replaced with our new custom kernels that work on NVIDIA and AMD GPUs. Preprocessing and symbolic factorization are for now still CPU resident.
Generally, the executor choice in lines 263 and 349 should enable the numerical factorization to be executed on a:
gko::ReferenceExecutor::create()
)gko::CudaExecutor::create(0, gko::ReferenceExecutor::create())
)gko::HipExecutor::create(0, gko::ReferenceExecutor::create())
)By changing the executor I was able to successfully run the hiop tests both on MI100 GPUs on our own AMD system and on V100 GPUs on Summit.
A fairly easy addition would be a command line option choosing the hardware backend, where would I best put this?
As previously, this is based on an experimental state of work, so please expect things to be worked on and hence change.