Skip to content

Releases: xiaoyeli/superlu_dist

v9.1.0

11 Nov 01:53
ef8a7c1
Compare
Choose a tag to compare

v9.1.0 Release Note

This includes the following updates:

  1. Improved batched interface to solve many independent systems at the same time.
    Internally it uses C++ template to support multiple datatypes, e.g., complex.
    Please cite this IJHPCA paper when you use the batched functions.

  2. "SolveOnly" interface: you can input your own LU (or ILU) factored matrices,
    but use our parallel, multi-GPU capable sparse triangular solve routine.
    This is achieved by setting: options->SolveOnly = YES;
    The user still inputs matrix A. Internally, we will treat the lower triangle
    of A as the L factor, and upper triangle (including diagonal) of A as the U factor.
    See an example program EXAMPLE/pddrive3d.c

  3. Python interface, currently only support double precision.
    See PYTHON/README

  4. Fix memory leaks in the 3D multi-GPU routines in SRC/CplusplusFactor/

What's Changed

  • Fix the sizeof and add casting to trf3d partition structs by @abagusetty in #162
  • Fix memory error when using parallel symbolic factorization (ParMETIS) by @sebastiangrimberg in #164
  • Avoid cuda device compiling step when linking against the library. by @eromero-vlc in #170

New Contributors

Full Changelog: v9.0.0...v9.1.0

v9.0.0 release

08 May 20:25
Compare
Choose a tag to compare

V9.0.0 Release note

An example program is EXAMPLE/pddrive3d.c, calling the driver routine: SRC/double/pgssvx3d.c (or pdgssvx3d_csc_batch.c)

Please cite this ACM TOMS paper when you use these new features.

OpenMP performance hit:
On many systems, the default OMP_NUM_THREADS is set to to be the total number of CPU cores on a node. For example, it is set to be 128 on Perlmutter at NERSC. This is too high, because most of the algorithms are not efficient in the pure threading mode. We recommend users to experiment with mixed MPI and OpenMP mode, starting with smaller thread count, by settiing:
export OMP_NUM_THREADS=1, or 2, or 3, ....

The new features include the following:

  1. LU factorization: diagonal factorization, panel factorization, & Schur-complement update
    can all offloaded to GPU
    Environment variables:

    • export SUPERLU_ACC_OFFLOAD=1 (default setting: enable GPU)
      - export GPU3DVERSION=1 (default setting; use code in CplusplusFactor/ for all offload )
      - export GPU3DVERSION=0 (only Schur-complement updates are offloaded)
  2. Triangular solve: new 3D communication-avoiding code
    Environment variable:
    export SUPERLU_ACC_SOLVE=0 (default setting; only on CPU)
    export SUPERLU_ACC_SOLVE=1 (offload to GPU)

    ** NOTE: when using multiple NVIDIA GPUs per 2D grid for GPU triangular solve, we use NVSHMEM for fast
    inter-GPU communication. You need to configure NVSHMEM properly.
    For example, on Perlmutter at NERSC, we need the following setup:
    module load nvshmem/2.11.0
    export NVSHMEM_HOME=/global/common/software/nersc9/nvshmem/2.11.0

      export NVSHMEM_USE_GDRCOPY=1
      export NVSHMEM_MPI_SUPPORT=1
      export MPI_HOME=${MPICH_DIR}
      export NVSHMEM_LIBFABRIC_SUPPORT=1
      export LIBFABRIC_HOME=/opt/cray/libfabric/1.15.2.0
      export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH
      export NVSHMEM_DISABLE_CUDA_VMM=1
      export FI_CXI_OPTIMIZED_MRS=false
      export NVSHMEM_BOOTSTRAP_TWO_STAGE=1
      export NVSHMEM_BOOTSTRAP=MPI
      export NVSHMEM_REMOTE_TRANSPORT=libfabric
    
  3. Batched interface to solve many independent systems at the same time
    Driver routine: p[d,s,z]gssvx3d_csc_batch.c
    Example program: p[d,s,z]drive3d.c [ -b batchCount ]

  4. Julia interface
    https://github.com/JuliaSparse/SuperLUDIST.jl

Dependencies: the following shows what needs to be defined in CMake build script

  1. Highly recommended:
  • BLAS:
    -DTPL_ENABLE_INTERNAL_BLASLIB=OFF
    -DTPL__BLAS_LIBRARIES=”path to your BLAS library file”
  • ParMETIS:
    -DTPL_PARMETIS_LIBRARIES=ON
    -DTPL_PARMETIS_INCLUDE_DIRS=”path to metis and parmetis header files”
    -DTPL_PARMETIS_LIBRARIES=”path to metis and parmetis library files”
  1. If you use GPU triangular solve, need the following:
  • LAPACK
    -DTPL_ENABLE_LAPACKLIB=ON
    -DTPL_LAPACK_LIBRARIES=”path to lapack library file”
  • NVSHMEM is needed when using multiple GPUs
    -DTPL_ENABLE_NVSHMEM=ON
    -DTPL_NVSHMEM_LIBRARIES=”path to nvshmem files”
  1. If you use batched interface, need MAGMA
    -DTPL_ENABLE_MAGMALIB=ON
    -DTPL_MAGMA_INCLUDE_DIRS=”path to magma header files”
    -DTPL_MAGMA_LIBRARIES=”path to magma library file”

What's Changed

  • Add create large array for broadcast by @SidShi in #157

New Contributors

Full Changelog: v8.2.1...v9.0.0

v8.2.1

18 Nov 03:29
Compare
Choose a tag to compare

A patch release to correct version string, now 8.2.1.

Full Changelog: v8.2.0...v8.2.1

v8.2.0

10 Nov 06:47
Compare
Choose a tag to compare
  • more accurate memory counting for parallel symbolic and distribution routines
  • improved NERSC Perlmutter build and run scripts
  • added OLCF Frontier build and run scripts
  • updated superlu_enum_consts.h, compatible with serial SupeLU
  • added routine PStatClear()
  • fixes for taskloop in triangular solve
  • CMake: add target_compile_features() to specify C standard lower bound
  • a number of bug fixes

Update: version strings in several files.

What's Changed

New Contributors

Full Changelog: v8.1.2...v8.2.0

v8.1.2

12 Nov 21:22
Compare
Choose a tag to compare
  • In SRC/
    ** add an env variable COMM_TREE_MPI_WAIT in comm_tree.c
    ** replace a taskloop by parallel for in pxgstrs_lsum.c
  • In EXAMPLE/
    ** drivers: only initialise cublas if GPU offloading is
    enabled at runtime (James Trott)
    ** global interface drivers, P0 generates random Xtrue and RHS
  • Support 64-bit indexing for input matrix A

version 8.1.1

02 Oct 03:40
Compare
Choose a tag to compare
  • bug fix for CPU trisolve:
    ** fix omp taskloop bug for certain Intel compilers
    ** change mpi_test to mpi_wait in broadcast tree
  • fixing error related to MPI communicator reordering
  • correct memory allocation for GEMM buffer
  • disable internal copy of COLAMD code, link with external library
  • add single precision HWPM option
  • fixes for Fortran parallel build and CMakeLists.txt
  • add automatic CUDA architecture detection in cmake

What's Changed

  • changed mpi_test to mpi_wait in forwardMessageSimple functions by @jjepson2 in #117

New Contributors

Full Changelog: v8.1.0...v8.1.1

version 8.1.0

06 Jul 01:28
Compare
Choose a tag to compare
  • Improved GPU U-solve performance
    ** A compile-time CPP flag "-DGPU_SOLVE" is needed to use this function
    ** Currently GPU trisolve works with 1 MPI rank
  • Updated FORTRAN/CMakeLists.txt:
    ** parallel build
    ** use Fortran linker
    ** allow disable Fortran/ buiild when not needed
  • Added single precision interface to HWPM pivoting code
  • Temporary bug work-around for GPU trisolve
  • Updated a number of scripts in example_scripts/

Full Changelog: v8.0.0...v8.1.0

v8.0.0

23 May 00:44
Compare
Choose a tag to compare
  • Include support for AMD GPUs with HIP programming.
  • Allow runtime SUPERLU_ACC_OFFLOAD = 0 to disable GPU offload for both
    2D and 3D codes.
  • Include mixed-precision routines: 'psdrive' (single working precision)
    can take double-precision iterative refinement as an option.
  • Add the fields in 'options' input structure, corresponding to the
    parameters that are changeable by environment variables.
  • Add GPU stats variables in SuperLUStat_t{}, print the same way as CPU.

v7.2.0

17 Dec 02:29
Compare
Choose a tag to compare
changed installation dirs of example drivers to /EXAMPLE

v7.1.1

18 Oct 20:43
Compare
Choose a tag to compare

Bug fix: in "dReDistributre_A", dereference several uninitialized/unallocated arrays even though their sizes are zero.
The seg. fault only shows up in Windows, not Linux.