Releases · xiaoyeli/superlu_dist

11 Nov 01:53

xiaoyeli

v9.1.0

ef8a7c1

v9.1.0 Latest

Latest

v9.1.0 Release Note

This includes the following updates:

Improved batched interface to solve many independent systems at the same time.
Internally it uses C++ template to support multiple datatypes, e.g., complex.
Please cite this IJHPCA paper when you use the batched functions.
"SolveOnly" interface: you can input your own LU (or ILU) factored matrices,
but use our parallel, multi-GPU capable sparse triangular solve routine.
This is achieved by setting: options->SolveOnly = YES;
The user still inputs matrix A. Internally, we will treat the lower triangle
of A as the L factor, and upper triangle (including diagonal) of A as the U factor.
See an example program EXAMPLE/pddrive3d.c
Python interface, currently only support double precision.
See PYTHON/README
Fix memory leaks in the 3D multi-GPU routines in SRC/CplusplusFactor/

What's Changed

Fix the sizeof and add casting to trf3d partition structs by @abagusetty in #162
Fix memory error when using parallel symbolic factorization (ParMETIS) by @sebastiangrimberg in #164
Avoid cuda device compiling step when linking against the library. by @eromero-vlc in #170

New Contributors

@abagusetty made their first contribution in #162
@sebastiangrimberg made their first contribution in #164
@eromero-vlc made their first contribution in #170

Full Changelog: v9.0.0...v9.1.0

Contributors

eromero-vlc, sebastiangrimberg, and abagusetty

Assets 2

08 May 20:25

xiaoyeli

v9.0.0

51da98f

v9.0.0 release

V9.0.0 Release note

An example program is EXAMPLE/pddrive3d.c, calling the driver routine: SRC/double/pgssvx3d.c (or pdgssvx3d_csc_batch.c)

Please cite this ACM TOMS paper when you use these new features.

OpenMP performance hit:
On many systems, the default OMP_NUM_THREADS is set to to be the total number of CPU cores on a node. For example, it is set to be 128 on Perlmutter at NERSC. This is too high, because most of the algorithms are not efficient in the pure threading mode. We recommend users to experiment with mixed MPI and OpenMP mode, starting with smaller thread count, by settiing:
export OMP_NUM_THREADS=1, or 2, or 3, ....

The new features include the following:

LU factorization: diagonal factorization, panel factorization, & Schur-complement update
can all offloaded to GPU
Environment variables:
- export SUPERLU_ACC_OFFLOAD=1 (default setting: enable GPU)
  - export GPU3DVERSION=1 (default setting; use code in CplusplusFactor/ for all offload )
  - export GPU3DVERSION=0 (only Schur-complement updates are offloaded)
Triangular solve: new 3D communication-avoiding code
Environment variable:
export SUPERLU_ACC_SOLVE=0 (default setting; only on CPU)
export SUPERLU_ACC_SOLVE=1 (offload to GPU)

** NOTE: when using multiple NVIDIA GPUs per 2D grid for GPU triangular solve, we use NVSHMEM for fast
inter-GPU communication. You need to configure NVSHMEM properly.
For example, on Perlmutter at NERSC, we need the following setup:
module load nvshmem/2.11.0
export NVSHMEM_HOME=/global/common/software/nersc9/nvshmem/2.11.0
```
  export NVSHMEM_USE_GDRCOPY=1
  export NVSHMEM_MPI_SUPPORT=1
  export MPI_HOME=${MPICH_DIR}
  export NVSHMEM_LIBFABRIC_SUPPORT=1
  export LIBFABRIC_HOME=/opt/cray/libfabric/1.15.2.0
  export LD_LIBRARY_PATH=$NVSHMEM_HOME/lib:$LD_LIBRARY_PATH
  export NVSHMEM_DISABLE_CUDA_VMM=1
  export FI_CXI_OPTIMIZED_MRS=false
  export NVSHMEM_BOOTSTRAP_TWO_STAGE=1
  export NVSHMEM_BOOTSTRAP=MPI
  export NVSHMEM_REMOTE_TRANSPORT=libfabric
```
Batched interface to solve many independent systems at the same time
Driver routine: p[d,s,z]gssvx3d_csc_batch.c
Example program: p[d,s,z]drive3d.c [ -b batchCount ]
Julia interface
https://github.com/JuliaSparse/SuperLUDIST.jl

Dependencies: the following shows what needs to be defined in CMake build script

Highly recommended:

BLAS:
-DTPL_ENABLE_INTERNAL_BLASLIB=OFF
-DTPL__BLAS_LIBRARIES=”path to your BLAS library file”
ParMETIS:
-DTPL_PARMETIS_LIBRARIES=ON
-DTPL_PARMETIS_INCLUDE_DIRS=”path to metis and parmetis header files”
-DTPL_PARMETIS_LIBRARIES=”path to metis and parmetis library files”

If you use GPU triangular solve, need the following:

LAPACK
-DTPL_ENABLE_LAPACKLIB=ON
-DTPL_LAPACK_LIBRARIES=”path to lapack library file”
NVSHMEM is needed when using multiple GPUs
-DTPL_ENABLE_NVSHMEM=ON
-DTPL_NVSHMEM_LIBRARIES=”path to nvshmem files”

If you use batched interface, need MAGMA
-DTPL_ENABLE_MAGMALIB=ON
-DTPL_MAGMA_INCLUDE_DIRS=”path to magma header files”
-DTPL_MAGMA_LIBRARIES=”path to magma library file”

What's Changed

Add create large array for broadcast by @SidShi in #157

New Contributors

@SidShi made their first contribution in #157

Full Changelog: v8.2.1...v9.0.0

Contributors

SidShi

Assets 2

18 Nov 03:29

xiaoyeli

v8.2.1

b3eecd3

v8.2.1

A patch release to correct version string, now 8.2.1.

Full Changelog: v8.2.0...v8.2.1

Assets 2

10 Nov 06:47

xiaoyeli

v8.2.0

3fe6856

v8.2.0

more accurate memory counting for parallel symbolic and distribution routines
improved NERSC Perlmutter build and run scripts
added OLCF Frontier build and run scripts
updated superlu_enum_consts.h, compatible with serial SupeLU
added routine PStatClear()
fixes for taskloop in triangular solve
CMake: add target_compile_features() to specify C standard lower bound
a number of bug fixes

Update: version strings in several files.

What's Changed

Remove unused ptr and associated free by @jeanlucf22 in #131
Fix -Wstrict-prototypes by @prj- in #139
-Wundef by @prj- in #149
Last necessary fix for -Wundef by @prj- in #151

New Contributors

@jeanlucf22 made their first contribution in #131

Full Changelog: v8.1.2...v8.2.0

Contributors

prj- and jeanlucf22

Assets 2

12 Nov 21:22

xiaoyeli

v8.1.2

58e4171

v8.1.2

In SRC/
** add an env variable COMM_TREE_MPI_WAIT in comm_tree.c
** replace a taskloop by parallel for in pxgstrs_lsum.c
In EXAMPLE/
** drivers: only initialise cublas if GPU offloading is
enabled at runtime (James Trott)
** global interface drivers, P0 generates random Xtrue and RHS
Support 64-bit indexing for input matrix A

Assets 2

02 Oct 03:40

xiaoyeli

v8.1.1

e55f0d7

version 8.1.1

bug fix for CPU trisolve:
** fix omp taskloop bug for certain Intel compilers
** change mpi_test to mpi_wait in broadcast tree
fixing error related to MPI communicator reordering
correct memory allocation for GEMM buffer
disable internal copy of COLAMD code, link with external library
add single precision HWPM option
fixes for Fortran parallel build and CMakeLists.txt
add automatic CUDA architecture detection in cmake

What's Changed

changed mpi_test to mpi_wait in forwardMessageSimple functions by @jjepson2 in #117

New Contributors

@jjepson2 made their first contribution in #117

Full Changelog: v8.1.0...v8.1.1

Contributors

jjepson2

Assets 2

06 Jul 01:28

xiaoyeli

v8.1.0

04612eb

version 8.1.0

Improved GPU U-solve performance
** A compile-time CPP flag "-DGPU_SOLVE" is needed to use this function
** Currently GPU trisolve works with 1 MPI rank
Updated FORTRAN/CMakeLists.txt:
** parallel build
** use Fortran linker
** allow disable Fortran/ buiild when not needed
Added single precision interface to HWPM pivoting code
Temporary bug work-around for GPU trisolve
Updated a number of scripts in example_scripts/

Full Changelog: v8.0.0...v8.1.0

Assets 2

23 May 00:44

xiaoyeli

v8.0.0

4459a89

v8.0.0

Include support for AMD GPUs with HIP programming.
Allow runtime SUPERLU_ACC_OFFLOAD = 0 to disable GPU offload for both
2D and 3D codes.
Include mixed-precision routines: 'psdrive' (single working precision)
can take double-precision iterative refinement as an option.
Add the fields in 'options' input structure, corresponding to the
parameters that are changeable by environment variables.
Add GPU stats variables in SuperLUStat_t{}, print the same way as CPU.

Assets 2

17 Dec 02:29

xiaoyeli

v7.2.0

c73050a

v7.2.0

changed installation dirs of example drivers to /EXAMPLE

Assets 2

18 Oct 20:43

xiaoyeli

v7.1.1

0017a2e

v7.1.1

Bug fix: in "dReDistributre_A", dereference several uninitialized/unallocated arrays even though their sizes are zero.
The seg. fault only shows up in Windows, not Linux.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

Releases: xiaoyeli/superlu_dist

v9.1.0

What's Changed

New Contributors

Contributors

v9.0.0 release

What's Changed

New Contributors

Contributors

v8.2.1

v8.2.0

What's Changed

New Contributors

Contributors

v8.1.2

version 8.1.1

What's Changed

New Contributors

Contributors

version 8.1.0

v8.0.0

v7.2.0

v7.1.1