MPI Cuda Research

This respository contains the source code and the report of the research on the performance gains with MPI only and MPI with CUDA with large-scale matrix multiplication. The research was conducted on Rensselaer Polytechnic Institute's Artificial Intelligence Multiprocessing Optimized System supercomputer, which is managed by the university's Center for Computational Innovation. A strong (matrix size remains the same as the number of CPU core/GPU threads utilized increases) and weak scaling (matrix size increases as the number of CPU core/GPU threads utilized increases) study was done. The report contains the research findings (below is the abstract). A key finding is that MPI individually and MPI/CUDA reduced matrix multiplication times for a fixed matrix size by up to 58x/175x respectively (page 6, table 2).

Abstract

The study of parallel message passing, CUDA, and their combination is important for understanding how common, but computationally expensive, operations can be parallelized. Matrix multiply is one such operation; as the dimensions of the matrices increases, the computation becomes expensive. In this paper, we research and compare three implementations of matrix multiply with square matrices: a serial version with solely CPU usage and two parallelized versions with usage of the CPU and MPI and a hybrid CPU/GPU MPI implementation. MPI was integrated with OpenMPI to parallelize chunks of the multiplication across processes. Parallelization with the GPU was implemented through the CUDA library. Benchmark data on how efficiently each version performs with matrix multiplication in a strong and weak scaling setup is provided. The performance of MPI I/O with the MPI (non-CUDA) cases was also investigated. Strong and weak scaling experiments were conducted on the Artificial Intelligence Multiprocessing Optimized System (AiMOS) supercomputer with the system’s built-in MPI and CUDA modules. Within the benchmark are smaller and larger test cases that are analyzed based on the time for the matrix multiplication and the MPI message passing overhead for the parallelized versions. MPI I/O was executed on the computer's NVMe storage.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
matrix-mpi-cuda.c		matrix-mpi-cuda.c
matrix-mpi-cuda.cu		matrix-mpi-cuda.cu
matrix-mpi.c		matrix-mpi.c
matrix.c		matrix.c
report.pdf		report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MPI Cuda Research

Abstract

About

Releases

Packages

Languages

License

JessHua159/mpi-cuda-research

Folders and files

Latest commit

History

Repository files navigation

MPI Cuda Research

Abstract

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages