Skip to content

Research on the performance gains with MPI only and MPI with CUDA with large-scale matrix multiplication

License

Notifications You must be signed in to change notification settings

JessHua159/mpi-cuda-research

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MPI Cuda Research

This respository contains the source code and the report of the research on the performance gains with MPI only and MPI with CUDA with large-scale matrix multiplication. The research was conducted on Rensselaer Polytechnic Institute's Artificial Intelligence Multiprocessing Optimized System supercomputer, which is managed by the university's Center for Computational Innovation. A strong (matrix size remains the same as the number of CPU core/GPU threads utilized increases) and weak scaling (matrix size increases as the number of CPU core/GPU threads utilized increases) study was done. The report contains the research findings (below is the abstract). A key finding is that MPI individually and MPI/CUDA reduced matrix multiplication times for a fixed matrix size by up to 58x/175x respectively (page 6, table 2).

Abstract

The study of parallel message passing, CUDA, and their combination is important for understanding how common, but computationally expensive, operations can be parallelized. Matrix multiply is one such operation; as the dimensions of the matrices increases, the computation becomes expensive. In this paper, we research and compare three implementations of matrix multiply with square matrices: a serial version with solely CPU usage and two parallelized versions with usage of the CPU and MPI and a hybrid CPU/GPU MPI implementation. MPI was integrated with OpenMPI to parallelize chunks of the multiplication across processes. Parallelization with the GPU was implemented through the CUDA library. Benchmark data on how efficiently each version performs with matrix multiplication in a strong and weak scaling setup is provided. The performance of MPI I/O with the MPI (non-CUDA) cases was also investigated. Strong and weak scaling experiments were conducted on the Artificial Intelligence Multiprocessing Optimized System (AiMOS) supercomputer with the system’s built-in MPI and CUDA modules. Within the benchmark are smaller and larger test cases that are analyzed based on the time for the matrix multiplication and the MPI message passing overhead for the parallelized versions. MPI I/O was executed on the computer's NVMe storage.

About

Research on the performance gains with MPI only and MPI with CUDA with large-scale matrix multiplication

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published