This program provides benchmarking tools for data movement on heterogenenous architecutres, including:
- Inter-CPU data movement
- Inter-GPU data movement
- Cuda Memcpys
- Injection bandwidth limitations
This codebase uses cmake. To compile the code :
mkdir build
cd build
cmake ..
If your system does not have CUDA-Aware MPI, you can benchmark and model communication routed through the CPU. To compile the code without CUDA-Aware MPI:
mkdir build
cd build
cmake -DCUDA_AWARE=OFF ..
Example runscripts for each example on Summit are available in the folder 'benchmarks/summit'. These runscripts all use Spectrum MPI. Figures for benchmarks on Summit are available in the folder 'figures/summit'. A subset of these figures are published in [Modeling Data Movement Performance on Heterogeneous Architectures]{https://arxiv.org/pdf/2010.10378.pdf}.
Example runscripts for each example on Lassen are available in the folder 'benchmarks/lassen'. Spectrum MPI results are in the subfolder 'spectrum', while benchmarks with MVAPICH2-GDR are in the subfolder 'mvapich'. Corresponding figures are in the folder 'figures/lassen'. A subset of these figures are published in [Modeling Data Movement Performance on Heterogeneous Architectures]{https://arxiv.org/pdf/2010.10378.pdf}.
Each of the existing benchmarks is explained below. For each benchmark, you will need to run code from the 'examples' folder. Then, you will be able to plot measurements and models with scripts in the 'plots' folder.
The memcpy benchmark measures the cost of the cudaMemcpyAsync operation. This benchmark compares the cost of the following transfers:
- host to device
- device to host
- device to device All data remains on a single NUMA node. For example, the host and device are both on the same NUMA node. For device to device transfers, both devices along with the calling CPU core are all located on a single NUMA node.
Create a folder within 'benchmarks' containing the name of the computer on which you are running this benchmark. Create a runscript within this folder if necessary. All output from the benchmark should be saved in a file titled 'memcpy.<job_id>.out' where 'job_id' is a unique identifier for the individual run. Run the file 'examples/time_memcpy' on a single node, with one CPU core available per GPU. For best performance, the CPU core controlling each GPU should be located on the same NUMA node as the GPU. For example, this benchmark can be run on Lassen with the following :
jsrun -a4 -c4 -g4 -r1 -n1 -M "-gpu" --latency_priority=gpu-cpu --launch_distribution=packed ./time_memcpy
The memcpy benchmark can be plotted using the scripts within the 'plots' folder. To run these scripts, make sure you have the benchpress module in 'plots' added to your PYTHONPATH. For each of the plots, you can pass display_plot=True to display the plot rather than saving it to a file. The memcpy benchmarks can be plotted with the following:
from benchpress.memcpy import memcpy_plots
# Plot Host to Device and Device To Host Copies
memcpy_plots.plot_memcpy()
# Plot Device to Device Copies
memcpy_plots.plot_memcpy_d2d()
'''
# License
This code is distributed under BSD: http://opensource.org/licenses/BSD-2-Clause
Please see `LICENSE.txt` for more information.