Skip to content

Benchmarking and Performance Modeling for Heterogeneous Architectures

Notifications You must be signed in to change notification settings

bienz2/BenchPress

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Benchmarking Suite for Heterogenenous Architectures

This program provides benchmarking tools for data movement on heterogenenous architecutres, including:

  • Inter-CPU data movement
  • Inter-GPU data movement
  • Cuda Memcpys
  • Injection bandwidth limitations

Compiling

This codebase uses cmake. To compile the code :

mkdir build
cd build
cmake ..

CUDA-Aware MPI

If your system does not have CUDA-Aware MPI, you can benchmark and model communication routed through the CPU. To compile the code without CUDA-Aware MPI:

mkdir build
cd build
cmake -DCUDA_AWARE=OFF ..

Summit SuperComputer

Example runscripts for each example on Summit are available in the folder 'benchmarks/summit'. These runscripts all use Spectrum MPI. Figures for benchmarks on Summit are available in the folder 'figures/summit'. A subset of these figures are published in [Modeling Data Movement Performance on Heterogeneous Architectures]{https://arxiv.org/pdf/2010.10378.pdf}.

Lassen SuperComputer

Example runscripts for each example on Lassen are available in the folder 'benchmarks/lassen'. Spectrum MPI results are in the subfolder 'spectrum', while benchmarks with MVAPICH2-GDR are in the subfolder 'mvapich'. Corresponding figures are in the folder 'figures/lassen'. A subset of these figures are published in [Modeling Data Movement Performance on Heterogeneous Architectures]{https://arxiv.org/pdf/2010.10378.pdf}.

Benchmarks

Each of the existing benchmarks is explained below. For each benchmark, you will need to run code from the 'examples' folder. Then, you will be able to plot measurements and models with scripts in the 'plots' folder.

Memcpy Benchmark

The memcpy benchmark measures the cost of the cudaMemcpyAsync operation. This benchmark compares the cost of the following transfers:

  • host to device
  • device to host
  • device to device All data remains on a single NUMA node. For example, the host and device are both on the same NUMA node. For device to device transfers, both devices along with the calling CPU core are all located on a single NUMA node.

Running the Memcpy Benchmark

Create a folder within 'benchmarks' containing the name of the computer on which you are running this benchmark. Create a runscript within this folder if necessary. All output from the benchmark should be saved in a file titled 'memcpy.<job_id>.out' where 'job_id' is a unique identifier for the individual run. Run the file 'examples/time_memcpy' on a single node, with one CPU core available per GPU. For best performance, the CPU core controlling each GPU should be located on the same NUMA node as the GPU. For example, this benchmark can be run on Lassen with the following :

jsrun -a4 -c4 -g4 -r1 -n1 -M "-gpu" --latency_priority=gpu-cpu --launch_distribution=packed ./time_memcpy  

Plotting the Memcpy Benchmark

The memcpy benchmark can be plotted using the scripts within the 'plots' folder. To run these scripts, make sure you have the benchpress module in 'plots' added to your PYTHONPATH. For each of the plots, you can pass display_plot=True to display the plot rather than saving it to a file. The memcpy benchmarks can be plotted with the following:

from benchpress.memcpy import memcpy_plots
# Plot Host to Device and Device To Host Copies
memcpy_plots.plot_memcpy()
# Plot Device to Device Copies
memcpy_plots.plot_memcpy_d2d()
'''


# License

This code is distributed under BSD: http://opensource.org/licenses/BSD-2-Clause

Please see `LICENSE.txt` for more information.

About

Benchmarking and Performance Modeling for Heterogeneous Architectures

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published