This repository contains the source code for benchmarking the loop fission algorithm presented in "Distributing and Parallelizing Non-canonical Loops".
This is a curated benchmark suite. The included benchmarks have been selected specifically for offering opportunity to perform loop fission transformation. For each benchmark, we measure difference in clock time, after loop fission and parallelization using OpenMP.
We compare the original baseline programs to those generated by our loop fission technique. We also compare our technique to an alternative loop fission technique implemented in ROSE compiler's LoopProcessor tool.
- For an instructed walk-through and reproduction of paper experiments, access the artifact published on zenodo and follow the readme instructions.
- For running experiments on native hardware or extending this source code, clone this repository.
Directory | Description |
---|---|
original |
baseline, unmodified programs |
fission |
transformed and parallelized programs, using our method |
alt |
transformed and parallelized programs, using alternative method |
A more detailed explanation of benchmark transformations
In this repository, transformations of benchmarks have already been applied in the appropriate directories,
fission
and alt
, and we measure differences between those transformations and
original
. This section explains briefly how the transformations are generated in the first place.
Fission benchmarks have been transformed manually following our loop
fission algorithm. These transformations are not reproducible mechanically. The programs have been parallelized
by hand. We use manual approach here because automatic parallelization of while
loops is not supported
by any tool we are aware of. Programs with for
loops could be parallelized automatically, and as
shown in the next case.
Alt benchmarks are the results of applying automatic transformation using
ROSE compiler. They are also automatically parallelized using the same tool.
Benchmark remap
fails during transformation, and we measure no difference between the original version.
Because the tool transforms all code, including the timing code of the benchmark template, as last step we
restore the original benchmark template. Detailed steps for re-generating the alt-benchmarks are
here.
All programs are written in C language. We include programs with both for
and while
loops.
Benchmark | Description | for | while | Origin |
---|---|---|---|---|
3mm |
3D matrix multiplication | ✔ | PolyBench/C | |
bicg |
BiCG sub kernel of BiCGStab linear solver | ✔ | PolyBench/C | |
colormap |
TIFF image conversion of photometric palette | ✔ | MiBench | |
conjgrad |
Conjugate gradient routine | ✔ | NAS-CG | |
cp50 |
Ghostscript/CP50 color print routine | ✔ | ✔ | MiBench |
deriche |
Edge detection filter | ✔ | PolyBench/C | |
fdtd-2d |
2-D finite different time domain kernel | ✔ | PolyBench/C | |
gemm |
Matrix-multiply C=alpha.A.B+beta.C | ✔ | PolyBench/C | |
gesummv |
Scalar, vector and matrix multiplication | ✔ | PolyBench/C | |
mvt |
Matrix vector product and transpose | ✔ | PolyBench/C | |
remap |
4D matrix memory remapping | ✔ | NAS-UA | |
tblshift |
TIFF PixarLog compression main table bit shift | ✔ | ✔ | MiBench |
-
headers/
header files for benchmark programs. -
utilities/
e.g. the timing script, obtained from PolyBench/C benchmark suite version 4.2. -
plot.py
is used for generating tables and plots from results. -
run.sh
is a wrapper for the timing script; it enables benchmarking directories. -
ref_eval
referential result; the evaluation on which we base results of the paper.
System requirements: Linux/OSX host, C compiler with OpenMP support, Make, and Python 3.8+ for plotting. The system should have multiple cores for parallelism, but 4 cores is sufficient for this experiment.
Benchmarking proceeds in two phases: first capture the timing results, then generate plots and tables from those results. Details of these steps follow next.
✳️ System should include a C compiler that supports OpenMP pragmas (defaults to GCC).
To specify an alternative compiler, append to the make commands CC=[compiler_name]
.
Small evaluation — time partial benchmarks — ⏲️ ~ 10 min.
make small
Run all benchmarks — time execution of all benchmarks — ⏲️ ~ 2-4 h, by OS.
make all
Run specific benchmark — ⏲️ 1-60 min, varies by benchmark and OS.
make [benchmark_name]
Custom execution — call the run.sh
script directly:
./run.sh
Available arguments for timing
ARGUMENT | DESCRIPTION: options | DEFAULT |
---|---|---|
-c |
system compiler to use | gcc |
-d |
which directory: original , fission , alt |
original |
-o |
optimization level: O0 , O1 , O2 , O3 , ... |
O0 |
-v |
max. variance (%) when timing results: > 0.0 |
5.0 |
-s |
data size: MINI , SMALL , MEDIUM , LARGE , EXTRALARGE , STANDARD |
STANDARD |
-p |
only specific benchmark: 3mm , bicg , deriche ... |
not set |
If necessary, change permissions: chmod u+r+x ./run.sh
.
The results can be found in eval/results
directory, categorized by source directory name,
optimization level and data size. Two files will be generated for each category, for each run:
-
[args]_model.txt
- machine + processor snapshot, meta data -
[args].txt
- actual results of timing.
Data labels, in order:
program
: name of evaluated programvariance (%)
: variance of recorded execution timestime (s)
: average runtime (clock time), in secondsUTC timestamp
: time at completion, in seconds since epoch
Data labels are always the same. They are not included in timing results file.
Timing options are same as default:
- perform 5 executions/program
- take average of 3 runs (min and max time are excluded)
- variance controls the %-difference allowed between the 3 median runs
After capturing results, use the plotting script to generate tables or graphs. This step requires Python version 3.8 or higher.
-
Install dependencies
python3 -m pip install -q -r requirements.txt
-
Generate tables or plots
make plots
To customize plot options, call the plot.py
directly with selected arguments.
Available arguments for plotting
python3 plot.py --help
ARGUMENT | DESCRIPTION : options | DEFAULT |
---|---|---|
--data |
data choice: time , speedup |
time |
--input |
path to results (input) directory | eval/results |
--out |
path to output directory | eval/plots |
--fmt |
output format: tex , md , plot |
md |
--digits |
number of digits for tabular values: 0 ...15 |
6 |
--ss |
source directory for calculating speedup | original |
--st |
target directory for calculating speedup (all when not set) | not set |
--millis |
display table of times in milliseconds (otherwise in seconds) | not set |
--show |
show generated plot or table | not set |
--dir_filter |
include directories (comma-separated list): original , fission , alt |
not set |
--prog_filter |
include benchmarks (comma-separated list): 3mm , bicg , deriche ... |
not set |
--help |
show help message and exit | not set |
e.g. to generate a text-based speedup table, and display it, run:
python3 plot.py -d speedup -f md --digits 2 --show