This instruction assumes that the users have followed the Getting Started Guide to set up GraphIt. The following overview consists of a Step by Step Instructions explaining how to reproduce Figure 6 (PageRankDelta with different schedules) and Table 8 (GraphIt performance on our 2-socket machine) in the paper. NOTE: the schedules we used here are almost certainly NOT the fastest schedules for your machine. Please only use the instructions here as examples for writing and compiling different schedules, and tune schedules to best fit your machine's features, such as cache size, number of sockets, and number of cores.
Figure 6 in the paper shows the different C++ code generated by applying different schedules to PageRankDelta. We have built a script to generate the code for PageRankDelta with different schedules and make sure the generated C++ code compiles successfully.
#start from graphit root directory
cd graphit_eval/
cd pagerankdelta_example
python compile_pagerankdelta_fig6.py
The program should output the information on each schedule, print the generated C++ file to stdout, and save the generated file in .cpp files in the directory. The schedules we used are stored in pagerankdelta_example/schedules
. We added a cache optimized schedule that was not included in the paper due to space constraints. This experiment demonstrates GraphIt's ability to compose together cache, direction, parallelization and data structure optimizations.
Table 8 in the paper shows the performance numbers of GraphIt for a few applications. Here we provide a script that can produce GraphIt's performance for PageRank, PageRankDelta, Breadth-First Search, Single Source Shortest Paths, and Connected Components. Collaborative Filtering currently only works with Intel ICPC compiler. NOTE: the schedules we used here are almost certainly NOT the fastest schedules for your machine. Please only use the instructions here as examples for writing and compiling different schedules, and tune schedules to best fit your machine's features, such as cache size, number of sockets, and number of cores.
The algorithms we used for benchmarking, such as PageRank, PageRankDelta, BFS, Connected Components, Single Source Shortest Paths and Collaborative Filtering are in the apps directory. These files include ONLY the algorithm and NO schedule. You need to use the appropriate schedules for the specific algorithm and input graph to get the best performance.
In the arxiv paper (Table 8), we described the schedules used for each algorithm on each graph on a dual socket system with Intel Xeon E5-2695 v3 CPUs with 12 cores each for a total of 24 cores and 48 hyper-threads. The system has 128GB of DDR3-1600 memory and 30 MB last level cache on each socket, and runs with Transparent Huge Pages (THP) enabled. The best schedule for a different machine can be different. You might need to try a few different set of schedules for the best performance.
In the schedules shown in Table 8, the keyword ’Program’ and the continuation symbol ’->’ are omitted. ’ca’ is the abbreviation for ’configApply’. Note that configApplyNumSSG uses an integer parameter (X) which is dependent on the graph size and the cache size of a system. For example, the complete schedule used for CC on Twitter graph is the following (X is tuned based on the cache size)
schedule:
program->configApplyDirection("s1", "SparsePush-DensePull")->configApplyParallelization("s1", "dynamic-vertex-parallel")->configApplyDenseVertexSet("s1","bitvector", "src-vertexset", "DensePull");
program->configApplyNumSSG("s1", "fixed-vertex-count", X, "DensePull");
The test/input and test/input_with_schedules directories contain many examples of the algorithm and schedule files. Use them as references when writing your own schedule and generate C++ implementations.
Here we provide a script that will compile (displaying the commands used) the graphit programs (with .gt extensions) into C++ programs using the schedules shown in the paper.
#start from graphit root directory
cd graphit/graphit_eval/eval/table7
#automatically compile the graphit files (.gt) into C++ files with schedules used in the paper (from graphit/test/input_with_schedules directory)
make graphit_files
Here we show the abbreviated output of the script below. These are essentially the commands we used to compile the graphit files using schedules in the test directory. The output cpp files are stored in graphit/graphit_eval/eval/table7/cpps. You can look at the schedules files here to figure out the schedules we used to get high performance for our machines.
yunming:table7$ make graphit_files
python ../../../build/bin/graphitc.py -a ../../../apps/cf.gt -f ../../../test/input_with_schedules/cf_pull_parallel_load_balance_segment_argv.gt -o cpps/cf_pull_load_balance_segment.cpp
python ../../../build/bin/graphitc.py -a ../../../apps/pagerankdelta.gt -f ../../../test/input_with_schedules/pagerank_delta_hybrid_dense.gt -o cpps/pagerankdelta_hybrid_dense.cpp
python ../../../build/bin/graphitc.py -a ../../../apps/pagerankdelta.gt -f ../../../test/input_with_schedules/pagerank_delta_benchmark_cache.gt -o cpps/pagerankdelta_hybrid_dense_bitvec_segment.cpp
python ../../../build/bin/graphitc.py -a ../../../apps/pagerankdelta.gt -f ../../../test/input_with_schedules/pagerank_delta_sparse_push_parallel.gt -o cpps/pagerankdelta_sparse_push.cpp
python ../../../build/bin/graphitc.py -a ../../../apps/cc.gt -f ../../../test/input_with_schedules/cc_benchmark_cache.gt -o cpps/cc_hybrid_dense_bitvec_segment.cpp
...
The following commands run the serial version of GraphIt on a small test graph (both the unweighted and weighted versions are in graphit/graphit_eval/data/testGraph
) that is included in the repository. We assume that you have already generated optimized C++ files for our dual-socket machine in the graphit_eval/eval/table7/cpps
directory following the last step. (The names of the files generated need to match those used in the last step). Please use the Makefile here to figure out the commands we used to compile the serial C++ files.
#start from graphit root directory
cd graphit_eval/eval/table7
#first compile the generated cpp files
make cpps
#run and benchmark the performance
python table7_graphit.py
The script table7_graphit.py first runs the benchmarks and then saves the outputs to the graphit_eval/eval/table7/outputs/
directory. The benchmark script choose the binary based on the graph. Then a separate script parses the outputs to generate the final table of performance in the following form. The application and graph information are shown in the leftmost column, and the running times are shown in the second column in seconds.
{'graphit': {'testGraph': {'bfs': 3.3333333333333335e-07, 'pr': 1e-06, 'sssp': 5e-07, 'cc': 1e-06, 'prd': 3e-06}}}
bfs
testGraph, 3.33333333333e-07
sssp
testGraph, 5e-07
pr
testGraph, 1e-06
cc
testGraph, 1e-06
prd
testGraph, 3e-06
Done parsing the run outputs
These runs should complete very quickly.
We have provided a few slightly larger graphs for testing in the Dropbox folder additional_graphit_graphs
. In the folder we have socLive.sg (unweighted binary Live Journal graph), socLive.wsg (weighted binary Live Journal graph), road graph (binary unweighted and weighted), and Twitter graph (binary unweighted and weighted). The weights are currently just set to 1. Running the experiments on Twitter graph can potentially take a significant amount of time if your machine does not have a 100 GB memory and many cores. Running these other graphs with serial C++ implementations are very slow. Please try to use the parallel implementations if possible (instructions given in later sections).
Below we first show the instructions for running the socLive (Live Journal) graph.
#copy the files to the data directories.
#The directory names have to be socLive, road-usad, twitter as we used hard-coded names in the scripts.
mkdir graphit/graphit_eval/eval/data/socLive
cp socLive.sg graphit/graphit_eval/eval/data/socLive
cp socLive.wsg graphit/graphit_eval/eval/data/socLive
#start from graphit root directory
cd graphit_eval/eval/table7
#first compile the graphit files and the generated cpp files
make graphit_files
make cpps
#run and benchmark the performance
python table7_graphit.py --graph socLive
The road graph and Twitter graph need to be named as road-usad
and twitter
. We have included some instructions below.
#copy the files to the data directories.
#The directory names have to be socLive, road-usad, twitter as we used hard-coded names in the scripts.
mkdir graphit/graphit_eval/eval/data/road-usad
cp road-usad.sg graphit/graphit_eval/eval/data/road-usad
cp road-usad.wsg graphit/graphit_eval/eval/data/road-usad
mkdir graphit/graphit_eval/eval/data/twitter
cp twitter.sg graphit/graphit_eval/eval/data/twitter
cp twitter.wsg graphit/graphit_eval/eval/data/twitter
#start from graphit root directory
cd graphit_eval/eval/table7
#first compile the generated cpp files
#see the section below for info on parallel builds
make
#run and benchmark the performance on both road and Twitter graphs
python table7_graphit.py --graph road-usad twitter
Here we list the instructions for compiling the generated C++ files using icpc or gcc with Cilk and OpenMP. The user mostly need to define a few variables for the Makefile. We used CILK for most of the files because the work-stealing performs bettern than untuned OPENMP schedule dynamic. For sssp_push_slq.cpp and bfs_push_slq.cpp, we had to use OPENMP for compilation as we needed features specific to OPENMP. The user can look at the Makefile, or the output of the Makefile to figure out the exact commands to compile each individual cpp file.
#start from graphit root directory
cd graphit_eval/eval/table7
#remove previously compiled binaries
make clean
#compile with icpc if you installed the intel compiler
make ICPC_PAR=1 cpps
#compile with gcc with Cilk and OpenMP
make GCC_PAR=1 cpps
#run and benchmark the performance
python table7_graphit.py --graph socLive
As we mentioned earlier, to replicate the performance, you will need to 1) use the parallel versions of the generated C++ programs 2) run them on a machine with similar configurations as ours. We used Intel Xeon E5-2695 v3 CPUs with 12 cores each for a total of 24 cores and 48 hyper-threads. The system has 128GB of DDR3-1600 memory and 30 MB last level cache on each socket, and runs with Transparent Huge Pages (THP) enabled. The generated C++ files are also not optimized for single-socket or 4-socket machines.
If you are trying to evaluate GraphIt on a different machine, you need to tune the schedules to best fit your machine's features, such as cache size, number of sockets, and number of cores. These schedules and scripts are only meant to be examples of the compilation and running commands for you to look at.
GraphIt reuses GAPBS input formats. Specifically, we have tested with edge list file (.el), weighted edge list file (.wel), binary edge list (.sg), and weighted binary edge list (.wsg) formats. Users can use the converter in GAPBS (GAPBS/src/converter.cc) to convert other graph formats into the supported formats, or convert weighted and unweighted edge list files into their respective binary formats.
For the additional graphs, you can use the compiled binaries in the graphit_eval/eval/table7/bin
directory to evaluate the performance. We normally use numactl -i all
on graphs with 10M+ vertices. graphit_eval/eval/table7/benchmark.py
has examples of commands used for running the generated binaries.
To use the script for additional graphs, follow the example of socLive on creating directories in graphit/graphit_eval/eval/data/
. However, certain graphs have to be named in a certain way in order to use our provided script. For example, road graph and Twitter graph need to be named as road-usad
and twitter
. Please take a look at graphit_eval/eval/table7/benchmark.py
for more details.