Skip to content

Latest commit

 

History

History
134 lines (105 loc) · 6.7 KB

README.md

File metadata and controls

134 lines (105 loc) · 6.7 KB

OpenMP-based parallel graph implementation.

I have been trying to parallelize the graph data structure from bottom up.

  • Each vertex has a list of edges, which is a sorted vector of pairs of edge id and weight.
  • A sorted list has better locality, lookup, and update time.
  • To amortize cost of edge deletions and insertions, they are simply queued up and lazily updated.
  • Either a batch of deletions or insertions must be done at a time to use lazy behaviour (not a mix).
  • The lazy update is done manually by calling update() with a temporary per-thread buffer.
  • The buffer is needed to do an in-place set_union_last_unique() / set_difference_unique().
  • Different threads must work on different vertices.

The last time i tried parallelizing graph reading with 12 threads, performance as 1/2 that of sequential, with converting each line to numbers taking the most time. It seems that the default stream operator (>>) uses locks for some reason to support locales, and is not recommended to use in parallel code.


I switched to using strtoull() and strtod() for the conversion, switched to istringstream from stringstream, and parallelized edge addition into the graph (along with parallel update). On web-Google graph we are now getting a speedup of ~6x for graph loading. I now log system date and time for detecting any more bottlenecks. The parallel edge update is as follows. Only one of the 12 threads will add the edge to the graph if the source vertex belongs to its chunk.


Graph functions such as symmetricize() are now simply parallelized as below.


We are able to load graphs with more that 4 billion edges in under 10 minutes. Most of the time spent is loading the graph is the time needed to read the file, and add edges to the graph data structure. We get a net loading rate of about 10 million edges per second. Updating the graph, which involves sorting the edges of each vertex while picking only the last unique entry, is about 50 times faster.


All outputs are saved in a gist and a small part of the output is listed here. Some charts are also included below, generated from sheets. The input data used for this experiment is available from the SuiteSparse Matrix Collection. This experiment was done with guidance from Prof. Kishore Kothapalli and Prof. Dip Sankar Banerjee.


$ g++ -std=c++17 -O3 main.cxx
$ ./a.out ~/Data/GAP-road.mtx
$ ./a.out ~/Data/GAP-twitter.mtx
$ ...

# 2022-12-17 21:04:36 Loading graph /home/subhajit/Data/GAP-road.mtx ...
# 2022-12-17 21:04:36 OMP_NUM_THREADS=24
# 2022-12-17 21:04:42 readMtxOmpW(): vertices=269.8ms, read=3173.7ms, parse=218.2ms, edges=2290.6ms, update=193.3ms
# 2022-12-17 21:04:42 order: 23947347 size: 57708624 [directed] {}
# 2022-12-17 21:04:42 [06160.472 ms] readMtxOmpW
#
# 2022-12-17 21:04:43 Loading graph /home/subhajit/Data/GAP-twitter.mtx ...
# 2022-12-17 21:04:43 OMP_NUM_THREADS=24
# 2022-12-17 21:08:07 readMtxOmpW(): vertices=711.1ms, read=138848.7ms, parse=10906.0ms, edges=50429.3ms, update=3494.1ms
# 2022-12-17 21:08:07 order: 61578415 size: 1468364884 [directed] {}
# 2022-12-17 21:08:07 [204464.219 ms] readMtxOmpW
#
# ...


References




ORG DOI