Parallelize betweenness_centrality() #428

mtreinish · 2021-08-29T15:32:22Z

This commit updates the internals of the betweenness_centrality()
function to leverage rayon parallel iterators to parallelize the loop
over all nodes in the graph which finds the shortest path for
centrality.

TODO:

Benchmark function vs existing serial version
try to eliminate memory overhead
add parallel threshold kwarg

This commit updates the internals of the betweenness_centrality() function to leverage rayon parallel iterators to parallelize the loop over all nodes in the graph which finds the shortest path for centrality. There was no data dependency for doing this calculation over all nodes and it is only using the return that does. This changes the implementation to pre-compute the list of ShortestPathData objects first in parallel and then iterate over that, instead of iterating over the nodes and computing it as we go. The tradeoff here is that we'll use more memory to store the intermediate results in exchange for being able to execute in parallel. Benchmarking will need to be done to see if this is worthwhile and if there is a threshold where the tradeoff makes sense.

mtreinish · 2021-08-29T15:33:46Z

@jlapeyre I'm not sure if you had any scripts or benchmarks you were using on #220 but if you did it would be good to test them with this as I just wrote the code and pushed this up and haven't had a chance to test it yet.

coveralls · 2021-08-29T15:57:59Z

Pull Request Test Coverage Report for Build 1214679179

91 of 91 (100.0%) changed or added relevant lines in 1 file are covered.
9 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.07%) to 97.437%

Files with Coverage Reduction	New Missed Lines	%
src/shortest_path/astar.rs	9	91.74%

Totals
Change from base Build 1214678187:	-0.07%
Covered Lines:	10074
Relevant Lines:	10339

💛 - Coveralls

georgios-ts · 2021-08-29T20:56:23Z

After skimming through the code, i think we can parallelize it a bit more if each thread computes and returns the delta vector that appears in _accumulate_* instead of returning ShortestPathData. However this might need more changes in the code.

jlapeyre · 2021-09-02T12:40:53Z

I'm not sure if you had any scripts or benchmarks you were using on #220

I didn't find a convenient workflow for scripts. I mean I had to install (in my venv) every time I wanted to run the script. This takes a minimum of three minutes, so not very interactive. But, I did use the following:

import retworkx
import retworkx.generators

n = 50
g = retworkx.generators.grid_graph(n, n)
btw = retworkx.betweenness_centrality(g, normalized=False);
print(max(btw.values()))

This was to test correctness, speed, and overflow for various data types. The same test was used in an issue for betweenness centrality in the Julia graph package referenced above. The result, max(btw.values()) was nearly the same in the two implementations (I didn't count zeros, but the difference was maybe 10^(-14)). The current retworkx version (without threads) was something like 30% faster than Julia for these tests. I have not yet tried this on the branch for this PR.

mtreinish · 2021-09-02T14:17:09Z

I just ran this and compared the serial to parallel versions. At a 50x50 grid the multi threaded version was comparable in performance but a bit slower (probably from thread pool overhead). But for a grid >=100x100 there is a noticeable improvement with this pr (on my desktop with 32 cores/64 threads a 100x100 took ~86.6 sec with this PR and ~104.1 sec with 0.10.1). However the intermediate storage uses a ton of memory especially for larger graphs. I hit the OOMKiller running a 200x200 grid (with 128GB of ram), so to make this approach viable I think that I need to do 2 things, add a configurable threshold for parallelization and try to eliminate (or significantly reduce as per @georgios-ts's suggestion) the intermediate storage when running in parallel.

This commit reworks the parallelization to run the full betweenness calculation loop in parallel. This wasn't done in the initial commit because the betweeness Vec is mutated by each thread and we can't write to it from multiple threads at once. To overcome this a RwLock is used to ensure only one thread is writing to the betweenness Vec at once. While this does limit the overall throughput of the function it's not any slower than doing it serially and it eliminates the use of intermediate storage between the main loop and updating the output betweeness Vec.

mtreinish · 2021-09-02T20:04:18Z

The most recent commit: mtreinish@a9a4d12 removes the intermediate storage and makes everything multithreaded and uses a lock around betweenness to get around the shared mutability. This actually improved performance pretty noticeably over the earlier version. I ran some scaling runs comparing this to the single threaded released version:

(these two graphs are the same data with the second is just loglog)

I ran a linear sweep from 2x2 to a 200x200 grid graph on my laptop and my desktop. My laptop is more modest with a 6 core/12 thread cpu and 16gb of ram (and I think I started hitting thermal issues in the larger graphs) and my desktop has a 32core/64 thread cpu and 128gb of ram. The thing still left to decide is how many nodes to make the default parallel threshold (which is why I also put a loglog graph in there). Based on the loglog graph I'm thinking somewhere between 36 and 100 nodes is the point where the multithreaded version is faster.

I also have an idea to decrease the lock wait time a bit which should improve performance more. If I get that working I'll push up new graphs.

The previous commit which introduced the use of a RwLock around the betweenness Vec was a bit to aggressive in which code required the lock. It was aquiring a lock too early which resulted in each thread having a long wait while it waited for the lock, despite there being work it could do before it needed the lock. We only need the lock when betweenness is being used in the parallel portion of the code. This commit updates the use of the lock to do just this which further improves the runtime of the function. Before adopting this approach I did try to go a step further than this commit and acquire the lock right only for each individual write to betweenness, but this significantly degraded performance. The overhead of locking in that case became quite high and at least from a quick inspection each thread was spending ~50% of it's time dealing with locking. The approach used in this commit resulted in much better performance.

mtreinish · 2021-09-02T20:20:34Z

The updated locking strategy I tried worked extremely well and I pushed it in: d96408f

The updated graphs with this new commit are (the 0.10.1 release data is unchanged but still in the graphs):

the scaling on my desktop was quite an improvement, the 200x200 grid only took ~36 sec.

This commit adds a new kwarg parallel_threshold which is used to set the number of nodes that the function starts to run multithreaded. For any graph with less nodes than the value of this kwarg the execution will be single threaded.

mtreinish · 2021-09-02T21:58:59Z

Ok, this is ready for review now. I added a parallel_threshold kwarg and went with a default of 50 nodes which in my scale testing seemed to be the point at which multithreaded was faster than a single thread.

georgios-ts

This looks great!

jlapeyre · 2021-09-10T18:07:28Z

I'm sort of late to the party (!) But, I wonder if you can avoid some code duplication with something like the following. I'm sure this is not correct rust, but just to get the idea: let iter_method = node_count.iter or let iter_method = node_count.par_iter depending on the threshold count. Then you only need one code block with

        node_indices
            .iter_method()
            .map(|node_s| {
            ...

mtreinish · 2021-09-10T18:32:54Z

I'm sort of late to the party (!) But, I wonder if you can avoid some code duplication with something like the following. I'm sure this is not correct rust, but just to get the idea: let iter_method = node_count.iter or let iter_method = node_count.par_iter depending on the threshold count. Then you only need one code block with
        node_indices
            .iter_method()
            .map(|node_s| {
            ...

I've tried something like this in the past for other functions with a parallel threshold like this and ran into a typing issue because the parallel iterator and the serial iterator were of different types so when you try to use the same variable for either it complains the types aren't the same. There might be a shared trait we can use a dyn type for, but I'll need to do some digging to see if there is a pattern that works. I agree it would be way nicer if we could just use a single iterator chain and switch between the serial and parallel (it would clean up the code in a bunch of places).

georgios-ts · 2021-09-10T18:57:16Z

A macro could do the trick

macro_rules! do_work {
        ($iter_method:ident) => {
            node_indices
                .$iter_method()
                .map(|node_s| {
                ...

and then

if graph.node_count() < parallel_threshold {
        do_work!(iter)
    } else {
        do_work!(par_iter)
    }

jlapeyre · 2021-09-10T19:26:55Z

Using a macro is not bad. But of course, plain code is preferable if possible. Maybe refactoring the work into a function is the way to go. This might look very much like the macro above, but do_work! would be a function, not a macro.

EDIT: An anonymous function, ie a closure, so you only pass iter_method might work nicely. I'll try this.

jlapeyre · 2021-09-13T17:03:30Z

After trying several things, I've discovered that rust is not Python (or Julia). The compiler insists on knowing the type of an argument (even by inference) at compile time. Looking online, it seems several people accept that factoring the code is not possible (without a macro), although this particular problem is not uncommon. I found two other solutions that don't involve factoring the code into a function:

https://github.com/cuviper/rayon-cond (rayon-cond is a published crate)
https://users.rust-lang.org/t/dynamically-using-or-not-using-rayon/18327/2

mtreinish added the on hold label Aug 29, 2021

mtreinish added 2 commits September 2, 2021 16:28

Remove leftover lines from abandoned approach

00de51f

Add parallel_threshold kwarg

5459027

This commit adds a new kwarg parallel_threshold which is used to set the number of nodes that the function starts to run multithreaded. For any graph with less nodes than the value of this kwarg the execution will be single threaded.

mtreinish changed the title ~~[WIP] Parallelize shortest path calculation in betweenness_centrality()~~ Parallelize shortest path calculation in betweenness_centrality() Sep 2, 2021

mtreinish removed the on hold label Sep 2, 2021

mtreinish added this to the 0.11.0 milestone Sep 2, 2021

mtreinish added 2 commits September 2, 2021 17:59

Run cargo fmt

ed62b8d

Fix python docstring lint

4276d9b

mtreinish changed the title ~~Parallelize shortest path calculation in betweenness_centrality()~~ Parallelize betweenness_centrality() Sep 2, 2021

mtreinish added 2 commits September 2, 2021 22:15

Add parallel test coverage

30c9846

Merge remote-tracking branch 'origin/main' into parallel-betweeness

270fee2

georgios-ts approved these changes Sep 3, 2021

View reviewed changes

mtreinish added 2 commits September 3, 2021 10:33

Merge branch 'main' into parallel-betweeness

133d786

Merge branch 'main' into parallel-betweeness

d747848

mtreinish merged commit 0a58b85 into Qiskit:main Sep 9, 2021

mtreinish deleted the parallel-betweeness branch September 9, 2021 12:43

mtreinish restored the parallel-betweeness branch September 9, 2021 19:46

jlapeyre mentioned this pull request Sep 13, 2021

Use rayon-cond to use threads conditionally without manual code duplication. #448

Open

mtreinish deleted the parallel-betweeness branch November 24, 2022 00:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize betweenness_centrality() #428

Parallelize betweenness_centrality() #428

mtreinish commented Aug 29, 2021 •

edited

Loading

mtreinish commented Aug 29, 2021

coveralls commented Aug 29, 2021 •

edited

Loading

georgios-ts commented Aug 29, 2021

jlapeyre commented Sep 2, 2021

mtreinish commented Sep 2, 2021

mtreinish commented Sep 2, 2021 •

edited

Loading

mtreinish commented Sep 2, 2021 •

edited

Loading

mtreinish commented Sep 2, 2021

georgios-ts left a comment

jlapeyre commented Sep 10, 2021

mtreinish commented Sep 10, 2021

georgios-ts commented Sep 10, 2021

jlapeyre commented Sep 10, 2021 •

edited

Loading

jlapeyre commented Sep 13, 2021 •

edited

Loading

Parallelize betweenness_centrality() #428

Parallelize betweenness_centrality() #428

Conversation

mtreinish commented Aug 29, 2021 • edited Loading

mtreinish commented Aug 29, 2021

coveralls commented Aug 29, 2021 • edited Loading

Pull Request Test Coverage Report for Build 1214679179

💛 - Coveralls

georgios-ts commented Aug 29, 2021

jlapeyre commented Sep 2, 2021

mtreinish commented Sep 2, 2021

mtreinish commented Sep 2, 2021 • edited Loading

mtreinish commented Sep 2, 2021 • edited Loading

mtreinish commented Sep 2, 2021

georgios-ts left a comment

Choose a reason for hiding this comment

jlapeyre commented Sep 10, 2021

mtreinish commented Sep 10, 2021

georgios-ts commented Sep 10, 2021

jlapeyre commented Sep 10, 2021 • edited Loading

jlapeyre commented Sep 13, 2021 • edited Loading

mtreinish commented Aug 29, 2021 •

edited

Loading

coveralls commented Aug 29, 2021 •

edited

Loading

mtreinish commented Sep 2, 2021 •

edited

Loading

mtreinish commented Sep 2, 2021 •

edited

Loading

jlapeyre commented Sep 10, 2021 •

edited

Loading

jlapeyre commented Sep 13, 2021 •

edited

Loading