Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize betweenness_centrality() #428

Merged
merged 11 commits into from
Sep 9, 2021

Conversation

mtreinish
Copy link
Member

@mtreinish mtreinish commented Aug 29, 2021

This commit updates the internals of the betweenness_centrality()
function to leverage rayon parallel iterators to parallelize the loop
over all nodes in the graph which finds the shortest path for
centrality.

TODO:

  • Benchmark function vs existing serial version
  • try to eliminate memory overhead
  • add parallel threshold kwarg

This commit updates the internals of the betweenness_centrality()
function to leverage rayon parallel iterators to parallelize the loop
over all nodes in the graph which finds the shortest path for
centrality. There was no data dependency for doing this calculation over
all nodes and it is only using the return that does. This changes the
implementation to pre-compute the list of ShortestPathData objects first
in parallel and then iterate over that, instead of iterating over the
nodes and computing it as we go. The tradeoff here is that we'll use
more memory to store the intermediate results in exchange for being able
to execute in parallel. Benchmarking will need to be done to see if this
is worthwhile and if there is a threshold where the tradeoff makes
sense.
@mtreinish
Copy link
Member Author

@jlapeyre I'm not sure if you had any scripts or benchmarks you were using on #220 but if you did it would be good to test them with this as I just wrote the code and pushed this up and haven't had a chance to test it yet.

@coveralls
Copy link

coveralls commented Aug 29, 2021

Pull Request Test Coverage Report for Build 1214679179

  • 91 of 91 (100.0%) changed or added relevant lines in 1 file are covered.
  • 9 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.07%) to 97.437%

Files with Coverage Reduction New Missed Lines %
src/shortest_path/astar.rs 9 91.74%
Totals Coverage Status
Change from base Build 1214678187: -0.07%
Covered Lines: 10074
Relevant Lines: 10339

💛 - Coveralls

@georgios-ts
Copy link
Collaborator

After skimming through the code, i think we can parallelize it a bit more if each thread computes and returns the delta vector that appears in _accumulate_* instead of returning ShortestPathData. However this might need more changes in the code.

@jlapeyre
Copy link
Collaborator

jlapeyre commented Sep 2, 2021

I'm not sure if you had any scripts or benchmarks you were using on #220

I didn't find a convenient workflow for scripts. I mean I had to install (in my venv) every time I wanted to run the script. This takes a minimum of three minutes, so not very interactive. But, I did use the following:

import retworkx
import retworkx.generators

n = 50
g = retworkx.generators.grid_graph(n, n)
btw = retworkx.betweenness_centrality(g, normalized=False);
print(max(btw.values()))

This was to test correctness, speed, and overflow for various data types. The same test was used in an issue for betweenness centrality in the Julia graph package referenced above. The result, max(btw.values()) was nearly the same in the two implementations (I didn't count zeros, but the difference was maybe 10^(-14)). The current retworkx version (without threads) was something like 30% faster than Julia for these tests. I have not yet tried this on the branch for this PR.

@mtreinish
Copy link
Member Author

I just ran this and compared the serial to parallel versions. At a 50x50 grid the multi threaded version was comparable in performance but a bit slower (probably from thread pool overhead). But for a grid >=100x100 there is a noticeable improvement with this pr (on my desktop with 32 cores/64 threads a 100x100 took ~86.6 sec with this PR and ~104.1 sec with 0.10.1). However the intermediate storage uses a ton of memory especially for larger graphs. I hit the OOMKiller running a 200x200 grid (with 128GB of ram), so to make this approach viable I think that I need to do 2 things, add a configurable threshold for parallelization and try to eliminate (or significantly reduce as per @georgios-ts's suggestion) the intermediate storage when running in parallel.

This commit reworks the parallelization to run the full betweenness
calculation loop in parallel. This wasn't done in the initial commit
because the betweeness Vec is mutated by each thread and we can't write
to it from multiple threads at once. To overcome this a RwLock is used
to ensure only one thread is writing to the betweenness Vec at once.
While this does limit the overall throughput of the function  it's not
any slower than doing it serially and it eliminates the use of
intermediate storage between the main loop and updating the output
betweeness Vec.
@mtreinish
Copy link
Member Author

mtreinish commented Sep 2, 2021

The most recent commit: mtreinish@a9a4d12 removes the intermediate storage and makes everything multithreaded and uses a lock around betweenness to get around the shared mutability. This actually improved performance pretty noticeably over the earlier version. I ran some scaling runs comparing this to the single threaded released version:

times

times-logy

(these two graphs are the same data with the second is just loglog)

I ran a linear sweep from 2x2 to a 200x200 grid graph on my laptop and my desktop. My laptop is more modest with a 6 core/12 thread cpu and 16gb of ram (and I think I started hitting thermal issues in the larger graphs) and my desktop has a 32core/64 thread cpu and 128gb of ram. The thing still left to decide is how many nodes to make the default parallel threshold (which is why I also put a loglog graph in there). Based on the loglog graph I'm thinking somewhere between 36 and 100 nodes is the point where the multithreaded version is faster.

I also have an idea to decrease the lock wait time a bit which should improve performance more. If I get that working I'll push up new graphs.

The previous commit which introduced the use of a RwLock around the
betweenness Vec was a bit to aggressive in which code required the lock.
It was aquiring a lock too early which resulted in each thread having a
long wait while it waited for the lock, despite there being work it
could do before it needed the lock. We only need the lock when
betweenness is being used in the parallel portion of the code. This
commit updates the use of the lock to do just this which further
improves the runtime of the function.

Before adopting this approach I did try to go a step further than this
commit and acquire the lock right only for each individual write to
betweenness, but this significantly degraded performance. The overhead
of locking in that case became quite high and at least from a quick
inspection each thread was spending ~50% of it's time dealing with
locking. The approach used in this commit resulted in much better
performance.
@mtreinish
Copy link
Member Author

mtreinish commented Sep 2, 2021

The updated locking strategy I tried worked extremely well and I pushed it in: d96408f

The updated graphs with this new commit are (the 0.10.1 release data is unchanged but still in the graphs):

times

times-logy

the scaling on my desktop was quite an improvement, the 200x200 grid only took ~36 sec.

This commit adds a new kwarg parallel_threshold which is used to set the
number of nodes that the function starts to run multithreaded. For any
graph with less nodes than the value of this kwarg the execution will be
single threaded.
@mtreinish mtreinish changed the title [WIP] Parallelize shortest path calculation in betweenness_centrality() Parallelize shortest path calculation in betweenness_centrality() Sep 2, 2021
@mtreinish mtreinish removed the on hold label Sep 2, 2021
@mtreinish mtreinish added this to the 0.11.0 milestone Sep 2, 2021
@mtreinish
Copy link
Member Author

Ok, this is ready for review now. I added a parallel_threshold kwarg and went with a default of 50 nodes which in my scale testing seemed to be the point at which multithreaded was faster than a single thread.

@mtreinish mtreinish changed the title Parallelize shortest path calculation in betweenness_centrality() Parallelize betweenness_centrality() Sep 2, 2021
Copy link
Collaborator

@georgios-ts georgios-ts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

@mtreinish mtreinish merged commit 0a58b85 into Qiskit:main Sep 9, 2021
@mtreinish mtreinish deleted the parallel-betweeness branch September 9, 2021 12:43
@mtreinish mtreinish restored the parallel-betweeness branch September 9, 2021 19:46
@jlapeyre
Copy link
Collaborator

I'm sort of late to the party (!) But, I wonder if you can avoid some code duplication with something like the following. I'm sure this is not correct rust, but just to get the idea: let iter_method = node_count.iter or let iter_method = node_count.par_iter depending on the threshold count. Then you only need one code block with

        node_indices
            .iter_method()
            .map(|node_s| {
            ...

@mtreinish
Copy link
Member Author

I'm sort of late to the party (!) But, I wonder if you can avoid some code duplication with something like the following. I'm sure this is not correct rust, but just to get the idea: let iter_method = node_count.iter or let iter_method = node_count.par_iter depending on the threshold count. Then you only need one code block with

        node_indices
            .iter_method()
            .map(|node_s| {
            ...

I've tried something like this in the past for other functions with a parallel threshold like this and ran into a typing issue because the parallel iterator and the serial iterator were of different types so when you try to use the same variable for either it complains the types aren't the same. There might be a shared trait we can use a dyn type for, but I'll need to do some digging to see if there is a pattern that works. I agree it would be way nicer if we could just use a single iterator chain and switch between the serial and parallel (it would clean up the code in a bunch of places).

@georgios-ts
Copy link
Collaborator

A macro could do the trick

macro_rules! do_work {
        ($iter_method:ident) => {
            node_indices
                .$iter_method()
                .map(|node_s| {
                ...

and then

if graph.node_count() < parallel_threshold {
        do_work!(iter)
    } else {
        do_work!(par_iter)
    }

@jlapeyre
Copy link
Collaborator

jlapeyre commented Sep 10, 2021

Using a macro is not bad. But of course, plain code is preferable if possible. Maybe refactoring the work into a function is the way to go. This might look very much like the macro above, but do_work! would be a function, not a macro.

EDIT: An anonymous function, ie a closure, so you only pass iter_method might work nicely. I'll try this.

@jlapeyre
Copy link
Collaborator

jlapeyre commented Sep 13, 2021

After trying several things, I've discovered that rust is not Python (or Julia). The compiler insists on knowing the type of an argument (even by inference) at compile time. Looking online, it seems several people accept that factoring the code is not possible (without a macro), although this particular problem is not uncommon. I found two other solutions that don't involve factoring the code into a function:

  1. https://github.com/cuviper/rayon-cond (rayon-cond is a published crate)
  2. https://users.rust-lang.org/t/dynamically-using-or-not-using-rayon/18327/2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants