-
Notifications
You must be signed in to change notification settings - Fork 153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize betweenness_centrality() #428
Conversation
This commit updates the internals of the betweenness_centrality() function to leverage rayon parallel iterators to parallelize the loop over all nodes in the graph which finds the shortest path for centrality. There was no data dependency for doing this calculation over all nodes and it is only using the return that does. This changes the implementation to pre-compute the list of ShortestPathData objects first in parallel and then iterate over that, instead of iterating over the nodes and computing it as we go. The tradeoff here is that we'll use more memory to store the intermediate results in exchange for being able to execute in parallel. Benchmarking will need to be done to see if this is worthwhile and if there is a threshold where the tradeoff makes sense.
Pull Request Test Coverage Report for Build 1214679179
💛 - Coveralls |
After skimming through the code, i think we can parallelize it a bit more if each thread computes and returns the |
I didn't find a convenient workflow for scripts. I mean I had to install (in my venv) every time I wanted to run the script. This takes a minimum of three minutes, so not very interactive. But, I did use the following: import retworkx
import retworkx.generators
n = 50
g = retworkx.generators.grid_graph(n, n)
btw = retworkx.betweenness_centrality(g, normalized=False);
print(max(btw.values())) This was to test correctness, speed, and overflow for various data types. The same test was used in an issue for betweenness centrality in the Julia graph package referenced above. The result, |
I just ran this and compared the serial to parallel versions. At a 50x50 grid the multi threaded version was comparable in performance but a bit slower (probably from thread pool overhead). But for a grid >=100x100 there is a noticeable improvement with this pr (on my desktop with 32 cores/64 threads a 100x100 took ~86.6 sec with this PR and ~104.1 sec with 0.10.1). However the intermediate storage uses a ton of memory especially for larger graphs. I hit the OOMKiller running a 200x200 grid (with 128GB of ram), so to make this approach viable I think that I need to do 2 things, add a configurable threshold for parallelization and try to eliminate (or significantly reduce as per @georgios-ts's suggestion) the intermediate storage when running in parallel. |
This commit reworks the parallelization to run the full betweenness calculation loop in parallel. This wasn't done in the initial commit because the betweeness Vec is mutated by each thread and we can't write to it from multiple threads at once. To overcome this a RwLock is used to ensure only one thread is writing to the betweenness Vec at once. While this does limit the overall throughput of the function it's not any slower than doing it serially and it eliminates the use of intermediate storage between the main loop and updating the output betweeness Vec.
The most recent commit: mtreinish@a9a4d12 removes the intermediate storage and makes everything multithreaded and uses a lock around betweenness to get around the shared mutability. This actually improved performance pretty noticeably over the earlier version. I ran some scaling runs comparing this to the single threaded released version: (these two graphs are the same data with the second is just loglog) I ran a linear sweep from 2x2 to a 200x200 grid graph on my laptop and my desktop. My laptop is more modest with a 6 core/12 thread cpu and 16gb of ram (and I think I started hitting thermal issues in the larger graphs) and my desktop has a 32core/64 thread cpu and 128gb of ram. The thing still left to decide is how many nodes to make the default parallel threshold (which is why I also put a loglog graph in there). Based on the loglog graph I'm thinking somewhere between 36 and 100 nodes is the point where the multithreaded version is faster. I also have an idea to decrease the lock wait time a bit which should improve performance more. If I get that working I'll push up new graphs. |
The previous commit which introduced the use of a RwLock around the betweenness Vec was a bit to aggressive in which code required the lock. It was aquiring a lock too early which resulted in each thread having a long wait while it waited for the lock, despite there being work it could do before it needed the lock. We only need the lock when betweenness is being used in the parallel portion of the code. This commit updates the use of the lock to do just this which further improves the runtime of the function. Before adopting this approach I did try to go a step further than this commit and acquire the lock right only for each individual write to betweenness, but this significantly degraded performance. The overhead of locking in that case became quite high and at least from a quick inspection each thread was spending ~50% of it's time dealing with locking. The approach used in this commit resulted in much better performance.
The updated locking strategy I tried worked extremely well and I pushed it in: d96408f The updated graphs with this new commit are (the 0.10.1 release data is unchanged but still in the graphs): the scaling on my desktop was quite an improvement, the 200x200 grid only took ~36 sec. |
This commit adds a new kwarg parallel_threshold which is used to set the number of nodes that the function starts to run multithreaded. For any graph with less nodes than the value of this kwarg the execution will be single threaded.
Ok, this is ready for review now. I added a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great!
I'm sort of late to the party (!) But, I wonder if you can avoid some code duplication with something like the following. I'm sure this is not correct rust, but just to get the idea: node_indices
.iter_method()
.map(|node_s| {
... |
I've tried something like this in the past for other functions with a parallel threshold like this and ran into a typing issue because the parallel iterator and the serial iterator were of different types so when you try to use the same variable for either it complains the types aren't the same. There might be a shared trait we can use a |
A macro could do the trick macro_rules! do_work {
($iter_method:ident) => {
node_indices
.$iter_method()
.map(|node_s| {
... and then if graph.node_count() < parallel_threshold {
do_work!(iter)
} else {
do_work!(par_iter)
} |
Using a macro is not bad. But of course, plain code is preferable if possible. Maybe refactoring the work into a function is the way to go. This might look very much like the macro above, but EDIT: An anonymous function, ie a closure, so you only pass |
After trying several things, I've discovered that rust is not Python (or Julia). The compiler insists on knowing the type of an argument (even by inference) at compile time. Looking online, it seems several people accept that factoring the code is not possible (without a macro), although this particular problem is not uncommon. I found two other solutions that don't involve factoring the code into a function:
|
This commit updates the internals of the betweenness_centrality()
function to leverage rayon parallel iterators to parallelize the loop
over all nodes in the graph which finds the shortest path for
centrality.
TODO: