-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract BFS paths SG implementation #1838
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review part 1 (I am still reading extract_bfs_paths_impl.cuh).
vertex_t const* distances_; | ||
vertex_t const* destinations_; | ||
|
||
size_t __device__ operator()(vertex_t idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be size_t idx
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next push
@@ -310,67 +310,73 @@ ConfigureTest(WCC_TEST components/wcc_test.cpp) | |||
ConfigureTest(HUNGARIAN_TEST linear_assignment/hungarian_test.cu) | |||
|
|||
################################################################################################### | |||
# - MST tests ---------------------------------------------------------------------------- | |||
# - MST tests ------------------------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing these :-)
vertex_t const* distances_; | ||
|
||
thrust::tuple<vertex_t, vertex_t, int> __device__ | ||
operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the third tuple element (type int)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original rank. Each rank has to retrieve results from another rank. The current implementation is:
- Construct the tuple (current_vertex, offset_in_current_frontier_array my_rank)
- Shuffle the tuples based on which rank contains the predecessor information for current_vertex
- Replace current_vertex with previous_vertex
- Shuffle the tuples back to the original rank where the data is required
- Update current_frontier in the proper location with the predecessor value
I could achieve the same result without the int parameter by writing a more sophisticated communication pattern (sort the vertices by which rank they are assigned to, do block transfers keeping track of the offsets for each rank, etc). I suspect this would be more time efficient as well as memory efficient. But I was trying to reuse existing functions as much as possible to get something working. We can optimize this more later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... OK, so I guess this code can go away if we implement collect_values_for_vertices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this functor still used? Can't find a place this function is actually used.
vertex_t const* predecessors_; | ||
|
||
thrust::tuple<vertex_t, vertex_t, int> __device__ | ||
operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the third tuple element (type int)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this functor still used? Can't find a place this function is actually used.
rmm::device_uvector<vertex_t> const& d_predecessors, | ||
size_t num_paths_to_check, | ||
uint64_t seed) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if I am not mistaken, this randomly selects num_paths_to_check
vertices (that are not sources).
I think we can achieve this by
thrust::copy_if();
thrust::shuffle();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And as this is only for test, another option is to do this in CPU, so no need to add this additional file and keep our codebase size in check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can look into shuffle, that's one I haven't used before.
I can certainly look at a CPU implementation. I was imagining that we might have other tests that might benefit from a function to randomly select vertices - although I started with it here because it was easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... so, for personalized PageRank, we're randomly selecting personalization vertices as well (it's currently implemented as host code).
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/tests/link_analysis/pagerank_test.cpp#L178
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/tests/link_analysis/mg_pagerank_test.cpp#L104
We may better combine these and add something under tests/utilities?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Part 2
* @param graph_view Graph view object. | ||
* @param distances Pointer to the distance array constructed by bfs. | ||
* @param predecessors Pointer to the predecessor array constructed by bfs. | ||
* @param destinations Destination vertices, extract path from source to each of these destinations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, in multi-GPU, destinations should be local to this GPU, right? Shouldn't we explain this here?
And I think this should be mentioned in
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/include/cugraph/algorithms.hpp#L1106
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/include/cugraph/algorithms.hpp#L1185
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I started with this assumption. However, once I started implementing I realized I had to handle off-GPU vertices anyway, so I dropped that assumption. If you specify on GPU 0 a destination that resides on GPU 1, the implementation will just go to GPU 1 to get the predecessor of that destination.
I agree that should be mentioned in the other places. I will add those in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may still enforce this for consistency with other APIs (and we can avoid communication for the first step).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this back to this assumption.
- Added a check to validate that destinations are on the correct gpu
- Eliminated the initial shuffle logic (outside the loop in the implementation)
- Updated the documentation (including the other locations)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Updated the documentation (including the other locations)
I can see updates in other locations, but I can't find documentation updates for extract_bfs_paths? Could you point out the exact line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow missed this one, updated in next push.
size_t count = | ||
multi_gpu | ||
? cugraph::host_scalar_allreduce( | ||
handle.get_comms(), current_frontier.size(), raft::comms::op_t::MAX, handle.get_stream()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this better be SUM reduction? (to minimize the difference in SG & MG logics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think MAX is semantically what I want, although SUM would suffice.
I'm actually thinking I'll change this loop to a for loop from over the range [0, max_path_length), since we already know how many iterations are required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually thinking I'll change this loop to a for loop from over the range [0, max_path_length), since we already know how many iterations are required.
+1 for this, this will be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to a for loop, definitely much clearer and saves the allreduce cost.
|
||
count = multi_gpu ? cugraph::host_scalar_allreduce(handle.get_comms(), | ||
current_frontier.size(), | ||
raft::comms::op_t::MAX, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this better be SUM reduction? (to minimize the difference in SG & MG logics)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same fix as above
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #1838 +/- ##
===============================================
Coverage ? 70.11%
===============================================
Files ? 143
Lines ? 8827
Branches ? 0
===============================================
Hits ? 6189
Misses ? 2638
Partials ? 0 Continue to review full report at Codecov.
|
Pushing out to 21.12 to incorporate more of the discussed changes. |
* @param graph_view Graph view object. | ||
* @param distances Pointer to the distance array constructed by bfs. | ||
* @param predecessors Pointer to the predecessor array constructed by bfs. | ||
* @param destinations Destination vertices, extract path from source to each of these destinations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Updated the documentation (including the other locations)
I can see updates in other locations, but I can't find documentation updates for extract_bfs_paths? Could you point out the exact line?
@@ -259,4 +259,74 @@ collect_values_for_unique_keys(raft::comms::comms_t const& comm, | |||
return value_buffer; | |||
} | |||
|
|||
template <typename VertexIterator, | |||
typename ValueIterator, | |||
typename KeyToGPUIdOp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'd better rename this to VertexToGPUIdOp to clarify that this function behavior is different from collect_values_for_keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next push
VertexIterator map_vertex_first, | ||
VertexIterator map_vertex_last, | ||
ValueIterator map_value_first, | ||
KeyToGPUIdOp key_to_gpu_id_op, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better be VertexToGPUIdOp vertex_to_gpu_id_op?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next push
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to vertex_partition_lasts
|
||
raft::update_host(h_gpu_counts.data(), gpu_counts.data(), gpu_counts.size(), stream_view); | ||
|
||
std::tie(shuffled_vertices, shuffled_counts) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto [shuffled_vertices, shuffled_counts) =
(so, no need to explicitly define shuffled_vertices & shuffled_counts)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And actually, input_vertices here will not be used after this line, so better invent a name that covers both "shuffled_vertices" and "input_vertices" and do something like std::tie(new_name, ...) = shuffle_values(..., std::move(new_name), ...);
(so if this function becomes a memory bottleneck, we can free memory hold by input_vertices).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_vertices is used at the end of the function as the keys for a lookup.
using vertex_t = typename std::iterator_traits<VertexIterator>::value_type; | ||
using value_t = typename std::iterator_traits<ValueIterator>::value_type; | ||
|
||
size_t input_size = thrust::distance(map_vertex_first, map_vertex_last); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, [map_vertex_first, map_vertex_last) are vertices we're querying values (not the vertices in local GPU).
In the collect_values_for_keys
function,
VertexIterator0 map_key_first,
VertexIterator0 map_key_last,
ValueIterator map_value_first,
store (key, value) pairs local to this GPU and
VertexIterator1 collect_key_first,
VertexIterator1 collect_key_last,
are keys we want to query values for. So, I think map_key_first & last here are inconsistent (and misnomers).
collect_values_for_vertices(raft::comms::comms_t const& comm,
VertexPartitionDeviceViewType vertex_partition_device_view (or local_vertex_first, local_vertex_last to better mirror the above collect_values_for_keys function),
ValueIterator map_value_first,
VertexToGPUIdOp vertex_to_gpu_id_op,
VertexIterator1 collect_unique_vertex_first,
VertexIterator1 collect_unique_vertex_last,
...
Or even better as vertex to GPU ID mapping is pretty much pre-determined by partitioning, we may take std::vector<vertex_t> const& vertex_partition_lasts instead of vertex_to_gpu_id_op (then no need to take vertex_partition_device_view).
collect_values_for_vertices(raft::comms::comms_t const& comm,
ValueIterator local_value_first,
VertexIterator1 collect_unique_vertex_first, (or sorted_unique_vertex_first, if we want to ask caller to sort if not already sorted)
VertexIterator1 collect_unique_vertex_last,
std::vector<vertex_t> const& vertex_partition_lasts, (size == # GPUs),
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured in the push I just made considering some of these ideas.
thrust::sort(rmm::exec_policy(stream_view), input_vertices.begin(), input_vertices.end()); | ||
|
||
// 1: Shuffle vertices to the correct node | ||
auto gpu_counts = groupby_and_count( | ||
input_vertices.begin(), input_vertices.end(), key_to_gpu_id_op, comm.get_size(), stream_view); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, this is an optimization issue, and we may not need to address this right now (having at least a FIXME statement will be nice), but there are few issues we need to think about.
So, in some cases, users may pass an already sorted vertex list to query values for. So, this sort can be redundant in such cases. And for sorted input_vertices
, we can just run thrust::lower_bound for # GPUs values instead of visiting and groupbying every element in input vertices (and if we're running group_by_and_count, I think the sort above is unnecessary; please double check). See code around https://github.com/rapidsai/cugraph/blob/branch-21.12/cpp/src/structure/renumber_utils_impl.cuh#L258 for reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured the code in the next push to accommodate this concern and some of the other comments above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sort here is necessary to use the keys for the lookup at the end of this function.
I added a call to thrust::unique
, since there's no point in a particular GPU asking for the same key more than once in a single call.
vertex_t const* distances_; | ||
|
||
thrust::tuple<vertex_t, vertex_t, int> __device__ | ||
operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this functor still used? Can't find a place this function is actually used.
vertex_t const* predecessors_; | ||
|
||
thrust::tuple<vertex_t, vertex_t, int> __device__ | ||
operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, is this functor still used? Can't find a place this function is actually used.
rmm::device_uvector<vertex_t> d_vertex_partition_offsets(0, handle.get_stream()); | ||
|
||
if constexpr (multi_gpu) { | ||
// FIXME: There should be a better way to do this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... I think we'd better add a member function that returns std::vector<vertex_t> type (size == # GPUs) vertex_partition_lasts to graph_view_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the function to graph_view_t in next push
vertex_t number_of_vertices, | ||
vertex_t local_vertex_first, | ||
rmm::device_uvector<vertex_t> const& d_predecessors, | ||
size_t num_paths_to_check) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just an issue to think about.
So, for testing, should we use CUDA code for this kind of input preparation functions (and add additional files as we cannot use thrust from .cpp files) or we'd better just use a slower host function for this (as this is only for testing.... and performance for this part may not matter much in many cases unless this will become a bottleneck in large scale benchmarking... it might be helpful in keeping our test code simpler).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely worth considering. Not sure how frequent this pattern will be.
…values_for_vertices, some code cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me besides minor complaint.
cpp/include/cugraph/algorithms.hpp
Outdated
* @param distances Pointer to the distance array constructed by bfs. | ||
* @param predecessors Pointer to the predecessor array constructed by bfs. | ||
* @param destinations Destination vertices, extract path from source to each of these destinations | ||
* In a multi-gpu context the source vertex should be local to this GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a multi-gpu context the source vertex should be local to this GPU.
=>
In a multi-gpu context the destination vertex should be local to this GPU.
vertex_t num_vertices, | ||
ValueIterator local_value_first, | ||
std::vector<vertex_t> const& vertex_partition_lasts, | ||
VertexPartitionDeviceViewType vertex_partition_device_view, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a minor point, I feel like vertex_partition_device_view
is a bit redundant as we can easily compute local_vertex_first from vertex_partition_lasts
(0 if comm_rank == 0 and vertex_partition_lasts[comm_rank - 1] otherwise).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess since collect_comm.cuh
is part of the primitive functionality it's reasonable to have it know the internal workings of the data structures. I have been trying to avoid having things outside the primitives (most of what I work on) rely on understanding the inner workings of the data structures.
I'll make these two changes, hopefully we can merge later today.
detail::update_paths<vertex_t>{paths.data(), invalid_vertex}); | ||
} | ||
|
||
return std::make_tuple(std::move(paths), max_path_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just want to say that this code looks beautiful :-) The MG part is just slightly more complex than the SG path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does look nice. Creating the collect_values_for_vertices
function really cleaned this up well.
@gpucibot merge |
Partially addresses #1753
Creates a new C++ function to extract BFS paths.
This PR includes the SG implementation and the SG unit tests. A separate PR will provide the MG unit test.