Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract BFS paths SG implementation #1838

Merged
merged 17 commits into from
Oct 14, 2021

Conversation

ChuckHastings
Copy link
Collaborator

@ChuckHastings ChuckHastings commented Sep 23, 2021

Partially addresses #1753

Creates a new C++ function to extract BFS paths.

This PR includes the SG implementation and the SG unit tests. A separate PR will provide the MG unit test.

@ChuckHastings ChuckHastings requested review from a team as code owners September 23, 2021 01:59
@ChuckHastings ChuckHastings self-assigned this Sep 23, 2021
@ChuckHastings ChuckHastings added 2 - In Progress improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 23, 2021
@ChuckHastings ChuckHastings added this to the 21.10 milestone Sep 23, 2021
@ChuckHastings ChuckHastings changed the title [WIP] [skip-ci] Extract BFS paths implementation Extract BFS paths implementation Sep 28, 2021
@ChuckHastings ChuckHastings changed the title Extract BFS paths implementation Extract BFS paths SG implementation Sep 28, 2021
Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review part 1 (I am still reading extract_bfs_paths_impl.cuh).

cpp/include/cugraph/algorithms.hpp Outdated Show resolved Hide resolved
vertex_t const* distances_;
vertex_t const* destinations_;

size_t __device__ operator()(vertex_t idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be size_t idx?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

@@ -310,67 +310,73 @@ ConfigureTest(WCC_TEST components/wcc_test.cpp)
ConfigureTest(HUNGARIAN_TEST linear_assignment/hungarian_test.cu)

###################################################################################################
# - MST tests ----------------------------------------------------------------------------
# - MST tests -------------------------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing these :-)

vertex_t const* distances_;

thrust::tuple<vertex_t, vertex_t, int> __device__
operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the third tuple element (type int)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original rank. Each rank has to retrieve results from another rank. The current implementation is:

  1. Construct the tuple (current_vertex, offset_in_current_frontier_array my_rank)
  2. Shuffle the tuples based on which rank contains the predecessor information for current_vertex
  3. Replace current_vertex with previous_vertex
  4. Shuffle the tuples back to the original rank where the data is required
  5. Update current_frontier in the proper location with the predecessor value

I could achieve the same result without the int parameter by writing a more sophisticated communication pattern (sort the vertices by which rank they are assigned to, do block transfers keeping track of the offsets for each rank, etc). I suspect this would be more time efficient as well as memory efficient. But I was trying to reuse existing functions as much as possible to get something working. We can optimize this more later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... OK, so I guess this code can go away if we implement collect_values_for_vertices.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this functor still used? Can't find a place this function is actually used.

vertex_t const* predecessors_;

thrust::tuple<vertex_t, vertex_t, int> __device__
operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the third tuple element (type int)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this functor still used? Can't find a place this function is actually used.

rmm::device_uvector<vertex_t> const& d_predecessors,
size_t num_paths_to_check,
uint64_t seed)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if I am not mistaken, this randomly selects num_paths_to_check vertices (that are not sources).

I think we can achieve this by

thrust::copy_if();
thrust::shuffle();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as this is only for test, another option is to do this in CPU, so no need to add this additional file and keep our codebase size in check.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can look into shuffle, that's one I haven't used before.

I can certainly look at a CPU implementation. I was imagining that we might have other tests that might benefit from a function to randomly select vertices - although I started with it here because it was easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... so, for personalized PageRank, we're randomly selecting personalization vertices as well (it's currently implemented as host code).

https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/tests/link_analysis/pagerank_test.cpp#L178
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/tests/link_analysis/mg_pagerank_test.cpp#L104

We may better combine these and add something under tests/utilities?

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Part 2

cpp/src/traversal/extract_bfs_paths_impl.cuh Outdated Show resolved Hide resolved
cpp/src/traversal/extract_bfs_paths_impl.cuh Outdated Show resolved Hide resolved
* @param graph_view Graph view object.
* @param distances Pointer to the distance array constructed by bfs.
* @param predecessors Pointer to the predecessor array constructed by bfs.
* @param destinations Destination vertices, extract path from source to each of these destinations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, in multi-GPU, destinations should be local to this GPU, right? Shouldn't we explain this here?

And I think this should be mentioned in

https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/include/cugraph/algorithms.hpp#L1106
https://github.com/rapidsai/cugraph/blob/branch-21.10/cpp/include/cugraph/algorithms.hpp#L1185

as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I started with this assumption. However, once I started implementing I realized I had to handle off-GPU vertices anyway, so I dropped that assumption. If you specify on GPU 0 a destination that resides on GPU 1, the implementation will just go to GPU 1 to get the predecessor of that destination.

I agree that should be mentioned in the other places. I will add those in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may still enforce this for consistency with other APIs (and we can avoid communication for the first step).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this back to this assumption.

  1. Added a check to validate that destinations are on the correct gpu
  2. Eliminated the initial shuffle logic (outside the loop in the implementation)
  3. Updated the documentation (including the other locations)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Updated the documentation (including the other locations)

I can see updates in other locations, but I can't find documentation updates for extract_bfs_paths? Could you point out the exact line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow missed this one, updated in next push.

cpp/src/traversal/extract_bfs_paths_impl.cuh Outdated Show resolved Hide resolved
cpp/src/traversal/extract_bfs_paths_impl.cuh Outdated Show resolved Hide resolved
size_t count =
multi_gpu
? cugraph::host_scalar_allreduce(
handle.get_comms(), current_frontier.size(), raft::comms::op_t::MAX, handle.get_stream())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this better be SUM reduction? (to minimize the difference in SG & MG logics)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MAX is semantically what I want, although SUM would suffice.

I'm actually thinking I'll change this loop to a for loop from over the range [0, max_path_length), since we already know how many iterations are required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually thinking I'll change this loop to a for loop from over the range [0, max_path_length), since we already know how many iterations are required.

+1 for this, this will be clearer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to a for loop, definitely much clearer and saves the allreduce cost.

cpp/src/traversal/extract_bfs_paths_impl.cuh Outdated Show resolved Hide resolved

count = multi_gpu ? cugraph::host_scalar_allreduce(handle.get_comms(),
current_frontier.size(),
raft::comms::op_t::MAX,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this better be SUM reduction? (to minimize the difference in SG & MG logics)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same fix as above

@codecov-commenter
Copy link

codecov-commenter commented Sep 28, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.12@50158c6). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.12    #1838   +/-   ##
===============================================
  Coverage                ?   70.11%           
===============================================
  Files                   ?      143           
  Lines                   ?     8827           
  Branches                ?        0           
===============================================
  Hits                    ?     6189           
  Misses                  ?     2638           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 50158c6...ce12426. Read the comment docs.

@ChuckHastings ChuckHastings modified the milestones: 21.10, 21.12 Sep 28, 2021
@ChuckHastings ChuckHastings changed the base branch from branch-21.10 to branch-21.12 September 28, 2021 22:31
@ChuckHastings
Copy link
Collaborator Author

Pushing out to 21.12 to incorporate more of the discussed changes.

@BradReesWork BradReesWork linked an issue Sep 29, 2021 that may be closed by this pull request
* @param graph_view Graph view object.
* @param distances Pointer to the distance array constructed by bfs.
* @param predecessors Pointer to the predecessor array constructed by bfs.
* @param destinations Destination vertices, extract path from source to each of these destinations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Updated the documentation (including the other locations)

I can see updates in other locations, but I can't find documentation updates for extract_bfs_paths? Could you point out the exact line?

@@ -259,4 +259,74 @@ collect_values_for_unique_keys(raft::comms::comms_t const& comm,
return value_buffer;
}

template <typename VertexIterator,
typename ValueIterator,
typename KeyToGPUIdOp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'd better rename this to VertexToGPUIdOp to clarify that this function behavior is different from collect_values_for_keys.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

VertexIterator map_vertex_first,
VertexIterator map_vertex_last,
ValueIterator map_value_first,
KeyToGPUIdOp key_to_gpu_id_op,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better be VertexToGPUIdOp vertex_to_gpu_id_op?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in next push

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to vertex_partition_lasts


raft::update_host(h_gpu_counts.data(), gpu_counts.data(), gpu_counts.size(), stream_view);

std::tie(shuffled_vertices, shuffled_counts) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto [shuffled_vertices, shuffled_counts) = (so, no need to explicitly define shuffled_vertices & shuffled_counts)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And actually, input_vertices here will not be used after this line, so better invent a name that covers both "shuffled_vertices" and "input_vertices" and do something like std::tie(new_name, ...) = shuffle_values(..., std::move(new_name), ...); (so if this function becomes a memory bottleneck, we can free memory hold by input_vertices).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_vertices is used at the end of the function as the keys for a lookup.

using vertex_t = typename std::iterator_traits<VertexIterator>::value_type;
using value_t = typename std::iterator_traits<ValueIterator>::value_type;

size_t input_size = thrust::distance(map_vertex_first, map_vertex_last);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, [map_vertex_first, map_vertex_last) are vertices we're querying values (not the vertices in local GPU).

In the collect_values_for_keys function,

                        VertexIterator0 map_key_first,
                        VertexIterator0 map_key_last,
                        ValueIterator map_value_first,

store (key, value) pairs local to this GPU and

                       VertexIterator1 collect_key_first,
                       VertexIterator1 collect_key_last,

are keys we want to query values for. So, I think map_key_first & last here are inconsistent (and misnomers).

collect_values_for_vertices(raft::comms::comms_t const& comm,
                            VertexPartitionDeviceViewType vertex_partition_device_view (or local_vertex_first, local_vertex_last to better mirror the above collect_values_for_keys function),
                            ValueIterator map_value_first,
                            VertexToGPUIdOp vertex_to_gpu_id_op,
                            VertexIterator1 collect_unique_vertex_first,
                            VertexIterator1 collect_unique_vertex_last,
                            ...

Or even better as vertex to GPU ID mapping is pretty much pre-determined by partitioning, we may take std::vector<vertex_t> const& vertex_partition_lasts instead of vertex_to_gpu_id_op (then no need to take vertex_partition_device_view).

collect_values_for_vertices(raft::comms::comms_t const& comm,
                            ValueIterator local_value_first,
                            VertexIterator1 collect_unique_vertex_first, (or sorted_unique_vertex_first, if we want to ask caller to sort if not already sorted)
                            VertexIterator1 collect_unique_vertex_last,
                            std::vector<vertex_t> const& vertex_partition_lasts, (size == # GPUs),
                            ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured in the push I just made considering some of these ideas.

Comment on lines 286 to 290
thrust::sort(rmm::exec_policy(stream_view), input_vertices.begin(), input_vertices.end());

// 1: Shuffle vertices to the correct node
auto gpu_counts = groupby_and_count(
input_vertices.begin(), input_vertices.end(), key_to_gpu_id_op, comm.get_size(), stream_view);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is an optimization issue, and we may not need to address this right now (having at least a FIXME statement will be nice), but there are few issues we need to think about.

So, in some cases, users may pass an already sorted vertex list to query values for. So, this sort can be redundant in such cases. And for sorted input_vertices, we can just run thrust::lower_bound for # GPUs values instead of visiting and groupbying every element in input vertices (and if we're running group_by_and_count, I think the sort above is unnecessary; please double check). See code around https://github.com/rapidsai/cugraph/blob/branch-21.12/cpp/src/structure/renumber_utils_impl.cuh#L258 for reference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured the code in the next push to accommodate this concern and some of the other comments above.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sort here is necessary to use the keys for the lookup at the end of this function.

I added a call to thrust::unique, since there's no point in a particular GPU asking for the same key more than once in a single call.

vertex_t const* distances_;

thrust::tuple<vertex_t, vertex_t, int> __device__
operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this functor still used? Can't find a place this function is actually used.

vertex_t const* predecessors_;

thrust::tuple<vertex_t, vertex_t, int> __device__
operator()(thrust::tuple<vertex_t, vertex_t, int> const& tuple)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is this functor still used? Can't find a place this function is actually used.

rmm::device_uvector<vertex_t> d_vertex_partition_offsets(0, handle.get_stream());

if constexpr (multi_gpu) {
// FIXME: There should be a better way to do this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I think we'd better add a member function that returns std::vector<vertex_t> type (size == # GPUs) vertex_partition_lasts to graph_view_t.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the function to graph_view_t in next push

vertex_t number_of_vertices,
vertex_t local_vertex_first,
rmm::device_uvector<vertex_t> const& d_predecessors,
size_t num_paths_to_check)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just an issue to think about.

So, for testing, should we use CUDA code for this kind of input preparation functions (and add additional files as we cannot use thrust from .cpp files) or we'd better just use a slower host function for this (as this is only for testing.... and performance for this part may not matter much in many cases unless this will become a bottleneck in large scale benchmarking... it might be helpful in keeping our test code simpler).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely worth considering. Not sure how frequent this pattern will be.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me besides minor complaint.

* @param distances Pointer to the distance array constructed by bfs.
* @param predecessors Pointer to the predecessor array constructed by bfs.
* @param destinations Destination vertices, extract path from source to each of these destinations
* In a multi-gpu context the source vertex should be local to this GPU.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a multi-gpu context the source vertex should be local to this GPU.
=>
In a multi-gpu context the destination vertex should be local to this GPU.

vertex_t num_vertices,
ValueIterator local_value_first,
std::vector<vertex_t> const& vertex_partition_lasts,
VertexPartitionDeviceViewType vertex_partition_device_view,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor point, I feel like vertex_partition_device_view is a bit redundant as we can easily compute local_vertex_first from vertex_partition_lasts (0 if comm_rank == 0 and vertex_partition_lasts[comm_rank - 1] otherwise).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess since collect_comm.cuh is part of the primitive functionality it's reasonable to have it know the internal workings of the data structures. I have been trying to avoid having things outside the primitives (most of what I work on) rely on understanding the inner workings of the data structures.

I'll make these two changes, hopefully we can merge later today.

detail::update_paths<vertex_t>{paths.data(), invalid_vertex});
}

return std::make_tuple(std::move(paths), max_path_length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just want to say that this code looks beautiful :-) The MG part is just slightly more complex than the SG path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does look nice. Creating the collect_values_for_vertices function really cleaned this up well.

@ChuckHastings
Copy link
Collaborator Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 707fcf7 into rapidsai:branch-21.12 Oct 14, 2021
@ChuckHastings ChuckHastings deleted the fea_1753 branch February 1, 2022 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Create function to extract paths from BFS output
4 participants