Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define and Implement C++ API for negative sampling #4523

Merged
merged 25 commits into from
Aug 21, 2024

Conversation

ChuckHastings
Copy link
Collaborator

@ChuckHastings ChuckHastings commented Jul 5, 2024

Defines and implements the PLC/C/C++ APIs for negative sampling.

Closes #4497

@ChuckHastings ChuckHastings self-assigned this Jul 5, 2024
@ChuckHastings ChuckHastings added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 5, 2024
@github-actions github-actions bot added the CMake label Jul 17, 2024
@ChuckHastings ChuckHastings changed the title Define C++ API for negative sampling Define and Implement C++ API for negative sampling Jul 17, 2024
@ChuckHastings ChuckHastings marked this pull request as ready for review July 23, 2024 21:30
@ChuckHastings ChuckHastings requested review from a team as code owners July 23, 2024 21:30
Copy link
Member

@alexbarghi-nv alexbarghi-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ChuckHastings ChuckHastings modified the milestones: 24.08, 24.10 Jul 30, 2024
@ChuckHastings ChuckHastings requested review from a team as code owners August 5, 2024 20:52
@ChuckHastings ChuckHastings changed the base branch from branch-24.08 to branch-24.10 August 5, 2024 20:53
@ChuckHastings ChuckHastings removed request for a team and jameslamb August 5, 2024 20:54
Comment on lines 73 to 74
raft::device_span<value_t> output,
raft::device_span<bias_t const> biases);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something very minor, but should we place input vectors (e.g. biases) before output vectors (e.g. output)? AFAIK, that is a convention in the C++ API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may delete this function. I implemented a different way. After I've refactored the implementation I'll revisit whether we need this function or not. If we keep it I'll make that change.

* Sampling occurs by creating a list of source vertex ids from biased samping
* of the source vertex space, and destination vertex ids from biased sampling of the
* destination vertex space, and using this as the putative list of edges. We
* then can optionally remove duplicates and remove false negatives to generate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is false negative here a right terminology? AFAIK, false negative is something that should be reported is missing. Here, isn't it the opposite? (edges that shouldn't appear actually appear).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to remove_existing_edges

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you commit the change? I still see false negatives.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed the change in the documentation... I'll search the documentation and push a fix once I resolve the testing issue on one of the build configurations.

* vertices will be selected uniformly
* @param remove_duplicates If true, remove duplicate samples
* @param remove_false_negatives If true, remove false negatives (samples that are actually edges in
* the graph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these false negatives? Should we better say something like @param remove_positive_samples If true, remove positive samples (edges that exist in the input graph).

False negatives can be mis-interpreted as something that should be reported but missing. I guess here false negatives mean something that is reported as negative samples but should not be reported (as it is positive samples, this actually means false positive negative samples).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to remove_existing_edges

Comment on lines 795 to 796
std::optional<raft::device_span<weight_t const>> src_bias,
std::optional<raft::device_span<weight_t const>> dst_bias,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for nitpicking but better be src_biases and dst_biases to be consistent with the reset of C++ API? (use plural forms for vectors with multiple elements?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to be plural.

Comment on lines 772 to 775
* @param src_bias Optional bias for randomly selecting source vertices. If std::nullopt vertices
* will be selected uniformly
* @param dst_bias Optional bias for randomly selecting destination vertices. If std::nullopt
* vertices will be selected uniformly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the ranges for multi-GPU?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment on that.

graph_view.has_edge(handle,
raft::device_span<vertex_t const>{batch_src.data(), batch_src.size()},
raft::device_span<vertex_t const>{batch_dst.data(), batch_dst.size()},
// do_expensive_check);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disable expensive_check once validated.

auto new_end = thrust::remove_if(handle.get_thrust_policy(),
begin_iter,
begin_iter + batch_src.size(),
[] __device__(auto tuple) { return thrust::get<2>(tuple); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use stencil.

https://nvidia.github.io/cccl/thrust/api/function_group__stream__compaction_1ga557e8dd3130229b1a6193b3acc82ee5e.html

auto edge_first = thrust::make_zip_iterator(batch_src.begin(), batch_dst.begin());
auto new_end = thrust::remove_if(handle.get_thrust_policy(),  edge_first, edge_first + batch_src.size(), has_edge_flags.begin(), []__device__(auto flag) { return !flag; });

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made

Comment on lines 482 to 488
thrust::copy(handle.get_thrust_policy(),
thrust::make_zip_iterator(batch_src.begin(), batch_dst.begin()),
thrust::make_zip_iterator(batch_src.end(), batch_dst.end()),
thrust::make_zip_iterator(src.begin(), dst.begin()) + current_end);

auto begin_iter = thrust::make_zip_iterator(src.begin(), dst.begin());
thrust::sort(handle.get_thrust_policy(), begin_iter, begin_iter + src.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use thrust::merge if remove_duplicates is true. If I am not mistaken, both (src, dst) and (batch_src, batch_dst) are sorted.

If remove_duplicates is false, we don't need to sort, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to merge.

Comment on lines 493 to 496
if (!remove_duplicates) {
auto begin_iter = thrust::make_zip_iterator(src.begin(), dst.begin());
thrust::sort(handle.get_thrust_policy(), begin_iter, begin_iter + src.size());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we sort here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted

handle.get_thrust_policy(),
dst_bias_cache_->view().value_first(),
dst_bias_cache_->view().value_first() + graph_view.local_edge_partition_dst_range_size(),
weight_t{0});
Copy link
Contributor

@seunghwak seunghwak Aug 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work.

edge_src|dst_property_t stores edge src|dst property values in either linear array or (key, value) pairs.

This works only if property values are store in an linear array, but with this approach, memory footprint scales as a function of O(V/sqrt(P)), so this won't scale in large scale for graphs with a relatively low average vertex degree. Using key, value pairs ensure that memory footprint stays as min(O(V/sqrt(P)), O(E/P)).

And I am uncomfortable in exploiting this implementation details outside the primitives. And if we are only visiting the edges that exist in the input graph, we can store in (key, value) pairs, but if we need to cover the entire src/dst range, edge_property_t is not a right data structure.

One robust (but wastes some communication) approach is this.

We locally sum src biases and dst biases in each rank, then call host_scalar_allgather.

We first generate edge sources in two steps.

  1. In each rank, find how many sources to generate in each V/P segment (host_scalar_allgather output of local src bias sum). Then call host_scalar_all_to_all (host_scalar_multicast). Based on this, each rank generates sources for its local vertex partition range. Then, send the source values back.
  2. Do the same for destinations.
  3. Shuffle edges.

A key draw back is that we need to send sources & destinations back and shuffle edges as well. More communication than necessary but no memory footprint issue and minimally exploits any cugraph implementation details.

A more efficient approach is to create a P by sqrt(P) (or minor_comm_size to be exact) array (this requires understanding how the adjacency matrix is partitoined). By just collecting local bias sum values, we can compute how many edges should be generated in each box. We can send these nuumbers to the rank owning the boxes.

When we generate sources, we can generate sources in minor_com_size iterations; in each iteration we hold source biases only for the V/P vertices. When we generate destination vertices. We first just compute how many to generate per each V/P range in each box. Sum it, send to the owning GPUs (or in loops, we can bring V/P dst bias values based on which incur less communication), and get the destination vertices back. This will incur less communication but more involved.

I am inclined to implement the first approach, and if that turns out to be a performance bottleneck, we may consider implementing the second approach may be in the primitive space.... Let me know what you think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revisit this.

My first attempt at refactoring did something akin to your first suggestion. I ran into some problems in debugging and then the current approach occurred to me. My second attempt was something akin to your second approach, but then I ran into the issue that the source edge partitions are separate. This was my third attempt and works [as long as the internal implementation is an array and not a k-v implementation... I guess I didn't have a test that reached that point].

I agree that doing this properly and efficiently might need some work in primitive space.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored in latest push to use option 1. Note that's only required for biased... if either src or dst is uniform that can be done directly.

I suspect the original problem I had with that approach was actually the raft bug that I later worked around. Code is much simpler this way, and I think the performance should be fine - at least for now.

std::optional<rmm::device_uvector<weight_t>>>
normalize_biases(raft::handle_t const& handle,
graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view,
std::optional<raft::device_span<weight_t const>> biases)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something minor, but should we better call this function only when biases is valid? (instead of taking an optional biases variable?).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will adjust. This is residual from when we were doing the 2D partitioning... in that case we needed to do some of the setup (what became this function) even if it was uniform biases. That's no longer required... so I will fix.

Comment on lines 86 to 87
gpu_biases = cugraph::device_allgatherv(
handle, handle.get_comms(), raft::device_span<weight_t const>{d_sum.data(), d_sum.size()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we don't need both the above host_scalar_allreduce and device_allgatherv here.

If we just call allgatherv on every GPU's local sum, we can compute aggregate_sum from there. We have every GPU's local sum and the aggregate sum, we can compute everything without calling an additional host_scalar_allreduce.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up in next push.

thrust::inclusive_scan(
handle.get_thrust_policy(), gpu_biases->begin(), gpu_biases->end(), gpu_biases->begin());

weight_t force_to_one{1.1};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.1 => 1?


weight_t force_to_one{1.1};
raft::update_device(
gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gpu_biases->set_element_async(gpu_biases->size() - 1, force_to_one, handle.get_stream());
handle.sync_stream(); <= this is necessary as force_to_one can become out-of-scope before this update finishes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or if we really want to ensure that no value is larger than 1, we may call thrust::transform on the entire gpu_biases and set value to cuda::std::min(org_value, 1.0).

I am not sure that floating point arithmetic guarantees that running inclusive scans on local_sum / aggregate_sum values guarantee that no value exceeds 1 (and if the local_sum on the last GPU is zero, non-last element may have a value larger than 1 as well).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the bug in raft's float random number generator which allows for the generation of a random number value 1.0. When I do thrust::upper_bound below, if the random number is 1.0 then the upper_bound call goes out of bounds on the gpu_counts array. Forcing this to a value larger than 1.0 lets us handle this condition.

But you are correct, what I want to do is set all values at the end of the array (where counts are 0) to this value. I will ponder a bit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to support the case where some GPUs might have no edges.

Comment on lines 283 to 284
size_t comm_size = handle.get_comms().get_size();
size_t comm_rank = handle.get_comms().get_rank();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size_t => auto const (comm_size & comm_rank are actually int values following MPI convention, I guess we won''t need to consider more than 2^31-1 GPUs in our life time.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines 126 to 129
auto& major_comm = handle.get_subcomm(cugraph::partition_manager::major_comm_name());
auto const major_comm_size = major_comm.get_size();
auto& minor_comm = handle.get_subcomm(cugraph::partition_manager::minor_comm_name());
auto const minor_comm_size = minor_comm.get_size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

auto all_gpu_counts = cugraph::device_allgatherv(
handle,
handle.get_comms(),
raft::device_span<size_t const>{gpu_counts.data(), gpu_counts.size()});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to just collect sample counts for comm_size GPUs (no need to collect for the entire comm_size * comm_size matrix). Use device_multicast_sendrecv to send to comm_size GPUs and receive from comm_size GPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, the code below will also become simpler.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use shuffle_values, which is essentially this.

@jnke2016
Copy link
Contributor

I re-reviewed the refactored PR and approve it

Copy link
Collaborator Author

@ChuckHastings ChuckHastings left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest push addresses the latest batch of requests. Please review again.

std::optional<rmm::device_uvector<weight_t>>>
normalize_biases(raft::handle_t const& handle,
graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view,
std::optional<raft::device_span<weight_t const>> biases)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will adjust. This is residual from when we were doing the 2D partitioning... in that case we needed to do some of the setup (what became this function) even if it was uniform biases. That's no longer required... so I will fix.

Comment on lines 86 to 87
gpu_biases = cugraph::device_allgatherv(
handle, handle.get_comms(), raft::device_span<weight_t const>{d_sum.data(), d_sum.size()});
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up in next push.


weight_t force_to_one{1.1};
raft::update_device(
gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the bug in raft's float random number generator which allows for the generation of a random number value 1.0. When I do thrust::upper_bound below, if the random number is 1.0 then the upper_bound call goes out of bounds on the gpu_counts array. Forcing this to a value larger than 1.0 lets us handle this condition.

But you are correct, what I want to do is set all values at the end of the array (where counts are 0) to this value. I will ponder a bit.

Comment on lines 126 to 129
auto& major_comm = handle.get_subcomm(cugraph::partition_manager::major_comm_name());
auto const major_comm_size = major_comm.get_size();
auto& minor_comm = handle.get_subcomm(cugraph::partition_manager::minor_comm_name());
auto const minor_comm_size = minor_comm.get_size();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 283 to 284
size_t comm_size = handle.get_comms().get_size();
size_t comm_rank = handle.get_comms().get_rank();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

auto all_gpu_counts = cugraph::device_allgatherv(
handle,
handle.get_comms(),
raft::device_span<size_t const>{gpu_counts.data(), gpu_counts.size()});
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to use shuffle_values, which is essentially this.


weight_t force_to_one{1.1};
raft::update_device(
gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream());
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored to support the case where some GPUs might have no edges.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM besides minor cosmetic issues and the need to test with edge masked graphs (to make sure that every algorithm works properly with edge masked graphs).

raft::handle_t const& handle,
raft::random::RngState& rng_state,
graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view,
size_t num_samples,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort of our convention is to list scalar parameters at the end. We may better move num_samples after dst_biases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved in next push

normalized_biases->begin());

if constexpr (multi_gpu) {
// rmm::device_scalar<weight_t> d_sum((sum / aggregate_sum), handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete commented out code.l

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted in next push

bool use_dst_bias{false};
bool remove_duplicates{false};
bool remove_existing_edges{false};
bool exact_number_of_samples{false};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should better test with edge masked graphs as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would behavior be in an edge masked graph? That if an edge is masked out it would be allowed to be in the negative sampled edges if remove_existing_edges is set to true? I can add that tests for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. If you call has_edge() on masked out edges, you will get false. So, those edges can be included in the negative sampling output

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in next push

bool remove_duplicates{false};
bool remove_existing_edges{false};
bool exact_number_of_samples{false};
bool check_correctness{true};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should better test with edge masked graphs as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in next push

@ChuckHastings
Copy link
Collaborator Author

/merge

@rapids-bot rapids-bot bot merged commit 97d1641 into rapidsai:branch-24.10 Aug 21, 2024
109 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake cuGraph improvement Improvement / enhancement to an existing function non-breaking Non-breaking change python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement negative sampling in cugraph-core
4 participants