Define and Implement C++ API for negative sampling #4523

ChuckHastings · 2024-07-05T20:12:50Z

Defines and implements the PLC/C/C++ APIs for negative sampling.

cpp/tests/sampling/mg_negative_sampling.cu

alexbarghi-nv

👍

seunghwak · 2024-08-08T01:46:29Z

cpp/include/cugraph/detail/utility_wrappers.hpp

+                        raft::device_span<value_t> output,
+                        raft::device_span<bias_t const> biases);


Something very minor, but should we place input vectors (e.g. biases) before output vectors (e.g. output)? AFAIK, that is a convention in the C++ API.

I may delete this function. I implemented a different way. After I've refactored the implementation I'll revisit whether we need this function or not. If we keep it I'll make that change.

seunghwak · 2024-08-08T01:55:37Z

cpp/include/cugraph/sampling_functions.hpp

+ * Sampling occurs by creating a list of source vertex ids from biased samping
+ * of the source vertex space, and destination vertex ids from biased sampling of the
+ * destination vertex space, and using this as the putative list of edges.  We
+ * then can optionally remove duplicates and remove false negatives to generate


Is false negative here a right terminology? AFAIK, false negative is something that should be reported is missing. Here, isn't it the opposite? (edges that shouldn't appear actually appear).

Changed to remove_existing_edges

Did you commit the change? I still see false negatives.

Missed the change in the documentation... I'll search the documentation and push a fix once I resolve the testing issue on one of the build configurations.

seunghwak · 2024-08-08T02:08:54Z

cpp/include/cugraph/sampling_functions.hpp

+ * vertices will be selected uniformly
+ * @param remove_duplicates If true, remove duplicate samples
+ * @param remove_false_negatives If true, remove false negatives (samples that are actually edges in
+ * the graph


Are these false negatives? Should we better say something like @param remove_positive_samples If true, remove positive samples (edges that exist in the input graph).

False negatives can be mis-interpreted as something that should be reported but missing. I guess here false negatives mean something that is reported as negative samples but should not be reported (as it is positive samples, this actually means false positive negative samples).

Changed to remove_existing_edges

seunghwak · 2024-08-08T02:12:01Z

cpp/include/cugraph/sampling_functions.hpp

+  std::optional<raft::device_span<weight_t const>> src_bias,
+  std::optional<raft::device_span<weight_t const>> dst_bias,


Sorry for nitpicking but better be src_biases and dst_biases to be consistent with the reset of C++ API? (use plural forms for vectors with multiple elements?)

Changed to be plural.

seunghwak · 2024-08-08T02:12:21Z

cpp/include/cugraph/sampling_functions.hpp

+ * @param src_bias Optional bias for randomly selecting source vertices.  If std::nullopt vertices
+ * will be selected uniformly
+ * @param dst_bias Optional bias for randomly selecting destination vertices.  If std::nullopt
+ * vertices will be selected uniformly


What are the ranges for multi-GPU?

Added comment on that.

seunghwak · 2024-08-08T02:25:29Z

cpp/src/sampling/negative_sampling_impl.cuh

+        graph_view.has_edge(handle,
+                            raft::device_span<vertex_t const>{batch_src.data(), batch_src.size()},
+                            raft::device_span<vertex_t const>{batch_dst.data(), batch_dst.size()},
+                            // do_expensive_check);


Disable expensive_check once validated.

seunghwak · 2024-08-08T02:30:06Z

cpp/src/sampling/negative_sampling_impl.cuh

+      auto new_end = thrust::remove_if(handle.get_thrust_policy(),
+                                       begin_iter,
+                                       begin_iter + batch_src.size(),
+                                       [] __device__(auto tuple) { return thrust::get<2>(tuple); });


You can use stencil.

https://nvidia.github.io/cccl/thrust/api/function_group__stream__compaction_1ga557e8dd3130229b1a6193b3acc82ee5e.html

auto edge_first = thrust::make_zip_iterator(batch_src.begin(), batch_dst.begin()); auto new_end = thrust::remove_if(handle.get_thrust_policy(), edge_first, edge_first + batch_src.size(), has_edge_flags.begin(), []__device__(auto flag) { return !flag; });

Change made

seunghwak · 2024-08-08T02:35:32Z

cpp/src/sampling/negative_sampling_impl.cuh

+      thrust::copy(handle.get_thrust_policy(),
+                   thrust::make_zip_iterator(batch_src.begin(), batch_dst.begin()),
+                   thrust::make_zip_iterator(batch_src.end(), batch_dst.end()),
+                   thrust::make_zip_iterator(src.begin(), dst.begin()) + current_end);
+
+      auto begin_iter = thrust::make_zip_iterator(src.begin(), dst.begin());
+      thrust::sort(handle.get_thrust_policy(), begin_iter, begin_iter + src.size());


We can use thrust::merge if remove_duplicates is true. If I am not mistaken, both (src, dst) and (batch_src, batch_dst) are sorted.

If remove_duplicates is false, we don't need to sort, right?

Switched to merge.

seunghwak · 2024-08-08T02:37:58Z

cpp/src/sampling/negative_sampling_impl.cuh

+      if (!remove_duplicates) {
+        auto begin_iter = thrust::make_zip_iterator(src.begin(), dst.begin());
+        thrust::sort(handle.get_thrust_policy(), begin_iter, begin_iter + src.size());
+      }


Why should we sort here?

seunghwak · 2024-08-08T03:36:27Z

cpp/src/sampling/negative_sampling_impl.cuh

+          handle.get_thrust_policy(),
+          dst_bias_cache_->view().value_first(),
+          dst_bias_cache_->view().value_first() + graph_view.local_edge_partition_dst_range_size(),
+          weight_t{0});


This won't work.

edge_src|dst_property_t stores edge src|dst property values in either linear array or (key, value) pairs.

This works only if property values are store in an linear array, but with this approach, memory footprint scales as a function of O(V/sqrt(P)), so this won't scale in large scale for graphs with a relatively low average vertex degree. Using key, value pairs ensure that memory footprint stays as min(O(V/sqrt(P)), O(E/P)).

And I am uncomfortable in exploiting this implementation details outside the primitives. And if we are only visiting the edges that exist in the input graph, we can store in (key, value) pairs, but if we need to cover the entire src/dst range, edge_property_t is not a right data structure.

One robust (but wastes some communication) approach is this.

We locally sum src biases and dst biases in each rank, then call host_scalar_allgather.

We first generate edge sources in two steps.

In each rank, find how many sources to generate in each V/P segment (host_scalar_allgather output of local src bias sum). Then call host_scalar_all_to_all (host_scalar_multicast). Based on this, each rank generates sources for its local vertex partition range. Then, send the source values back.

Do the same for destinations.

Shuffle edges.

A key draw back is that we need to send sources & destinations back and shuffle edges as well. More communication than necessary but no memory footprint issue and minimally exploits any cugraph implementation details.

A more efficient approach is to create a P by sqrt(P) (or minor_comm_size to be exact) array (this requires understanding how the adjacency matrix is partitoined). By just collecting local bias sum values, we can compute how many edges should be generated in each box. We can send these nuumbers to the rank owning the boxes.

When we generate sources, we can generate sources in minor_com_size iterations; in each iteration we hold source biases only for the V/P vertices. When we generate destination vertices. We first just compute how many to generate per each V/P range in each box. Sum it, send to the owning GPUs (or in loops, we can bring V/P dst bias values based on which incur less communication), and get the destination vertices back. This will incur less communication but more involved.

I am inclined to implement the first approach, and if that turns out to be a performance bottleneck, we may consider implementing the second approach may be in the primitive space.... Let me know what you think.

I'll revisit this.

My first attempt at refactoring did something akin to your first suggestion. I ran into some problems in debugging and then the current approach occurred to me. My second attempt was something akin to your second approach, but then I ran into the issue that the source edge partitions are separate. This was my third attempt and works [as long as the internal implementation is an array and not a k-v implementation... I guess I didn't have a test that reached that point].

I agree that doing this properly and efficiently might need some work in primitive space.

Refactored in latest push to use option 1. Note that's only required for biased... if either src or dst is uniform that can be done directly.

I suspect the original problem I had with that approach was actually the raft bug that I later worked around. Code is much simpler this way, and I think the performance should be fine - at least for now.

…ng inner details of edge partitioning

… bit to address CUDA 11.8 failures

seunghwak · 2024-08-14T19:04:41Z

cpp/src/sampling/negative_sampling_impl.cuh

+           std::optional<rmm::device_uvector<weight_t>>>
+normalize_biases(raft::handle_t const& handle,
+                 graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view,
+                 std::optional<raft::device_span<weight_t const>> biases)


Something minor, but should we better call this function only when biases is valid? (instead of taking an optional biases variable?).

I will adjust. This is residual from when we were doing the 2D partitioning... in that case we needed to do some of the setup (what became this function) even if it was uniform biases. That's no longer required... so I will fix.

seunghwak · 2024-08-14T19:22:42Z

cpp/src/sampling/negative_sampling_impl.cuh

+      gpu_biases = cugraph::device_allgatherv(
+        handle, handle.get_comms(), raft::device_span<weight_t const>{d_sum.data(), d_sum.size()});


I assume we don't need both the above host_scalar_allreduce and device_allgatherv here.

If we just call allgatherv on every GPU's local sum, we can compute aggregate_sum from there. We have every GPU's local sum and the aggregate sum, we can compute everything without calling an additional host_scalar_allreduce.

Cleaned up in next push.

seunghwak · 2024-08-14T19:22:59Z

cpp/src/sampling/negative_sampling_impl.cuh

+      thrust::inclusive_scan(
+        handle.get_thrust_policy(), gpu_biases->begin(), gpu_biases->end(), gpu_biases->begin());
+
+      weight_t force_to_one{1.1};


seunghwak · 2024-08-14T19:26:26Z

cpp/src/sampling/negative_sampling_impl.cuh

+
+      weight_t force_to_one{1.1};
+      raft::update_device(
+        gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream());


gpu_biases->set_element_async(gpu_biases->size() - 1, force_to_one, handle.get_stream());
handle.sync_stream(); <= this is necessary as force_to_one can become out-of-scope before this update finishes.

Or if we really want to ensure that no value is larger than 1, we may call thrust::transform on the entire gpu_biases and set value to cuda::std::min(org_value, 1.0).

I am not sure that floating point arithmetic guarantees that running inclusive scans on local_sum / aggregate_sum values guarantee that no value exceeds 1 (and if the local_sum on the last GPU is zero, non-last element may have a value larger than 1 as well).

The problem is the bug in raft's float random number generator which allows for the generation of a random number value 1.0. When I do thrust::upper_bound below, if the random number is 1.0 then the upper_bound call goes out of bounds on the gpu_counts array. Forcing this to a value larger than 1.0 lets us handle this condition.

But you are correct, what I want to do is set all values at the end of the array (where counts are 0) to this value. I will ponder a bit.

Refactored to support the case where some GPUs might have no edges.

seunghwak · 2024-08-14T19:39:32Z

cpp/src/sampling/negative_sampling_impl.cuh

+      size_t comm_size = handle.get_comms().get_size();
+      size_t comm_rank = handle.get_comms().get_rank();


size_t => auto const (comm_size & comm_rank are actually int values following MPI convention, I guess we won''t need to consider more than 2^31-1 GPUs in our life time.

seunghwak · 2024-08-14T19:45:48Z

cpp/src/sampling/negative_sampling_impl.cuh

+      auto& major_comm     = handle.get_subcomm(cugraph::partition_manager::major_comm_name());
+      auto const major_comm_size = major_comm.get_size();
+      auto& minor_comm = handle.get_subcomm(cugraph::partition_manager::minor_comm_name());
+      auto const minor_comm_size = minor_comm.get_size();


seunghwak · 2024-08-14T20:00:47Z

cpp/src/sampling/negative_sampling_impl.cuh

+      auto all_gpu_counts = cugraph::device_allgatherv(
+        handle,
+        handle.get_comms(),
+        raft::device_span<size_t const>{gpu_counts.data(), gpu_counts.size()});


You need to just collect sample counts for comm_size GPUs (no need to collect for the entire comm_size * comm_size matrix). Use device_multicast_sendrecv to send to comm_size GPUs and receive from comm_size GPUs.

Then, the code below will also become simpler.

Refactored to use shuffle_values, which is essentially this.

jnke2016 · 2024-08-19T17:42:21Z

I re-reviewed the refactored PR and approve it

ChuckHastings

Latest push addresses the latest batch of requests. Please review again.

ChuckHastings · 2024-08-14T21:08:17Z

cpp/src/sampling/negative_sampling_impl.cuh

+           std::optional<rmm::device_uvector<weight_t>>>
+normalize_biases(raft::handle_t const& handle,
+                 graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view,
+                 std::optional<raft::device_span<weight_t const>> biases)


I will adjust. This is residual from when we were doing the 2D partitioning... in that case we needed to do some of the setup (what became this function) even if it was uniform biases. That's no longer required... so I will fix.

ChuckHastings · 2024-08-14T21:22:49Z

cpp/src/sampling/negative_sampling_impl.cuh

+      gpu_biases = cugraph::device_allgatherv(
+        handle, handle.get_comms(), raft::device_span<weight_t const>{d_sum.data(), d_sum.size()});


Cleaned up in next push.

ChuckHastings · 2024-08-14T21:32:59Z

cpp/src/sampling/negative_sampling_impl.cuh

+
+      weight_t force_to_one{1.1};
+      raft::update_device(
+        gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream());


The problem is the bug in raft's float random number generator which allows for the generation of a random number value 1.0. When I do thrust::upper_bound below, if the random number is 1.0 then the upper_bound call goes out of bounds on the gpu_counts array. Forcing this to a value larger than 1.0 lets us handle this condition.

But you are correct, what I want to do is set all values at the end of the array (where counts are 0) to this value. I will ponder a bit.

ChuckHastings · 2024-08-14T21:33:44Z

cpp/src/sampling/negative_sampling_impl.cuh

+      auto& major_comm     = handle.get_subcomm(cugraph::partition_manager::major_comm_name());
+      auto const major_comm_size = major_comm.get_size();
+      auto& minor_comm = handle.get_subcomm(cugraph::partition_manager::minor_comm_name());
+      auto const minor_comm_size = minor_comm.get_size();


ChuckHastings · 2024-08-14T21:34:39Z

cpp/src/sampling/negative_sampling_impl.cuh

+      size_t comm_size = handle.get_comms().get_size();
+      size_t comm_rank = handle.get_comms().get_rank();


ChuckHastings · 2024-08-15T04:43:53Z

cpp/src/sampling/negative_sampling_impl.cuh

+      auto all_gpu_counts = cugraph::device_allgatherv(
+        handle,
+        handle.get_comms(),
+        raft::device_span<size_t const>{gpu_counts.data(), gpu_counts.size()});


Refactored to use shuffle_values, which is essentially this.

ChuckHastings · 2024-08-19T18:55:05Z

cpp/src/sampling/negative_sampling_impl.cuh

+
+      weight_t force_to_one{1.1};
+      raft::update_device(
+        gpu_biases->data() + gpu_biases->size() - 1, &force_to_one, 1, handle.get_stream());


Refactored to support the case where some GPUs might have no edges.

seunghwak

LGTM besides minor cosmetic issues and the need to test with edge masked graphs (to make sure that every algorithm works properly with edge masked graphs).

seunghwak · 2024-08-20T00:27:08Z

cpp/include/cugraph/sampling_functions.hpp

+  raft::handle_t const& handle,
+  raft::random::RngState& rng_state,
+  graph_view_t<vertex_t, edge_t, store_transposed, multi_gpu> const& graph_view,
+  size_t num_samples,


Sort of our convention is to list scalar parameters at the end. We may better move num_samples after dst_biases.

Moved in next push

seunghwak · 2024-08-20T00:41:56Z

cpp/src/sampling/negative_sampling_impl.cuh

+                         normalized_biases->begin());
+
+  if constexpr (multi_gpu) {
+    // rmm::device_scalar<weight_t> d_sum((sum / aggregate_sum), handle.get_stream());


Delete commented out code.l

Deleted in next push

seunghwak · 2024-08-20T00:48:27Z

cpp/tests/sampling/mg_negative_sampling.cpp

+  bool use_dst_bias{false};
+  bool remove_duplicates{false};
+  bool remove_existing_edges{false};
+  bool exact_number_of_samples{false};


We should better test with edge masked graphs as well.

What would behavior be in an edge masked graph? That if an edge is masked out it would be allowed to be in the negative sampled edges if remove_existing_edges is set to true? I can add that tests for that.

Yes. If you call has_edge() on masked out edges, you will get false. So, those edges can be included in the negative sampling output

Done in next push

seunghwak · 2024-08-20T00:48:53Z

cpp/tests/sampling/negative_sampling.cpp

+  bool remove_duplicates{false};
+  bool remove_existing_edges{false};
+  bool exact_number_of_samples{false};
+  bool check_correctness{true};


We should better test with edge masked graphs as well.

Done in next push

… cleanup

ChuckHastings · 2024-08-21T03:04:04Z

/merge

Define C++ API for negative sampling

29cbe97

github-actions bot added the cuGraph label Jul 5, 2024

ChuckHastings self-assigned this Jul 5, 2024

ChuckHastings added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 5, 2024

ChuckHastings added 5 commits July 10, 2024 11:11

first cut at negative sampling implementation (untested)... fixed API

983a881

rename utility_wrapper.cuh

5504c74

Working SG negative sampling tests

912ae6f

add MG tests

0ce0712

Merge branch 'branch-24.08' into negative_sampling_api

0d89269

github-actions bot added the CMake label Jul 17, 2024

ChuckHastings changed the title ~~Define C++ API for negative sampling~~ Define and Implement C++ API for negative sampling Jul 17, 2024

ChuckHastings added 3 commits July 23, 2024 13:38

Add C API and PLC for negative sampling

a31f5a9

Merge branch 'branch-24.08' into negative_sampling_api

fdad347

Fix filename change lost in merge

6a90844

github-actions bot added the python label Jul 23, 2024

ChuckHastings marked this pull request as ready for review July 23, 2024 21:30

ChuckHastings requested review from a team as code owners July 23, 2024 21:30

ChuckHastings commented Jul 23, 2024

View reviewed changes

cpp/tests/sampling/mg_negative_sampling.cu Outdated Show resolved Hide resolved

alexbarghi-nv approved these changes Jul 25, 2024

View reviewed changes

ChuckHastings modified the milestones: 24.08, 24.10 Jul 30, 2024

ChuckHastings added 2 commits August 5, 2024 13:50

Negative sampling now working for SG, MG with 1/2/4 GPUs

2f23ac1

Merge branch 'branch-24.10' into negative_sampling_api

51d30db

ChuckHastings requested review from a team as code owners August 5, 2024 20:52

ChuckHastings changed the base branch from branch-24.08 to branch-24.10 August 5, 2024 20:53

ChuckHastings removed request for a team and jameslamb August 5, 2024 20:54

jnke2016 approved these changes Aug 6, 2024

View reviewed changes

seunghwak reviewed Aug 8, 2024

View reviewed changes

ChuckHastings added 2 commits August 9, 2024 15:09

Refactor to do biased sampling by vertex partitions instead of exposi…

06c3d5d

…ng inner details of edge partitioning

Merge branch 'branch-24.10' into negative_sampling_api

b9ab33a

github-actions bot removed ci benchmarks conda labels Aug 9, 2024

ChuckHastings added 3 commits August 9, 2024 21:06

address other PR comments

4de3bda

Fix a few straggling references to remove_false_negatives, refactor a…

b38d4c6

… bit to address CUDA 11.8 failures

Merge branch 'branch-24.10' into negative_sampling_api

16bbd7b

seunghwak reviewed Aug 14, 2024

View reviewed changes

ChuckHastings added 6 commits August 15, 2024 11:12

refactor negative sampling based on PR comments

905d1b6

start refactoring to make tests .cpp files

dbc0b38

move MG validation code into validation_utilitices.cu

5f35987

rename sampling file

06e71c4

remove reference of device structure from host API

94990e1

Merge branch 'branch-24.10' into negative_sampling_api

3305835

jnke2016 approved these changes Aug 19, 2024

View reviewed changes

update to accomodate GPUs with no bias

996b9ac

ChuckHastings commented Aug 19, 2024

View reviewed changes

seunghwak approved these changes Aug 20, 2024

View reviewed changes

ChuckHastings and others added 2 commits August 20, 2024 11:47

move num_samples parameter, add tests for edge masking, some cosmetic…

8a28b0b

… cleanup

Merge branch 'branch-24.10' into negative_sampling_api

b381784

rapids-bot bot merged commit 97d1641 into rapidsai:branch-24.10 Aug 21, 2024
109 checks passed

		raft::device_span<value_t> output,
		raft::device_span<bias_t const> biases);

		std::optional<raft::device_span<weight_t const>> src_bias,
		std::optional<raft::device_span<weight_t const>> dst_bias,

		gpu_biases = cugraph::device_allgatherv(
		handle, handle.get_comms(), raft::device_span<weight_t const>{d_sum.data(), d_sum.size()});

		size_t comm_size = handle.get_comms().get_size();
		size_t comm_rank = handle.get_comms().get_rank();

Define and Implement C++ API for negative sampling #4523

Define and Implement C++ API for negative sampling #4523

Conversation

ChuckHastings commented Jul 5, 2024 • edited Loading

alexbarghi-nv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seunghwak Aug 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jnke2016 commented Aug 19, 2024

ChuckHastings left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seunghwak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChuckHastings commented Aug 21, 2024

ChuckHastings commented Jul 5, 2024 •

edited

Loading

seunghwak Aug 8, 2024 •

edited

Loading