Priority Queue #105

andrewbriand · 2021-09-10T21:25:06Z

Adds a GPU-accelerated priority queue

Allows for multiple concurrent insertions as well as multiple concurrent
deletions.

The implementation of the priority queue is based on https://arxiv.org/pdf/1906.06504.pdf.

The queue supports two operations:
push: Add elements into the queue
pop: Remove the element(s) with the lowest (when Max == false) or highest
(when Max == true) keys

The priority queue supports bulk host-side operations and more fine-grained
device-side operations.

The host-side bulk operations push and pop allow an arbitrary number of
elements to be pushed to or popped from the queue.

The device-side operations allow a cooperative group to push or pop
some number of elements less than or equal to node_size. These device side
operations are invoked with a trivially-copyable device view,
device_mutable_view which can be obtained with the host function
get_mutable_device_view and passed to the device.

Current limitations:

Only supports trivially comparable key types
Does not support insertion and deletion at the same time
Capacity is fixed and the queue does not automatically resize
Deletion from the queue is much slower than insertion into the queue due to congestion at the underlying heap's root node

TODO: Port tests to Catch2 and benchmarks to google benchmark

GPUtester · 2021-09-10T21:25:07Z

Can one of the admins verify this patch?

include/cuco/priority_queue.cuh

jrhemstad · 2021-09-10T21:44:43Z

include/cuco/priority_queue.cuh

+namespace cuco {
+
+/*
+* @brief A GPU-accelerated priority queue of key-value pairs


Is there a reason for this to be hardcoded for key-value pairs? Can't it be for any trivially copyable type T? e.g., with std::priority_queue I could have a std::priority_queue<int> or a std::priority_queue<std::pair<int,int>>.

Update docs now that this has been updated.

jrhemstad · 2021-09-10T21:46:59Z

I reviewed the top level header at this point and gave some thoughts/questions on how to make this a little more generic.

include/cuco/priority_queue.cuh

PointKernel · 2022-06-15T17:08:10Z

ok to test

andrewbriand · 2022-06-19T01:59:55Z

@PointKernel Thanks for your comments! I believe that I have addressed or responded to them all. Please let me know what you think and what other comments you might have.

PointKernel

Another round of review.

Thanks @andrewbriand for your effort and persistence made to this PR! We are almost there.

include/cuco/priority_queue.cuh

PointKernel · 2022-06-19T17:51:53Z

include/cuco/priority_queue.cuh

+  ~priority_queue();
+
+  class device_mutable_view {
+   public:


Suggested change

public:

public:

using value_type = T;

Should I also replace references to T with value_type in device_mutable_view?

That will be great!

PointKernel · 2022-06-19T17:54:01Z

include/cuco/detail/priority_queue.inl

+  detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(
+    first,
+    last - first,
+    d_heap_,
+    d_size_,
+    node_size_,
+    d_locks_,
+    d_p_buffer_size_,
+    lowest_level_start_,
+    compare_);


Suggested change

detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(

first,

last - first,

d_heap_,

d_size_,

node_size_,

d_locks_,

d_p_buffer_size_,

lowest_level_start_,

compare_);

auto view = get_device_mutable_view();

detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(

first, num_elements, view);

This is a great example showing the power of "view". Accordingly, the push_kernel would look like:

template <typename OutputIt, typename viewT> __global__ void push_kernel(OutputIt elements, std::size_t const num_elements, viewT view) { using T = typename viewT::value_type; ... }

If you want, push_n_kernel instead of push_kernel would be a more descriptive name in this case.

PointKernel · 2022-06-19T18:22:14Z

include/cuco/detail/priority_queue.inl

+    detail::push_single_node(g,
+                             first + i * node_size_,
+                             d_heap_,
+                             d_size_,
+                             node_size_,
+                             d_locks_,
+                             lowest_level_start_,
+                             shmem,
+                             compare_);


push_single_node, push_partial_node, and related utilities should be member functions of device_mutable_view.

The same as pop_single_node and pop_partial_node

PointKernel · 2022-06-19T18:25:33Z

include/cuco/priority_queue.cuh

+    /*
+     * @brief Return the amount of temporary storage required for operations
+     * on the queue with a cooperative group size of block_size
+     *
+     * @param block_size Size of the cooperative groups to calculate storage for
+     * @return The amount of temporary storage required in bytes
+     */
+    __device__ int get_shmem_size(int block_size) const
+    {
+      int intersection_bytes = 2 * (block_size + 1) * sizeof(int);
+      int node_bytes         = node_size_ * sizeof(T);
+      return intersection_bytes + 2 * node_bytes;
+    }


This seems never used

PointKernel · 2022-06-19T18:27:10Z

include/cuco/detail/priority_queue_kernels.cuh

+ * @param shmem The shared memory layout for this cooperative group
+ * @param compare Comparison operator ordering the elements in the heap
+ */
+template <typename InputIt, typename T, typename Compare, typename CG>


OutputIt instead of InputIt

PointKernel · 2022-06-19T18:27:49Z

include/cuco/detail/priority_queue_kernels.cuh

+* @param lowest_level_start The first index of the heaps lowest layer
+* @param compare Comparison operator ordering the elements in the heap
+*/
+template <typename OutputIt, typename T, typename Compare>


Suggested change

template <typename OutputIt, typename T, typename Compare>

template <typename InputIt, typename viewT>

PointKernel · 2022-06-19T18:32:17Z

include/cuco/detail/priority_queue_kernels.cuh

+                            T* heap,
+                            int* size,
+                            std::size_t node_size,
+                            int* locks,
+                            std::size_t* p_buffer_size,
+                            int lowest_level_start,
+                            Compare compare)


Suggested change

T* heap,

int* size,

std::size_t node_size,

int* locks,

std::size_t* p_buffer_size,

int lowest_level_start,

Compare compare)

viewT view)

The kernel implementation can also be simplified with view.

Co-authored-by: Yunsong Wang <wangyunsong89@gmail.com>

PointKernel · 2022-06-19T20:13:49Z

@andrewbriand Can you please also merge with the latest dev branch and fix build warnings (if there is any)?

Andrew Briand added 4 commits September 9, 2021 15:20

Initial priority queue commit

5ab856e

Add priority queue benchmark

1f2092c

Class comment

6a9dc99

Improve comments and switch to cuco style

6b263e3

jrhemstad reviewed Sep 10, 2021

View reviewed changes

include/cuco/priority_queue.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Sep 10, 2021

View reviewed changes

include/cuco/priority_queue.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Sep 10, 2021

View reviewed changes

include/cuco/priority_queue.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Sep 10, 2021

View reviewed changes

include/cuco/priority_queue.cuh Outdated Show resolved Hide resolved

jrhemstad reviewed Sep 10, 2021

View reviewed changes

jrhemstad reviewed Sep 15, 2021

View reviewed changes

include/cuco/priority_queue.cuh Show resolved Hide resolved

Andrew Briand and others added 9 commits September 16, 2021 20:51

Iterators

0eaaedf

Test for iterators with thrust device_vector

249165c

Add allocator template parameter

c28a5ad

Allocator

e8a9c1e

Accept arbitrary comparison

012ebde

Accept arbitrary types instead of just pairs

8cf681a

Remove pq_pair.h

8485bec

Start porting priority queue benchmark to gbenchmark

da608cc

Finish porting priority queue benchmark to gbenchmark

8a11b7f

PointKernel added topic: build CMake build issue type: feature request New feature request topic: performance Performance related issue labels Dec 3, 2021

andrewbriand added 6 commits December 18, 2021 04:43

Add multiple node sizes to benchmark

d1392b9

Start porting tests to Catch2

9ee6c8b

Prevent block size from being larger than node size

e223598

Continue porting tests to Catch2

dd8c6b7

Make generate_element for KVPair generic

d031519

Finish Catch2 tests

ba3a6fd

andrewbriand and others added 16 commits May 30, 2022 20:26

Order headers from near to far in priority queue files

0196bde

Bug fix in priority queue test code

4af61ca

[pre-commit.ci] auto code formatting

a1d074a

Remove unnecessary allocator

bf930dd

[pre-commit.ci] auto code formatting

2d9bda9

Add missing member docs in priority_queue.cuh

54dc9f3

[pre-commit.ci] auto code formatting

a5c169d

Add stream parameter to priority queue ctor

4269e9c

Snake case in priority queue files

30cbf83

Put priority queue kernels in detail namespace

bec63f3

generate_keys_uniform -> generate_kv_pairs_uniform

aa12404

Remove FavorInsertionPerformance template parameter

55cf2e6

Default node size 64 -> 1024

f4814db

Avoid c-style expressions in priority queue files

89eea18

Remove FavorInsertionPerformance in priority queue benchmark

7d47200

[pre-commit.ci] auto code formatting

007316a

andrewbriand and others added 5 commits June 17, 2022 05:03

Snake case in priority_queue_test.cu

192e263

[pre-commit.ci] auto code formatting

66dd359

kPBufferIdx -> p_buffer_idx and kRootIdx -> root_idx

9da822f

Use const and constexpr wherever possible in priority queue files

0cfdd94

[pre-commit.ci] auto code formatting

828b00b

PointKernel requested changes Jun 19, 2022

View reviewed changes

andrewbriand and others added 4 commits June 19, 2022 12:29

Add missing const in priority queue

1932418

Co-authored-by: Yunsong Wang <wangyunsong89@gmail.com>

Add docs for stream parameter to priority queue ctor

7c4b1f6

Co-authored-by: Yunsong Wang <wangyunsong89@gmail.com>

Add value_type to priority_queue::device_mutable_view

838e4ea

Co-authored-by: Yunsong Wang <wangyunsong89@gmail.com>

[pre-commit.ci] auto code formatting

d58dd9f

porumbes mentioned this pull request May 30, 2024

Shortest edge collapse app using a preliminary cuCollections priority queue owensgroup/RXMesh#36

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Priority Queue #105

Priority Queue #105

andrewbriand commented Sep 10, 2021

GPUtester commented Sep 10, 2021

jrhemstad Sep 10, 2021

jrhemstad Jan 6, 2022

jrhemstad commented Sep 10, 2021

PointKernel commented Jun 15, 2022

andrewbriand commented Jun 19, 2022

PointKernel left a comment

PointKernel Jun 19, 2022

andrewbriand Jun 19, 2022

PointKernel Jun 19, 2022

PointKernel Jun 19, 2022

PointKernel Jun 19, 2022

PointKernel Jun 19, 2022

PointKernel Jun 19, 2022

PointKernel Jun 19, 2022

PointKernel Jun 19, 2022

PointKernel Jun 19, 2022

PointKernel commented Jun 19, 2022

	template <typename OutputIt, typename T, typename Compare>
	template <typename InputIt, typename viewT>

Priority Queue #105

Are you sure you want to change the base?

Priority Queue #105

Conversation

andrewbriand commented Sep 10, 2021

GPUtester commented Sep 10, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jrhemstad commented Sep 10, 2021

PointKernel commented Jun 15, 2022

andrewbriand commented Jun 19, 2022

PointKernel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PointKernel commented Jun 19, 2022