Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Priority Queue #105

Open
wants to merge 56 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
5ab856e
Initial priority queue commit
Sep 9, 2021
1f2092c
Add priority queue benchmark
Sep 9, 2021
6a9dc99
Class comment
Sep 9, 2021
6b263e3
Improve comments and switch to cuco style
Sep 9, 2021
0eaaedf
Iterators
Sep 17, 2021
249165c
Test for iterators with thrust device_vector
Sep 17, 2021
c28a5ad
Add allocator template parameter
Oct 19, 2021
e8a9c1e
Allocator
andrewbriand Oct 20, 2021
012ebde
Accept arbitrary comparison
andrewbriand Oct 20, 2021
8cf681a
Accept arbitrary types instead of just pairs
andrewbriand Oct 24, 2021
8485bec
Remove pq_pair.h
andrewbriand Nov 2, 2021
da608cc
Start porting priority queue benchmark to gbenchmark
andrewbriand Nov 2, 2021
8a11b7f
Finish porting priority queue benchmark to gbenchmark
andrewbriand Nov 3, 2021
d1392b9
Add multiple node sizes to benchmark
andrewbriand Dec 18, 2021
9ee6c8b
Start porting tests to Catch2
andrewbriand Dec 18, 2021
e223598
Prevent block size from being larger than node size
andrewbriand Dec 18, 2021
dd8c6b7
Continue porting tests to Catch2
andrewbriand Dec 19, 2021
d031519
Make generate_element for KVPair generic
andrewbriand Dec 19, 2021
ba3a6fd
Finish Catch2 tests
andrewbriand Dec 26, 2021
16db085
Hide kernel launch details
andrewbriand Dec 26, 2021
052cec0
Clean up partial deletion code
andrewbriand Dec 27, 2021
a11bea5
Correct test comparisons
andrewbriand Dec 27, 2021
e3c4a27
Commenting and cleanup
andrewbriand Dec 27, 2021
f6fa484
Commenting for Compare
andrewbriand Dec 27, 2021
599067f
Cleanup, arbitrary number of elements for device API functions
andrewbriand Dec 27, 2021
44db340
Formatting
andrewbriand Dec 27, 2021
acfdf7e
Add missing syncs
andrewbriand Apr 12, 2022
d870e29
Merge NVIDIA:dev into andrewbriand:dev
andrewbriand Apr 14, 2022
71775b6
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] Apr 14, 2022
9838569
Add copyright to priority_queue_bench.cu
andrewbriand May 31, 2022
aab4ba0
Add copyright to priority queue files
andrewbriand May 31, 2022
0196bde
Order headers from near to far in priority queue files
andrewbriand May 31, 2022
4af61ca
Bug fix in priority queue test code
andrewbriand May 31, 2022
a1d074a
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] May 31, 2022
bf930dd
Remove unnecessary allocator
andrewbriand May 31, 2022
2d9bda9
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] May 31, 2022
54dc9f3
Add missing member docs in priority_queue.cuh
andrewbriand Jun 11, 2022
a5c169d
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] Jun 11, 2022
4269e9c
Add stream parameter to priority queue ctor
andrewbriand Jun 11, 2022
30cbf83
Snake case in priority queue files
andrewbriand Jun 12, 2022
bec63f3
Put priority queue kernels in detail namespace
andrewbriand Jun 12, 2022
aa12404
generate_keys_uniform -> generate_kv_pairs_uniform
andrewbriand Jun 13, 2022
55cf2e6
Remove FavorInsertionPerformance template parameter
andrewbriand Jun 13, 2022
f4814db
Default node size 64 -> 1024
andrewbriand Jun 15, 2022
89eea18
Avoid c-style expressions in priority queue files
andrewbriand Jun 15, 2022
7d47200
Remove FavorInsertionPerformance in priority queue benchmark
andrewbriand Jun 15, 2022
007316a
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] Jun 15, 2022
192e263
Snake case in priority_queue_test.cu
andrewbriand Jun 17, 2022
66dd359
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] Jun 17, 2022
9da822f
kPBufferIdx -> p_buffer_idx and kRootIdx -> root_idx
andrewbriand Jun 17, 2022
0cfdd94
Use const and constexpr wherever possible in priority queue files
andrewbriand Jun 19, 2022
828b00b
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] Jun 19, 2022
1932418
Add missing const in priority queue
andrewbriand Jun 19, 2022
7c4b1f6
Add docs for stream parameter to priority queue ctor
andrewbriand Jun 19, 2022
838e4ea
Add value_type to priority_queue::device_mutable_view
andrewbriand Jun 19, 2022
d58dd9f
[pre-commit.ci] auto code formatting
pre-commit-ci[bot] Jun 19, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions benchmarks/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,7 @@ ConfigureNVBench(RETRIEVE_BENCH
# - reduce_by_key benchmarks ----------------------------------------------------------------------
set(RBK_BENCH_SRC "${CMAKE_CURRENT_SOURCE_DIR}/reduce_by_key/reduce_by_key.cu")
ConfigureBench(RBK_BENCH "${RBK_BENCH_SRC}")

###################################################################################################
set(PRIORITY_QUEUE_BENCH_SRC "${CMAKE_CURRENT_SOURCE_DIR}/priority_queue/priority_queue_bench.cu")
ConfigureBench(PRIORITY_QUEUE_BENCH "${PRIORITY_QUEUE_BENCH_SRC}")
93 changes: 93 additions & 0 deletions benchmarks/priority_queue/priority_queue_bench.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#include <cuco/detail/pair.cuh>
#include <cuco/priority_queue.cuh>

#include <thrust/device_vector.h>

#include <benchmark/benchmark.h>

#include <cstdint>
#include <random>
#include <vector>

using namespace cuco;

template <typename T>
struct pair_less {
__host__ __device__ bool operator()(const T& a, const T& b) const { return a.first < b.first; }
};

template <typename Key, typename Value, typename OutputIt>
static void generate_kv_pairs_uniform(OutputIt output_begin, OutputIt output_end)
{
std::random_device rd;
std::mt19937 gen{rd()};

const auto num_keys = std::distance(output_begin, output_end);

for (auto i = 0; i < num_keys; ++i) {
output_begin[i] = {static_cast<Key>(gen()), static_cast<Value>(gen())};
}
}

template <typename Key, typename Value, int NumKeys>
static void BM_insert(::benchmark::State& state)
{
for (auto _ : state) {
state.PauseTiming();

priority_queue<pair<Key, Value>, pair_less<pair<Key, Value>>> pq(NumKeys);

std::vector<pair<Key, Value>> h_pairs(NumKeys);
generate_kv_pairs_uniform<Key, Value>(h_pairs.begin(), h_pairs.end());
const thrust::device_vector<pair<Key, Value>> d_pairs(h_pairs);

state.ResumeTiming();
pq.push(d_pairs.begin(), d_pairs.end());
cudaDeviceSynchronize();
}
}

template <typename Key, typename Value, int NumKeys>
static void BM_delete(::benchmark::State& state)
{
for (auto _ : state) {
state.PauseTiming();

priority_queue<pair<Key, Value>, pair_less<pair<Key, Value>>> pq(NumKeys);

std::vector<pair<Key, Value>> h_pairs(NumKeys);
generate_kv_pairs_uniform<Key, Value>(h_pairs.begin(), h_pairs.end());
thrust::device_vector<pair<Key, Value>> d_pairs(h_pairs);

pq.push(d_pairs.begin(), d_pairs.end());
cudaDeviceSynchronize();

state.ResumeTiming();
pq.pop(d_pairs.begin(), d_pairs.end());
cudaDeviceSynchronize();
}
}

BENCHMARK_TEMPLATE(BM_insert, int, int, 128'000'000)->Unit(benchmark::kMillisecond);

BENCHMARK_TEMPLATE(BM_delete, int, int, 128'000'000)->Unit(benchmark::kMillisecond);

BENCHMARK_TEMPLATE(BM_insert, int, int, 256'000'000)->Unit(benchmark::kMillisecond);

BENCHMARK_TEMPLATE(BM_delete, int, int, 256'000'000)->Unit(benchmark::kMillisecond);
197 changes: 197 additions & 0 deletions include/cuco/detail/priority_queue.inl
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
/*
* Copyright (c) 2021-2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

#pragma once

#include <cuco/detail/error.hpp>
#include <cuco/detail/priority_queue_kernels.cuh>

#include <cmath>

namespace cuco {

template <typename T, typename Compare, typename Allocator>
priority_queue<T, Compare, Allocator>::priority_queue(std::size_t initial_capacity,
Allocator const& allocator,
cudaStream_t stream)
: int_allocator_{allocator}, t_allocator_{allocator}, size_t_allocator_{allocator}
{
node_size_ = 1024;

// Round up to the nearest multiple of node size
const int nodes = ((initial_capacity + node_size_ - 1) / node_size_);

node_capacity_ = nodes;
lowest_level_start_ = 1 << static_cast<int>(std::log2(nodes));

// Allocate device variables

d_size_ = std::allocator_traits<int_allocator_type>::allocate(int_allocator_, 1);

CUCO_CUDA_TRY(cudaMemsetAsync(d_size_, 0, sizeof(int), stream));

d_p_buffer_size_ = std::allocator_traits<size_t_allocator_type>::allocate(size_t_allocator_, 1);

CUCO_CUDA_TRY(cudaMemsetAsync(d_p_buffer_size_, 0, sizeof(std::size_t), stream));

d_heap_ = std::allocator_traits<t_allocator_type>::allocate(
t_allocator_, node_capacity_ * node_size_ + node_size_);

d_locks_ =
std::allocator_traits<int_allocator_type>::allocate(int_allocator_, node_capacity_ + 1);

CUCO_CUDA_TRY(cudaMemsetAsync(d_locks_, 0, sizeof(int) * (node_capacity_ + 1), stream));
}

template <typename T, typename Compare, typename Allocator>
priority_queue<T, Compare, Allocator>::~priority_queue()
{
std::allocator_traits<int_allocator_type>::deallocate(int_allocator_, d_size_, 1);
std::allocator_traits<size_t_allocator_type>::deallocate(size_t_allocator_, d_p_buffer_size_, 1);
std::allocator_traits<t_allocator_type>::deallocate(
t_allocator_, d_heap_, node_capacity_ * node_size_ + node_size_);
std::allocator_traits<int_allocator_type>::deallocate(
int_allocator_, d_locks_, node_capacity_ + 1);
}

template <typename T, typename Compare, typename Allocator>
template <typename InputIt>
void priority_queue<T, Compare, Allocator>::push(InputIt first, InputIt last, cudaStream_t stream)
{
constexpr int block_size = 256;

const int num_nodes = static_cast<int>((last - first) / node_size_) + 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const int num_nodes = static_cast<int>((last - first) / node_size_) + 1;
auto const num_elements = std::distance(first, last);
const int num_nodes = static_cast<int>(num_elements / node_size_) + 1;

last - first is not always safe for iterators.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note the empty input should be properly handled wrt #151

const int num_blocks = std::min(64000, num_nodes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid using the magic number 64000?


detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(
first,
last - first,
d_heap_,
d_size_,
node_size_,
d_locks_,
d_p_buffer_size_,
lowest_level_start_,
compare_);
Comment on lines +79 to +88
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(
first,
last - first,
d_heap_,
d_size_,
node_size_,
d_locks_,
d_p_buffer_size_,
lowest_level_start_,
compare_);
auto view = get_device_mutable_view();
detail::push_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(
first, num_elements, view);

This is a great example showing the power of "view". Accordingly, the push_kernel would look like:

template <typename OutputIt, typename viewT>
__global__ void push_kernel(OutputIt elements,
                            std::size_t const num_elements,
                            viewT view)
{
  using T = typename viewT::value_type;
  ...
}

If you want, push_n_kernel instead of push_kernel would be a more descriptive name in this case.


CUCO_CUDA_TRY(cudaGetLastError());
}

template <typename T, typename Compare, typename Allocator>
template <typename OutputIt>
void priority_queue<T, Compare, Allocator>::pop(OutputIt first, OutputIt last, cudaStream_t stream)
{
constexpr int block_size = 256;
const int pop_size = last - first;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::distance


const int num_nodes = static_cast<int>(pop_size / node_size_) + 1;
const int num_blocks = std::min(64000, num_nodes);

detail::pop_kernel<<<num_blocks, block_size, get_shmem_size(block_size), stream>>>(
first,
pop_size,
d_heap_,
d_size_,
node_size_,
d_locks_,
d_p_buffer_size_,
lowest_level_start_,
node_capacity_,
compare_);
Comment on lines +106 to +113
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use view instead


CUCO_CUDA_TRY(cudaGetLastError());
}

template <typename T, typename Compare, typename Allocator>
template <typename CG, typename InputIt>
__device__ void priority_queue<T, Compare, Allocator>::device_mutable_view::push(CG const& g,
InputIt first,
InputIt last,
void* temp_storage)
{
const detail::shared_memory_layout<T> shmem =
detail::get_shared_memory_layout<T>((int*)temp_storage, g.size(), node_size_);

const auto push_size = last - first;
for (std::size_t i = 0; i < push_size / node_size_; i++) {
detail::push_single_node(g,
first + i * node_size_,
d_heap_,
d_size_,
node_size_,
d_locks_,
lowest_level_start_,
shmem,
compare_);
Comment on lines +130 to +138
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

push_single_node, push_partial_node, and related utilities should be member functions of device_mutable_view.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as pop_single_node and pop_partial_node

}

if (push_size % node_size_ != 0) {
detail::push_partial_node(g,
first + (push_size / node_size_) * node_size_,
push_size % node_size_,
d_heap_,
d_size_,
node_size_,
d_locks_,
d_p_buffer_size_,
lowest_level_start_,
shmem,
compare_);
}
}

template <typename T, typename Compare, typename Allocator>
template <typename CG, typename OutputIt>
__device__ void priority_queue<T, Compare, Allocator>::device_mutable_view::pop(CG const& g,
OutputIt first,
OutputIt last,
void* temp_storage)
{
const detail::shared_memory_layout<T> shmem =
detail::get_shared_memory_layout<T>((int*)temp_storage, g.size(), node_size_);

const auto pop_size = last - first;
for (std::size_t i = 0; i < pop_size / node_size_; i++) {
detail::pop_single_node(g,
first + i * node_size_,
d_heap_,
d_size_,
node_size_,
d_locks_,
d_p_buffer_size_,
lowest_level_start_,
node_capacity_,
shmem,
compare_);
}

if (pop_size % node_size_ != 0) {
detail::pop_partial_node(g,
first + (pop_size / node_size_) * node_size_,
last - first,
d_heap_,
d_size_,
node_size_,
d_locks_,
d_p_buffer_size_,
lowest_level_start_,
node_capacity_,
shmem,
compare_);
}
}

} // namespace cuco
Loading