Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StreamHPC 2023-10-18 #480

Merged
merged 107 commits into from
Nov 14, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
107 commits
Select commit Hold shift + click to select a range
98994c5
Excessive shared memory usage in block_shuffle fix
Melirius Jun 9, 2023
0fbf43e
Merge branch '632-excessive-shared-memory-usage-in-block_shuffle-algo…
Melirius Jun 9, 2023
6c3d3c0
remove block_sort_algorithm template param from block_sort_kernel_imp…
parbenc Jun 27, 2023
ab274c6
fixed compile errors
parbenc Jun 28, 2023
2329a5f
Updated ChangeLog.md
parbenc Jun 28, 2023
69fbd4a
remove unnecessary code
parbenc Jun 29, 2023
f68110f
fixed CHANGELOG.md to not be so verbose about non public api changes
parbenc Jun 29, 2023
6343c66
Merge branch '573-merge-sort-block-sort-remove-specializations-for-bl…
parbenc Jul 3, 2023
22e731b
Add dynamic dispatch and autotuning to device_adjacent_difference
sbalint98 May 23, 2023
18ede8e
Fix device_adjacent_difference storage type
sbalint98 Jun 26, 2023
ad93f78
ci: remove autotune dependency from build:benchmark
Jul 5, 2023
2fa35ae
Merge branch '607-tuning-adjdiff' into 'develop_stream'
Jul 6, 2023
8783207
test: fix indexing error test_type_helper<custom_16aligned>::get_rand…
Jul 7, 2023
ac746ed
Merge branch 'fixup_autotune' into 'develop_stream'
Jul 7, 2023
343519e
fixes for compilation in debug for radix_sort
Jul 4, 2023
9cac25f
Merge branch 'fix_custom_16aligned_indexing' into 'develop_stream'
Jul 10, 2023
931280c
fix: Detect DPP & DPP broadcast support with __GFX<GENERATION>__ macros
Jul 11, 2023
c46a941
refactor: Use __GFX<GENERATION>__ to detect NAVI cards
Jul 11, 2023
027d2ef
Merge branch 'debug_fixes' into 'develop_stream'
Jul 10, 2023
0a71e6e
docs: Update CHANGELOG for DPP & ROCPRIM_NAVI fixes
Jul 11, 2023
730664a
Merge branch 'dpp_broadcast' into 'develop_stream'
Jul 12, 2023
c7fea30
remove deprecated structs and functions
parbenc Jul 6, 2023
fc38c7b
rename scan_by_key_config_v2 to scan_by_key_config
parbenc Jul 6, 2023
668f5d3
remove the option to use custom implemented config for histogram
parbenc Jul 7, 2023
c5195d9
update config compile time check to a different pattern
parbenc Jul 7, 2023
ad58f6e
update documentation comments for configs
parbenc Jul 7, 2023
1173bc5
change documentation comments
parbenc Jul 7, 2023
c28b2b2
change documentation comments on device_radix_sort
parbenc Jul 7, 2023
97f48ee
change documentation comment
parbenc Jul 7, 2023
268a37c
update documentation comments
parbenc Jul 7, 2023
752aaea
update documentation comments
parbenc Jul 10, 2023
31769d9
update documentation comments
parbenc Jul 10, 2023
9711471
remove wrap_adjacent_difference_config function
parbenc Jul 11, 2023
733aec5
add missing transform_config ctor
parbenc Jul 11, 2023
2aaaf87
fix binary search still using wrap_transform_config
parbenc Jul 11, 2023
f753f37
implement static_asset to make binary_search only use binary search c…
parbenc Jul 12, 2023
987f04e
update changelog
parbenc Jul 13, 2023
6c5ad41
remove some *_v2s that went under the radar
parbenc Jul 17, 2023
103b892
remove unnecessary default values
parbenc Jul 17, 2023
34c2ced
Merge branch '617-rocm-6-0' into 'develop_stream'
parbenc Jul 17, 2023
9cb4ab2
Add binary search, lower_bound and upper_bound documentation
Beanavil Jul 14, 2023
be698b5
host_warp_size() is replaced with two different versions with paramet…
parbenc Jul 17, 2023
95a9991
comment out unused param names
parbenc Jul 17, 2023
ed788aa
fix typos in the documentation
parbenc Jul 20, 2023
cfa1bdb
Merge branch '145-documentation-binary-search' into 'develop_stream'
Beanavil Jul 19, 2023
fe56926
move host_warp_size to config_type.hpp
parbenc Jul 20, 2023
b524867
add error checks to host_warp_size calls in tests and benchmarks
parbenc Jul 21, 2023
6063e8b
fix format
parbenc Jul 21, 2023
e902223
add missing comment
parbenc Jul 25, 2023
3cc1b26
fix error handling in lookback_scan_state.hpp
parbenc Jul 25, 2023
b1df7a7
fix compilation error
parbenc Jul 25, 2023
1b41ea4
Merge branch '624-remove-internal-references-to-host_warp_size' into …
parbenc Jul 26, 2023
9707407
change block_radix_rank_match and block_histogram_atomic to use rocpr…
parbenc Jun 30, 2023
7c33d35
change radix_digit_count_helper to use rocprim::match_any instead of …
parbenc Jun 30, 2023
b360815
add elect function to warp intrinsics
parbenc Jul 3, 2023
1c0dcaa
update match_any to return 0 when predicate is false
parbenc Jul 4, 2023
5856531
fix the bit check in elect function
parbenc Jul 4, 2023
84f33e1
update changelog.md
parbenc Jul 4, 2023
9c39575
fix hard coded warps per block value to come from param in kernel
parbenc Jul 5, 2023
4f31872
remove unused variables
parbenc Jul 5, 2023
b7c5363
fix review comments
parbenc Jul 6, 2023
624777a
update group_elect test
parbenc Jul 6, 2023
eff7a21
remove unnecessary comments
parbenc Jul 6, 2023
4730dfc
remove expected from group_elect test
parbenc Jul 6, 2023
a6e9d41
fix overindexing
parbenc Jul 7, 2023
64abd1c
fix review comments
parbenc Jul 10, 2023
d62c807
format
parbenc Jul 10, 2023
2520342
fix review comments
parbenc Jul 11, 2023
1c461d7
fix perf regression
parbenc Jul 12, 2023
f6fc6b8
undo group_elect in block_histogram_atomic.hpp, because of perf impact
parbenc Jul 18, 2023
f1f8fe7
fix bad func name in CHANGELOG.md
Jul 25, 2023
ee63ca4
fix merge errors
parbenc Jul 26, 2023
be98ebf
Fix reduce_by_key algorithm so keys[0] is not flagged as a new run wh…
Beanavil Jul 28, 2023
46e0296
Merge branch '623-merge-the-many-copies-of-rocprim-match_any-code' in…
parbenc Jul 28, 2023
759a6f4
Merge branch 'fix-RLE-for-nan' into 'develop_stream'
Beanavil Aug 1, 2023
428f078
make device_radix_sort compatible with compiler provided __int128_t a…
parbenc Aug 3, 2023
8902238
add ifdefs to only compile int128 parts on clang/gcc
parbenc Aug 4, 2023
d793f43
update changelog
parbenc Aug 4, 2023
814d9c5
fix for int128 to_string labdas
parbenc Aug 8, 2023
8461645
add test for block_radix_sort int128 support
parbenc Aug 11, 2023
7580bdd
Implement block run length decode
Naraenda Aug 16, 2023
38ba018
Merge branch '184-add-support-for-u-int128_t-in-deviceradixsort' into…
parbenc Aug 15, 2023
f7083e0
Merge branch 'block-run-length-decode' into 'develop_stream'
Naraenda Aug 16, 2023
53a3164
Fix reduce_by_key algorithm so out of bounds items are not flagged as…
Beanavil Jul 28, 2023
43dc2f8
Add reduce_by_key test to check that flagging is correct when keys ar…
Beanavil Aug 17, 2023
417cea1
Fix performance regression observed during tuning for gfx1030 and gfx…
sbalint98 Aug 30, 2023
8584dbf
Merge branch 'fix-RLE-for-nan' into 'develop_stream'
Beanavil Aug 28, 2023
b028d2e
Block Runlength Decode: Fix incorrect offsets and improve test
Naraenda Sep 8, 2023
e21a158
Merge branch '639-fix-regression-tuning' into 'develop_stream'
sbalint98 Aug 30, 2023
e91e745
Remove duplicate key from .clang-format
sbalint98 Sep 19, 2023
c39ee8c
Merge branch 'improve-block-rld-tests' into 'develop_stream'
Naraenda Sep 8, 2023
f6dda0a
Remove additional duplicates from clang-format
sbalint98 Sep 20, 2023
b3b8028
Merge branch 'fix-clang-format' into 'develop_stream'
sbalint98 Sep 20, 2023
a4c53f2
Merge branch 'fix-clang-format-2' into 'develop_stream'
matyas-streamhpc Sep 26, 2023
15adf4d
Fix binary_search upper/lower_bound config tuning
sbalint98 Sep 19, 2023
04e4afb
unify language around config params in documentation
parbenc Sep 21, 2023
15e1a67
Merge branch 'fix-binary-search-tuning' into 'develop_stream'
matyas-streamhpc Sep 28, 2023
3a35351
Merge branch 'fix_documentation' into 'develop_stream'
parbenc Sep 29, 2023
7075051
Make the autotune build job run nightly
sbalint98 Oct 6, 2023
56fe65d
remove radix_sort_onesweep autotuning workaround
parbenc Oct 9, 2023
9487932
Merge branch 'add-tuning-build-to-pipeline' into 'develop_stream'
Oct 6, 2023
f760858
Merge branch '626-check-skipped-device_radix_sort_onesweep-autotune-c…
parbenc Oct 16, 2023
d07c302
Resolve doxygen warnings for upstream PR
Naraenda Oct 20, 2023
31f12a4
Merge branch 'fix-doxygen-warnings' into 'develop_stream_2023_10_18'
Naraenda Oct 20, 2023
8c66863
Enable get_device_from_stream for Windows
sbalint98 Nov 1, 2023
8c9c64c
Use _ENABLE_EXTENDED_ALIGNED_STORAGE for windows build in rmake.py
Naraenda Nov 2, 2023
466a946
Bump unreleased ROCm version
Naraenda Nov 2, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -58,8 +58,6 @@ BraceWrapping:
AfterNamespace: true
AfterStruct: true
AfterUnion: true
BeforeCatch: true
BeforeElse: true
AfterExternBlock: false
BeforeCatch: true
BeforeElse: true
Expand Down
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,14 @@
### Build dirs ###
build/
build*/

### clangd. ###
/.cache

### Docs dirs ###
doc/html/
doc/xml/
doc/latex/
doc/*.tag

# Created by https://www.gitignore.io/api/c++,cmake

Expand Down
13 changes: 4 additions & 9 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -230,9 +230,7 @@ build:package:

build:benchmark:
stage: build
needs:
- job: "autotune:generate-config"
optional: true
needs: []
tags:
- rocm-build
extends:
Expand Down Expand Up @@ -270,7 +268,7 @@ autotune:build:
extends:
- .cmake-minimum
- .gpus:rocm-gpus
- .rules:manual
- .rules:benchmark
variables:
BENCHMARK_TARGETS: benchmark_config_tuning
script:
Expand All @@ -282,6 +280,7 @@ autotune:build:
-S $CI_PROJECT_DIR
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_CXX_FLAGS="-Wno-#pragma-messages"
-D CMAKE_BUILD_TYPE=Release
-D BUILD_TEST=OFF
-D BUILD_EXAMPLE=OFF
Expand Down Expand Up @@ -472,11 +471,7 @@ autotune:execute-tuning:
# Exclude benchmark that is known to fail on gfx906
# On ROCm 5.7 or later, check if this can be removed - the presumption is that the failure is caused by a compiler issue.
- >
if [[ "${GPU_TARGET}" == "gfx906" ]] && [[ "${AUTOTUNE_ALGORITHM_REGEX}" == "" ]]; then
export AUTOTUNE_ALGORITHM_REGEX="-\{\"lvl\":\"device\",\"algo\":\"radix_sort_onesweep\",\"key_type\":\"short\",\"value_type\":\"short\",\"cfg\":\{\"histogram\":\{\"bs\":1024,\"ipt\":22},\"sort\":\{\"bs\":1024,\"ipt\":22},\"bits_per_place\":5,\"algorithm\":\"block_radix_rank_algorithm::match\"}}"
fi
- 'printf "CI Variables used in benchmarks:\nAUTOTUNE_RESULT_DIR: %s\nAUTOTUNE_FILENAME_REGEX: %s\nAUTOTUNE_ALGORITHM_REGEX: %s \nAUTOTUNE_SIZE: %s \nAUTOTUNE_TRIALS: %s\n" "$AUTOTUNE_RESULT_DIR" "$AUTOTUNE_FILENAME_REGEX" "$AUTOTUNE_ALGORITHM_REGEX" "$AUTOTUNE_SIZE" "$AUTOTUNE_TRIALS"'
- cd "${CI_PROJECT_DIR}"
cd "${CI_PROJECT_DIR}"
- mkdir -p "${AUTOTUNE_RESULT_DIR}"
- python3
.gitlab/run_benchmarks.py
Expand Down
20 changes: 20 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,39 @@

Full documentation for rocPRIM is available at [https://rocprim.readthedocs.io/en/latest/](https://rocprim.readthedocs.io/en/latest/)

## [Unreleased rocPRIM-3.0.0 for ROCm 6.1.0]
### Added
- Added new primitive: `block_run_length_decode`.
### Changed
- Removed deprecated functionality: `reduce_by_key_config`, `MatchAny`, `scan_config`, `scan_by_key_config` and `radix_sort_config`.
- Renamed `scan_config_v2` to `scan_config`, `scan_by_key_config_v2` to `scan_by_key_config`, `radix_sort_config_v2` to `radix_sort_config`, `reduce_by_key_config_v2` to `reduce_by_key_config`, `radix_sort_config_v2` to `radix_sort_config`.
- Removed support for custom config types for device algorithms.
- `host_warp_size()` was moved into `rocprim/device/config_types.hpp`, and now uses either a `device_id` or a `stream` parameter to query the proper device and a `device_id` out parameter. The return type is `hipError_t`.
- Added support for __int128_t in `device_radix_sort` and `block_radix_sort`.
### Fixed
- Fixed build issues with `rmake.py` on Windows when using VS 2017 15.8 or later due to a breaking fix with extended aligned storage.

## [Unreleased rocPRIM-2.13.1 for ROCm 5.7.0]
### Added
- `block_sort::sort()` overload for keys and values with a dynamic size, for all block sort algorithms. Additionally, all `block_sort::sort()` overloads with a dynamic size are now supported for `block_sort_algorithm::merge_sort` and `block_sort_algorithm::bitonic_sort`.
- New two-way partition primitive `partition_two_way` which can write to two separate iterators.
- Added config tuning and dynamic dispatch to `device_adjacent_difference` algorithm
- New `rocprim::group_elect` warp intrinsic, which chooses one lane from the lanes enabled by a mask.
### Changed
- Deprecated configuration `radix_sort_config` for device-level radix sort as it no longer matches the algorithm's parameters. New configuration `radix_sort_config_v2` is preferred instead.
- Removed erroneous implementation of device-level `inclusive_scan` and `exclusive_scan`. The prior default implementation using lookback-scan now is the only available implementation.
- The benchmark metric indicating the bytes processed for `exclusive_scan_by_key` and `inclusive_scan_by_key` has been changed to incorporate the key type. Furthermore, the benchmark log has been changed such that these algorithms are reported as `scan` and `scan_by_key` instead of `scan_exclusive` and `scan_inclusive`.
- Deprecated configurations `scan_config` and `scan_by_key_config` for device-level scans, as they no longer match the algorithm's parameters. New configurations `scan_config_v2` and `scan_by_key_config_v2` are preferred instead.
- Improved the performance of `partition`.
- `merge_sort_block_sort` will always use stable merge sort as it is faster than the fallback implementation.
- The `rocprim::match_any` interface has a new parameter, `valid` to enalble/disable lanes. The default value is true, so it doesn't change the previous behaviour.
### Fixed
- Fixed build issue caused by missing header in `thread/thread_search.hpp`.
- Fixed `rocprim::MatchAny` for devices with 64-bit warp size. The function `rocprim::MatchAny` is deprecated and `rocprim::match_any` is preferred instead.
- Fixed `device_adjacent_difference` using more shared memory than required.
- Fixed a compilation error when `ROCPRIM_DISABLE_DPP` is defined.
- rocPRIM should be more robust for detecting GPU architecture features. Explicitly listing each architecture is no longer required by developers, fixing compilation failures when
targeting devices not known by rocPRIM.

## [rocPRIM-2.13.0 for ROCm 5.5.0]
### Added
Expand Down
1 change: 1 addition & 0 deletions benchmark/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ add_rocprim_benchmark(benchmark_block_histogram.cpp)
add_rocprim_benchmark(benchmark_block_radix_sort.cpp)
add_rocprim_benchmark(benchmark_block_radix_rank.cpp)
add_rocprim_benchmark(benchmark_block_reduce.cpp)
add_rocprim_benchmark(benchmark_block_run_length_decode.cpp)
add_rocprim_benchmark(benchmark_block_scan.cpp)
add_rocprim_benchmark(benchmark_block_sort.cpp)
add_rocprim_benchmark(benchmark_config_dispatch.cpp)
Expand Down
6 changes: 3 additions & 3 deletions benchmark/ConfigAutotuneSettings.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,10 @@ set(LIMITED_TUNING_TYPES "int64_t int short int8_t")

function(read_config_autotune_settings file list_across_names list_across output_pattern_suffix)
if(file STREQUAL "benchmark_device_adjacent_difference")
set(list_across_names "DataType;Left;InPlace;BlockSize;ItemsPerThread" PARENT_SCOPE)
set(list_across_names "DataType;Left;InPlace;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};\
true false;true false;64 128;1 2 4 8 16" PARENT_SCOPE)
set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@_@ItemsPerThread@" PARENT_SCOPE)
true;false true;32 64 128 256 512 1024" PARENT_SCOPE)
set(output_pattern_suffix "@DataType@_@Left@_@InPlace@_@BlockSize@" PARENT_SCOPE)
elseif(file STREQUAL "benchmark_device_histogram")
set(list_across_names "DataType;BlockSize" PARENT_SCOPE)
set(list_across "${TUNING_TYPES};64 128 256" PARENT_SCOPE)
Expand Down
242 changes: 242 additions & 0 deletions benchmark/benchmark_block_run_length_decode.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
// MIT License
//
// Copyright (c) 2021-2023 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
// in the Software without restriction, including without limitation the rights
// to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
// copies of the Software, and to permit persons to whom the Software is
// furnished to do so, subject to the following conditions:
//
// The above copyright notice and this permission notice shall be included in all
// copies or substantial portions of the Software.
//
// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
// IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
// FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
// AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
// LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
// OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
// SOFTWARE.

#include "benchmark/benchmark.h"
#include "benchmark_utils.hpp"
#include "cmdparser.hpp"

#include "rocprim/block/block_load.hpp"
#include "rocprim/block/block_run_length_decode.hpp"
#include "rocprim/block/block_store.hpp"

#include <random>
#include <vector>

#ifndef DEFAULT_N
const size_t DEFAULT_N = 1024 * 1024 * 32;
#endif

template<class ItemT,
class OffsetT,
unsigned BlockSize,
unsigned RunsPerThread,
unsigned DecodedItemsPerThread,
unsigned Trials>
__global__
__launch_bounds__(BlockSize) void block_run_length_decode_kernel(const ItemT* d_run_items,
const OffsetT* d_run_offsets,
ItemT* d_decoded_items,
bool enable_store = false)
{
using BlockRunLengthDecodeT
= rocprim::block_run_length_decode<ItemT, BlockSize, RunsPerThread, DecodedItemsPerThread>;

ItemT run_items[RunsPerThread];
OffsetT run_offsets[RunsPerThread];

const unsigned global_thread_idx = BlockSize * hipBlockIdx_x + hipThreadIdx_x;
rocprim::block_load_direct_blocked(global_thread_idx, d_run_items, run_items);
rocprim::block_load_direct_blocked(global_thread_idx, d_run_offsets, run_offsets);

ROCPRIM_SHARED_MEMORY typename BlockRunLengthDecodeT::storage_type temp_storage;
BlockRunLengthDecodeT block_run_length_decode(run_items, run_offsets);

const OffsetT total_decoded_size
= d_run_offsets[(hipBlockIdx_x + 1) * BlockSize * RunsPerThread]
- d_run_offsets[hipBlockIdx_x * BlockSize * RunsPerThread];

#pragma nounroll
for(unsigned i = 0; i < Trials; ++i)
{
OffsetT decoded_window_offset = 0;
while(decoded_window_offset < total_decoded_size)
{
ItemT decoded_items[DecodedItemsPerThread];
block_run_length_decode.run_length_decode(decoded_items, decoded_window_offset);

if(enable_store)
{
rocprim::block_store_direct_blocked(global_thread_idx,
d_decoded_items + decoded_window_offset,
decoded_items);
}

decoded_window_offset += BlockSize * DecodedItemsPerThread;
}
}
}

template<class ItemT,
class OffsetT,
unsigned MinRunLength,
unsigned MaxRunLength,
unsigned BlockSize,
unsigned RunsPerThread,
unsigned DecodedItemsPerThread,
unsigned Trials = 100>
void run_benchmark(benchmark::State& state, hipStream_t stream, size_t N)
{
constexpr auto runs_per_block = BlockSize * RunsPerThread;
const auto target_num_runs = 2 * N / (MinRunLength + MaxRunLength);
const auto num_runs
= runs_per_block * ((target_num_runs + runs_per_block - 1) / runs_per_block);

std::vector<ItemT> run_items(num_runs);
std::vector<OffsetT> run_offsets(num_runs + 1);

std::default_random_engine prng(std::random_device{}());
using ItemDistribution = std::conditional_t<std::is_integral<ItemT>::value,
std::uniform_int_distribution<ItemT>,
std::uniform_real_distribution<ItemT>>;
ItemDistribution run_item_dist(0, 100);
std::uniform_int_distribution<OffsetT> run_length_dist(MinRunLength, MaxRunLength);

for(size_t i = 0; i < num_runs; ++i)
{
run_items[i] = run_item_dist(prng);
}
for(size_t i = 1; i < num_runs + 1; ++i)
{
const OffsetT next_run_length = run_length_dist(prng);
run_offsets[i] = run_offsets[i - 1] + next_run_length;
}
const OffsetT output_length = run_offsets.back();

ItemT* d_run_items{};
HIP_CHECK(hipMalloc(&d_run_items, run_items.size() * sizeof(ItemT)));
HIP_CHECK(hipMemcpy(d_run_items,
run_items.data(),
run_items.size() * sizeof(ItemT),
hipMemcpyHostToDevice));

OffsetT* d_run_offsets{};
HIP_CHECK(hipMalloc(&d_run_offsets, run_offsets.size() * sizeof(OffsetT)));
HIP_CHECK(hipMemcpy(d_run_offsets,
run_offsets.data(),
run_offsets.size() * sizeof(OffsetT),
hipMemcpyHostToDevice));

ItemT* d_output{};
HIP_CHECK(hipMalloc(&d_output, output_length * sizeof(ItemT)));

for(auto _ : state)
{
auto start = std::chrono::high_resolution_clock::now();
hipLaunchKernelGGL(HIP_KERNEL_NAME(block_run_length_decode_kernel<ItemT,
OffsetT,
BlockSize,
RunsPerThread,
DecodedItemsPerThread,
Trials>),
dim3(num_runs / runs_per_block),
dim3(BlockSize),
0,
stream,
d_run_items,
d_run_offsets,
d_output);
HIP_CHECK(hipPeekAtLastError());
HIP_CHECK(hipDeviceSynchronize());

auto end = std::chrono::high_resolution_clock::now();
auto elapsed_seconds
= std::chrono::duration_cast<std::chrono::duration<double>>(end - start);

state.SetIterationTime(elapsed_seconds.count());
}
state.SetBytesProcessed(state.iterations() * output_length * sizeof(ItemT) * Trials);
state.SetItemsProcessed(state.iterations() * output_length * Trials);

HIP_CHECK(hipFree(d_run_items));
HIP_CHECK(hipFree(d_run_offsets));
HIP_CHECK(hipFree(d_output));
}

#define CREATE_BENCHMARK(IT, OT, MINRL, MAXRL, BS, RPT, DIPT) \
benchmark::RegisterBenchmark("block_run_length_decode<Item Type:" #IT ",Offset Type:" #OT \
",Min RunLength:" #MINRL ",Max RunLength:" #MAXRL \
",BlockSize: " #BS ",Runs Per Thread:" #RPT \
",Decoded Items Per Thread:" #DIPT ">", \
&run_benchmark<IT, OT, MINRL, MAXRL, BS, RPT, DIPT>, \
stream, \
size)

int main(int argc, char* argv[])
{
cli::Parser parser(argc, argv);
parser.set_optional<size_t>("size", "size", DEFAULT_N, "number of values");
parser.set_optional<int>("trials", "trials", -1, "number of iterations");
parser.run_and_exit_if_error();

// Parse argv
benchmark::Initialize(&argc, argv);
const size_t size = parser.get<size_t>("size");
const int trials = parser.get<int>("trials");

std::cout << "benchmark_block_run_length_decode" << std::endl;

// HIP
hipStream_t stream = 0; // default
hipDeviceProp_t devProp;
int device_id = 0;
HIP_CHECK(hipGetDevice(&device_id));
HIP_CHECK(hipGetDeviceProperties(&devProp, device_id));
std::cout << "[HIP] Device name: " << devProp.name << std::endl;

// Add benchmarks
std::vector<benchmark::internal::Benchmark*> benchmarks{
CREATE_BENCHMARK(int, int, 1, 5, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 10, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 50, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 100, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 500, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 1000, 128, 2, 4),
CREATE_BENCHMARK(int, int, 1, 5000, 128, 2, 4),

CREATE_BENCHMARK(double, long long, 1, 5, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 10, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 50, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 100, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 500, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 1000, 128, 2, 4),
CREATE_BENCHMARK(double, long long, 1, 5000, 128, 2, 4)};

// Use manual timing
for(auto& b : benchmarks)
{
b->UseManualTime();
b->Unit(benchmark::kMillisecond);
}

// Force number of iterations
if(trials > 0)
{
for(auto& b : benchmarks)
{
b->Iterations(trials);
}
}

// Run benchmarks
benchmark::RunSpecifiedBenchmarks();
return 0;
}
10 changes: 6 additions & 4 deletions benchmark/benchmark_device_adjacent_difference.parallel.cpp.in
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// MIT License
//
// Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
// Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
//
// Permission is hereby granted, free of charge, to any person obtaining a copy
// of this software and associated documentation files (the "Software"), to deal
Expand All @@ -26,9 +26,11 @@
#include "benchmark_device_adjacent_difference.parallel.hpp"

namespace {
auto benchmarks = config_autotune_register::create<device_adjacent_difference_benchmark<
auto benchmarks = config_autotune_register::create_bulk(
device_adjacent_difference_benchmark_generator<
@DataType@,
@BlockSize@,
@Left@,
@InPlace@,
rocprim::adjacent_difference_config<@BlockSize@, @ItemsPerThread@>>>();
@InPlace@>::create);

}
Loading