Upstream staging 2024 03 22 (#541)

* Modify adjacent_difference_kernel_impl so if ouput_type is void the input_type is used instead * fix(match_result_type.hpp): deprecate 'match_result_type.hpp' instead of removing it * change device_batch_memcpy to have an IsMemCpy template param started implementing test for device_copy * separate batch_memcpy and batch_cpy into different API calls and header files create test for batch_copy * add benchmark for batch_copy add separate config for batch_copy * update docs * removed unused variable * fix review comments * use warp_size properly instead of hardcoded number * merge batch_copy and batch_memcpy tests * merge batch_copy and batch_memcpy benchmarks * fix unused parameters * add event destroys to the benchmark * fix(device_scan.hpp): throw compiler warning on gfx 11 (navi 3x) for incorrect results due to compiler bug This change should be reverted once the compiler bug is fixed. * Resolve "tests for `block_adjacent_difference` and `block_discontinuity` don't compile with `rocprim::half`" * Update device_adjacent_difference fix for void output type to match CUB * Update CHANGELOG * revert: fix(device_scan.hpp): throw compiler warning on gfx 11 (navi 3x) for incorrect results due to compiler bug This reverts commit 3f375e458e741d10e46c61119ea93ea1012b7946. * fix(lookback_scan_state.hpp): fix sporadic failure in device scan algorithm on navi 3 gpus The compiler does not emit a 'buffer_gl0_inv'-instruction that is required on gfx11. We emit it manually only on this architecture to ensure correct results. * add tests for supported data types * clarified block_histogram documentation fixed block_histogram test, to work with int8_t data * fix bfloat16 tests * update changelog * add comment explainers for commented test cases * Abstracted bit_cast away * Implemented in-place radix sort * Added decomposer arguments and decomposer checking * Sliced compilation of test_block_radix_sort * Implemented sort_comparator for custom test type * Testing custom test types * Moved out identity_decomposer from detail namespace * Added custom type benchmarks * Updated changelog * Fixed typo in changelog * Fixed formatting * Separately testing radix key codec * Fixed documentation * Fixed test utils comparator for custom_float_type * Instantiate test block_radix_sort custom_test_tye<double> only when int128 is supported * Fixed compilation on Windows * Fixed tests for signed integral keys * Added extra radix_key_codec test cases * Fixed device_adjacent_difference kernel config selection and shared memory usage * Documentation fix after rebase * Added the memset to the graph for DeviceAdjacentDifferenceLargeTests * feat(predicate_iterator.hpp): added rocprim::predicate_iterator * fix(predicate_iterator.hpp): fix msvc build error due to implicit deletion of copy assignment * docs(predicate_iterator.hpp): fix doxygen build and improve consistency * refactor(predicate_iterator.hpp): improve naming consistency with other rocprim iterators and algorithms * fix(predicate_iterator.hpp): add missing out-of-class definition for operator+ * fix(predicate_iterator.hpp): fix predicate_iterator not reading value when underlying iterator dereferences to a non-reference type * test(benchmark_predicate_iterator.cpp): add benchmark for predicate iterator * refactor(predicate_iterator.hpp): remove unneeded << operator for std::ostream * docs(predicate_iterator.hpp): fix spelling * fix(predicate_iterator.hpp): drop 'const' from 'predicate_' as a class with const members cannot have an implicit copy assign * refactor(predicate_iterator.hpp): clean up * Replaced in `warp_sort_shuffle` where possible the `warp_shuffle_xor` for `warp_swizzle` * Replaced in `block_sort_bitonic` where possible the `warp_shuffle_xor` for `warp_swizzle` * Updated changelog * Generalize warp_swizzle_shuffle function for both block_sort_bitonic and warp_sort_shuffle * Fix docs warp_swizzle_shuffle * fix(predicate_iterator.hpp): drop constness of derference operator and deriviates This is required to relax the requirement of functions passed as predicate. * fix(predicate_iterator.hpp): derive proxy capture type from dereference operator instead of relying on iterator trait * test(test_predicate_iterator.cpp): extend predicate iterator type tests * remove old workaround comment * Fixed descending device_radix_sort for bool keys * Bool tests in test_device_radix_sort * Bool tests in test_block_radix_sort * Replaced std::getenv for Windows with _dupenv_s to prevent MSVC deprication warning * Removed malloc from linux version of the __get_env * Mark << operator as deprecated for iterators * device_radix_sort uses identity decomposer * Testing device_radix_sort with custom type [WIP] * Sorted overloads and updated docs for device_radix_sort * device_radix_sort public decomposer APIs * Fixed device_radix_sort with custom decomposer and added tests * Updated docs and changelog * Fixed building device_segmented_radix_sort * Added and tested additional device_radix_sort decomposers overloads * Enforce begin/end bit being default for floating point radix sort * Compile time dispatch and cleanup for radix_merge_compare * Added custom_key benchmarks for device_radix_sort * Fixed comparator dispatch in device_segmented_radix_sort * Removed duplicate test_device_radix_sort case * Iterating from MSB to LSB in decomposed radix_merge_compare * add decomposer argument * Added warp_exchange optimization with template recursion * Added tests to warp_exchange that can use the optimization * Warp exchange optimization using integer_sequence * Changed the shuffle warp_exchange to make use of swizzle and use a temp array. * Added extra benchmarks for warp_exchange where warpsize equals items per thread * Added non in place tests for warp exchange * Added documentation for optimized blocked_to_striped_shuffle and striped_to_blocked_shuffle functions of warp_exchange * Added some comments to warp_exchange * Use primes for tuning device adjacent difference * Removed accidental restriction of build targets in build:benchmark job * remove workaround for old compiler bug * Removed hotfix for double to double to __half conversion bug * style(test_utils_hipgraphs.hpp): add used includes * test: remove superfluous graph calls from tests All graph capturing and launching of host only rocprim calls are unneeded as they don't invoke device code. * fix(test_device_adjacent_difference.cpp): re-enable use of identity_operator on device_adjacent_difference tests * feat(test_device_adjacent_difference.cpp): add tests for void value_type in device_adjacent_difference * Update algorithms descriptions with non-bit-wise reproducibility * Added RadixBitsPerPass as parameter for the block_radix_sort * Add static_assert for RadixBitsPerPass from block_radix_sort * Added tests for block_radix_sort with different number of radix_bits_per_pass * Update CHANGELOG * Added some benchmarks with different radix bits per pass * Remove benchmarks with RadixBitsPerPass equal to 1 * Added check to partition kernel if size is smaller than items_per_block * Fixed benchmark_device_adjacent_difference formatting * Remove unnessary rocprim headers for more specific includes * Ordering includes of includes of benchmark files * Added detail includes only for direct usage of detail functions in benchmarks * declare shared memory at kernel level as workaround for non-optimized builds taking too long * increase build parallelism * add debug build and run to ci * fix leftover instance * fix copyright dates * debug benchmark builds * Deprecate TwiddleIn/TwiddleOut * Match radix_key_codec with radix_key_codec_inplace * Remove radix_key_codec_inplace * Make radix_key_codec part of the public API * Add radix_key_codec to sphinx docs * Add radix_key_codec tests for encode/decode/extract_digit consistency * Add static assert to ensure non-fundamental typed keys do not get an identity_decomposer * Add ROCPIM_PRAGMA_MESSAGE to warn about radix_sort.hpp functionality migration * Add test cases for block_histogram * Add test cases for block_exchange * Fix test_block_exchange to avoid UB * fix(thread_load.hpp): combine asm statements to fix broken behavior in debug builds 'rocprim::thread_load' used two consecutive asm declarations (a load and a wait) which allowed the compiler to insert code between the two instructions. This bug was only observed when compiling with '-O0'. By joining the two asm declarations, the compiler can no longer insert instructions between the load and wait, which would cause incomplete data to be used when it was dependent on the data being loaded. --------- Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Nara Prasetya <nara@streamhpc.com> Co-authored-by: Lőrinc Serfőző <lorinc@streamhpc.com> Co-authored-by: Nick Breed <nick@streamhpc.com> Co-authored-by: Nol Moonen <nol@streamhpc.com> Co-authored-by: Balint Soproni <balint@streamhpc.com>
ROCm · Apr 3, 2024 · 14779f7 · 14779f7
1 parent 7e343a6
commit 14779f7
Show file tree

Hide file tree

Showing 165 changed files with 11,654 additions and 5,295 deletions.
diff --git a/.gitlab-ci.yml b/.gitlab-ci.yml
@@ -169,25 +169,26 @@ build:cmake-minimum-apt:
       -G Ninja
       -D CMAKE_CXX_COMPILER="$AMDCLANG"
       -D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror"
-      -D CMAKE_BUILD_TYPE=Release
-      -D BUILD_TEST=ON
+      -D CMAKE_BUILD_TYPE="$BUILD_TYPE"
+      -D BUILD_$BUILD_TARGET=ON 
       -D BUILD_EXAMPLE=ON
-      -D BUILD_BENCHMARK=OFF
       -D GPU_TARGETS=$GPU_TARGETS
       -D AMDGPU_TEST_TARGETS=$GPU_TARGETS
       -S $CI_PROJECT_DIR
       -B $BUILD_DIR
     - cmake --build $BUILD_DIR
   artifacts:
     paths:
-      - $BUILD_DIR/test/test_*
-      - $BUILD_DIR/test/rocprim/test_*
-      - $BUILD_DIR/test/CTestTestfile.cmake
-      - $BUILD_DIR/test/rocprim/CTestTestfile.cmake
-      - $BUILD_DIR/gtest/
-      - $BUILD_DIR/CMakeCache.txt
       - $BUILD_DIR/.ninja_log
+      - $BUILD_DIR/benchmark/*
+      - $BUILD_DIR/CMakeCache.txt
       - $BUILD_DIR/CTestTestfile.cmake
+      - $BUILD_DIR/deps/googlebenchmark/
+      - $BUILD_DIR/gtest/
+      - $BUILD_DIR/test/CTestTestfile.cmake
+      - $BUILD_DIR/test/rocprim/CTestTestfile.cmake
+      - $BUILD_DIR/test/rocprim/test_*
+      - $BUILD_DIR/test/test_*
     expire_in: 2 weeks
 
 build:cmake-latest:
@@ -196,13 +197,19 @@ build:cmake-latest:
   extends:
     - .cmake-latest
     - .build:common
+  variables:
+    BUILD_TYPE: Release
+    BUILD_TARGET: TEST
 
 build:cmake-minimum:
-  stage: build
   needs: []
   extends:
     - .cmake-minimum
     - .build:common
+  parallel:
+    matrix:
+      - BUILD_TYPE: [Debug, Release]
+        BUILD_TARGET: [BENCHMARK, TEST]
 
 build:package:
   stage: build
@@ -229,39 +236,6 @@ build:package:
       - $PACKAGE_DIR/rocprim*.zip
     expire_in: 2 weeks
 
-build:benchmark:
-  stage: build
-  needs: []
-  tags:
-    - build
-  extends:
-    - .cmake-minimum
-    - .gpus:rocm-gpus
-    - .rules:build
-  script:
-# If we have a custom config created by autotune:create-config
-    - "[ -d ${AUTOTUNE_CONFIG_DIR} ] && cp -r -f ${AUTOTUNE_CONFIG_DIR}/* ${CI_PROJECT_DIR}/"
-    - mkdir -p $BUILD_DIR
-    - cd $BUILD_DIR
-    - cmake
-      -B $BUILD_DIR
-      -S $CI_PROJECT_DIR
-      -G Ninja
-      -D CMAKE_CXX_COMPILER="$AMDCLANG"
-      -D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror -Wno-#pragma-messages"
-      -D CMAKE_BUILD_TYPE=Release
-      -D BUILD_TEST=OFF
-      -D BUILD_EXAMPLE=OFF
-      -D BUILD_BENCHMARK=ON
-      -D GPU_TARGETS=$GPU_TARGETS
-    - cmake --build .
-  artifacts:
-    paths:
-      - $BUILD_DIR/benchmark/*
-      - $BUILD_DIR/.ninja_log
-      - $BUILD_DIR/deps/googlebenchmark/
-    expire_in: 2 weeks
-
 build:windows:
   stage: build
   needs: []
@@ -272,14 +246,9 @@ build:windows:
     - .deps:visual-studio-devshell
   parallel:
     matrix:
-      - BUILD_TYPE:
-        # Disabled due to extensive link times. 
-        # This is tracked in issue 679
-        #- Debug
-        - Release
-        BUILD_TARGET:
-        - BENCHMARK
-        - TEST
+      # Debug is disabled due to extensive link times, tracked in issue 679.
+      - BUILD_TYPE: [Release]
+        BUILD_TARGET: [BENCHMARK, TEST]
   script:
     - mkdir -p $CI_PROJECT_DIR/build
     - cmake -G Ninja 
@@ -360,7 +329,11 @@ test:
     - .rules:test
     - .gpus:rocm
   needs:
-    - build:cmake-minimum
+    - job: build:cmake-minimum
+      parallel:
+        matrix:
+          - BUILD_TYPE: Release
+            BUILD_TARGET: TEST
   script:
     - cd $BUILD_DIR
     - cmake
@@ -475,7 +448,11 @@ test:docs:
 
 benchmark:
   needs:
-    - build:benchmark
+    - job: build:cmake-minimum
+      parallel:
+        matrix:
+          - BUILD_TYPE: Release
+            BUILD_TARGET: BENCHMARK
   extends:
     - .cmake-minimum
     - .gpus:rocm
@@ -560,8 +537,6 @@ autotune:execute-tuning:
     paths:
       - ${AUTOTUNE_RESULT_DIR}/*.json
   script:
-    # Exclude benchmark that is known to fail on gfx906
-    # On ROCm 5.7 or later, check if this can be removed - the presumption is that the failure is caused by a compiler issue.
     - >
       cd "${CI_PROJECT_DIR}"
     - |

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,12 +12,45 @@ Documentation for rocPRIM is available at
   * The default accumulator type is still the value type of the input iterator (inclusive scan) or the initial value's type (exclusive scan).
     This is the same behaviour as before this change.
 * New overload for `device_adjacent_difference_inplace` that allows separate input and output iterators, but allows them to point to the same element.
+* New public API for deriving resulting type on device-only functions:
+  * `rocprim::invoke_result`
+  * `rocprim::invoke_result_t`
+  * `rocprim::invoke_result_binary_op`
+  * `rocprim::invoke_result_binary_op_t`
+* New `rocprim::batch_copy` function added. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy.
+* Added more test cases, to better cover supported data types.
+* Updated some tests to work with supported data types.
+* An optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`. 
+  To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple`
+  of references to arithmetic types from the key.
+* New `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate. 
+  It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`.
+  It can be instantiated with:
+  * `rocprim::make_predicate_iterator`
+  * `rocprim::make_mask_iterator`
+* Added custom radix sizes as the last parameter for `block_radix_sort`. The default value is 4, it can be a number between 0 and 32.
+* New `rocprim::radix_key_codec`, which allows the encoding/decoding of keys for radix-based sorts. For user-defined key types, a decomposer functor should be passed.
+
+### Optimizations
+
+* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`.
+* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread.
 
 ### Fixes
 
 * Fixed incorrect results of `warp_exchange::blocked_to_striped_shuffle` and `warp_exchange::striped_to_blocked_shuffle` when the block size is
   larger than the logical warp size. The test suite has been updated with such cases.
 * Fixed incorrect results returned when calling device `unique_by_key` with overlapping `values_input` and `values_output`.
+* Fixed incorrect output type used in `device_adjacent_difference`.
+* Hotfix for incorrect results on the GFX10 (Navi 10/RDNA1, Navi 20/RDNA2) ISA and GFX11 ISA (Navi 30 GPUs) on device scan algorithms `rocprim::inclusive_scan(_by_key)` and `rocprim::exclusive_scan(_by_key)` with large input types.
+* `device_adjacent_difference` now considers both the input and the output type for selecting the appropriate kernel launch config. Previously only the input type was considered, which could result in compilation errors due to excessive shared memory usage.
+* Fixed incorrect data being loaded with `rocprim::thread_load` when compiling with `-O0`.
+
+### Deprecations
+
+* The internal header `detail/match_result_type.hpp` has been deprecated.
+* `TwiddleIn` and `TwiddleOut` have been deprecated in favor of `radix_key_codec`.
+* The internal `::rocprim::detail::radix_key_codec` has been deprecated in favor of the new public utility with the same name.
 
 ## Unreleased rocPRIM-3.1.0 for ROCm 6.1.0
 

diff --git a/benchmark/CMakeLists.txt b/benchmark/CMakeLists.txt
@@ -151,6 +151,7 @@ add_rocprim_benchmark(benchmark_device_segmented_radix_sort_keys.cpp)
 add_rocprim_benchmark(benchmark_device_segmented_radix_sort_pairs.cpp)
 add_rocprim_benchmark(benchmark_device_segmented_reduce.cpp)
 add_rocprim_benchmark(benchmark_device_transform.cpp)
+add_rocprim_benchmark(benchmark_predicate_iterator.cpp)
 add_rocprim_benchmark(benchmark_warp_exchange.cpp)
 add_rocprim_benchmark(benchmark_warp_reduce.cpp)
 add_rocprim_benchmark(benchmark_warp_scan.cpp)

diff --git a/benchmark/benchmark_block_adjacent_difference.cpp b/benchmark/benchmark_block_adjacent_difference.cpp
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2022 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -20,19 +20,22 @@
 // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 // SOFTWARE.
 
-#include <algorithm>
-#include <rocprim/block/block_adjacent_difference.hpp>
-
-// Google Benchmark
-#include "benchmark/benchmark.h"
-
+#include "benchmark_utils.hpp"
 // CmdParser
 #include "cmdparser.hpp"
-#include "benchmark_utils.hpp"
+
+// Google Benchmark
+#include <benchmark/benchmark.h>
 
 // HIP API
 #include <hip/hip_runtime.h>
 
+// rocPRIM
+#include <rocprim/block/block_adjacent_difference.hpp>
+#include <rocprim/block/block_load_func.hpp>
+#include <rocprim/block/block_store_func.hpp>
+
+#include <algorithm>
 #include <iostream>
 #include <limits>
 #include <string>

diff --git a/benchmark/benchmark_block_discontinuity.cpp b/benchmark/benchmark_block_discontinuity.cpp
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2017-2022 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -20,24 +20,28 @@
 // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 // SOFTWARE.
 
-#include <cstdio>
-#include <cstdlib>
-#include <iostream>
-#include <limits>
-#include <string>
-#include <vector>
-
-// Google Benchmark
-#include "benchmark/benchmark.h"
 // CmdParser
-#include "cmdparser.hpp"
 #include "benchmark_utils.hpp"
+#include "cmdparser.hpp"
+
+// Google Benchmark
+#include <benchmark/benchmark.h>
 
 // HIP API
 #include <hip/hip_runtime.h>
 
 // rocPRIM
-#include <rocprim/rocprim.hpp>
+#include <rocprim/block/block_discontinuity.hpp>
+#include <rocprim/block/block_load_func.hpp>
+#include <rocprim/block/block_store_func.hpp>
+
+#include <iostream>
+#include <limits>
+#include <string>
+#include <vector>
+
+#include <cstdio>
+#include <cstdlib>
 
 #ifndef DEFAULT_N
 const size_t DEFAULT_N = 1024 * 1024 * 128;

diff --git a/benchmark/benchmark_block_exchange.cpp b/benchmark/benchmark_block_exchange.cpp
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2017-2022 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -20,24 +20,28 @@
 // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 // SOFTWARE.
 
-#include <cstdio>
-#include <cstdlib>
-#include <iostream>
-#include <limits>
-#include <string>
-#include <vector>
-
-// Google Benchmark
-#include "benchmark/benchmark.h"
 // CmdParser
 #include "cmdparser.hpp"
 #include "benchmark_utils.hpp"
 
+// Google Benchmark
+#include <benchmark/benchmark.h>
+
 // HIP API
 #include <hip/hip_runtime.h>
 
 // rocPRIM
-#include <rocprim/rocprim.hpp>
+#include <rocprim/block/block_exchange.hpp>
+#include <rocprim/block/block_load_func.hpp>
+#include <rocprim/block/block_store_func.hpp>
+
+#include <iostream>
+#include <limits>
+#include <string>
+#include <vector>
+
+#include <cstdio>
+#include <cstdlib>
 
 #ifndef DEFAULT_N
 const size_t DEFAULT_N = 1024 * 1024 * 32;

diff --git a/benchmark/benchmark_block_histogram.cpp b/benchmark/benchmark_block_histogram.cpp
@@ -1,6 +1,6 @@
 // MIT License
 //
-// Copyright (c) 2017-2023 Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2017-2024 Advanced Micro Devices, Inc. All rights reserved.
 //
 // Permission is hereby granted, free of charge, to any person obtaining a copy
 // of this software and associated documentation files (the "Software"), to deal
@@ -20,24 +20,28 @@
 // OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 // SOFTWARE.
 
-#include <cstdio>
-#include <cstdlib>
-#include <iostream>
-#include <limits>
-#include <string>
-#include <vector>
-
-// Google Benchmark
-#include "benchmark/benchmark.h"
+#include "benchmark_utils.hpp"
 // CmdParser
 #include "cmdparser.hpp"
-#include "benchmark_utils.hpp"
+
+// Google Benchmark
+#include <benchmark/benchmark.h>
 
 // HIP API
 #include <hip/hip_runtime.h>
 
 // rocPRIM
-#include <rocprim/rocprim.hpp>
+#include <rocprim/block/block_histogram.hpp>
+#include <rocprim/block/block_load_func.hpp>
+#include <rocprim/block/block_store_func.hpp>
+
+#include <iostream>
+#include <limits>
+#include <string>
+#include <vector>
+
+#include <cstdio>
+#include <cstdlib>
 
 #ifndef DEFAULT_N
 const size_t DEFAULT_N = 1024 * 1024 * 128;