StreamHPC 2023-10-18 #480

Naraenda · 2023-10-18T12:10:19Z

Summary of changes from old to new, with noteable changes bolded:

f760858 retuned device_radix_sort_onesweep configuration for gfx906
9487932 add auto tune build to pipeline
3a35351 minor docs fixes
15e1a67 fix binary search tuning
a4c53f2, b3b8028 clang format fixes
c39ee8c run_length_decode test improvements
e21a158 retuned various algorithms after performance regression
8584dbf, 759a6f4 fix run_length_encode when handling NaN
f7083e0 add block_run_length_decode algorithm
38ba018 add support for (u)int128_t in device_radix_sort
46e0296 removed duplicate instances of rocprim::match_any
1b41ea4 ⚠ removed internal references to host_warp_size. This changes the API of lookback scan!
cfa1bdb ~~add~~ changes to documentation of binary search
34c2ced ⚠ removal of various deprecated structs (ROCm 6.0)
730664a dpp_broadcast fixes (resolves rocPRIM selects wrong code paths for warp reductions and scas on gfx1036 #452)
027d2ef fixes for debug mode in radix_sort
9cac25f fix an out of bound access in internal test for custom_16aligned-type
ac746ed autotune improvements
2fa35ae autotune for adjacent difference
6343c66 merge sort block sort remove specializations for block_sort_algorithm bitonic
0fbf43e fix excessive shared memory usage in block_shuffle algorithm,

31f12a4 resolves doxygen errors after merging
8c66863, 8c9c64c windows build fixes

…' into 'develop_stream' Excessive shared memory usage in block_shuffle fix Closes ROCm#632 See merge request amd/libraries/rocPRIM!530

…l and block_sort_impl

…ock_sort_algorithm-bitonic-and' into 'develop_stream' remove block_sort_algorithm template param from block_sort_kernel_impl and block_sort_impl Closes ROCm#573 See merge request amd/libraries/rocPRIM!535

The workaround needed to make this work is has major disadvantages, and our current workflow does not make use of this dependency anyway (Currently the generated configs are checked into the repository, so the CI would run the benchmarks on them on the next push to the merge-request). When we improve automation around autotuning this could be implemented with conditional jobs, but lets just drop the dependency for now.

Resolve "Config tuning and dynamic dispatch for device adjacent difference" Closes ROCm#607 See merge request amd/libraries/rocPRIM!529

…om_data Indexing was 4 based when the type has 3 variables, therefore it was overflowing. Caught with address sanitizer.

ci: remove autotune dependency from build:benchmark See merge request amd/libraries/rocPRIM!540

- Add force inline to onesweep kernel, to avoid too much shared memory errors - Declare `block_radix_sort::radix_bits_per_pass` to fix linker errors

test: fix indexing error test_type_helper<custom_16aligned>::get_random_data See merge request amd/libraries/rocPRIM!541

The amdgpu target in clang now provides the GFX generation as a predefined macro, so we no longer need to explicitly list all targets, which was bad for maintenance. Also replace the use of the generic `ROCPRIM_NAVI` which signals navi support, with `ROCPRIM_DETAIL_HAS_DPP_BROADCAST`, a macro that explicitly states what we're after. Also also makes sure that `ROCPRIM_DETAIL_USE_DPP` is always defined (to 0 when DPP is disabled), previously it was undefined when `ROCPRIM_DISABLE_DPP` was set.

fixes for compilation in debug for radix_sort See merge request amd/libraries/rocPRIM!538

Use __GFX<GENERATION>__ macros for DPP & NAVI detection Closes ROCm#637 See merge request amd/libraries/rocPRIM!543

remove the option to use custom implemented config for scan_by_key update tests to not use custom implemented config for scan_by_key

update tests to not use custom implemented config for histogram

rename radix_sort_config_v2 to radix_sort_config

add static_assert to check type for reduce_config

remove wrap_scan_config function add static_assert to disallow custom scan_config type rename scan_config_v2 to scan_config

Fix performance regression observed during tuning for gfx1030 and gfx1102 Closes ROCm#639 See merge request amd/libraries/rocPRIM!548

Block Runlength Decode: Fix incorrect offsets and improve test See merge request amd/libraries/rocPRIM!553

Remove duplicate key from .clang-format See merge request amd/libraries/rocPRIM!556

Remove additional duplicates from clang-format See merge request amd/libraries/rocPRIM!558

Use specialized configurations for upper, lower, and binary search algorithms when preforming tuning

Fix binary_search upper/lower_bound config tuning See merge request amd/libraries/rocPRIM!555

unify language around config params in documentation See merge request amd/libraries/rocPRIM!559

Add config tuning build to pipeline See merge request amd/libraries/rocPRIM!557

…onfiguration-in-rocm-5-7-gfx906' into 'develop_stream' Check skipped device_radix_sort_onesweep autotune configuration in ROCm 5.7 (gfx906) Closes ROCm#626 See merge request amd/libraries/rocPRIM!560

Resolve doxygen warnings for upstream PR See merge request amd/libraries/rocPRIM!561

rocprim/include/rocprim/device/config_types.hpp

CHANGELOG.md

* StreamHPC 2023-10-18 (#480) * Excessive shared memory usage in block_shuffle fix * remove block_sort_algorithm template param from block_sort_kernel_impl and block_sort_impl * fixed compile errors * Updated ChangeLog.md * remove unnecessary code * fixed CHANGELOG.md to not be so verbose about non public api changes * Add dynamic dispatch and autotuning to device_adjacent_difference * Fix device_adjacent_difference storage type * ci: remove autotune dependency from build:benchmark The workaround needed to make this work is has major disadvantages, and our current workflow does not make use of this dependency anyway (Currently the generated configs are checked into the repository, so the CI would run the benchmarks on them on the next push to the merge-request). When we improve automation around autotuning this could be implemented with conditional jobs, but lets just drop the dependency for now. * test: fix indexing error test_type_helper<custom_16aligned>::get_random_data Indexing was 4 based when the type has 3 variables, therefore it was overflowing. Caught with address sanitizer. * fixes for compilation in debug for radix_sort - Add force inline to onesweep kernel, to avoid too much shared memory errors - Declare `block_radix_sort::radix_bits_per_pass` to fix linker errors * fix: Detect DPP & DPP broadcast support with __GFX<GENERATION>__ macros The amdgpu target in clang now provides the GFX generation as a predefined macro, so we no longer need to explicitly list all targets, which was bad for maintenance. Also replace the use of the generic `ROCPRIM_NAVI` which signals navi support, with `ROCPRIM_DETAIL_HAS_DPP_BROADCAST`, a macro that explicitly states what we're after. Also also makes sure that `ROCPRIM_DETAIL_USE_DPP` is always defined (to 0 when DPP is disabled), previously it was undefined when `ROCPRIM_DISABLE_DPP` was set. * refactor: Use __GFX<GENERATION>__ to detect NAVI cards * docs: Update CHANGELOG for DPP & ROCPRIM_NAVI fixes * remove deprecated structs and functions * rename scan_by_key_config_v2 to scan_by_key_config remove the option to use custom implemented config for scan_by_key update tests to not use custom implemented config for scan_by_key * remove the option to use custom implemented config for histogram update tests to not use custom implemented config for histogram * update config compile time check to a different pattern * update documentation comments for configs * change documentation comments * change documentation comments on device_radix_sort rename radix_sort_config_v2 to radix_sort_config * change documentation comment add static_assert to check type for reduce_config * update documentation comments remove wrap_scan_config function add static_assert to disallow custom scan_config type rename scan_config_v2 to scan_config * update documentation comments * update documentation comments make transform_config inherit from detail::transfomr_config_params remove wrap_transform_config add static assert to test for Config type in device_transform * remove wrap_adjacent_difference_config function add static_assert to test config type create default ctor for adjacent_difference_config * add missing transform_config ctor rewrite adjacent_difference_config ctor to match other config structs * fix binary search still using wrap_transform_config * implement static_asset to make binary_search only use binary search configs, but also work with the underlying transform * update changelog * remove some *_v2s that went under the radar * remove unnecessary default values * Add binary search, lower_bound and upper_bound documentation * host_warp_size() is replaced with two different versions with parameters. the new versions use either a device id or a stream to figure out the warp size of the device * comment out unused param names * fix typos in the documentation * move host_warp_size to config_type.hpp changed host_warp_size signatures to fit other similar functions * add error checks to host_warp_size calls in tests and benchmarks * fix format * add missing comment * fix error handling in lookback_scan_state.hpp * fix compilation error * change block_radix_rank_match and block_histogram_atomic to use rocprim::match_any instead of implementing same functionality * change radix_digit_count_helper to use rocprim::match_any instead of implementing same functionality added predicate param to rocprim::match_any to set invalid lanes and added tests for this functionality * add elect function to warp intrinsics add test for elect change block_histogram_atomic, block_radix_rank_match, device_histogram, device_radix_sort to use elect instead of copy-paste code * update match_any to return 0 when predicate is false * fix the bit check in elect function * update changelog.md * fix hard coded warps per block value to come from param in kernel * remove unused variables * fix review comments minor name changes update test update comments * update group_elect test tests multiple groups per warp doesn't check which exact thread is elected in a group, only that one is elected * remove unnecessary comments * remove expected from group_elect test fix compile error * fix overindexing * fix review comments update group_elect_test to have better coverage * format * fix review comments * fix perf regression * undo group_elect in block_histogram_atomic.hpp, because of perf impact * fix bad func name in CHANGELOG.md * fix merge errors * Fix reduce_by_key algorithm so keys[0] is not flagged as a new run when is nan * make device_radix_sort compatible with compiler provided __int128_t and __uint128_t * add ifdefs to only compile int128 parts on clang/gcc * update changelog * fix for int128 to_string labdas * add test for block_radix_sort int128 support * Implement block run length decode * Fix reduce_by_key algorithm so out of bounds items are not flagged as new runs for NaNs * Add reduce_by_key test to check that flagging is correct when keys are all different * Fix performance regression observed during tuning for gfx1030 and gfx1102 * Block Runlength Decode: Fix incorrect offsets and improve test * Remove duplicate key from .clang-format * Remove additional duplicates from clang-format * Fix binary_search upper/lower_bound config tuning Use specialized configurations for upper, lower, and binary search algorithms when preforming tuning * unify language around config params in documentation * Make the autotune build job run nightly * remove radix_sort_onesweep autotuning workaround * Resolve doxygen warnings for upstream PR * Enable get_device_from_stream for Windows * Use _ENABLE_EXTENDED_ALIGNED_STORAGE for windows build in rmake.py * Bump unreleased ROCm version --------- Co-authored-by: Ivan Siutsou <ivan@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: Bálint Soproni <balint@streamhpc.com> Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Mátyás Aradi <matyas@streamhpc.com> * StreamHPC 2023-11-17 (batch memcpy) (#485) * Implemented batch memcpy algorithm and relevant tests and benchmarks * Optimize match_any by using arithmetic shifts The compiler seems to see through these much better than the conditional, generating bit-field extract instructions, and recognizing that the loop is a reduction. * Pedantic / consistency changes for batch memcpy * Improve interface and implementation of align_(up|down) - Use the alignment of the destination type instead of its size - Rename to emphasize that this does a form of reinterpret_cast - Use the same type as the return type and template parameter, to match the interface of built-in casts - Pedantic: use uintptr_t instead of size_t for the numerical value of a pointer - Use clangs __builtin_align_(up|down) when available * Take parameters as explicit const-ref in test_utils::bit_equal Because these are templates this already works for non-copyable types, (as `T` will be deduced to `Type&`) but its confusing, and wouldn't work for r-values. Because we are comparing object representations taking a copy isn't okay as that only guarantees that the value representation is copied. (I.e. padding bytes are not required to be copied when taking a parameter by copy) * Actually make custom_non(copyable|moveable)_type non (copy|move)-able * Allow passing rocprim::default_config to batch_memcpy As all the other device functions do too. * Fix typo in cast_align_down documentation * Fixup accidentally deleted constructor of custom_non_moveable_type This was accidentally deleted, it was meant to be defaulted. Currently no test calls this as batch-memcpy tests only use this type at the device side. * Improve error message of test_rocprim_package The error message of the package test wasn't very nice, improve it for easier debugging in the future. Before: ```console ❯ ./a.out 98 ``` After: ```console ❯ ./a.out Error hipErrorInvalidDeviceFunction(98): invalid device function in main at test_rocprim_package.cpp:90 ``` * Refactor test_utils::get_random_data into generate_random_data_n - Writes the output into an output iterator instead of creating & returning a vector. This allows greater flexibility for users i.e. writing random values with differing options into the same container. - Accepts a generator instead of a seed. This is more efficient, because creating an instance of an rng engine might be costly. It's also more consistent with how the standard library operates. - The naming and interface tries to mirror the stl (i.e. `std::generate_n`) - Backwards compatibility is maintained by adding test_utils::get_random_data that uses `generate_random_data_n` internally. * Refactor get_random_data into generate_random_data_n in benchmark_utils This mirrors the test changes in the previous commit * Unify segmnented generation from test generate_random_data_n overloads * Add missing include for iterator traits to benchmark_utils * ci: use build instead rocm-build tag This allows the build job to be performed by any runner configured for building, instead of the ROCm-specialized builder. As the target architectures are specified ahead of time, the GPU is not needed during the build process, and may be performed by any builder. * fix: Fixed doxygen warning in device_memcpy_config.hpp * Speed up / Improve data-generation in test_device_batch_memcpy Do bulk data-generation instead of individual calls, especially of individual bytes for the data to copy. Also changes the verification to do bulk memcmp instead of item-wise test_utils::bit_equals for each buffer. Overall this reduces the time it takes to run the test to ~1s from around 3s. * Refactor & Speedup benchmark_device_batch_memcpy - Share the data generation between the naive and uut benchmarks - Make the data-generation be bulk using a fast random number engine (mt19937) to significantly speed it up. The overall runtime of the benchmark decreased from 14 minutes (!) to around 2 minutes. * Fix explanation comment in batch_memcpy test/benchmark * fix include order in benchmark_device_batch_memcpy * doc: add batch memcpy to changelog --------- Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> * Add unit testing to verify that algorithms work with hipGraphs (#478) * Basic hipGraph tests * Add basic tests for graph creation, instantiation, and execution using: * stream capture * manual construction * hipGraph test for device_reduce algorithms * Added new unit tests for device_reduce, device_reduce_by_key algorithms to verify basic support for hipGraphs (no synchronous API functions are called within the algorithms). * Fixed up CMakeLists compile issue for tests in the test/hipgraph folder * Updated code documentation * Add hipGraph unit tests for device level algorithms * Added unit tests that run the following algorithms inside of a graph (in isolation): - device_adjacent_difference - device_binary_search - device_histogram - device_merge - device_merge_sort - device_partition - device_radix_sort - device_scan - device_segmented_reduce - device_segmented_scan - device_select - device_transform * Updated existing tests for: - device_reduce - device_reduce_by_key * Moved graph test helper functions to a separate file * Add hipGraph unit tests * Added remaining device level hipGraph unit tests * Note: currently, there are two device level algorithms that do no work with hipGraphs because they contain synchronization barriers. No hipGraph unit tests have been added for these algorithms: * device_run_length_encode * device_segmented_radix_sort * Added a functional integration test for hipGraphs, which runs several algorithms back-to-back within a graph. * Refactored test helper code to remove unnecessary parameter * Set hipgraph test pointers to nullptr * Set key_type device pointers to nullptr when they are declared, for safety. * Several minor fixes for hipGraph tests * Fixed up spelling error in comments * Moved call to hipGetLastError to a more appropriate position * Removed old commented test code * Minor fixes for hipgraph unit tests * Moved several synchronization barriers so they are now outside of graph capture blocks in the test_device_partition source * Changed several loop counters to unsigned type * Updatedpgraph cmake files - removed test/hipgraph directory's CMakeLists.txt * Additional test and bugfix for hipgraph tests * Removed syncrhonization barrier in test_device_scan * Added basic test to exercise atomic function within a hipgraph * Rebased and resolved merge conflicts * readme and changelog updates (#486) * Skip device_adjacent_difference hipGraph test on Windows for Navi3x (#490) * Currently, the LargeIndices hipGraphs test for gfx1030 on Windows is skipped * This change causes this test case to also get skiped on gfx1100, gfx1101, gfx1102 on Windows * The reason this test fails on Navi on Windows appears to be related to the check_output class (used by OutputIterator in the test). * this may be releated to using atomics inside of graphs, but further investigation is needed * Bump cryptography from 41.0.4 to 41.0.6 in /docs/.sphinx (#488) Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@41.0.4...41.0.6) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 0.27.0 to 0.30.0 in /docs/.sphinx (#489) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.27.0 to 0.30.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.27.0...v0.30.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Lookback state fixes (#491) * Do not call fence in the wait loop * Use __hip_atomic_load/store instead of atomicExch/atomicAdd atomicExch is compiled to global_atomic_swap even when the results is not used. * Use faster fences in lookback algorithms on gfx94* This version is specific for devices with slow __threadfence ("agent" fence which does L2 cache flushing and invalidation). Fences with "workgroup" scope are used instead to ensure ordering only but not coherence, they do not flush and invalidate cache. Global coherence of prefixes_*_values is ensured by atomic_load/atomic_store that bypass cache. * Rename ROCPRIM_DETAIL_LOOKBACK_SCAN_STATE_WITHOUT_SLOW_FENCES from ROCPRIM_LOOKBACK_WITHOUT_SLOW_FENCES. This is more verbose to communicates that it is implementation detail It uses 0 and 1 instead of the presence of the macro now, and won't be overriden if set by a developer on the command line. * Add WITHOUT_SLOW_FENCES version to lookback_scan_state::get_complete_value * refactor: lookback_scan_state WITHOUT_SLOW_FENCES misc changes - use sizeof(variable) - use auto* and const auto* instead of just auto - use void* instead of char* to avoid yet another cast - make the atomic order fence a separate function and add docs & warning * fix: Restore removed interfaces of lookback_scan_state Even though these are in the detail namespace and as such explicitly not meant for usage by users, some projects did start depending on them. The interfaces for these are slightly broken and rocPRIM developers discourage any users from using them (or the newer interfaces for that matter) because they are implementation details. No further guarantees are provided for these APIs. In the future a public interface is planned for lookback_scan_state as we have recognized that this is a useful primitive, and it's unreasonable to expect users to implement for themselves. * refactor: rename __builtin_amdgcn_fence as atomic_fence_acquire_order_only --------- Co-authored-by: Anton Gorenko <anton@streamhpc.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Nara <nara@streamhpc.com> Co-authored-by: Ivan Siutsou <ivan@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: Bálint Soproni <balint@streamhpc.com> Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Mátyás Aradi <matyas@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> Co-authored-by: Wayne Franz <wayfranz@amd.com> Co-authored-by: Lisa <lisajdelaney@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com>

* StreamHPC 2023-10-18 (#480) * Excessive shared memory usage in block_shuffle fix * remove block_sort_algorithm template param from block_sort_kernel_impl and block_sort_impl * fixed compile errors * Updated ChangeLog.md * remove unnecessary code * fixed CHANGELOG.md to not be so verbose about non public api changes * Add dynamic dispatch and autotuning to device_adjacent_difference * Fix device_adjacent_difference storage type * ci: remove autotune dependency from build:benchmark The workaround needed to make this work is has major disadvantages, and our current workflow does not make use of this dependency anyway (Currently the generated configs are checked into the repository, so the CI would run the benchmarks on them on the next push to the merge-request). When we improve automation around autotuning this could be implemented with conditional jobs, but lets just drop the dependency for now. * test: fix indexing error test_type_helper<custom_16aligned>::get_random_data Indexing was 4 based when the type has 3 variables, therefore it was overflowing. Caught with address sanitizer. * fixes for compilation in debug for radix_sort - Add force inline to onesweep kernel, to avoid too much shared memory errors - Declare `block_radix_sort::radix_bits_per_pass` to fix linker errors * fix: Detect DPP & DPP broadcast support with __GFX<GENERATION>__ macros The amdgpu target in clang now provides the GFX generation as a predefined macro, so we no longer need to explicitly list all targets, which was bad for maintenance. Also replace the use of the generic `ROCPRIM_NAVI` which signals navi support, with `ROCPRIM_DETAIL_HAS_DPP_BROADCAST`, a macro that explicitly states what we're after. Also also makes sure that `ROCPRIM_DETAIL_USE_DPP` is always defined (to 0 when DPP is disabled), previously it was undefined when `ROCPRIM_DISABLE_DPP` was set. * refactor: Use __GFX<GENERATION>__ to detect NAVI cards * docs: Update CHANGELOG for DPP & ROCPRIM_NAVI fixes * remove deprecated structs and functions * rename scan_by_key_config_v2 to scan_by_key_config remove the option to use custom implemented config for scan_by_key update tests to not use custom implemented config for scan_by_key * remove the option to use custom implemented config for histogram update tests to not use custom implemented config for histogram * update config compile time check to a different pattern * update documentation comments for configs * change documentation comments * change documentation comments on device_radix_sort rename radix_sort_config_v2 to radix_sort_config * change documentation comment add static_assert to check type for reduce_config * update documentation comments remove wrap_scan_config function add static_assert to disallow custom scan_config type rename scan_config_v2 to scan_config * update documentation comments * update documentation comments make transform_config inherit from detail::transfomr_config_params remove wrap_transform_config add static assert to test for Config type in device_transform * remove wrap_adjacent_difference_config function add static_assert to test config type create default ctor for adjacent_difference_config * add missing transform_config ctor rewrite adjacent_difference_config ctor to match other config structs * fix binary search still using wrap_transform_config * implement static_asset to make binary_search only use binary search configs, but also work with the underlying transform * update changelog * remove some *_v2s that went under the radar * remove unnecessary default values * Add binary search, lower_bound and upper_bound documentation * host_warp_size() is replaced with two different versions with parameters. the new versions use either a device id or a stream to figure out the warp size of the device * comment out unused param names * fix typos in the documentation * move host_warp_size to config_type.hpp changed host_warp_size signatures to fit other similar functions * add error checks to host_warp_size calls in tests and benchmarks * fix format * add missing comment * fix error handling in lookback_scan_state.hpp * fix compilation error * change block_radix_rank_match and block_histogram_atomic to use rocprim::match_any instead of implementing same functionality * change radix_digit_count_helper to use rocprim::match_any instead of implementing same functionality added predicate param to rocprim::match_any to set invalid lanes and added tests for this functionality * add elect function to warp intrinsics add test for elect change block_histogram_atomic, block_radix_rank_match, device_histogram, device_radix_sort to use elect instead of copy-paste code * update match_any to return 0 when predicate is false * fix the bit check in elect function * update changelog.md * fix hard coded warps per block value to come from param in kernel * remove unused variables * fix review comments minor name changes update test update comments * update group_elect test tests multiple groups per warp doesn't check which exact thread is elected in a group, only that one is elected * remove unnecessary comments * remove expected from group_elect test fix compile error * fix overindexing * fix review comments update group_elect_test to have better coverage * format * fix review comments * fix perf regression * undo group_elect in block_histogram_atomic.hpp, because of perf impact * fix bad func name in CHANGELOG.md * fix merge errors * Fix reduce_by_key algorithm so keys[0] is not flagged as a new run when is nan * make device_radix_sort compatible with compiler provided __int128_t and __uint128_t * add ifdefs to only compile int128 parts on clang/gcc * update changelog * fix for int128 to_string labdas * add test for block_radix_sort int128 support * Implement block run length decode * Fix reduce_by_key algorithm so out of bounds items are not flagged as new runs for NaNs * Add reduce_by_key test to check that flagging is correct when keys are all different * Fix performance regression observed during tuning for gfx1030 and gfx1102 * Block Runlength Decode: Fix incorrect offsets and improve test * Remove duplicate key from .clang-format * Remove additional duplicates from clang-format * Fix binary_search upper/lower_bound config tuning Use specialized configurations for upper, lower, and binary search algorithms when preforming tuning * unify language around config params in documentation * Make the autotune build job run nightly * remove radix_sort_onesweep autotuning workaround * Resolve doxygen warnings for upstream PR * Enable get_device_from_stream for Windows * Use _ENABLE_EXTENDED_ALIGNED_STORAGE for windows build in rmake.py * Bump unreleased ROCm version --------- Co-authored-by: Ivan Siutsou <ivan@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: Bálint Soproni <balint@streamhpc.com> Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Mátyás Aradi <matyas@streamhpc.com> * StreamHPC 2023-11-17 (batch memcpy) (#485) * Implemented batch memcpy algorithm and relevant tests and benchmarks * Optimize match_any by using arithmetic shifts The compiler seems to see through these much better than the conditional, generating bit-field extract instructions, and recognizing that the loop is a reduction. * Pedantic / consistency changes for batch memcpy * Improve interface and implementation of align_(up|down) - Use the alignment of the destination type instead of its size - Rename to emphasize that this does a form of reinterpret_cast - Use the same type as the return type and template parameter, to match the interface of built-in casts - Pedantic: use uintptr_t instead of size_t for the numerical value of a pointer - Use clangs __builtin_align_(up|down) when available * Take parameters as explicit const-ref in test_utils::bit_equal Because these are templates this already works for non-copyable types, (as `T` will be deduced to `Type&`) but its confusing, and wouldn't work for r-values. Because we are comparing object representations taking a copy isn't okay as that only guarantees that the value representation is copied. (I.e. padding bytes are not required to be copied when taking a parameter by copy) * Actually make custom_non(copyable|moveable)_type non (copy|move)-able * Allow passing rocprim::default_config to batch_memcpy As all the other device functions do too. * Fix typo in cast_align_down documentation * Fixup accidentally deleted constructor of custom_non_moveable_type This was accidentally deleted, it was meant to be defaulted. Currently no test calls this as batch-memcpy tests only use this type at the device side. * Improve error message of test_rocprim_package The error message of the package test wasn't very nice, improve it for easier debugging in the future. Before: ```console ❯ ./a.out 98 ``` After: ```console ❯ ./a.out Error hipErrorInvalidDeviceFunction(98): invalid device function in main at test_rocprim_package.cpp:90 ``` * Refactor test_utils::get_random_data into generate_random_data_n - Writes the output into an output iterator instead of creating & returning a vector. This allows greater flexibility for users i.e. writing random values with differing options into the same container. - Accepts a generator instead of a seed. This is more efficient, because creating an instance of an rng engine might be costly. It's also more consistent with how the standard library operates. - The naming and interface tries to mirror the stl (i.e. `std::generate_n`) - Backwards compatibility is maintained by adding test_utils::get_random_data that uses `generate_random_data_n` internally. * Refactor get_random_data into generate_random_data_n in benchmark_utils This mirrors the test changes in the previous commit * Unify segmnented generation from test generate_random_data_n overloads * Add missing include for iterator traits to benchmark_utils * ci: use build instead rocm-build tag This allows the build job to be performed by any runner configured for building, instead of the ROCm-specialized builder. As the target architectures are specified ahead of time, the GPU is not needed during the build process, and may be performed by any builder. * fix: Fixed doxygen warning in device_memcpy_config.hpp * Speed up / Improve data-generation in test_device_batch_memcpy Do bulk data-generation instead of individual calls, especially of individual bytes for the data to copy. Also changes the verification to do bulk memcmp instead of item-wise test_utils::bit_equals for each buffer. Overall this reduces the time it takes to run the test to ~1s from around 3s. * Refactor & Speedup benchmark_device_batch_memcpy - Share the data generation between the naive and uut benchmarks - Make the data-generation be bulk using a fast random number engine (mt19937) to significantly speed it up. The overall runtime of the benchmark decreased from 14 minutes (!) to around 2 minutes. * Fix explanation comment in batch_memcpy test/benchmark * fix include order in benchmark_device_batch_memcpy * doc: add batch memcpy to changelog --------- Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> * Add unit testing to verify that algorithms work with hipGraphs (#478) * Basic hipGraph tests * Add basic tests for graph creation, instantiation, and execution using: * stream capture * manual construction * hipGraph test for device_reduce algorithms * Added new unit tests for device_reduce, device_reduce_by_key algorithms to verify basic support for hipGraphs (no synchronous API functions are called within the algorithms). * Fixed up CMakeLists compile issue for tests in the test/hipgraph folder * Updated code documentation * Add hipGraph unit tests for device level algorithms * Added unit tests that run the following algorithms inside of a graph (in isolation): - device_adjacent_difference - device_binary_search - device_histogram - device_merge - device_merge_sort - device_partition - device_radix_sort - device_scan - device_segmented_reduce - device_segmented_scan - device_select - device_transform * Updated existing tests for: - device_reduce - device_reduce_by_key * Moved graph test helper functions to a separate file * Add hipGraph unit tests * Added remaining device level hipGraph unit tests * Note: currently, there are two device level algorithms that do no work with hipGraphs because they contain synchronization barriers. No hipGraph unit tests have been added for these algorithms: * device_run_length_encode * device_segmented_radix_sort * Added a functional integration test for hipGraphs, which runs several algorithms back-to-back within a graph. * Refactored test helper code to remove unnecessary parameter * Set hipgraph test pointers to nullptr * Set key_type device pointers to nullptr when they are declared, for safety. * Several minor fixes for hipGraph tests * Fixed up spelling error in comments * Moved call to hipGetLastError to a more appropriate position * Removed old commented test code * Minor fixes for hipgraph unit tests * Moved several synchronization barriers so they are now outside of graph capture blocks in the test_device_partition source * Changed several loop counters to unsigned type * Updatedpgraph cmake files - removed test/hipgraph directory's CMakeLists.txt * Additional test and bugfix for hipgraph tests * Removed syncrhonization barrier in test_device_scan * Added basic test to exercise atomic function within a hipgraph * Rebased and resolved merge conflicts * readme and changelog updates (#486) * Skip device_adjacent_difference hipGraph test on Windows for Navi3x (#490) * Currently, the LargeIndices hipGraphs test for gfx1030 on Windows is skipped * This change causes this test case to also get skiped on gfx1100, gfx1101, gfx1102 on Windows * The reason this test fails on Navi on Windows appears to be related to the check_output class (used by OutputIterator in the test). * this may be releated to using atomics inside of graphs, but further investigation is needed * Bump cryptography from 41.0.4 to 41.0.6 in /docs/.sphinx (#488) Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@41.0.4...41.0.6) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 0.27.0 to 0.30.0 in /docs/.sphinx (#489) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.27.0 to 0.30.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.27.0...v0.30.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Lookback state fixes (#491) * Do not call fence in the wait loop * Use __hip_atomic_load/store instead of atomicExch/atomicAdd atomicExch is compiled to global_atomic_swap even when the results is not used. * Use faster fences in lookback algorithms on gfx94* This version is specific for devices with slow __threadfence ("agent" fence which does L2 cache flushing and invalidation). Fences with "workgroup" scope are used instead to ensure ordering only but not coherence, they do not flush and invalidate cache. Global coherence of prefixes_*_values is ensured by atomic_load/atomic_store that bypass cache. * Rename ROCPRIM_DETAIL_LOOKBACK_SCAN_STATE_WITHOUT_SLOW_FENCES from ROCPRIM_LOOKBACK_WITHOUT_SLOW_FENCES. This is more verbose to communicates that it is implementation detail It uses 0 and 1 instead of the presence of the macro now, and won't be overriden if set by a developer on the command line. * Add WITHOUT_SLOW_FENCES version to lookback_scan_state::get_complete_value * refactor: lookback_scan_state WITHOUT_SLOW_FENCES misc changes - use sizeof(variable) - use auto* and const auto* instead of just auto - use void* instead of char* to avoid yet another cast - make the atomic order fence a separate function and add docs & warning * fix: Restore removed interfaces of lookback_scan_state Even though these are in the detail namespace and as such explicitly not meant for usage by users, some projects did start depending on them. The interfaces for these are slightly broken and rocPRIM developers discourage any users from using them (or the newer interfaces for that matter) because they are implementation details. No further guarantees are provided for these APIs. In the future a public interface is planned for lookback_scan_state as we have recognized that this is a useful primitive, and it's unreasonable to expect users to implement for themselves. * refactor: rename __builtin_amdgcn_fence as atomic_fence_acquire_order_only --------- Co-authored-by: Anton Gorenko <anton@streamhpc.com> * Bump rocm-docs-core from 0.30.0 to 0.30.3 in /docs/.sphinx (#496) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.0 to 0.30.3. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.30.0...v0.30.3) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * 6.0 final mergeback to develop (#498) * Fix cpp-check reported issues Fixed a number of issues that static the analysis tool picked up: - Made some functions const since they don't modify member state - Made some parameters const, since they're never modified - Made some functions static (for performance), since they don't require access to the class instance - Fixes for several benchmark/test functions - Removed unused variable declarations - Added missing input data transfer from host to device - Added default return value for one overlooked execution path - Added some member variables to constructor initializer list - Added override keyword in several places - Fixed up item placeholders in some printf statements * Separate gfx942 specific code (#468) Co-authored-by: Stanley Tsang <stanley.tsang@amd.com> * Fix cpp-check reported issues * Removed host to data transfer from memcpy benchmark. Since this benchmark only tests memcpy performance between device buffers, we don't really need to copy data into these from the host. * Remove Unnecessary Newline & Re-trigger Performance Checks * Update comment for extra clarification * Updated comment in memcpy benchmark to make the purpose of the code a little clearer. * Update googlebenchmark version (#477) * 6.0 cherry pick for changelog and version update (#483) * Fix changelog for 6.0 * Fix version * Fix up changelog --------- Co-authored-by: Wayne Franz <wayfranz@amd.com> Co-authored-by: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Co-authored-by: Lauren Wrubleski <Lauren.Wrubleski@amd.com> * Add CODEOWNERS file (#504) * Standardize documentation for ReadtheDocs (#497) * Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx (#506) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.2...3.1.3) --- updated-dependencies: - dependency-name: jinja2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump gitpython from 3.1.37 to 3.1.41 in /docs/sphinx (#508) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.37 to 3.1.41. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.37...3.1.41) --- updated-dependencies: - dependency-name: gitpython dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#507) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.30.3...v0.31.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update links in README.md - Update links to other ROCm repositories. * Update package version --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Nara <nara@streamhpc.com> Co-authored-by: Ivan Siutsou <ivan@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: Bálint Soproni <balint@streamhpc.com> Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Mátyás Aradi <matyas@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> Co-authored-by: Wayne Franz <wayfranz@amd.com> Co-authored-by: Lisa <lisajdelaney@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Co-authored-by: Lauren Wrubleski <Lauren.Wrubleski@amd.com> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: David Galiffi <dgaliffi@amd.com>

* 6.1 bulk update from develop branch 2024-1-16 (#510) * StreamHPC 2023-10-18 (#480) * Excessive shared memory usage in block_shuffle fix * remove block_sort_algorithm template param from block_sort_kernel_impl and block_sort_impl * fixed compile errors * Updated ChangeLog.md * remove unnecessary code * fixed CHANGELOG.md to not be so verbose about non public api changes * Add dynamic dispatch and autotuning to device_adjacent_difference * Fix device_adjacent_difference storage type * ci: remove autotune dependency from build:benchmark The workaround needed to make this work is has major disadvantages, and our current workflow does not make use of this dependency anyway (Currently the generated configs are checked into the repository, so the CI would run the benchmarks on them on the next push to the merge-request). When we improve automation around autotuning this could be implemented with conditional jobs, but lets just drop the dependency for now. * test: fix indexing error test_type_helper<custom_16aligned>::get_random_data Indexing was 4 based when the type has 3 variables, therefore it was overflowing. Caught with address sanitizer. * fixes for compilation in debug for radix_sort - Add force inline to onesweep kernel, to avoid too much shared memory errors - Declare `block_radix_sort::radix_bits_per_pass` to fix linker errors * fix: Detect DPP & DPP broadcast support with __GFX<GENERATION>__ macros The amdgpu target in clang now provides the GFX generation as a predefined macro, so we no longer need to explicitly list all targets, which was bad for maintenance. Also replace the use of the generic `ROCPRIM_NAVI` which signals navi support, with `ROCPRIM_DETAIL_HAS_DPP_BROADCAST`, a macro that explicitly states what we're after. Also also makes sure that `ROCPRIM_DETAIL_USE_DPP` is always defined (to 0 when DPP is disabled), previously it was undefined when `ROCPRIM_DISABLE_DPP` was set. * refactor: Use __GFX<GENERATION>__ to detect NAVI cards * docs: Update CHANGELOG for DPP & ROCPRIM_NAVI fixes * remove deprecated structs and functions * rename scan_by_key_config_v2 to scan_by_key_config remove the option to use custom implemented config for scan_by_key update tests to not use custom implemented config for scan_by_key * remove the option to use custom implemented config for histogram update tests to not use custom implemented config for histogram * update config compile time check to a different pattern * update documentation comments for configs * change documentation comments * change documentation comments on device_radix_sort rename radix_sort_config_v2 to radix_sort_config * change documentation comment add static_assert to check type for reduce_config * update documentation comments remove wrap_scan_config function add static_assert to disallow custom scan_config type rename scan_config_v2 to scan_config * update documentation comments * update documentation comments make transform_config inherit from detail::transfomr_config_params remove wrap_transform_config add static assert to test for Config type in device_transform * remove wrap_adjacent_difference_config function add static_assert to test config type create default ctor for adjacent_difference_config * add missing transform_config ctor rewrite adjacent_difference_config ctor to match other config structs * fix binary search still using wrap_transform_config * implement static_asset to make binary_search only use binary search configs, but also work with the underlying transform * update changelog * remove some *_v2s that went under the radar * remove unnecessary default values * Add binary search, lower_bound and upper_bound documentation * host_warp_size() is replaced with two different versions with parameters. the new versions use either a device id or a stream to figure out the warp size of the device * comment out unused param names * fix typos in the documentation * move host_warp_size to config_type.hpp changed host_warp_size signatures to fit other similar functions * add error checks to host_warp_size calls in tests and benchmarks * fix format * add missing comment * fix error handling in lookback_scan_state.hpp * fix compilation error * change block_radix_rank_match and block_histogram_atomic to use rocprim::match_any instead of implementing same functionality * change radix_digit_count_helper to use rocprim::match_any instead of implementing same functionality added predicate param to rocprim::match_any to set invalid lanes and added tests for this functionality * add elect function to warp intrinsics add test for elect change block_histogram_atomic, block_radix_rank_match, device_histogram, device_radix_sort to use elect instead of copy-paste code * update match_any to return 0 when predicate is false * fix the bit check in elect function * update changelog.md * fix hard coded warps per block value to come from param in kernel * remove unused variables * fix review comments minor name changes update test update comments * update group_elect test tests multiple groups per warp doesn't check which exact thread is elected in a group, only that one is elected * remove unnecessary comments * remove expected from group_elect test fix compile error * fix overindexing * fix review comments update group_elect_test to have better coverage * format * fix review comments * fix perf regression * undo group_elect in block_histogram_atomic.hpp, because of perf impact * fix bad func name in CHANGELOG.md * fix merge errors * Fix reduce_by_key algorithm so keys[0] is not flagged as a new run when is nan * make device_radix_sort compatible with compiler provided __int128_t and __uint128_t * add ifdefs to only compile int128 parts on clang/gcc * update changelog * fix for int128 to_string labdas * add test for block_radix_sort int128 support * Implement block run length decode * Fix reduce_by_key algorithm so out of bounds items are not flagged as new runs for NaNs * Add reduce_by_key test to check that flagging is correct when keys are all different * Fix performance regression observed during tuning for gfx1030 and gfx1102 * Block Runlength Decode: Fix incorrect offsets and improve test * Remove duplicate key from .clang-format * Remove additional duplicates from clang-format * Fix binary_search upper/lower_bound config tuning Use specialized configurations for upper, lower, and binary search algorithms when preforming tuning * unify language around config params in documentation * Make the autotune build job run nightly * remove radix_sort_onesweep autotuning workaround * Resolve doxygen warnings for upstream PR * Enable get_device_from_stream for Windows * Use _ENABLE_EXTENDED_ALIGNED_STORAGE for windows build in rmake.py * Bump unreleased ROCm version --------- Co-authored-by: Ivan Siutsou <ivan@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: Bálint Soproni <balint@streamhpc.com> Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Mátyás Aradi <matyas@streamhpc.com> * StreamHPC 2023-11-17 (batch memcpy) (#485) * Implemented batch memcpy algorithm and relevant tests and benchmarks * Optimize match_any by using arithmetic shifts The compiler seems to see through these much better than the conditional, generating bit-field extract instructions, and recognizing that the loop is a reduction. * Pedantic / consistency changes for batch memcpy * Improve interface and implementation of align_(up|down) - Use the alignment of the destination type instead of its size - Rename to emphasize that this does a form of reinterpret_cast - Use the same type as the return type and template parameter, to match the interface of built-in casts - Pedantic: use uintptr_t instead of size_t for the numerical value of a pointer - Use clangs __builtin_align_(up|down) when available * Take parameters as explicit const-ref in test_utils::bit_equal Because these are templates this already works for non-copyable types, (as `T` will be deduced to `Type&`) but its confusing, and wouldn't work for r-values. Because we are comparing object representations taking a copy isn't okay as that only guarantees that the value representation is copied. (I.e. padding bytes are not required to be copied when taking a parameter by copy) * Actually make custom_non(copyable|moveable)_type non (copy|move)-able * Allow passing rocprim::default_config to batch_memcpy As all the other device functions do too. * Fix typo in cast_align_down documentation * Fixup accidentally deleted constructor of custom_non_moveable_type This was accidentally deleted, it was meant to be defaulted. Currently no test calls this as batch-memcpy tests only use this type at the device side. * Improve error message of test_rocprim_package The error message of the package test wasn't very nice, improve it for easier debugging in the future. Before: ```console ❯ ./a.out 98 ``` After: ```console ❯ ./a.out Error hipErrorInvalidDeviceFunction(98): invalid device function in main at test_rocprim_package.cpp:90 ``` * Refactor test_utils::get_random_data into generate_random_data_n - Writes the output into an output iterator instead of creating & returning a vector. This allows greater flexibility for users i.e. writing random values with differing options into the same container. - Accepts a generator instead of a seed. This is more efficient, because creating an instance of an rng engine might be costly. It's also more consistent with how the standard library operates. - The naming and interface tries to mirror the stl (i.e. `std::generate_n`) - Backwards compatibility is maintained by adding test_utils::get_random_data that uses `generate_random_data_n` internally. * Refactor get_random_data into generate_random_data_n in benchmark_utils This mirrors the test changes in the previous commit * Unify segmnented generation from test generate_random_data_n overloads * Add missing include for iterator traits to benchmark_utils * ci: use build instead rocm-build tag This allows the build job to be performed by any runner configured for building, instead of the ROCm-specialized builder. As the target architectures are specified ahead of time, the GPU is not needed during the build process, and may be performed by any builder. * fix: Fixed doxygen warning in device_memcpy_config.hpp * Speed up / Improve data-generation in test_device_batch_memcpy Do bulk data-generation instead of individual calls, especially of individual bytes for the data to copy. Also changes the verification to do bulk memcmp instead of item-wise test_utils::bit_equals for each buffer. Overall this reduces the time it takes to run the test to ~1s from around 3s. * Refactor & Speedup benchmark_device_batch_memcpy - Share the data generation between the naive and uut benchmarks - Make the data-generation be bulk using a fast random number engine (mt19937) to significantly speed it up. The overall runtime of the benchmark decreased from 14 minutes (!) to around 2 minutes. * Fix explanation comment in batch_memcpy test/benchmark * fix include order in benchmark_device_batch_memcpy * doc: add batch memcpy to changelog --------- Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> * Add unit testing to verify that algorithms work with hipGraphs (#478) * Basic hipGraph tests * Add basic tests for graph creation, instantiation, and execution using: * stream capture * manual construction * hipGraph test for device_reduce algorithms * Added new unit tests for device_reduce, device_reduce_by_key algorithms to verify basic support for hipGraphs (no synchronous API functions are called within the algorithms). * Fixed up CMakeLists compile issue for tests in the test/hipgraph folder * Updated code documentation * Add hipGraph unit tests for device level algorithms * Added unit tests that run the following algorithms inside of a graph (in isolation): - device_adjacent_difference - device_binary_search - device_histogram - device_merge - device_merge_sort - device_partition - device_radix_sort - device_scan - device_segmented_reduce - device_segmented_scan - device_select - device_transform * Updated existing tests for: - device_reduce - device_reduce_by_key * Moved graph test helper functions to a separate file * Add hipGraph unit tests * Added remaining device level hipGraph unit tests * Note: currently, there are two device level algorithms that do no work with hipGraphs because they contain synchronization barriers. No hipGraph unit tests have been added for these algorithms: * device_run_length_encode * device_segmented_radix_sort * Added a functional integration test for hipGraphs, which runs several algorithms back-to-back within a graph. * Refactored test helper code to remove unnecessary parameter * Set hipgraph test pointers to nullptr * Set key_type device pointers to nullptr when they are declared, for safety. * Several minor fixes for hipGraph tests * Fixed up spelling error in comments * Moved call to hipGetLastError to a more appropriate position * Removed old commented test code * Minor fixes for hipgraph unit tests * Moved several synchronization barriers so they are now outside of graph capture blocks in the test_device_partition source * Changed several loop counters to unsigned type * Updatedpgraph cmake files - removed test/hipgraph directory's CMakeLists.txt * Additional test and bugfix for hipgraph tests * Removed syncrhonization barrier in test_device_scan * Added basic test to exercise atomic function within a hipgraph * Rebased and resolved merge conflicts * readme and changelog updates (#486) * Skip device_adjacent_difference hipGraph test on Windows for Navi3x (#490) * Currently, the LargeIndices hipGraphs test for gfx1030 on Windows is skipped * This change causes this test case to also get skiped on gfx1100, gfx1101, gfx1102 on Windows * The reason this test fails on Navi on Windows appears to be related to the check_output class (used by OutputIterator in the test). * this may be releated to using atomics inside of graphs, but further investigation is needed * Bump cryptography from 41.0.4 to 41.0.6 in /docs/.sphinx (#488) Bumps [cryptography](https://github.com/pyca/cryptography) from 41.0.4 to 41.0.6. - [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst) - [Commits](pyca/cryptography@41.0.4...41.0.6) --- updated-dependencies: - dependency-name: cryptography dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 0.27.0 to 0.30.0 in /docs/.sphinx (#489) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.27.0 to 0.30.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.27.0...v0.30.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Lookback state fixes (#491) * Do not call fence in the wait loop * Use __hip_atomic_load/store instead of atomicExch/atomicAdd atomicExch is compiled to global_atomic_swap even when the results is not used. * Use faster fences in lookback algorithms on gfx94* This version is specific for devices with slow __threadfence ("agent" fence which does L2 cache flushing and invalidation). Fences with "workgroup" scope are used instead to ensure ordering only but not coherence, they do not flush and invalidate cache. Global coherence of prefixes_*_values is ensured by atomic_load/atomic_store that bypass cache. * Rename ROCPRIM_DETAIL_LOOKBACK_SCAN_STATE_WITHOUT_SLOW_FENCES from ROCPRIM_LOOKBACK_WITHOUT_SLOW_FENCES. This is more verbose to communicates that it is implementation detail It uses 0 and 1 instead of the presence of the macro now, and won't be overriden if set by a developer on the command line. * Add WITHOUT_SLOW_FENCES version to lookback_scan_state::get_complete_value * refactor: lookback_scan_state WITHOUT_SLOW_FENCES misc changes - use sizeof(variable) - use auto* and const auto* instead of just auto - use void* instead of char* to avoid yet another cast - make the atomic order fence a separate function and add docs & warning * fix: Restore removed interfaces of lookback_scan_state Even though these are in the detail namespace and as such explicitly not meant for usage by users, some projects did start depending on them. The interfaces for these are slightly broken and rocPRIM developers discourage any users from using them (or the newer interfaces for that matter) because they are implementation details. No further guarantees are provided for these APIs. In the future a public interface is planned for lookback_scan_state as we have recognized that this is a useful primitive, and it's unreasonable to expect users to implement for themselves. * refactor: rename __builtin_amdgcn_fence as atomic_fence_acquire_order_only --------- Co-authored-by: Anton Gorenko <anton@streamhpc.com> * Bump rocm-docs-core from 0.30.0 to 0.30.3 in /docs/.sphinx (#496) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.0 to 0.30.3. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.30.0...v0.30.3) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * 6.0 final mergeback to develop (#498) * Fix cpp-check reported issues Fixed a number of issues that static the analysis tool picked up: - Made some functions const since they don't modify member state - Made some parameters const, since they're never modified - Made some functions static (for performance), since they don't require access to the class instance - Fixes for several benchmark/test functions - Removed unused variable declarations - Added missing input data transfer from host to device - Added default return value for one overlooked execution path - Added some member variables to constructor initializer list - Added override keyword in several places - Fixed up item placeholders in some printf statements * Separate gfx942 specific code (#468) Co-authored-by: Stanley Tsang <stanley.tsang@amd.com> * Fix cpp-check reported issues * Removed host to data transfer from memcpy benchmark. Since this benchmark only tests memcpy performance between device buffers, we don't really need to copy data into these from the host. * Remove Unnecessary Newline & Re-trigger Performance Checks * Update comment for extra clarification * Updated comment in memcpy benchmark to make the purpose of the code a little clearer. * Update googlebenchmark version (#477) * 6.0 cherry pick for changelog and version update (#483) * Fix changelog for 6.0 * Fix version * Fix up changelog --------- Co-authored-by: Wayne Franz <wayfranz@amd.com> Co-authored-by: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Co-authored-by: Lauren Wrubleski <Lauren.Wrubleski@amd.com> * Add CODEOWNERS file (#504) * Standardize documentation for ReadtheDocs (#497) * Bump jinja2 from 3.1.2 to 3.1.3 in /docs/sphinx (#506) Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.2 to 3.1.3. - [Release notes](https://github.com/pallets/jinja/releases) - [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst) - [Commits](pallets/jinja@3.1.2...3.1.3) --- updated-dependencies: - dependency-name: jinja2 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump gitpython from 3.1.37 to 3.1.41 in /docs/sphinx (#508) Bumps [gitpython](https://github.com/gitpython-developers/GitPython) from 3.1.37 to 3.1.41. - [Release notes](https://github.com/gitpython-developers/GitPython/releases) - [Changelog](https://github.com/gitpython-developers/GitPython/blob/main/CHANGES) - [Commits](gitpython-developers/GitPython@3.1.37...3.1.41) --- updated-dependencies: - dependency-name: gitpython dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump rocm-docs-core from 0.30.3 to 0.31.0 in /docs/sphinx (#507) Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.30.3 to 0.31.0. - [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases) - [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](ROCm/rocm-docs-core@v0.30.3...v0.31.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Update links in README.md - Update links to other ROCm repositories. * Update package version --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Nara <nara@streamhpc.com> Co-authored-by: Ivan Siutsou <ivan@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: Bálint Soproni <balint@streamhpc.com> Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Mátyás Aradi <matyas@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> Co-authored-by: Wayne Franz <wayfranz@amd.com> Co-authored-by: Lisa <lisajdelaney@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Co-authored-by: Lauren Wrubleski <Lauren.Wrubleski@amd.com> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: David Galiffi <dgaliffi@amd.com> * Temporarily skip hipGraph version of LargeIndices test Currently there appear to be some problems calling the atomicAdd function within hipGraphs on some architectures. The atomicAdd function is called within the device_adjacent_difference algorithm's LargeIndices test (check_output class) to increment a counter value. As a result, the hipGraph version of the test fails when called on affected architectures. This change temporarily skips the hipGraph version of the test. We can re-enable it when the root cause has been addressed. --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Stanley Tsang <stanley.tsang@amd.com> Co-authored-by: Nara <nara@streamhpc.com> Co-authored-by: Ivan Siutsou <ivan@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: Bálint Soproni <balint@streamhpc.com> Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Mátyás Aradi <matyas@streamhpc.com> Co-authored-by: Robin Voetter <robin@streamhpc.com> Co-authored-by: Wayne Franz <wayfranz@amd.com> Co-authored-by: Lisa <lisajdelaney@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: Eiden Yoshida <47196116+eidenyoshida@users.noreply.github.com> Co-authored-by: Lauren Wrubleski <Lauren.Wrubleski@amd.com> Co-authored-by: Sam Wu <sam.wu2@amd.com> Co-authored-by: David Galiffi <dgaliffi@amd.com>

Melirius and others added 30 commits October 18, 2023 11:26

Excessive shared memory usage in block_shuffle fix

98994c5

Merge branch '632-excessive-shared-memory-usage-in-block_shuffle-algo…

0fbf43e

…' into 'develop_stream' Excessive shared memory usage in block_shuffle fix Closes ROCm#632 See merge request amd/libraries/rocPRIM!530

remove block_sort_algorithm template param from block_sort_kernel_imp…

6c3d3c0

…l and block_sort_impl

fixed compile errors

ab274c6

Updated ChangeLog.md

2329a5f

remove unnecessary code

69fbd4a

fixed CHANGELOG.md to not be so verbose about non public api changes

f68110f

Merge branch '573-merge-sort-block-sort-remove-specializations-for-bl…

6343c66

…ock_sort_algorithm-bitonic-and' into 'develop_stream' remove block_sort_algorithm template param from block_sort_kernel_impl and block_sort_impl Closes ROCm#573 See merge request amd/libraries/rocPRIM!535

Add dynamic dispatch and autotuning to device_adjacent_difference

22e731b

Fix device_adjacent_difference storage type

18ede8e

Merge branch '607-tuning-adjdiff' into 'develop_stream'

2fa35ae

Resolve "Config tuning and dynamic dispatch for device adjacent difference" Closes ROCm#607 See merge request amd/libraries/rocPRIM!529

test: fix indexing error test_type_helper<custom_16aligned>::get_rand…

8783207

…om_data Indexing was 4 based when the type has 3 variables, therefore it was overflowing. Caught with address sanitizer.

Merge branch 'fixup_autotune' into 'develop_stream'

ac746ed

ci: remove autotune dependency from build:benchmark See merge request amd/libraries/rocPRIM!540

fixes for compilation in debug for radix_sort

343519e

- Add force inline to onesweep kernel, to avoid too much shared memory errors - Declare `block_radix_sort::radix_bits_per_pass` to fix linker errors

Merge branch 'fix_custom_16aligned_indexing' into 'develop_stream'

9cac25f

test: fix indexing error test_type_helper<custom_16aligned>::get_random_data See merge request amd/libraries/rocPRIM!541

refactor: Use __GFX<GENERATION>__ to detect NAVI cards

c46a941

Merge branch 'debug_fixes' into 'develop_stream'

027d2ef

fixes for compilation in debug for radix_sort See merge request amd/libraries/rocPRIM!538

docs: Update CHANGELOG for DPP & ROCPRIM_NAVI fixes

0a71e6e

Merge branch 'dpp_broadcast' into 'develop_stream'

730664a

Use __GFX<GENERATION>__ macros for DPP & NAVI detection Closes ROCm#637 See merge request amd/libraries/rocPRIM!543

remove deprecated structs and functions

c7fea30

rename scan_by_key_config_v2 to scan_by_key_config

fc38c7b

remove the option to use custom implemented config for scan_by_key update tests to not use custom implemented config for scan_by_key

remove the option to use custom implemented config for histogram

668f5d3

update tests to not use custom implemented config for histogram

update config compile time check to a different pattern

c5195d9

update documentation comments for configs

ad58f6e

change documentation comments

1173bc5

change documentation comments on device_radix_sort

c28b2b2

rename radix_sort_config_v2 to radix_sort_config

change documentation comment

97f48ee

add static_assert to check type for reduce_config

update documentation comments

268a37c

remove wrap_scan_config function add static_assert to disallow custom scan_config type rename scan_config_v2 to scan_config

Naraenda and others added 17 commits October 18, 2023 11:28

Block Runlength Decode: Fix incorrect offsets and improve test

b028d2e

Merge branch '639-fix-regression-tuning' into 'develop_stream'

e21a158

Fix performance regression observed during tuning for gfx1030 and gfx1102 Closes ROCm#639 See merge request amd/libraries/rocPRIM!548

Remove duplicate key from .clang-format

e91e745

Merge branch 'improve-block-rld-tests' into 'develop_stream'

c39ee8c

Block Runlength Decode: Fix incorrect offsets and improve test See merge request amd/libraries/rocPRIM!553

Remove additional duplicates from clang-format

f6dda0a

Merge branch 'fix-clang-format' into 'develop_stream'

b3b8028

Remove duplicate key from .clang-format See merge request amd/libraries/rocPRIM!556

Merge branch 'fix-clang-format-2' into 'develop_stream'

a4c53f2

Remove additional duplicates from clang-format See merge request amd/libraries/rocPRIM!558

Fix binary_search upper/lower_bound config tuning

15adf4d

Use specialized configurations for upper, lower, and binary search algorithms when preforming tuning

unify language around config params in documentation

04e4afb

Merge branch 'fix-binary-search-tuning' into 'develop_stream'

15e1a67

Fix binary_search upper/lower_bound config tuning See merge request amd/libraries/rocPRIM!555

Merge branch 'fix_documentation' into 'develop_stream'

3a35351

unify language around config params in documentation See merge request amd/libraries/rocPRIM!559

Make the autotune build job run nightly

7075051

remove radix_sort_onesweep autotuning workaround

56fe65d

Merge branch 'add-tuning-build-to-pipeline' into 'develop_stream'

9487932

Add config tuning build to pipeline See merge request amd/libraries/rocPRIM!557

Merge branch '626-check-skipped-device_radix_sort_onesweep-autotune-c…

f760858

…onfiguration-in-rocm-5-7-gfx906' into 'develop_stream' Check skipped device_radix_sort_onesweep autotune configuration in ROCm 5.7 (gfx906) Closes ROCm#626 See merge request amd/libraries/rocPRIM!560

Resolve doxygen warnings for upstream PR

d07c302

Merge branch 'fix-doxygen-warnings' into 'develop_stream_2023_10_18'

31f12a4

Resolve doxygen warnings for upstream PR See merge request amd/libraries/rocPRIM!561

Naraenda marked this pull request as ready for review October 23, 2023 08:59

Naraenda mentioned this pull request Oct 30, 2023

StreamHPC 2023-10-30 ROCm/rocThrust#338

Merged

stanleytsang-amd requested changes Oct 30, 2023

View reviewed changes

rocprim/include/rocprim/device/config_types.hpp Show resolved Hide resolved

CHANGELOG.md Outdated Show resolved Hide resolved

sbalint98 and others added 3 commits November 1, 2023 11:09

Enable get_device_from_stream for Windows

8c66863

Use _ENABLE_EXTENDED_ALIGNED_STORAGE for windows build in rmake.py

8c9c64c

Bump unreleased ROCm version

466a946

Naraenda requested a review from stanleytsang-amd November 6, 2023 10:05

stanleytsang-amd approved these changes Nov 14, 2023

View reviewed changes

stanleytsang-amd merged commit f2347ab into ROCm:develop Nov 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StreamHPC 2023-10-18 #480

StreamHPC 2023-10-18 #480

Naraenda commented Oct 18, 2023 •

edited

Loading

StreamHPC 2023-10-18 #480

StreamHPC 2023-10-18 #480

Conversation

Naraenda commented Oct 18, 2023 • edited Loading

Naraenda commented Oct 18, 2023 •

edited

Loading