Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Modify adjacent_difference_kernel_impl so if ouput_type is void the input_type is used instead * fix(match_result_type.hpp): deprecate 'match_result_type.hpp' instead of removing it * change device_batch_memcpy to have an IsMemCpy template param started implementing test for device_copy * separate batch_memcpy and batch_cpy into different API calls and header files create test for batch_copy * add benchmark for batch_copy add separate config for batch_copy * update docs * removed unused variable * fix review comments * use warp_size properly instead of hardcoded number * merge batch_copy and batch_memcpy tests * merge batch_copy and batch_memcpy benchmarks * fix unused parameters * add event destroys to the benchmark * fix(device_scan.hpp): throw compiler warning on gfx 11 (navi 3x) for incorrect results due to compiler bug This change should be reverted once the compiler bug is fixed. * Resolve "tests for `block_adjacent_difference` and `block_discontinuity` don't compile with `rocprim::half`" * Update device_adjacent_difference fix for void output type to match CUB * Update CHANGELOG * revert: fix(device_scan.hpp): throw compiler warning on gfx 11 (navi 3x) for incorrect results due to compiler bug This reverts commit 3f375e458e741d10e46c61119ea93ea1012b7946. * fix(lookback_scan_state.hpp): fix sporadic failure in device scan algorithm on navi 3 gpus The compiler does not emit a 'buffer_gl0_inv'-instruction that is required on gfx11. We emit it manually only on this architecture to ensure correct results. * add tests for supported data types * clarified block_histogram documentation fixed block_histogram test, to work with int8_t data * fix bfloat16 tests * update changelog * add comment explainers for commented test cases * Abstracted bit_cast away * Implemented in-place radix sort * Added decomposer arguments and decomposer checking * Sliced compilation of test_block_radix_sort * Implemented sort_comparator for custom test type * Testing custom test types * Moved out identity_decomposer from detail namespace * Added custom type benchmarks * Updated changelog * Fixed typo in changelog * Fixed formatting * Separately testing radix key codec * Fixed documentation * Fixed test utils comparator for custom_float_type * Instantiate test block_radix_sort custom_test_tye<double> only when int128 is supported * Fixed compilation on Windows * Fixed tests for signed integral keys * Added extra radix_key_codec test cases * Fixed device_adjacent_difference kernel config selection and shared memory usage * Documentation fix after rebase * Added the memset to the graph for DeviceAdjacentDifferenceLargeTests * feat(predicate_iterator.hpp): added rocprim::predicate_iterator * fix(predicate_iterator.hpp): fix msvc build error due to implicit deletion of copy assignment * docs(predicate_iterator.hpp): fix doxygen build and improve consistency * refactor(predicate_iterator.hpp): improve naming consistency with other rocprim iterators and algorithms * fix(predicate_iterator.hpp): add missing out-of-class definition for operator+ * fix(predicate_iterator.hpp): fix predicate_iterator not reading value when underlying iterator dereferences to a non-reference type * test(benchmark_predicate_iterator.cpp): add benchmark for predicate iterator * refactor(predicate_iterator.hpp): remove unneeded << operator for std::ostream * docs(predicate_iterator.hpp): fix spelling * fix(predicate_iterator.hpp): drop 'const' from 'predicate_' as a class with const members cannot have an implicit copy assign * refactor(predicate_iterator.hpp): clean up * Replaced in `warp_sort_shuffle` where possible the `warp_shuffle_xor` for `warp_swizzle` * Replaced in `block_sort_bitonic` where possible the `warp_shuffle_xor` for `warp_swizzle` * Updated changelog * Generalize warp_swizzle_shuffle function for both block_sort_bitonic and warp_sort_shuffle * Fix docs warp_swizzle_shuffle * fix(predicate_iterator.hpp): drop constness of derference operator and deriviates This is required to relax the requirement of functions passed as predicate. * fix(predicate_iterator.hpp): derive proxy capture type from dereference operator instead of relying on iterator trait * test(test_predicate_iterator.cpp): extend predicate iterator type tests * remove old workaround comment * Fixed descending device_radix_sort for bool keys * Bool tests in test_device_radix_sort * Bool tests in test_block_radix_sort * Replaced std::getenv for Windows with _dupenv_s to prevent MSVC deprication warning * Removed malloc from linux version of the __get_env * Mark << operator as deprecated for iterators * device_radix_sort uses identity decomposer * Testing device_radix_sort with custom type [WIP] * Sorted overloads and updated docs for device_radix_sort * device_radix_sort public decomposer APIs * Fixed device_radix_sort with custom decomposer and added tests * Updated docs and changelog * Fixed building device_segmented_radix_sort * Added and tested additional device_radix_sort decomposers overloads * Enforce begin/end bit being default for floating point radix sort * Compile time dispatch and cleanup for radix_merge_compare * Added custom_key benchmarks for device_radix_sort * Fixed comparator dispatch in device_segmented_radix_sort * Removed duplicate test_device_radix_sort case * Iterating from MSB to LSB in decomposed radix_merge_compare * add decomposer argument * Added warp_exchange optimization with template recursion * Added tests to warp_exchange that can use the optimization * Warp exchange optimization using integer_sequence * Changed the shuffle warp_exchange to make use of swizzle and use a temp array. * Added extra benchmarks for warp_exchange where warpsize equals items per thread * Added non in place tests for warp exchange * Added documentation for optimized blocked_to_striped_shuffle and striped_to_blocked_shuffle functions of warp_exchange * Added some comments to warp_exchange * Use primes for tuning device adjacent difference * Removed accidental restriction of build targets in build:benchmark job * remove workaround for old compiler bug * Removed hotfix for double to double to __half conversion bug * style(test_utils_hipgraphs.hpp): add used includes * test: remove superfluous graph calls from tests All graph capturing and launching of host only rocprim calls are unneeded as they don't invoke device code. * fix(test_device_adjacent_difference.cpp): re-enable use of identity_operator on device_adjacent_difference tests * feat(test_device_adjacent_difference.cpp): add tests for void value_type in device_adjacent_difference * Update algorithms descriptions with non-bit-wise reproducibility * Added RadixBitsPerPass as parameter for the block_radix_sort * Add static_assert for RadixBitsPerPass from block_radix_sort * Added tests for block_radix_sort with different number of radix_bits_per_pass * Update CHANGELOG * Added some benchmarks with different radix bits per pass * Remove benchmarks with RadixBitsPerPass equal to 1 * Added check to partition kernel if size is smaller than items_per_block * Fixed benchmark_device_adjacent_difference formatting * Remove unnessary rocprim headers for more specific includes * Ordering includes of includes of benchmark files * Added detail includes only for direct usage of detail functions in benchmarks * declare shared memory at kernel level as workaround for non-optimized builds taking too long * increase build parallelism * add debug build and run to ci * fix leftover instance * fix copyright dates * debug benchmark builds * Deprecate TwiddleIn/TwiddleOut * Match radix_key_codec with radix_key_codec_inplace * Remove radix_key_codec_inplace * Make radix_key_codec part of the public API * Add radix_key_codec to sphinx docs * Add radix_key_codec tests for encode/decode/extract_digit consistency * Add static assert to ensure non-fundamental typed keys do not get an identity_decomposer * Add ROCPIM_PRAGMA_MESSAGE to warn about radix_sort.hpp functionality migration * Add test cases for block_histogram * Add test cases for block_exchange * Fix test_block_exchange to avoid UB * fix(thread_load.hpp): combine asm statements to fix broken behavior in debug builds 'rocprim::thread_load' used two consecutive asm declarations (a load and a wait) which allowed the compiler to insert code between the two instructions. This bug was only observed when compiling with '-O0'. By joining the two asm declarations, the compiler can no longer insert instructions between the load and wait, which would cause incomplete data to be used when it was dependent on the data being loaded. --------- Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Nara Prasetya <nara@streamhpc.com> Co-authored-by: Lőrinc Serfőző <lorinc@streamhpc.com> Co-authored-by: Nick Breed <nick@streamhpc.com> Co-authored-by: Nol Moonen <nol@streamhpc.com> Co-authored-by: Balint Soproni <balint@streamhpc.com>
- Loading branch information