Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* feat(device_transform): add tuning benchmarks and config generation for device transform * perf(device_transform): tuned device transform algorithm for better performance * docs(changelog.md): add 'device_transform' improvements to changelog * feat(ConfigAutotuneSettings.cmake): allow benchmark_device_transform to tune for more block sizes * fix(benchmark_device_transform.cpp): fix unused type warning when compiling tuning benchmarks * perf(device_transform.hpp): updated configs for device transform which uses a wider range of block sizes * fix(transform_config_template): added missing '::type' in general case of transform config * refactor(benchmark_device_transform.cpp): remove duplicated code with 'benchmark_device_transform.parallel.hpp' * docs(changelog.md): removed 'slightly' in device transform performance improvements * fix(benchmark_device_transform): fix various build errors and warnings * test(test_device_batch_memcpy.cpp): add simple batch copy test This test can be more easily modified to find issues with batch (mem) copy. * fix(device_batch_memcpy.hpp): use dereference instead of 'rocprim::thread_load/store' 'thread_load/store' uses inline assembly prohibiting compiler optimization. This also bypasses an issue where 'thread_load' behaves oddly on debug builds. * revert test(test_device_batch_memcpy.cpp): add simple batch copy test This reverts commit 6dafd1c66684e775eae07fe4fd50632a80ca1673. * test(benchmark_device_adjacent_difference.cpp): increased the default size of input so that in place uint8 benchmarks don't fit in L3 cache on select architectures * docs(changelog.md): update changelog with benchmark changes * Added overload for match_any * Replaced section with match_any() call * Fixed copyright date * Fixed formatting * change match_any to runtime dispatch * docs(intrinsics/warp): name the correct label_bits in match_any documentation * unified wavefront definition * build: Remove force-inline workaround on windows The problem mentioned there should be resolved by now. * ci: enable debug builds on windows Supposedly the slowest jobs should now be resolved, so this should work. * docs: Add CHANGELOG for removing force-inline workaround * fix clang format * fix(tests): Add saturating casts and use them for random data generation The `static_casts` can over / underflow making the maximum value smaller than the minimum. This was triggering an assert on the microsoft standard library. Technically this was undefined behaviour that went unnoticed on non-debug builds. Saturate the input value to the range of the distribution type instead to prevent this error. * fix(benchmark_device_adjacent_difference): fixe size in bytes instead of number of elements * Update contributing guidelines * specify benchmark seed via command line * refactor lookback sleep dispatch * add config to tests * add config tuning for partition * generic tuning * add tuned configurations * Fix "warning: loop not unrolled" with CMAKE_BUILD_TYPE=MinSizeRel (-Os) The compiler generates this warning when -Os is set: warning: loop not unrolled: the optimizer was unable to perform the requested transformation; the transformation might be disabled or specified as part of an unsupported transformation ordering Using static values as both loop bounds fixes these warnings. For some reason, other optimization levels do not have this issue, the compiler is able to understand that the number of iterations of the loops is a compile-time value. * fix(device_partition): re-added workaround for the device_partition family to properly limit block size for the base configuration * Using .lint:clang-format * refactor(intrinsics/thread.hpp): remove 'memory_fence_device' workaround for compiler bug on gfx10 and gfx11 * ci(.gitlab-ciy.yml): disable debug builds in cmake-minimum due to excessive build times when targeting debug test * fix(docs): Fixed documentation for thread subdir * fix(docs): Fixed documentation for the types subdir * fix undefined behavior in test data generation * Deprecate thread_load/thread_store * Ignore thread_load and thread_store deprecation warnings * Deprecated raw_storage and replaced by uninitialized_array in a few locations * unsigned char storage in raw_storage to prevent undefined aliasing * Added ROCPRIM_DONT_SUPPRESS_DEPRECATIONS * Resolve "Improve rocPRIM test logs" * improve documentation for configuration tuning * Refactor device_scan, use is_sleep_scan_state_used and with_scan_state as in other lookback algorithms * Use device of the current stream in is_sleep_scan_state_used * Do not build kernels with sleep in lookback state on devices that don't need it (!=gfx908) * Resolve "Add thread headers to rocprim.hpp and document thread-level methods" * Resolve "Batch memcpy: disable BENCHMARK_BATCH_MEMCPY_NAIVE" * Resolve "Fix under- and overflow in minimum and maximum for input data for benchmarks" * Resolve "CMake build consistency" * Resolve "Benchmark utility for random segments generates segments of wrong size" * Adapt device segmented_reduce for large indices within a segment * Add large indices test * Update CHANGELOG * reduce by key tuning * First commit nth element * Tests nth element * Simplified working version nth element on one block * Added output check for correctness * nth element sizes larger then 64 * Added equality buckets to nth_element logic * Added multiple blocks for nth element * Added test to see if elements did not change * Debugging synchronization * Nth element working version only for key with comperator greater and less * Nth element implemented for key with tests * Fixed issue for custom types in nth element and added tests * Added input and output itterators for nth element * Added some benchmarks for nth element * Small optimizations nth element * Debug code nth element * Made seperate kernel for block offset calculations nth element * Small optimizations nth element * Moved all block offset calculation to other kernel nth element * Optimization nth element * Make use of radix_rank instead of multiple scans * Start of adding multiple items per thread nth element * Nth element using less shared memory * Nth element small optimizations and cleanup code * Fixed benchmark break nth element after rebase * nth element local oracle for buckets_store * Cleanup nth element * Nth element update tests with random nth element * Addition of configs for nth element * Add lookbackstates to nth element * Cleanup and extra comments in nth_element * Removed unnecesarry test cases and choose nth_element based on seed_value * Added nth_element to changelog * Updated benchmark of nth_element based on feedback * Nth_element updated tests and config based on review * Documentation updated for nth_element * Cleanup code nth element * Nth element changes based on review * Add documentation spinx doc * Changed nth element to a while loop * Nth element asserts in device code * Nth element documentation fixes * nth element docs crash fix * nth element lookback state reset * Nth element changes based on review * Replaced raw storage with unitialized_array in nth element * Changed Nth element to be able to be used with iterators * nth element fix small mistakes * Added config for in place nth element * Changes based on review * Added c++17 tests nth_element * Make use of internal merge_path also fix bug with unsigned types for size * Added test for public merge_path_search * Fixed thread_load and thread_store bug with float and double * Made review changes * Add bug fixes to changelog * ci: remove trailing newlines in gitlab-ci.yml * ci: compress autotune artifacts using zstd * Removed oracles array from nth element * Remove constraint of 256 for number of buckets nth element * Apply 1 suggestion(s) to 1 file(s) Co-authored-by: Nara Prasetya <nara@streamhpc.com> * clang-format: trick clang-format into always breaking after c-style function attributes * add ctz intrinsic Counts the number of trailing zero bits. This is just a clean wrapper around __builtin_ctz(ll). * lookback scan: remove HIP-CPU bits - memcpy() (without std) works for HIP - We don't really care about HIP-CPU anymore This cleans up the source file a bit, and it doesn't seem like this affects any benchmarks. * lookback scan: reformat Makes formatting consistent with clang-format file. * lookback scan: add reproducibility test * test: print floats as hexfloat in assert_bit_eq * add warp_readfirstlane and warp_readlane intrinsics * lookback scan: add deterministic implementation * scan: add deterministic overload * scan_by_key: add deterministic overload * reduce_by_key: add deterministic overload * add char and short atomic load/store overloads It seems that these are just supported and work fine. * lookback scan: change flag to be always one byte This slightly reduces the amount of memory required for a lookback scan. Also, changing the INVALID value from -1 to 0xFF fixes some sign issues there were before by using unsigned int as flag underlying type. * lookback scan: swap flag and prefix, allow fast scan for values up to size 7 Since the prefix flag is always one byte now, we can put it behind the value to get a smaller struct. This helps in some cases, for example, scan_by_key over sizeof(AccType) = 2 now fits in an int instead of a long. * nara nit f32 * update changelog with mention of deterministic algorithms * lookback reproducibility test: allocate temporary memory with the right scan operator * lookback scan: avoid caching large types These types are stored in a separate buffer, so we don't need to or load them. Slightly speeds up deterministic scan algorithms when the lookback scan type is > 7 bytes. * remove assertions in lookbacn scan, they don't compile properly in debug builds * lookback reproducibility test: use same functor for both tests This enables the test to work with -ffast-math too. * lookback scan: rotate prefix rather than block_prefix * lookback scan: also test deterministic in normal tests * naive implementation * partial sort benchmark * Made partial_sort in place and created partial_sort_copy * Add and fix documentation partial_sort * Test partial_sort with iterator * Add partial_sort and partial_sort_copy to the changelog * Moved partial sort to own file * Added partial_sort_config * Merge with nth_element_remove_oracle branch * Created c++17 test for partial_sort * Cleanup code based on nth_element review * Review adaptations * Added benchmark for partial_sort * Fixed bug with inplicit casting in partial sort * add static_cast to fix compiler warning * Restored tests for device histogram_even for half/bfloat16 types * Removed unused variable and formatting * ci: Enable debug builds excluding test_block_adjacent_difference/discontinuity These tests take extremely long time to build with clang from ROCm 6.1+. * test(test_device_batch_memcpy.cpp): fix invalid calls being made to generate_random_data_n * test(test_device_batch_memcpy.cpp): standardize test names * test(test_intrinsics.cpp): fix invalid calls being made to test_utils::get_random_data * ci(.gitlab-ci.yml): add hardened libc++ assertions when building tests with gitlab ci * docs: update changelog * docs: fix doxygen errors and warnings * build(cmake/Dependencies.cmake): build rocm-cmake depedency during populate step when fetching it by source * refactor(benchmark_config_dispatch.cpp): fix unused variable and function * chore: bump version to 3.3.0 * Reduce items_per_thread for merge_sort to one for large types * Reduce block_size for device_merge with large types --------- Co-authored-by: Nara Prasetya <nara@streamhpc.com> Co-authored-by: Jaap Blok <jaap@streamhpc.com> Co-authored-by: Gergely Meszaros <gergely@streamhpc.com> Co-authored-by: Nol Moonen <nol@streamhpc.com> Co-authored-by: Bence Parajdi <bence@streamhpc.com> Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com> Co-authored-by: Anton Gorenko <anton@streamhpc.com> Co-authored-by: Lőrinc Serfőző <lorinc@streamhpc.com> Co-authored-by: Nick Breed <nick@streamhpc.com> Co-authored-by: Arsalan Anwari <arsalan@streamhpc.com> Co-authored-by: Ivan <ivan@streamhpc.com>
- Loading branch information