Skip to content

Commit

Permalink
Develop stream 2024-06-26 (#575)
Browse files Browse the repository at this point in the history
* feat(device_transform): add tuning benchmarks and config generation for device transform

* perf(device_transform): tuned device transform algorithm for better performance

* docs(changelog.md): add 'device_transform' improvements to changelog

* feat(ConfigAutotuneSettings.cmake): allow benchmark_device_transform to tune for more block sizes

* fix(benchmark_device_transform.cpp): fix unused type warning when compiling tuning benchmarks

* perf(device_transform.hpp): updated configs for device transform which uses a wider range of block sizes

* fix(transform_config_template): added missing '::type' in general case of transform config

* refactor(benchmark_device_transform.cpp): remove duplicated code with 'benchmark_device_transform.parallel.hpp'

* docs(changelog.md): removed 'slightly' in device transform performance improvements

* fix(benchmark_device_transform): fix various build errors and warnings

* test(test_device_batch_memcpy.cpp): add simple batch copy test

This test can be more easily modified to find issues with batch (mem) copy.

* fix(device_batch_memcpy.hpp): use dereference instead of 'rocprim::thread_load/store'

'thread_load/store' uses inline assembly prohibiting compiler optimization. This also bypasses an issue where 'thread_load' behaves oddly on debug builds.

* revert test(test_device_batch_memcpy.cpp): add simple batch copy test

This reverts commit 6dafd1c66684e775eae07fe4fd50632a80ca1673.

* test(benchmark_device_adjacent_difference.cpp): increased the default size of input so that in place uint8 benchmarks don't fit in L3 cache on select architectures

* docs(changelog.md): update changelog with benchmark changes

* Added overload for match_any

* Replaced section with match_any() call

* Fixed copyright date

* Fixed formatting

* change match_any to runtime dispatch

* docs(intrinsics/warp): name the correct label_bits in match_any documentation

* unified wavefront definition

* build: Remove force-inline workaround on windows

The problem mentioned there should be resolved by now.

* ci: enable debug builds on windows

Supposedly the slowest jobs should now be resolved, so this should work.

* docs: Add CHANGELOG for removing force-inline workaround

* fix clang format

* fix(tests): Add saturating casts and use them for random data generation

The `static_casts` can over / underflow making the maximum value smaller
than the minimum. This was triggering an assert on the microsoft standard
library. Technically this was undefined behaviour that went unnoticed
on non-debug builds.

Saturate the input value to the range of the distribution type instead
to prevent this error.

* fix(benchmark_device_adjacent_difference): fixe size in bytes instead of number of elements

* Update contributing guidelines

* specify benchmark seed via command line

* refactor lookback sleep dispatch

* add config to tests

* add config tuning for partition

* generic tuning

* add tuned configurations

* Fix "warning: loop not unrolled" with CMAKE_BUILD_TYPE=MinSizeRel (-Os)

The compiler generates this warning when -Os is set:
warning: loop not unrolled: the optimizer was unable to perform the
requested transformation; the transformation might be disabled or
specified as part of an unsupported transformation ordering

Using static values as both loop bounds fixes these warnings. For some
reason, other optimization levels do not have this issue, the compiler
is able to understand that the number of iterations of the loops is
a compile-time value.

* fix(device_partition): re-added workaround for the device_partition family to properly limit block size for the base configuration

* Using .lint:clang-format

* refactor(intrinsics/thread.hpp): remove 'memory_fence_device' workaround for compiler bug on gfx10 and  gfx11

* ci(.gitlab-ciy.yml): disable debug builds in cmake-minimum due to excessive build times when targeting debug test

* fix(docs): Fixed documentation for thread subdir

* fix(docs): Fixed documentation for the types subdir

* fix undefined behavior in test data generation

* Deprecate thread_load/thread_store

* Ignore thread_load and thread_store deprecation warnings

* Deprecated raw_storage and replaced by uninitialized_array in a few locations

* unsigned char storage in raw_storage to prevent undefined aliasing

* Added ROCPRIM_DONT_SUPPRESS_DEPRECATIONS

* Resolve "Improve rocPRIM test logs"

* improve documentation for configuration tuning

* Refactor device_scan, use is_sleep_scan_state_used and with_scan_state as in other lookback algorithms

* Use device of the current stream in is_sleep_scan_state_used

* Do not build kernels with sleep in lookback state on devices that don't need it (!=gfx908)

* Resolve "Add thread headers to rocprim.hpp and document thread-level methods"

* Resolve "Batch memcpy: disable BENCHMARK_BATCH_MEMCPY_NAIVE"

* Resolve "Fix under- and overflow in minimum and maximum for input data for benchmarks"

* Resolve "CMake build consistency"

* Resolve "Benchmark utility for random segments generates segments of wrong size"

* Adapt device segmented_reduce for large indices within a segment

* Add large indices test

* Update CHANGELOG

* reduce by key tuning

* First commit nth element

* Tests nth element

* Simplified working version nth element on one block

* Added output check for correctness

* nth element sizes larger then 64

* Added equality buckets to nth_element logic

* Added multiple blocks for nth element

* Added test to see if elements did not change

* Debugging synchronization

* Nth element working version only for key with comperator greater and
less

* Nth element implemented for key with tests

* Fixed issue for custom types in nth element and added tests

* Added input and output itterators for nth element

* Added some benchmarks for nth element

* Small optimizations nth element

* Debug code nth element

* Made seperate kernel for block offset calculations nth element

* Small optimizations nth element

* Moved all block offset calculation to other kernel nth element

* Optimization nth element

* Make use of radix_rank instead of multiple scans

* Start of adding multiple items per thread nth element

* Nth element using less shared memory

* Nth element small optimizations and cleanup code

* Fixed benchmark break nth element after rebase

* nth element local oracle for buckets_store

* Cleanup nth element

* Nth element update tests with random nth element

* Addition of configs for nth element

* Add lookbackstates to nth element

* Cleanup and extra comments in nth_element

* Removed unnecesarry test cases and choose nth_element based on seed_value

* Added nth_element to changelog

* Updated benchmark of nth_element based on feedback

* Nth_element updated tests and config based on review

* Documentation updated for nth_element

* Cleanup code nth element

* Nth element changes based on review

* Add documentation spinx doc

* Changed nth element to a while loop

* Nth element asserts in device code

* Nth element documentation fixes

* nth element docs crash fix

* nth element lookback state reset

* Nth element changes based on review

* Replaced raw storage with unitialized_array in nth element

* Changed Nth element to be able to be used with iterators

* nth element fix small mistakes

* Added config for in place nth element

* Changes based on review

* Added c++17 tests nth_element

* Make use of internal merge_path also fix bug with unsigned types for size

* Added test for public merge_path_search

* Fixed thread_load and thread_store bug with float and double

* Made review changes

* Add bug fixes to changelog

* ci: remove trailing newlines in gitlab-ci.yml

* ci: compress autotune artifacts using zstd

* Removed oracles array from nth element

* Remove constraint of 256 for number of buckets nth element

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Nara Prasetya <nara@streamhpc.com>

* clang-format: trick clang-format into always breaking after c-style function attributes

* add ctz intrinsic

Counts the number of trailing zero bits. This is just
a clean wrapper around __builtin_ctz(ll).

* lookback scan: remove HIP-CPU bits

- memcpy() (without std) works for HIP
- We don't really care about HIP-CPU anymore

This cleans up the source file a bit, and it doesn't seem like this
affects any benchmarks.

* lookback scan: reformat

Makes formatting consistent with clang-format file.

* lookback scan: add reproducibility test

* test: print floats as hexfloat in assert_bit_eq

* add warp_readfirstlane and warp_readlane intrinsics

* lookback scan: add deterministic implementation

* scan: add deterministic overload

* scan_by_key: add deterministic overload

* reduce_by_key: add deterministic overload

* add char and short atomic load/store overloads

It seems that these are just supported and work fine.

* lookback scan: change flag to be always one byte

This slightly reduces the amount of memory required for a lookback
scan. Also, changing the INVALID value from -1 to 0xFF fixes some
sign issues there were before by using unsigned int as flag
underlying type.

* lookback scan: swap flag and prefix, allow fast scan for values up to size 7

Since the prefix flag is always one byte now, we can put it behind the value
to get a smaller struct. This helps in some cases, for example, scan_by_key
over sizeof(AccType) = 2 now fits in an int instead of a long.

* nara nit f32

* update changelog with mention of deterministic algorithms

* lookback reproducibility test: allocate temporary memory with the right scan operator

* lookback scan: avoid caching large types

These types are stored in a separate buffer, so we don't
need to or load them. Slightly speeds up deterministic
scan algorithms when the lookback scan type is > 7 bytes.

* remove assertions in lookbacn scan, they don't compile properly in debug builds

* lookback reproducibility test: use same functor for both tests

This enables the test to work with -ffast-math too.

* lookback scan: rotate prefix rather than block_prefix

* lookback scan: also test deterministic in normal tests

* naive implementation

* partial sort benchmark

* Made partial_sort in place and created partial_sort_copy

* Add and fix documentation partial_sort

* Test partial_sort with iterator

* Add partial_sort and partial_sort_copy to the changelog

* Moved partial sort to own file

* Added partial_sort_config

* Merge with nth_element_remove_oracle branch

* Created c++17 test for partial_sort

* Cleanup code based on nth_element review

* Review adaptations

* Added benchmark for partial_sort

* Fixed bug with inplicit casting in partial sort

* add static_cast to fix compiler warning

* Restored tests for device histogram_even for half/bfloat16 types

* Removed unused variable and formatting

* ci: Enable debug builds excluding test_block_adjacent_difference/discontinuity

These tests take extremely long time to build with clang from ROCm 6.1+.

* test(test_device_batch_memcpy.cpp): fix invalid calls being made to generate_random_data_n

* test(test_device_batch_memcpy.cpp): standardize test names

* test(test_intrinsics.cpp): fix invalid calls being made to test_utils::get_random_data

* ci(.gitlab-ci.yml): add hardened libc++ assertions when building tests with gitlab ci

* docs: update changelog

* docs: fix doxygen errors and warnings

* build(cmake/Dependencies.cmake): build rocm-cmake depedency during populate step when fetching it by source

* refactor(benchmark_config_dispatch.cpp): fix unused variable and function

* chore: bump version to 3.3.0

* Reduce items_per_thread for merge_sort to one for large types

* Reduce block_size for device_merge with large types

---------

Co-authored-by: Nara Prasetya <nara@streamhpc.com>
Co-authored-by: Jaap Blok <jaap@streamhpc.com>
Co-authored-by: Gergely Meszaros <gergely@streamhpc.com>
Co-authored-by: Nol Moonen <nol@streamhpc.com>
Co-authored-by: Bence Parajdi <bence@streamhpc.com>
Co-authored-by: Beatriz Navidad Vilches <beatriz@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Lőrinc Serfőző <lorinc@streamhpc.com>
Co-authored-by: Nick Breed <nick@streamhpc.com>
Co-authored-by: Arsalan Anwari <arsalan@streamhpc.com>
Co-authored-by: Ivan <ivan@streamhpc.com>
  • Loading branch information
12 people authored Aug 3, 2024
1 parent dc23a85 commit dbb52d5
Show file tree
Hide file tree
Showing 241 changed files with 22,760 additions and 5,015 deletions.
37 changes: 36 additions & 1 deletion .clang-format
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ AllowShortLoopsOnASingleLine: false
AlwaysBreakAfterReturnType: None
AlwaysBreakBeforeMultilineStrings: false
AlwaysBreakTemplateDeclarations: Yes
AttributeMacros: ['ROCPRIM_DEVICE', 'ROCPRIM_HOST', 'ROCPRIM_HOST_DEVICE', 'ROCPRIM_SHARED_MEMORY', 'ROCPRIM_KERNEL', 'ROCPRIM_INLINE']
BinPackArguments: false
BinPackParameters: false
BitFieldColonSpacing: Both
Expand Down Expand Up @@ -135,4 +134,40 @@ SpacesInConditionalStatement: false
SpacesInContainerLiterals: true
SpacesInParentheses: false
SpacesInSquareBrackets: false

AttributeMacros:
- __host__
- __device__
- __global__
- __forceinline__
- __shared__
- __launch_bounds__
- ROCPRIM_DEVICE
- ROCPRIM_HOST
- ROCPRIM_HOST_DEVICE
- ROCPRIM_SHARED_MEMORY
- ROCPRIM_KERNEL
- ROCPRIM_INLINE
- ROCPRIM_FORCE_INLINE
- ROCPRIM_LAUNCH_BOUNDS

# Trick clang into thinking that our C-style attributes are C++-style attributes
# Make sure that the sizes line up for linebreaks etc
Macros:
- __host__=[[host]]
- __device__=[[device]]
- __global__=[[global]]
- __forceinline__=[[forceinline]]
- __shared__=[[shared]]
- __launch_bounds__(x)=[[launch_bounds(x)]]
- __attribute__(x)=[[attribute(x)]]
- ROCPRIM_DEVICE=[[DEVICE____]]
- ROCPRIM_HOST=[[HOST____]]
- ROCPRIM_HOST_DEVICE=[[HOST_DEVICE____]]
- ROCPRIM_SHARED_MEMORY=[[SHARED_MEMORY____]]
- ROCPRIM_KERNEL=[[KERNEL____]]
- ROCPRIM_INLINE=[[INLINE____]]
- ROCPRIM_FORCE_INLINE=[FORCE_INLINE____]]
- ROCPRIM_LAUNCH_BOUNDS(x)=[[launch_bounds(x)____]]
BreakAfterAttributes: Always
---
86 changes: 41 additions & 45 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ include:
- /defaults.yaml
- /deps-cmake.yaml
- /deps-docs.yaml
- /deps-format.yaml
- /deps-rocm.yaml
- /deps-vcpkg.yaml
- /deps-windows.yaml
Expand All @@ -46,20 +47,7 @@ variables:

clang-format:
extends:
- .deps:rocm
stage: lint
needs: []
tags:
- build
variables:
CLANG_FORMAT: "/opt/rocm/llvm/bin/clang-format"
GIT_CLANG_FORMAT: "/opt/rocm/llvm/bin/git-clang-format"
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
script:
- cd $CI_PROJECT_DIR
- git config --global --add safe.directory $CI_PROJECT_DIR
- scripts/code-format/check-format.sh $CI_MERGE_REQUEST_DIFF_BASE_SHA --binary "$CLANG_FORMAT"
- .lint:clang-format

copyright-date:
extends:
Expand Down Expand Up @@ -162,15 +150,22 @@ build:cmake-minimum-apt:
extends:
- .gpus:rocm-gpus
- .rules:build
variables:
EXTRA_CMAKE_CXX_FLAGS: ""
script:
- mkdir -p $BUILD_DIR
- cd $BUILD_DIR
- | # Add hardened libc++ assertions for tests only
if [[ $BUILD_TARGET == "TEST" ]]; then
echo "Configuring with hardened libc++!"
EXTRA_CMAKE_CXX_FLAGS+=" -D_GLIBCXX_ASSERTIONS=ON"
fi
- cmake
-G Ninja
-D CMAKE_CXX_COMPILER="$AMDCLANG"
-D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror"
-D CMAKE_CXX_FLAGS="-Wall -Wextra -Werror $EXTRA_CMAKE_CXX_FLAGS"
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
-D BUILD_$BUILD_TARGET=ON
-D BUILD_$BUILD_TARGET=ON
-D BUILD_EXAMPLE=ON
-D GPU_TARGETS=$GPU_TARGETS
-D AMDGPU_TEST_TARGETS=$GPU_TARGETS
Expand All @@ -197,9 +192,11 @@ build:cmake-latest:
extends:
- .cmake-latest
- .build:common
variables:
BUILD_TYPE: Release
BUILD_TARGET: TEST
parallel:
# Debug builds disabled due to excessive build times for debug test builds
matrix:
- BUILD_TYPE: Release
BUILD_TARGET: [BENCHMARK, TEST]

build:cmake-minimum:
needs: []
Expand Down Expand Up @@ -246,20 +243,19 @@ build:windows:
- .deps:visual-studio-devshell
parallel:
matrix:
# Debug is disabled due to extensive link times, tracked in issue 679.
- BUILD_TYPE: [Release]
- BUILD_TYPE: [Debug, Release]
BUILD_TARGET: [BENCHMARK, TEST]
script:
- mkdir -p $CI_PROJECT_DIR/build
- cmake -G Ninja
-S $CI_PROJECT_DIR
-B $CI_PROJECT_DIR/build
-D BUILD_$BUILD_TARGET=ON
- cmake -G Ninja
-S $CI_PROJECT_DIR
-B $CI_PROJECT_DIR/build
-D BUILD_$BUILD_TARGET=ON
-D GPU_TARGETS=$GPU_TARGET
-D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe"
-D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}"
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
- cmake --build "$CI_PROJECT_DIR/build"
-D CMAKE_CXX_COMPILER:PATH="${env:HIP_PATH}\bin\clang++.exe"
-D CMAKE_PREFIX_PATH:PATH="${env:HIP_PATH}"
-D CMAKE_BUILD_TYPE="$BUILD_TYPE"
- cmake --build "$CI_PROJECT_DIR/build"
artifacts:
paths:
- $CI_PROJECT_DIR/build/test/test_*
Expand All @@ -281,6 +277,10 @@ autotune:build:
- .cmake-minimum
- .gpus:rocm-gpus
- .rules:benchmark
before_script:
- !reference [".cmake-minimum", before_script]
- $SUDO_CMD apt-get update -qq
- $SUDO_CMD apt-get install -qq -y zstd
variables:
BENCHMARK_TARGETS: benchmark_config_tuning
script:
Expand All @@ -301,22 +301,13 @@ autotune:build:
-D GPU_TARGETS=$GPU_TARGETS
- cmake --build . --target $BENCHMARK_TARGETS
- 'rm -rf $BUILD_DIR/benchmark/benchmark*.parallel'
# remove benchmark executables if their size together is too large for gitlab ci to handle
- |
total_size_bytes=0
while read -r file_size; do
total_size_bytes=$((total_size_bytes + file_size))
done < <(stat --format="%s" benchmark/benchmark*)
total_size_gib="$(numfmt --round=down --to-unit=Gi "$total_size_bytes")"
if [ "$total_size_gib" -ge 3 ]; then
printf "Total size: %s (%d bytes) > 3GiB, skipping benchmark executables from the artifact.\n" \
"$(numfmt --to=iec-i "$total_size_bytes")" "$total_size_bytes"
rm benchmark/benchmark*
fi
# The autotune benchmarks get very large, above GitLabs upload limit. Fortunately they compress well.
# We'll put them all in a single archive to compress them to a few hundred MB.
- find benchmark -type f -executable -print0 | tar -I zstd -cvf benchmarks.tar.zstd --null -T -

artifacts:
paths:
- $BUILD_DIR/benchmark/benchmark*
- $BUILD_DIR/benchmarks.tar.zstd
- $BUILD_DIR/.ninja_log
- $BUILD_DIR/deps/googlebenchmark/
expire_in: 1 week
Expand Down Expand Up @@ -360,7 +351,7 @@ test:
- cd $CI_PROJECT_DIR/build
- ctest --output-on-failure

# Disabled due to extensive link times.
# Disabled due to extensive link times.
# This is tracked in issue 679
# test-windows-debug:
# extends:
Expand Down Expand Up @@ -535,9 +526,13 @@ autotune:execute-tuning:
artifacts:
paths:
- ${AUTOTUNE_RESULT_DIR}/*.json
before_script:
- !reference [".cmake-minimum", before_script]
- $SUDO_CMD apt-get update -qq
- $SUDO_CMD apt-get install -qq -y zstd
script:
- >
cd "${CI_PROJECT_DIR}"
- cd "${CI_PROJECT_DIR}"
- tar -I zstd -xvf "${BUILD_DIR}/benchmarks.tar.zstd" -C "${BUILD_DIR}/"
- |
if [ ! -d "${BUILD_DIR}/benchmark" ]; then
echo "There are no benchmark executables. Run the build job with a BUILD_TARGET."
Expand All @@ -553,6 +548,7 @@ autotune:execute-tuning:
--benchmark_filter_regex="${AUTOTUNE_ALGORITHM_REGEX}"
--size="${AUTOTUNE_SIZE}"
--trials="${AUTOTUNE_TRIALS}"
--seed=82589933

autotune:generate-config:
image: python:3.10.5-buster
Expand Down
13 changes: 10 additions & 3 deletions .gitlab/run_benchmarks.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/usr/bin/env python3

# Copyright (c) 2022-2023 Advanced Micro Devices, Inc. All rights reserved.
# Copyright (c) 2022-2024 Advanced Micro Devices, Inc. All rights reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -28,7 +28,7 @@
import subprocess
import sys

BenchmarkContext = namedtuple('BenchmarkContext', ['gpu_architecture', 'benchmark_output_dir', 'benchmark_dir', 'benchmark_filename_regex', 'benchmark_filter_regex', 'size', 'trials'])
BenchmarkContext = namedtuple('BenchmarkContext', ['gpu_architecture', 'benchmark_output_dir', 'benchmark_dir', 'benchmark_filename_regex', 'benchmark_filter_regex', 'size', 'trials', 'seed'])

def run_benchmarks(benchmark_context):
def is_benchmark_executable(filename):
Expand Down Expand Up @@ -61,6 +61,8 @@ def is_benchmark_executable(filename):
args += ['--size', benchmark_context.size]
if benchmark_context.trials:
args += ['--trials', benchmark_context.trials]
if benchmark_context.seed:
args += ['--seed', benchmark_context.seed]
try:
subprocess.check_call(args)
except subprocess.CalledProcessError as error:
Expand Down Expand Up @@ -97,6 +99,10 @@ def main():
help='Controls the number of trial iterations for each benchmark case',
default='',
required=False)
parser.add_argument('--seed',
help='Controls the seed for random number generation for each benchmark case',
default='',
required=False)

args = parser.parse_args()

Expand All @@ -107,7 +113,8 @@ def main():
args.benchmark_filename_regex,
args.benchmark_filter_regex,
args.size,
args.trials)
args.trials,
args.seed)

benchmark_run_successful = run_benchmarks(benchmark_context)

Expand Down
32 changes: 29 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,36 @@ Documentation for rocPRIM is available at

## Unreleased rocPRIM-3.3.0 for ROCm 6.3.0

### Fixes
### Additions

* Option `--seed` to benchmarks to specify a seed for the generation of random inputs. The default behavior is to keep using a random seed per benchmark measurement.
* Added configuration autotuning to device partition (`rocprim::partition`, `rocprim::partition_two_way`, and `rocprim::partition_three_way`), device select (`rocprim::select`, `rocprim::unique`, and `rocprim::unique_by_key`), and device reduce by key (`rocprim::reduce_by_key`) for improved performance on selected architectures.
* Added `rocprim::uninitialized_array` which provides uninitialized storage in local memory for user-defined types.
* Added large segment support for `rocprim:segmented_reduce`.
* Added a parallel `nth_element` device function similar to `std::nth_element`, this function rearranges elements smaller than the n-th before and bigger than the n-th after the n-th element.
* Added deterministic (bitwise reproducible) algorithm variants `rocprim::deterministic_inclusive_scan`, `rocprim::deterministic_exclusive_scan`, `rocprim::deterministic_inclusive_scan_by_key`, `rocprim::deterministic_exclusive_scan_by_key`, and `rocprim::deterministic_reduce_by_key`. These provide run-to-run stable results with non-associative operators such as float operations, at the cost of reduced performance.
* Added a parallel `partial_sort` and `partial_sort_copy` device function similar to `std::partial_sort` and `std::partial_sort_copy`, these functions rearranges elements such that the elements are the same as a sorted list up to and including the middle index.

### Changes

* Modified the input size in device adjacent difference benchmarks. Observed performance with these benchmarks might be different.
* Changed the default seed for `device_benchmark_segmented_reduce`.

### Fixes

* Fixed an issue where while running rtest.py on windows and passing in an absolute path to `--install_dir` causes a `FileNotFound` error.
* rocPRIM functions are no longer forcefully inlined on Windows, significantly reducing the build
time in debug builds.
* `block_load`, `block_store`, `block_shuffle`, `block_exchange` and `warp_exchange` now use placement `new` instead of copy
assignment (`operator=`) when writing to local memory. This fixes the behavior of custom types with non-trivial copy assignments.
* Fixed a bug in the generation of input data for benchmarks, which caused incorrect performance to be reported in specific cases. It may affect the reported performance for one-byte types (`uint8_t` and `int8_t`) and instantiations of `custom_type`. Specifically, device binary search, device histogram, device merge and warp sort are affected.
* Fixed a bug for `rocprim::merge_path_search` where using `unsigned` offsets would output wrong results.
* Fixed a bug for `rocprim::thread_load` and `rocprim::thread_store` where `float` and `double` were not casted to the correct type resulting in wrong results.
* Fix tests failing when compiling with `-D_GLIBCXX_ASSERTIONS=ON`.

### Deprecations

* `rocprim::thread_load` and `rocprim::thread_store`, use dereference instead. Not all of those functions are available on every device architecture, and their usage can hurt performance, because inline assembly inhibits optimizations.

## Unreleased rocPRIM-3.2.0 for ROCm 6.2.0

Expand All @@ -28,10 +53,10 @@ Documentation for rocPRIM is available at
* New `rocprim::batch_copy` function added. Similar to `rocprim::batch_memcpy`, but copies by element, not with memcpy.
* Added more test cases, to better cover supported data types.
* Updated some tests to work with supported data types.
* An optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`.
* An optional `decomposer` argument for all member functions of `rocprim::block_radix_sort` and all functions of `device_radix_sort`.
To sort keys of an user-defined type, a decomposer functor should be passed. The decomposer should produce a `rocprim::tuple`
of references to arithmetic types from the key.
* New `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate.
* New `rocprim::predicate_iterator` which acts as a proxy for an underlying iterator based on a predicate.
It iterates over proxies that holds the references to the underlying values, but only allow reading and writing if the predicate is `true`.
It can be instantiated with:
* `rocprim::make_predicate_iterator`
Expand All @@ -43,6 +68,7 @@ Documentation for rocPRIM is available at

* Improved the performance of `warp_sort_shuffle` and `block_sort_bitonic`.
* Created an optimized version of the `warp_exchange` functions `blocked_to_striped_shuffle` and `striped_to_blocked_shuffle` when the warpsize is equal to the items per thread.
* Improved the performance of `device_transform`.

### Fixes

Expand Down
3 changes: 2 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ endif()
# Build options
option(BUILD_TEST "Build tests (requires googletest)" OFF)
option(BUILD_BENCHMARK "Build benchmarks" OFF)
option(BUILD_NAIVE_BENCHMARK "Build naive benchmarks" OFF)
option(BUILD_EXAMPLE "Build examples" OFF)
option(BUILD_DOCS "Build documentation (requires sphinx)" OFF)
option(USE_HIP_CPU "Prefer HIP-CPU runtime instead of HW acceleration" OFF)
Expand Down Expand Up @@ -130,7 +131,7 @@ if(USE_HIP_CPU)
endif()

# Setup VERSION
set(VERSION_STRING "3.2.0")
set(VERSION_STRING "3.3.0")
rocm_setup_version(VERSION ${VERSION_STRING})

# Print configuration summary
Expand Down
Loading

0 comments on commit dbb52d5

Please sign in to comment.