Releases: ROCm/rocBLAS
Releases · ROCm/rocBLAS
rocBLAS 2.42.0 for ROCm 5.0.1
rocBLAS code for ROCm 5.0.1 is unchanged from rocBLAS for ROCm 5.0.0. The library was rebuilt for the updated ROCm 5.0.1 stack.
rocBLAS 2.42.0 for ROCm 5.0.0
Added
- Added rocblas_get_version_string_size convenience function
- Added rocblas_xtrmm_outofplace, an out-of-place version of rocblas_xtrmm
- Added hpl and trig initialization for gemm_ex to rocblas-bench
- Added source code gemm. It can be used as an alternative to Tensile for debugging and development
- Added option ROCM_MATHLIBS_API_USE_HIP_COMPLEX to opt-in to use hipFloatComplex and hipDoubleComplex
Optimizations
- Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.
- Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.
Changed
- Instantiate templated rocBLAS functions to reduce size of librocblas.so
- Removed static library dependency on msgpack
- Removed boost dependencies for clients
Fixed
- Option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Correctly set output of nrm2_batched_ex and nrm2_strided_batched_ex when given bad input
- Fix for dgmm with side == rocblas_side_left and a negative incx
- Fixed out-of-bounds read for small trsm
- Fixed numerical checking for tbmv_strided_batched
rocBLAS 2.41.0 for ROCm 4.5.2
rocBLAS code for ROCm 4.5.2 is unchanged from rocBLAS for ROCm 4.5.0. The library was rebuilt for the updated ROCm 4.5.2 stack.
rocBLAS 2.41.0 for ROCm 4.5.0
Optimizations
- Improved performance of non-batched and batched syr for all sizes and data types
- Improved performance of non-batched and batched hemv for all sizes and data types
- Improved performance of non-batched and batched symv for all sizes and data types
- Improved memory utilization in rocblas-bench, rocblas-test gemm functions, increasing possible runtime sizes.
- Improved performance of non-batched and batched dot, dotc, and dot_ex for small n. e.g. sdot n <= 31000.
- Improved performance of non-batched and batched trmv for all sizes and matrix types.
- Improved performance of non-batched and batched gemv transpose case for all sizes and datatypes.
- Improved performance of sger and dger for all sizes, in particular the larger dger sizes.
- Improved performance of syrkx for for large size including those in rocBLAS Issue #1184.
Changed
- Update from C++14 to C++17.
- Packaging split into a runtime package (called rocblas) and a development package (called rocblas-dev for
.deb
packages, and rocblas-devel for.rpm
packages). The development package depends on runtime. The runtime package suggests the development package for all supported OSes except CentOS 7 to aid in the transition. The suggests feature in packaging is introduced as a deprecated feature and will be removed in a future rocm release.
Fixed
- For function geam avoid overflow in offset calculation.
- For function syr avoid overflow in offset calculation.
- For function gemv (Transpose-case) avoid overflow in offset calculation.
- For functions ssyrk and dsyrk, allow conjugate-transpose case to match legacy BLAS. Behavior is the same as the transpose case.
rocBLAS 2.39.0 for ROCm 4.3.1
Fixed
- CI testing/benchmark issues.
rocBLAS 2.39.0 for ROCm 4.3.0
Optimizations
- Improved performance of non-batched and batched rocblas_Xgemv for gfx908 when m <= 15000 and n <= 15000
- Improved performance of non-batched and batched rocblas_sgemv and rocblas_dgemv for gfx906 when m <= 6000 and n <= 6000
- Improved the overall performance of non-batched and batched rocblas_cgemv for gfx906
Changed
- Internal use only APIs prefixed with rocblas_internal_ and deprecated to discourage use
rocBLAS-2.38.0 for ROCm 4.2.0
Added
- Added option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Supported gemm ext for unpacked int8 input layout on gfx908 GPUs
- Added new flags rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4 to specify if using the packed layout
- Set the rocblas_gemm_flags_pack_int8x4 when using packed int8x4, this should be always set on GPUs before gfx908.
- For gfx908 GPUs, unpacked int8 is supported so no need to set this flag.
- Notice the default flags 0 uses unpacked int8, this somehow changes the behaviour of int8 gemm from ROCm 4.1.0
- Added new flags rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4 to specify if using the packed layout
- Added a query function rocblas_query_int8_layout_flag to get the preferable layout of int8 for gemm by device
Optimizations
- Improved performance of single precision copy, swap, and scal when incx == 1 and incy == 1.
- Improved performance of single precision axpy when incx == 1, incy == 1 and batch_count =< 8192.
- Improved performance of trmm.
Changed
- Change cmake_minimum_required to VERSION 3.16.8
rocBLAS-2.36.0 for ROCm 4.1.0
Added
- Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output vectors of rocBLAS level 1 and 2 functions.
- Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output general matrices of rocBLAS level 2 and 3 functions.
Fixed
- Fixed complex unit test bug caused by incorrect caxpy and zaxpy function signatures.
- Make functions compliant with Legacy Blas for special values alpha == 0, k == 0, beta == 1, beta == 0.
Optimizations
- Improved performance of single precision axpy_batched and axpy_strided_batched: batch_count >= 8192.
rocBLAS-2.32.0 for ROCm 4.0.0
New Features
- No new features
Known Issues
- None
rocBLAS-2.32.0 for ROCm 3.10.0
New Features
- Improved performance of gemm_batched for NN, general m, n, k, small m, n, k
Known Issues
- None