Releases: ROCm/rocBLAS
Releases · ROCm/rocBLAS
rocBLAS 4.2.1 for ROCm 6.2.2
rocBLAS code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.
rocBLAS 4.2.1 for ROCm 6.2.1
Removals
- Remove Device_Memory_Allocation.pdf link in documentation
Fixes
- Fixed error/warn message during rocblas_set_stream() call
rocBLAS 4.2.0 for ROCm 6.2.0
Additions
- Level 2 functions and level 3 trsm have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments
- Cache flush timing for gemm_batched_ex, gemm_strided_batched_ex, axpy
- Benchmark class for common timing code
- An environment variable "ROCBLAS_DEFAULT_ATOMICS_MODE" to set default atomics mode during creation of 'rocblas_handle'
- Extended dot_ex to support single-precision (fp32_r) input and double-precision (fp64_r) output and compute types
Optimizations
- Improved performance of Level 1 dot_batched and dot_strided_batched for all precisions. Performance enhanced by 6 times for bigger problem sizes measured on MI210 GPU
Changes
- Linux AOCL dependency updated to release 4.2 gcc build
- Windows vcpkg dependencies updated to release 2024.02.14
- Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40
Deprecations
- rocblas_gemm_ex3, gemm_batched_ex3 and gemm_strided_batched_ex3 are deprecated and will be removed in the next major release of rocBLAS. Please refer to hipBLASLt for future 8 bit float usage https://github.com/ROCm/hipBLASLt
rocBLAS 4.1.2 for ROCm 6.1.2
Fixes
- Fixes BF16 TT get_solutions
Optimizations
- Tune gfx942 BBS TN, TT
rocBLAS 4.1.0 for ROCm 6.1.1
rocBLAS code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.
rocBLAS 4.1.0 for ROCm 6.1.0
Additions
- Level 1 and Level 1 Extension functions have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments.
- Cache flush timing for gemm_ex.
Changes
- Some Level 2 function argument names have changed 'm' to 'n' to match legacy BLAS, there was no change in implementation.
- Standardized the use of non-blocking streams for copying results from device to host.
Fixes
- Fixed host-pointer mode reductions for non-blocking streams.
rocBLAS 4.0.0 for ROCm 6.0.2
rocBLAS code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.
rocBLAS 4.0.0 for ROCm 6.0.0
Added
- Addition of beta API rocblas_gemm_batched_ex3 and rocblas_gemm_strided_batched_ex3
- Added input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched
- Added rocblas_status_excluded_from_build to be used when calling functions which require Tensile when using rocBLAS built without Tensile
- Added system for async kernel launches setting a failure rocblas_status based on hipPeekAtLastError discrepancy
Optimized
- Trsm performance for small sizes m < 32 && n < 32
Deprecated
- In a future release atomic operations will be disabled by default so results will be repeatable. Atomic operations can always be enabled or disabled using the function rocblas_set_atomics_mode. Enabling atomic operations can improve performance.
Removed
- rocblas_gemm_ext2 API function is removed
- in-place trmm API from Legacy BLAS is removed. It is replaced by an API that supports both in-place and out-of-place trmm
- int8x4 support is removed. int8 support is unchanged
- The #define STDC_WANT_IEC_60559_TYPES_EXT has been removed from rocblas-types.h. Users who want ISO/IEC TS 18661-3:2015 functionality must define STDC_WANT_IEC_60559_TYPES_EXT before including float.h, math.h, and rocblas.h
- The default build removes device code for gfx803 architecture from the fat binary
Fixed
- Make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimension or increment potentially causing overflow:
- Level2: gbmv, gemv, hbmv, sbmv, spmv, tbmv, tpmv, tbsv, tpsv
- Lazy loading to support heterogeneous architecture setup and load appropriate tensile library files based on the device's architecture
- Guard against no-op kernel launches resulting in potential hipGetLastError
Changed
- Default verbosity of rocblas-test reduced. To see all tests set environment variable GTEST_LISTENER=PASS_LINE_IN_LOG
rocBLAS 3.1.0 for ROCm 5.7.1
rocBLAS code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.
rocBLAS 3.1.0 for ROCm 5.7.0
Added
- yaml lock step argument scanning for rocblas-bench and rocblas-test clients. See Programmers Guide for details.
- rocblas-gemm-tune is used to find the best performing GEMM kernel for each of a given set of GEMM problems.
Fixed
- make offset calculations for rocBLAS functions 64 bit safe. Fixes for very large leading dimensions or increments potentially causing overflow:
- Level 1: axpy, copy, rot, rotm, scal, swap, asum, dot, iamax, iamin, nrm2
- Level 2: gemv, symv, hemv, trmv, ger, syr, her, syr2, her2, trsv
- Level 3: gemm, symm, hemm, trmm, syrk, herk, syr2k, her2k, syrkx, herkx, trsm, trtri, dgmm, geam
- General: set_vector, get_vector, set_matrix, get_matrix
- Related fixes: internal scalar loads with > 32bit offsets
- fix in-place functionality for all trtri sizes
Changed
- dot when using rocblas_pointer_mode_host is now synchronous to match legacy BLAS as it stores results in host memory
- enhanced reporting of installation issues caused by runtime libraries (Tensile)
- standardized internal rocblas C++ interface across most functions
Deprecated
- Removal of STDC_WANT_IEC_60559_TYPES_EXT define in future release
Dependencies
- optional use of AOCL BLIS 4.0 on Linux for clients
- optional build tool only dependency on python psutil