rocBLAS documentation is available at https://rocm.docs.amd.com/projects/rocBLAS/en/latest/index.html.
- Level 2 functions and level 3 trsm have additional ILP64 API for both C and FORTRAN (_64 name suffix) with int64_t function arguments
- Cache flush timing for gemm_batched_ex, gemm_strided_batched_ex, axpy
- Benchmark class for common timing code
- An environment variable "ROCBLAS_DEFAULT_ATOMICS_MODE" to set default atomics mode during creation of 'rocblas_handle'
- Extended dot_ex to support single-precision (fp32_r) input and double-precision (fp64_r) output and compute types
- Improved performance of Level 1 dot_batched and dot_strided_batched for all precisions. Performance enhanced by 6 times for bigger problem sizes measured on MI210 GPU
- Linux AOCL dependency updated to release 4.2 gcc build
- Windows vcpkg dependencies updated to release 2024.02.14
- Increased default device workspace from 32 to 128 MiB for architecture gfx9xx with xx >= 40
- rocblas_gemm_ex3, gemm_batched_ex3 and gemm_strided_batched_ex3 are deprecated and will be removed in the next major release of rocBLAS. Please refer to hipBLASLt for future 8 bit float usage https://github.com/ROCm/hipBLASLt
- Level 1 and Level 1 Extension functions have additional ILP64 API for both C and Fortran (
_64
name suffix) with int64_t function arguments - Cache flush timing for
gemm_ex
- Some Level 2 function argument names have changed
m
ton
to match legacy BLAS; there is no change in implementation - Standardized the use of non-blocking streams for copying results from device to host
- Fixed host-pointer mode reductions for non-blocking streams
- Beta API
rocblas_gemm_batched_ex3
androcblas_gemm_strided_batched_ex3
*Input/output type f16_r/bf16_r and execution type f32_r support for Level 2 gemv_batched and gemv_strided_batched - Use of
rocblas_status_excluded_from_build
when calling functions that require Tensile (when using rocBLAS built without Tensile) - System for asynchronous kernel launches that set a
rocblas_status
failure based on ahipPeekAtLastError
discrepancy
- TRSM performance for small sizes (m < 32 && n < 32)
- Atomic operations will be disabled by default in a future release of rocBLAS
rocblas_gemm_ext2
API function- In-place trmm API from Legacy BLAS is replaced by an API that supports both in-place and out-of-place trmm
- int8x4 support is removed (int8 support is unchanged)
#define __STDC_WANT_IEC_60559_TYPES_EXT__
is removed fromrocblas-types.h
(if you want ISO/IEC TS 18661-3:2015 functionality, you must define__STDC_WANT_IEC_60559_TYPES_EXT__
before includingfloat.h
,math.h
, androcblas.h
)- The default build removes device code for gfx803 architecture from the fat binary
- Made offset calculations for 64-bit rocBLAS functions safe
- Fixes for very large leading dimension or increment potentially causing overflow:
- Level2:
gbmv
,gemv
,hbmv
,sbmv
,spmv
,tbmv
,tpmv
,tbsv
, andtpsv
- Level2:
- Fixes for very large leading dimension or increment potentially causing overflow:
- Lazy loading supports heterogeneous architecture setup and load-appropriate tensile library files, based on device architecture
- Guards against no-op kernel launches that result in a potential
hipGetLastError
- Reduced the default verbosity of
rocblas-test
(you can see all tests by setting theGTEST_LISTENER=PASS_LINE_IN_LOG
environment variable)
- YAML lock step argument scans for
rocblas-bench
androcblas-test
clients rocblas-gemm-tune
is used to find the best performing GEMM kernel for each set of GEMM problems
- Made offset calculations for 64-bit rocBLAS functions safe
- Fixes for very large leading dimensions or increments potentially causing overflow:
- Level 1:
axpy
,copy
,rot
,rotm
,scal
,swap
,asum
,dot
,iamax,
iamin,and
nrm2` - Level 2:
gemv
,symv
,hemv
,trmv
,ger
,syr
,her
,syr2
,her2
, andtrsv
- Level 3:
gemm
,symm
,hemm
,trmm
,syrk
,herk
,syr2k
,her2k
,syrkx
,herkx
,trsm
,trtri
,dgmm
, andgeam
- General:
set_vector
,get_vector
,set_matrix
, andget_matrix
- Related fixes: internal scalar loads with > 32-bit offsets
- In-place functionality for all
trtri
sizes
- Level 1:
- Fixes for very large leading dimensions or increments potentially causing overflow:
- Dot when using
rocblas_pointer_mode_host
is now synchronous in order to match legacy BLAS as it stores results in host memory - Enhanced reporting of installation issues caused by runtime libraries (Tensile)
- Standardized internal rocBLAS C++ interface across most functions
__STDC_WANT_IEC_60559_TYPES_EXT__
define will be removed in a future release
- Optional use of AOCL BLIS 4.0 on Linux for clients
- Optional build tool-only dependency on Python
psutil
- Level 2 rocBLAS GEMV performance on gfx90a GPU for non-transposed problems that have small
matrices (
m
andn
<= 32) and large batch counts (batch_count
>= 256) - rocBLAS syr2k performance for single, double, and double-complex precision
- rocBLAS her2k performance for double-complex precision
- Improved performance for general sizes on gfx90a
- bf16 inputs and f32 compute support to Level 1 rocBLAS extension functions:
axpy_ex
,scal_ex
, andnrm2_ex
- In-place trmm has been replaced with trmm that has in-place and out-of-place functionality
rocblas_query_int8_layout_flag()
rocblas_gemm_flags_pack_int8x4
rocblas_set_device_memory_size()
will be replaced withrocblas_increase_device_memory_size()
rocblas_is_user_managing_device_memory()
is_complex
helper: userocblas_is_complex
instead- The enum
truncate_t
: userocblas_truncate_t
instead - The value
truncate
: userocblas_truncate
instead rocblas_set_int8_type_for_hipblas
rocblas_get_int8_type_for_hipblas
- Python
joblib
build-only dependency (used in Tensile builds)
- Made 64-bit trsm offset calculations safe
- CMake install fixed on some operating systems when using
install.sh -d --cmake_install
- Refactored ROTG test code
rocblas_geam_ex
functionality for matrix-matrix minimum operations- HIP Graph support (beta feature) for rocBLAS Level 1, Level 2, and Level 3 (pointer mode host) functions
- Beta features API, exposed using compiler define
ROCBLAS_BETA_FEATURES_API
- Support for vector initialization in the rocBLAS test framework with negative increments
- Windows build documentation for HIP SDK support
- Scripts for plotting the performance of multiple functions
- Performance improvements for Level 2 rocBLAS GEMV for float and double precision (150-200% improvement for certain problem sizes when (m==n) measured on a gfx90a GPU)
- Performance improvements for Level 2 rocBLAS GER for float, double, and complex float precisions (5-7% improvement for certain problem sizes when measured on a gfx90a GPU)
- Performance improvements for Level 2 rocBLAS SYMV for float and double precisions (120-150% improvement for certain problem sizes measured on both gfx908 and gfx90a GPUs)
- Executable mode setting on
rocblas_gentest.py
client script to avoid potential permission errors with clientsrocblas-test
androcblas-bench
- Deprecated API compatibility with Visual Studio compiler
- Test framework memory exception handling for Level 2 functions when the host memory allocation exceeds the available memory
install.sh
internally runsrmake.py
(also used on Windows) andrmake.py
can be used directly on Linux (use--help
)- rocBLAS client executables all now begin with the
rocblas-
prefix
install.sh
no longer has the options-o --cov
because Tensile will now use the default COV format, which is set bycmake define Tensile_CODE_OBJECT_VERSION=default
- client smoke test dataset added for quick validation using command
rocblas-test --yaml rocblas_smoke.yaml
- Added stream order device memory allocation as a non-default beta option.
- Improved trsm performance for small sizes by using a substitution method technique
- Improved syr2k and her2k performance significantly by using a block-recursive algorithm
- Level 2, Level 1, and Extension functions: argument checking when the handle is set to
rocblas_pointer_mode_host
now returns the status ofrocblas_status_invalid_pointer
only for pointers that must be dereferenced based on the alpha and beta argument values. With handle moderocblas_pointer_mode_device
only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status ofrocblas_status_invalid_pointer
. This improves consistency with legacy BLAS behavior. - Add variable to turn on/off ieee16/ieee32 tests for mixed precision gemm
- Allow hipBLAS to select int8 datatype
- Disallow B == C && ldb != ldc in
rocblas_xtrmm_outofplace
- Fortran interfaces generalized for Fortran compilers other than GFortran
- fix for
trsm_strided_batched rocblas-bench
performance gathering - Fix for rocm-smi path in
commandrunner.py
script to match ROCm 5.2 and above
install.sh
option--upgrade_tensile_venv_pip
to upgrade Pip in Tensile Virtual Environment. The corresponding CMake option is TENSILE_VENV_UPGRADE_PIPinstall.sh
option--relocatable
or-r
adds rpath and removes ldconf entry on rocBLAS buildinstall.sh
option--lazy-library-loading
to enable on-demand loading of tensile library files at runtime to speedup rocBLAS initialization- Support for RHEL9 and CS9
- Added Numerical checking routine for symmetric, Hermitian, and triangular matrices, so that they could be checked for any numerical abnormalities such as NaN, Zero, infinity and denormal value
trmm_outofplace
performance improvements for all sizes and data types using block-recursive algorithm- herkx performance improvements for all sizes and data types using block-recursive algorithm
- syrk/herk performance improvements by utilising optimised syrkx/herkx code
- symm/hemm performance improvements for all sizes and datatypes using block-recursive algorithm
- Unifying library logic file names: affects HBH (->HHS_BH), BBH (->BBS_BH), 4xi8BH (->4xi8II_BH). All HPA types are using the new naming convention now.
- Level 3 function argument checking when the handle is set to rocblas_pointer_mode_host now returns the status of rocblas_status_invalid_pointer only for pointers that must be dereferenced based on the alpha and beta argument values. With handle mode rocblas_pointer_mode_device only pointers that are always dereferenced regardless of alpha and beta values are checked and so may lead to a return status of rocblas_status_invalid_pointer. This improves consistency with legacy BLAS behaviour
- Level 1, 2, and 3 function argument checking for enums is now more rigorously matching legacy BLAS so returns rocblas_status_invalid_value if arguments do not match the accepted subset
- Add quick-return for internal trmm and gemm template functions
- Moved function block sizes to a shared header file
- Level 1, 2, and 3 functions use rocblas_stride datatype for offset
- Modified the matrix and vector memory allocation in our test infrastructure for all Level 1, 2, 3 and BLAS_EX functions
- Added specific initialization for symmetric, Hermitian, and triangular matrix types in our test infrastructure
- Added NaN tests to the test infrastructure for the rest of Level 3, BLAS_EX functions
- Improved logic to #include vs <experimental/filesystem>
install.sh -s
option to build rocblas as a static library.- dot function now sets the device results asynchronously for N <= 0
- is_complex helper is now deprecated. Use
rocblas_is_complex
instead - The enum
truncate_t
and the value truncate is now deprecated and will removed from the ROCm release 6.0. It is replaced byrocblas_truncate_t
androcblas_truncate
, respectively. The new enumrocblas_truncate_t
and the valuerocblas_truncate
could be used from this ROCm release for an easy transition
install.sh
options--hip-clang
,--no-hip-clang
,--merge-files
,--no-merge-files
are removed
- Packages for test and benchmark executables on all supported operating systems using CPack
- Added denormal number detection to the Numerical checking helper function to detect denormal/subnormal numbers in the input and the output vectors of rocBLAS level 1 and 2 functions
- Added denormal number detection to the numerical checking helper function to detect denormal/subnormal numbers in the input and the output general matrices of rocBLAS level 2 and 3 functions
- Added NaN initialization tests to the YAML files of Level 2 rocBLAS batched and strided-batched functions for testing purposes
- Added memory allocation check to avoid disk swapping during rocblas-test runs by skipping tests
- Improved performance of non-batched and batched her2 for all sizes and data types
- Improved performance of non-batched and batched amin for all data types using shuffle reductions
- Improved performance of non-batched and batched amax for all data types using shuffle reductions
- Improved performance of trsv for all sizes and data types
- Modifying
gemm_ex
for HBH (high-precision F16). The alpha/beta data type remains as F32 without narrowing to F16 and expanding back to F32 in the kernel. This change prevents rounding errors due to alpha/beta conversion in situations where alpha/beta are not exactly represented as an F16 - Modified non-batched and batched asum, nrm2 functions to use shuffle instruction based reductions
- For
gemm
,gemm_ex
,gemm_ex2
internal API userocblas_stride
datatype for offset - For symm, hemm, syrk, herk, dgmm, geam internal API use
rocblas_stride
datatype for offset - AMD copyright year for all rocBLAS files
- For
gemv
(transpose-case), typecasted the 'lda'(offset) datatype tosize_t
during offset calculation to avoid overflow and remove duplicate template functions
- For function her2 avoid overflow in offset calculation
- For trsm when alpha == 0 and on host, allow A to be nullptr
- Fixed memory access issue in trsv
- Fixed git pre-commit script to update only AMD copyright year
- Fixed dgmm, geam test functions to set correct stride values
- For functions ssyr2k and dsyr2k allow trans ==
rocblas_operation_conjugate_transpose
- Fixed compilation error for clients-only build
- Remove Navi12 (gfx1011) from fat binary
- Option to install script for number of jobs to use for rocBLAS and Tensile compilation (
-j
,--jobs
) - Option to install script to build clients without using any Fortran (
--clients_no_fortran
) rocblas_client_initialize function
, to perform rocBLAS initialize for clients(benchmark/test) and report the execution time.- Added tests for output of reduction functions when given bad input
- Added user specified initialization (
rand_int
/trig_float
/hpl
) for initializing matrices and vectors inrocblas-bench
- Improved performance of trsm with side == left and n == 1
- Improved performance of trsm with side == left and m <= 32 along with side == right and n <= 32
- For syrkx and trmm internal API use
rocblas_stride
datatype for offset - For non-batched and batched gemm_ex functions if the C matrix pointer equals the D matrix pointer (aliased) their respective type and leading dimension arguments must now match
- Test client dependencies updated to GTest 1.11
- non-global false positives reported by cppcheck from file based suppression to inline suppression. File based suppression will only be used for global false positives
- Help menu messages in
install.sh
- For ger function, typecast the 'lda'(offset) datatype to
size_t
during offset calculation to avoid overflow and remove duplicate template functions - Modified default initialization from
rand_int
to hpl for initializing matrices and vectors inrocblas-bench
- For function trmv (non-transposed cases) avoid overflow in offset calculation
- Fixed cppcheck errors/warnings
- Fixed Doxygen warnings
- Added
rocblas_get_version_string_size
convenience function - Added
rocblas_xtrmm_outofplace
, an out-of-place version ofrocblas_xtrmm
- Added hpl and trig initialization for
gemm_ex
torocblas-bench
- Added source code gemm. It can be used as an alternative to Tensile for debugging and development
- Added option
ROCM_MATHLIBS_API_USE_HIP_COMPLEX
to opt-in to usehipFloatComplex
andhipDoubleComplex
- Improved performance of non-batched and batched single-precision GER for size m > 1024. Performance enhanced by 5-10% measured on a MI100 (gfx908) GPU.
- Improved performance of non-batched and batched HER for all sizes and data types. Performance enhanced by 2-17% measured on a MI100 (gfx908) GPU.
- Instantiate templated rocBLAS functions to reduce size of
librocblas.so
- Removed static library dependency on msgpack
- Removed boost dependencies for clients
- Option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Correctly set output of
nrm2_batched_ex
andnrm2_strided_batched_ex
when given bad input - Fix for dgmm with side ==
rocblas_side_left
and a negative incx - Fixed out-of-bounds read for small trsm
- Fixed numerical checking for
tbmv_strided_batched
- Improved performance of non-batched and batched syr for all sizes and data types
- Improved performance of non-batched and batched hemv for all sizes and data types
- Improved performance of non-batched and batched symv for all sizes and data types
- Improved memory utilization in
rocblas-bench
,rocblas-test
gemm functions, increasing possible runtime sizes. - Improved performance of non-batched and batched dot, dotc, and dot_ex for small n. e.g. sdot n <= 31000.
- Improved performance of non-batched and batched trmv for all sizes and matrix types.
- Improved performance of non-batched and batched gemv transpose case for all sizes and datatypes.
- Improved performance of sger and dger for all sizes, in particular the larger dger sizes.
- Improved performance of syrkx for for large size including those in rocBLAS Issue #1184.
- Update from C++14 to C++17.
- Packaging split into a runtime package (called rocblas) and a development package (called rocblas-dev for
.deb
packages, and rocblas-devel for.rpm
packages). The development package depends on runtime. The runtime package suggests the development package for all supported OSes except CentOS 7 to aid in the transition. The suggests feature in packaging is introduced as a deprecated feature and will be removed in a future rocm release.
- For function geam avoid overflow in offset calculation.
- For function syr avoid overflow in offset calculation.
- For function gemv (Transpose-case) avoid overflow in offset calculation.
- For functions ssyrk and dsyrk, allow conjugate-transpose case to match legacy BLAS. Behavior is the same as the transpose case.
- Improved performance of non-batched and batched
rocblas_Xgemv
for gfx908 when m <= 15000 and n <= 15000 - Improved performance of non-batched and batched
rocblas_sgemv
androcblas_dgemv
for gfx906 when m <= 6000 and n <= 6000 - Improved the overall performance of non-batched and
batched rocblas_cgemv
for gfx906 - Improved the overall performance of
rocblas_Xtrsv
- Internal use only APIs prefixed with
rocblas_internal_
and deprecated to discourage use
- Added option to install script to build only rocBLAS clients with a pre-built rocBLAS library
- Supported gemm ext for unpacked int8 input layout on gfx908 GPUs
- Added new flags
rocblas_gemm_flags::rocblas_gemm_flags_pack_int8x4
to specify if using the packed layout- Set the
rocblas_gemm_flags_pack_int8x4
when using packed int8x4, this should be always set on GPUs before gfx908. - For gfx908 GPUs, unpacked int8 is supported so no need to set this flag.
- Notice the default flags 0 uses unpacked int8, this somehow changes the behaviour of int8 gemm from ROCm 4.1.0
- Set the
- Added new flags
- Added a query function
rocblas_query_int8_layout_flag
to get the preferable layout of int8 for gemm by device
- Improved performance of single precision copy, swap, and scal when
incx
== 1 andincy
== 1 - Improved performance of single precision axpy when
incx
== 1,incy
== 1 andbatch_count
=< 8192 - Improved performance of trmm
- Change
cmake_minimum_required
to VERSION 3.16.8
- Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output vectors of rocBLAS level 1 and 2 functions
- Added Numerical checking helper function to detect zero/NaN/Inf in the input and the output general matrices of rocBLAS level 2 and 3 functions
- Fixed complex unit test bug caused by incorrect caxpy and zaxpy function signatures.
- Make functions compliant with Legacy Blas for special values
alpha
== 0,k
== 0,beta
== 1,beta
== 0
- Improved performance of single precision
axpy_batched
andaxpy_strided_batched
:batch_count
>= 8192 - Improved performance of trmm.
- Add changelog.
- Improved performance of gemm_batched for small m, n, k and NT, NC, TN, TT, TC, CN, CT, CC
- Improved performance of gemv, gemv_batched, gemv_strided_batched: small n large m
- Removed support for legacy hcc compiler
- Add
rot_ex
,rot_batched_ex
, androt_strided_batched_ex
- Removed
-DUSE_TENSILE_HOST
fromroc::rocblas
CMake usage requirements. This is a rocblas internal variable, and does not need to be defined in user code
- Improved performance of
gemm_batched
for NN, general m, n, k, small m, n, k
- Slight improvements to FP16 Megatron BERT performance on MI50
- Improvements to FP16 Transformer performance on MI50
- Slight improvements to FP32 Transformer performance on MI50
- Improvements to FP32 DLRM Terabyte performance on gfx908
-
added two functions:
rocblas_status rocblas_set_atomics_mode(rocblas_atomics_mode mode)
rocblas_status rocblas_get_atomics_mode(rocblas_atomics_mode mode)
-
added enum
rocblas_atomics_mode
. It can have two valuesrocblas_atomics_allowed
rocblas_atomics_not_allowed
The default isrocblas_atomics_not_allowed
-
function
rocblas_Xdgmm
algorithm corrected andincx
=0 support added -
dependencies:
rocblas-tensile
internal component requires msgpack instead of LLVM
-
Moved the following files from /opt/rocm/include to /opt/rocm/include/internal:
rocblas-auxillary.h
rocblas-complex-types.h
rocblas-functions.h
rocblas-types.h
rocblas-version.h
rocblas_bfloat16.h
These files should NOT be included directly as this may lead to errors. Instead,
/opt/rocm/include/rocblas.h
should be included directly./opt/rocm/include/rocblas_module.f90
can also be directly used
- Improvements to
rocblas_Xgemm_batched
performance for small m, n, k - Improvements to
rocblas_Xgemv_batched
androcblas_Xgemv_strided_batched
performance for small m (QMCPACK use) - Improvements to
rocblas_Xdot
(batched and non-batched) performance when both incx and incy are 1 - Improvements to FP32 ONNX BERT performance for MI50
- Significant improvements to FP32 Resnext, Inception Convolution performance for gfx908
- Slight improvements to FP32 DLRM Terabyte performance for gfx908
- Significant improvements to FP32 BDAS performance for gfx908
- Significant improvements to FP32 BDAS performance for MI50 and MI60
- Added substitution method for small trsm sizes with m <= 64 && n <= 64. Increases performance drastically for small batched trsm
- Improvements to User Guide and Design Document
- L1 dot function optimized to utilize shuffle instructions (improvements on bf16, f16, f32 data types)
- L1 dot function added x dot x optimized kernel
- Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
- Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
- Added Fortran interface for all rocBLAS functions
- add
geam complex
,geam_batched
, andgeam_strided_batched
- add
dgmm
,dgmm_batched
, anddgmm_strided_batched
- Optimized performance
- ger
rocblas_sger
,rocblas_dger
rocblas_sger_batched
,rocblas_dger_batched
rocblas_sger_strided_batched
,rocblas_dger_strided_batched
- geru
rocblas_cgeru
,rocblas_zgeru
rocblas_cgeru_batched
,rocblas_zgeru_batched
rocblas_cgeru_strided_batched
,rocblas_zgeru_strided_batched
- gerc
rocblas_cgerc
,rocblas_zgerc
rocblas_cgerc_batched
,rocblas_zgerc_batched
rocblas_cgerc_strided_batched
,rocblas_zgerc_strided_batched
- symv
rocblas_ssymv
,rocblas_dsymv
,rocblas_csymv
,rocblas_zsymv
rocblas_ssymv_batched
,rocblas_dsymv_batched
,rocblas_csymv_batched
,rocblas_zsymv_batched
rocblas_ssymv_strided_batched
,rocblas_dsymv_strided_batched
,rocblas_csymv_strided_batched
,rocblas_zsymv_strided_batched
- sbmv
rocblas_ssbmv
,rocblas_dsbmv
rocblas_ssbmv_batched
,rocblas_dsbmv_batched
rocblas_ssbmv_strided_batched
,rocblas_dsbmv_strided_batched
- spmv
rocblas_sspmv
,rocblas_dspmv
rocblas_sspmv_batched
,rocblas_dspmv_batched
rocblas_sspmv_strided_batched
,rocblas_dspmv_strided_batched
- ger
- improved documentation.
- Fix argument checking in functions to match legacy BLAS.
- Fixed conjugate-transpose version of geam.
- Compilation for GPU Targets:
When using the install.sh script for "all" GPU Targets, which is the default, you must first set an environment variable
HCC_AMDGPU_TARGET
listing the GPU targets, e.g.HCC_AMDGPU_TARGET=gfx803,gfx900,gfx906,gfx908
If building for a specific architecture(s) using the-a
| --architecture flag, you should also set the environment variableHCC_AMDGPU_TARGET
to match. Mismatching the environment variable to the-a
flag architectures creates builds that may result inSEGFAULTS
when running on GPUs which weren't specified.