Skip to content

Releases: ROCm/rocBLAS

rocBLAS-2.30.0 for ROCm 3.9.0

27 Oct 20:13
91e553c
Compare
Choose a tag to compare

New Features

  • Slight improvements to FP16 Megatron BERT performance on MI50
  • Improvements to FP16 Transformer performance on MI50
  • Slight improvements to FP32 Transformer performance on MI50

Known Issues

  • None

rocBLAS-2.28.0 for ROCm 3.8.0

18 Sep 21:32
8a77094
Compare
Choose a tag to compare

New Features

  • atomics_mode functions added:
    • rocblas_status rocblas_set_atomics_mode(rocblas_atomics_mode mode);
    • rocblas_status rocblas_get_atomics_mode(rocblas_atomics_mode mode);
  • added enum rocblas_atomics_mode. It can have two values:
    rocblas_atomics_allowed
    rocblas_atomics_not_allowed
    The default is rocblas_atomics_not_allowed
  • function rocblas_Xdgmm algorithm corrected and incx=0 support added
  • Additional dependencies needed:
    rocblas-tensile internal component requires msgpack instead of LLVM
  • Moved the following files from /opt/rocm/include to /opt/rocm/include/internal:
    rocblas-auxillary.h
    rocblas-complex-types.h
    rocblas-functions.h
    rocblas-types.h
    rocblas-version.h
    rocblas_bfloat16.h
    These files should NOT be included directly as this may lead to errors. Instead, /opt/rocm/include/rocblas.h should be included directly. /opt/rocm/include/rocblas_module.f90 can also be direcly used.

Known Issues

  • None

rocBLAS-2.26.0 for ROCm 3.7.0

15 Aug 04:26
9d98138
Compare
Choose a tag to compare

New Features

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions
  • Improvements to rocblas_Xgemm_batched performance for small m, n, k.
  • Improvements to rocblas_Xgemv_batched and rocblas_Xgemv_strided_batched performance for small m (QMCPACK use).
  • Improvements to rocblas_Xdot (batched and non-batched) performance when both incx and incy are 1
  • Improvements to FP32 ONNX BERT performance for MI50
  • Significant improvements to FP32 Resnext, Inception Convolution performance for gfx908
  • Slight improvements to FP32 DLRM Terabyte performance for gfx908
  • Significant improvements to FP32 BDAS performance for gfx908
  • Significant improvements to FP32 BDAS performance for MI50 and MI60
  • Added substitution method for small trsm sizes with m <= 64 && n <= 64. Increases performance drastically for small batched trsm.

Known Issues

  • None

rocBLAS-2.22.0 for ROCm 3.5.0

10 Jul 22:50
Compare
Choose a tag to compare

Changelist

  • add geam complex, geam_batched, and geam_strided_batched
  • add dgmm, dgmm_batched, and dgmm_strided_batched

Optimized performance

  • ger

    • rocblas_sger, rocblas_dger,
    • rocblas_sger_batched, rocblas_dger_batched
    • rocblas_sger_strided_batched, rocblas_dger_strided_batched
  • geru

    • rocblas_cgeru, rocblas_zgeru
    • rocblas_cgeru_batched, rocblas_zgeru_batched
    • rocblas_cgeru_strided_batched, rocblas_zgeru_strided_batched
  • gerc

    • rocblas_cgerc, rocblas_zgerc
    • rocblas_cgerc_batched, rocblas_zgerc_batched
    • rocblas_cgerc_strided_batched, rocblas_zgerc_strided_batched
  • symv

    • rocblas_ssymv, rocblas_dsymv, rocblas_csymv, rocblas_zsymv,
    • rocblas_ssymv_batched, rocblas_dsymv_batched, rocblas_csymv_batched, rocblas_zsymv_batched,
    • rocblas_ssymv_strided_batched, rocblas_dsymv_strided_batched, rocblas_csymv_strided_batched, rocblas_zsymv_strided_batched,
  • sbmv

    • rocblas_ssbmv, rocblas_dsbmv,
    • rocblas_ssbmv_batched, rocblas_dsbmv_batched,
    • rocblas_ssbmv_strided_batched, rocblas_dsbmv_strided_batched,
  • spmv

    • rocblas_sspmv, rocblas_dspmv,
    • rocblas_sspmv_batched, rocblas_dspmv_batched,
    • rocblas_sspmv_strided_batched, rocblas_dspmv_strided_batched,
  • improved documentation

  • Fix argument checking in functions to match legacy BLAS

  • Fixed conjugate-transpose version of geam

Known failures

  • Compilation for GPU Targets
    • When using the install.sh script for "all" GPU Targets, which is the default, you must first set an environment variable HCC_AMDGPU_TARGET listing the GPU targets, e.g. HCC_AMDGPU_TARGET=gfx803,gfx900,gfx906,gfx908
    • If building for a specific architecture(s) using the -a | --architecture flag, you should also set the environment variable HCC_AMDGPU_TARGET to match.
    • Mismatching the environment variable to the -a flag architectures creates builds that may result in SEGFAULTS when running on GPUs which weren't specified.

rocBLAS-2.24.0 for ROCm 3.6.0

11 Jul 00:38
Compare
Choose a tag to compare

New Features

  • Improvements to User Guide and Design Document
  • L1 dot function optimized to utilize shuffle instructions ( improvements on bf16, f16, f32 data types )
  • L1 dot function added x dot x optimized kernel
  • Standardization of L1 rocblas-bench to use device pointer mode to focus on GPU memory bandwidth
  • Adjustments for hipcc (hip-clang) compiler as standard build compiler and Centos8 support
  • Added Fortran interface for all rocBLAS functions

Known Issues

  • None

rocBLAS-2.2.0

28 Feb 22:11
Compare
Choose a tag to compare

Changelist:

  • Fix compilation of TRSV, IAMAX, IAMIN
  • Add TRSM test sizes
  • Fix false negative precision failures for f16_r gemm_ex tests
  • Improvements to documentation and addition of sample for i8_r/i32_r gemm_ex
  • Tuning for i8_r/i32_r gemm_ex for MIOpen
  • Add gtest ConfigurableEventListner to reduce Jenkins log file size
  • Initial refactorization of rocblas-bench
  • rocblas_dgemm NT tuning

rocBLAS-2.1.0

01 Feb 02:27
Compare
Choose a tag to compare

Changelist:

  • Refactor rocBLAS test framework
  • Improved performance of i8_r/i32_r rocblas_gemm_ex on gfx906
  • Addition of simple trsv implementation using trsm
  • Improved performance of trsm
  • Tuning improvements for resnet50 problems
  • Update tuning to use new Tensile solution selection logic
  • rocblas_gemm_ex performance improvement when ldd == lcc and strideD == strideC
  • Bug fixes for IAMIN and TRSV
  • Add sphinx based readthedoc documentation

rocBLAS-2.0.0 for ROCm 2.0

19 Dec 19:46
Compare
Choose a tag to compare

Changelist:

  • improved performance of fp16/fp32 rocblas_gemm_ex on gfx906
  • support for i8/i32 rocblas_gemm_ex
  • update vega-10 resnet50 tuning
  • refactor testing to be data driven
  • change gemm-ex API solution index from uint32_t to int32_t
  • disable gemm and gemm_ex chunking
  • fix gemv argument checking
  • add performance script for p1b1 benchmark sizes
  • refactor gemm code to reduce use of macros
  • trsm performance regression fix

rocBLAS-14.3.0 for ROCm1.9

12 Oct 03:00
Compare
Choose a tag to compare

Changelist:

  • add rocblas_gemm_strided_batched_ex for mixed precision support
  • tested on ROCm1.9
  • fix chunking of A and B matrices
  • expand testing of rocblas_gemm
  • sgemm and hgemm tuning on gfx906 for Resnet50 from Tensile V4.6.0

Known failures:

  • known dgemm failures for m,n < 16

enable gfx906 support

21 Sep 17:44
8490ca9
Compare
Choose a tag to compare

A small incremental release to enable gfx906 support. To get gfx906 support, ROCm 1.9 or later must be used to build rocBLAS.