Skip to content

Releases: CNugteren/CLBlast

CLBlast 1.6.3

13 Jun 17:50
2a08197
Compare
Choose a tag to compare

CLBlast version 1.6.3. Changes since previous release (version 1.6.2):

  • Fixed a bug in the GEMMK=1 kernel (with 2D register tiling) when MWG!=NWG
  • CMake fixes for older versions and for the CUDA backend
  • Added tuned parameters for many devices (see doc/tuning.md)

CLBlast 1.6.2

09 Feb 20:40
faa2109
Compare
Choose a tag to compare

CLBlast version 1.6.2. Changes since previous release (version 1.6.1):

  • Fix a bug in the pre-processor that would cause issues on Arm GPUs
  • Fix DLL install directory in mingw
  • Modifications to the Python bindings (pyclblast)
    • Convert float scalar values to cl_half for fp16 routines
    • Amax/amin, max/min routines accept unsigned integer buffers for index
    • Switch to pyproject.toml file for installing Python bindings
    • Build Python bindings using Cmake, adding Windows support
  • Generator script now always use LF endings, independent of the platform
  • Added tuned parameters for many devices (see doc/tuning.md)

CLBlast 1.6.1

09 Jul 09:30
e3ce21b
Compare
Choose a tag to compare

CLBlast version 1.6.1. Changes since previous release (version 1.6.0):

  • Fix pointer error in pyclblast on ARM
  • Fix a multithreading bug related to storing objects in the cache
  • Added tuned parameters for many devices (see doc/tuning.md)

CLBlast 1.6.0

21 May 19:22
b0b3028
Compare
Choose a tag to compare

CLBlast version 1.6.0. Changes since previous release (version 1.5.3):

  • Improved performance on Qualcomm Adreno GPUs:
    • Unique database entries for specific Adreno devices
    • Toggle OpenCL kernel compilation options for Adreno
    • New preprocessor directive RELAX_WORKGROUP_SIZE
  • Fixed a bug in handling of #undef in CLBlast loop unrolling and array-to-register mapping functions
  • Fixed a bug in XAMAX/XAMIN routines related to inadvertently including the increment and offset in the result
  • Fixed a bug in XAMAX/XAMIN routines that would cause only the real part of a complex number to be taken into account
  • Fixed a bug that caused tests to not properly do integer-output testing (for XAMAX/XAMIN)
  • Fixes a minor issue with the expected input buffer size in the TRMV/TBMV/TPMV/TRSV routines
  • Fixes an issue with crashes on Android related to calling clReleaseProgram
  • Fixes two small issues in the plotting script
  • Fixed a documentation bug in the 'ld' requirements
  • Enabled Github Actions CI builds for testing and releasing
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see doc/tuning.md)

CLBlast 1.5.3

29 Sep 18:46
d55840e
Compare
Choose a tag to compare

CLBlast version 1.5.3. Changes since previous release (version 1.5.2):

  • Fix a correctness issue with DGEMM on SM 7.5 Turing GPUs
  • Update cl.hpp to the new opencl.hpp header in the samples
  • Changed the complex sum routine to return the complex sum instead of the absolute complex sum.
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see doc/tuning.md)

CLBlast 1.5.2

20 Jan 13:22
Compare
Choose a tag to compare

CLBlast version 1.5.2. Changes since previous release (version 1.5.1):

  • Changed XAMAX/XAMIN to more likely return first rather than last min/max index, updated API docs
  • Added batched routines to pyclblast
  • Added CLBLAST_VERSION_MAJOR/MINOR/PATCH defines in headers to store version numbering
  • Several small improvements to the benchmark script (thanks to 'baryluk')
  • Fixed a bug in the caching when using a context with multiple devices
  • Fixed a bug in the tuners related to global workgroup size not being a multiple of the local
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see doc/tuning.md)

CLBlast 1.5.1

18 Feb 09:38
Compare
Choose a tag to compare

CLBlast version 1.5.1. Changes since previous release (version 1.5.0):

  • Implemented single-kernel version of convolution as GEMM
  • Now catches all exceptions thrown by the tuners
  • Fixed a bug in ISAMIN kernel
  • Fixed an out-of-bounds read/write in the XHAD routine (thanks to etomzak)
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see doc/tuning.md)

CLBlast 1.5.0

04 Dec 21:10
Compare
Choose a tag to compare

CLBlast version 1.5.0. Changes since previous release (version 1.4.1):

  • Added support for shuffle instructions for NVIDIA GPUs (thanks to 'tyler-utah')
  • Added an option to compile the Netlib API with static OpenCL device and context (-DNETLIB_PERSISTENT_OPENCL=ON)
  • Added a FAQ page to the documentation
  • The tuners now check beforehand on invalid local thread sizes and skip those completely
  • Made the tuning API (OverrideParameters) more flexible, disregarding superfluous parameters
  • Fixed an issue with conjugate transpose not being executed in certain cases for a.o. XOMATCOPY
  • Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel
  • Fixed an issue with the preprocessor and the new GEMMK == 1 kernel
  • Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel
  • Fixed an issue for certain parameters for AXPY's 'XaxpyFaster' kernel
  • Various minor fixes and enhancements
  • Added non-BLAS routines:
    • SCONVGEMM/DCONVGEMM/HCONVGEMM (convolution as im2col followed by batched GEMM)
    • SCOL2IM/DCOL2IM/CCOL2IM/ZCOL2IM/HCOL2IM (col2im transform as used in machine learning)

CLBlast 1.4.1

14 Jul 10:30
Compare
Choose a tag to compare

CLBlast version 1.4.1 (bugfix release). Changes since previous release (version 1.4.0):

  • Fixed an access violation under Windows upon releasing the OpenCL program when the driver is already unloaded
  • Fixed an issue with double cl_program release in the CLBlast caching system
  • Added tuned parameters for various devices (see doc/tuning.md)

CLBlast 1.4.0

03 Jun 11:27
Compare
Choose a tag to compare

CLBlast version 1.4.0. Changes since previous release (version 1.3.0):

  • Added Python interface to CLBlast 'PyCLBlast'
  • Added CLBlast to Ubuntu PPA and macOS Homebrew package managers
  • Added an API to run the tuners programmatically without any I/O
  • Improved the performance potential by adding a second tunable GEMM kernel with 2D register tiling
  • Added support for Intel specific subgroup shuffling extensions for faster GEMM on Intel GPUs
  • Re-added a local memory size constraint to the tuners
  • The routine tuners now automatically pick up tuning results from disk from the kernel tuners
  • Updated and reorganised the CLBlast documentation
  • Added a 'canary' region to check for overflows in the tuner and tests (inspired by clARMOR)
  • Added an option to test against and compare performance with Intel's MKL
  • Fixed an access violation when compiled with Visual Studio upon releasing the OpenCL program
  • Fixed incorrect releasing of the OpenCL program resulting in segfaults / access violations
  • Various minor fixes and enhancements
  • Added tuned parameters for various devices (see doc/tuning.md)
  • Added non-BLAS level-1 routines:
    • SHAD/DHAD/CHAD/ZHAD/HHAD (Hadamard element-wise vector-vector product)