Releases: CNugteren/CLBlast
Releases · CNugteren/CLBlast
CLBlast 1.6.3
CLBlast version 1.6.3. Changes since previous release (version 1.6.2):
- Fixed a bug in the GEMMK=1 kernel (with 2D register tiling) when MWG!=NWG
- CMake fixes for older versions and for the CUDA backend
- Added tuned parameters for many devices (see doc/tuning.md)
CLBlast 1.6.2
CLBlast version 1.6.2. Changes since previous release (version 1.6.1):
- Fix a bug in the pre-processor that would cause issues on Arm GPUs
- Fix DLL install directory in mingw
- Modifications to the Python bindings (pyclblast)
- Convert float scalar values to cl_half for fp16 routines
- Amax/amin, max/min routines accept unsigned integer buffers for index
- Switch to pyproject.toml file for installing Python bindings
- Build Python bindings using Cmake, adding Windows support
- Generator script now always use LF endings, independent of the platform
- Added tuned parameters for many devices (see doc/tuning.md)
CLBlast 1.6.1
CLBlast version 1.6.1. Changes since previous release (version 1.6.0):
- Fix pointer error in pyclblast on ARM
- Fix a multithreading bug related to storing objects in the cache
- Added tuned parameters for many devices (see doc/tuning.md)
CLBlast 1.6.0
CLBlast version 1.6.0. Changes since previous release (version 1.5.3):
- Improved performance on Qualcomm Adreno GPUs:
- Unique database entries for specific Adreno devices
- Toggle OpenCL kernel compilation options for Adreno
- New preprocessor directive RELAX_WORKGROUP_SIZE
- Fixed a bug in handling of #undef in CLBlast loop unrolling and array-to-register mapping functions
- Fixed a bug in XAMAX/XAMIN routines related to inadvertently including the increment and offset in the result
- Fixed a bug in XAMAX/XAMIN routines that would cause only the real part of a complex number to be taken into account
- Fixed a bug that caused tests to not properly do integer-output testing (for XAMAX/XAMIN)
- Fixes a minor issue with the expected input buffer size in the TRMV/TBMV/TPMV/TRSV routines
- Fixes an issue with crashes on Android related to calling clReleaseProgram
- Fixes two small issues in the plotting script
- Fixed a documentation bug in the 'ld' requirements
- Enabled Github Actions CI builds for testing and releasing
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)
CLBlast 1.5.3
CLBlast version 1.5.3. Changes since previous release (version 1.5.2):
- Fix a correctness issue with DGEMM on SM 7.5 Turing GPUs
- Update cl.hpp to the new opencl.hpp header in the samples
- Changed the complex sum routine to return the complex sum instead of the absolute complex sum.
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)
CLBlast 1.5.2
CLBlast version 1.5.2. Changes since previous release (version 1.5.1):
- Changed XAMAX/XAMIN to more likely return first rather than last min/max index, updated API docs
- Added batched routines to pyclblast
- Added CLBLAST_VERSION_MAJOR/MINOR/PATCH defines in headers to store version numbering
- Several small improvements to the benchmark script (thanks to 'baryluk')
- Fixed a bug in the caching when using a context with multiple devices
- Fixed a bug in the tuners related to global workgroup size not being a multiple of the local
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)
CLBlast 1.5.1
CLBlast version 1.5.1. Changes since previous release (version 1.5.0):
- Implemented single-kernel version of convolution as GEMM
- Now catches all exceptions thrown by the tuners
- Fixed a bug in ISAMIN kernel
- Fixed an out-of-bounds read/write in the XHAD routine (thanks to etomzak)
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)
CLBlast 1.5.0
CLBlast version 1.5.0. Changes since previous release (version 1.4.1):
- Added support for shuffle instructions for NVIDIA GPUs (thanks to 'tyler-utah')
- Added an option to compile the Netlib API with static OpenCL device and context (-DNETLIB_PERSISTENT_OPENCL=ON)
- Added a FAQ page to the documentation
- The tuners now check beforehand on invalid local thread sizes and skip those completely
- Made the tuning API (OverrideParameters) more flexible, disregarding superfluous parameters
- Fixed an issue with conjugate transpose not being executed in certain cases for a.o. XOMATCOPY
- Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel
- Fixed an issue with the preprocessor and the new GEMMK == 1 kernel
- Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel
- Fixed an issue for certain parameters for AXPY's 'XaxpyFaster' kernel
- Various minor fixes and enhancements
- Added non-BLAS routines:
- SCONVGEMM/DCONVGEMM/HCONVGEMM (convolution as im2col followed by batched GEMM)
- SCOL2IM/DCOL2IM/CCOL2IM/ZCOL2IM/HCOL2IM (col2im transform as used in machine learning)
CLBlast 1.4.1
CLBlast version 1.4.1 (bugfix release). Changes since previous release (version 1.4.0):
- Fixed an access violation under Windows upon releasing the OpenCL program when the driver is already unloaded
- Fixed an issue with double cl_program release in the CLBlast caching system
- Added tuned parameters for various devices (see doc/tuning.md)
CLBlast 1.4.0
CLBlast version 1.4.0. Changes since previous release (version 1.3.0):
- Added Python interface to CLBlast 'PyCLBlast'
- Added CLBlast to Ubuntu PPA and macOS Homebrew package managers
- Added an API to run the tuners programmatically without any I/O
- Improved the performance potential by adding a second tunable GEMM kernel with 2D register tiling
- Added support for Intel specific subgroup shuffling extensions for faster GEMM on Intel GPUs
- Re-added a local memory size constraint to the tuners
- The routine tuners now automatically pick up tuning results from disk from the kernel tuners
- Updated and reorganised the CLBlast documentation
- Added a 'canary' region to check for overflows in the tuner and tests (inspired by clARMOR)
- Added an option to test against and compare performance with Intel's MKL
- Fixed an access violation when compiled with Visual Studio upon releasing the OpenCL program
- Fixed incorrect releasing of the OpenCL program resulting in segfaults / access violations
- Various minor fixes and enhancements
- Added tuned parameters for various devices (see doc/tuning.md)
- Added non-BLAS level-1 routines:
- SHAD/DHAD/CHAD/ZHAD/HHAD (Hadamard element-wise vector-vector product)