Releases · CNugteren/CLBlast

13 Jun 17:50

CNugteren

1.6.3

2a08197

CLBlast 1.6.3 Latest

Latest

CLBlast version 1.6.3. Changes since previous release (version 1.6.2):

Fixed a bug in the GEMMK=1 kernel (with 2D register tiling) when MWG!=NWG
CMake fixes for older versions and for the CUDA backend
Added tuned parameters for many devices (see doc/tuning.md)

Assets 5

09 Feb 20:40

CNugteren

1.6.2

faa2109

CLBlast 1.6.2

CLBlast version 1.6.2. Changes since previous release (version 1.6.1):

Fix a bug in the pre-processor that would cause issues on Arm GPUs
Fix DLL install directory in mingw
Modifications to the Python bindings (pyclblast)
- Convert float scalar values to cl_half for fp16 routines
- Amax/amin, max/min routines accept unsigned integer buffers for index
- Switch to pyproject.toml file for installing Python bindings
- Build Python bindings using Cmake, adding Windows support
Generator script now always use LF endings, independent of the platform
Added tuned parameters for many devices (see doc/tuning.md)

Assets 5

09 Jul 09:30

CNugteren

1.6.1

e3ce21b

CLBlast 1.6.1

CLBlast version 1.6.1. Changes since previous release (version 1.6.0):

Fix pointer error in pyclblast on ARM
Fix a multithreading bug related to storing objects in the cache
Added tuned parameters for many devices (see doc/tuning.md)

Assets 5

21 May 19:22

CNugteren

1.6.0

b0b3028

CLBlast 1.6.0

CLBlast version 1.6.0. Changes since previous release (version 1.5.3):

Improved performance on Qualcomm Adreno GPUs:
- Unique database entries for specific Adreno devices
- Toggle OpenCL kernel compilation options for Adreno
- New preprocessor directive RELAX_WORKGROUP_SIZE
Fixed a bug in handling of #undef in CLBlast loop unrolling and array-to-register mapping functions
Fixed a bug in XAMAX/XAMIN routines related to inadvertently including the increment and offset in the result
Fixed a bug in XAMAX/XAMIN routines that would cause only the real part of a complex number to be taken into account
Fixed a bug that caused tests to not properly do integer-output testing (for XAMAX/XAMIN)
Fixes a minor issue with the expected input buffer size in the TRMV/TBMV/TPMV/TRSV routines
Fixes an issue with crashes on Android related to calling clReleaseProgram
Fixes two small issues in the plotting script
Fixed a documentation bug in the 'ld' requirements
Enabled Github Actions CI builds for testing and releasing
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)

Assets 5

29 Sep 18:46

CNugteren

1.5.3

d55840e

CLBlast 1.5.3

CLBlast version 1.5.3. Changes since previous release (version 1.5.2):

Fix a correctness issue with DGEMM on SM 7.5 Turing GPUs
Update cl.hpp to the new opencl.hpp header in the samples
Changed the complex sum routine to return the complex sum instead of the absolute complex sum.
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)

Assets 4

20 Jan 13:22

CNugteren

1.5.2

70016e8

CLBlast 1.5.2

CLBlast version 1.5.2. Changes since previous release (version 1.5.1):

Changed XAMAX/XAMIN to more likely return first rather than last min/max index, updated API docs
Added batched routines to pyclblast
Added CLBLAST_VERSION_MAJOR/MINOR/PATCH defines in headers to store version numbering
Several small improvements to the benchmark script (thanks to 'baryluk')
Fixed a bug in the caching when using a context with multiple devices
Fixed a bug in the tuners related to global workgroup size not being a multiple of the local
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)

Assets 4

18 Feb 09:38

CNugteren

1.5.1

8433985

CLBlast 1.5.1

CLBlast version 1.5.1. Changes since previous release (version 1.5.0):

Implemented single-kernel version of convolution as GEMM
Now catches all exceptions thrown by the tuners
Fixed a bug in ISAMIN kernel
Fixed an out-of-bounds read/write in the XHAD routine (thanks to etomzak)
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)

Assets 4

04 Dec 21:10

CNugteren

1.5.0

0c9411c

CLBlast 1.5.0

CLBlast version 1.5.0. Changes since previous release (version 1.4.1):

Added support for shuffle instructions for NVIDIA GPUs (thanks to 'tyler-utah')
Added an option to compile the Netlib API with static OpenCL device and context (-DNETLIB_PERSISTENT_OPENCL=ON)
Added a FAQ page to the documentation
The tuners now check beforehand on invalid local thread sizes and skip those completely
Made the tuning API (OverrideParameters) more flexible, disregarding superfluous parameters
Fixed an issue with conjugate transpose not being executed in certain cases for a.o. XOMATCOPY
Fixed an issue with AMD GPUs and the new GEMMK == 1 kernel
Fixed an issue with the preprocessor and the new GEMMK == 1 kernel
Fixed an issue for unequal MWG and NWG and the new GEMMK == 1 kernel
Fixed an issue for certain parameters for AXPY's 'XaxpyFaster' kernel
Various minor fixes and enhancements
Added non-BLAS routines:
- SCONVGEMM/DCONVGEMM/HCONVGEMM (convolution as im2col followed by batched GEMM)
- SCOL2IM/DCOL2IM/CCOL2IM/ZCOL2IM/HCOL2IM (col2im transform as used in machine learning)

Assets 4

14 Jul 10:30

CNugteren

1.4.1

db179a1

CLBlast 1.4.1

CLBlast version 1.4.1 (bugfix release). Changes since previous release (version 1.4.0):

Fixed an access violation under Windows upon releasing the OpenCL program when the driver is already unloaded
Fixed an issue with double cl_program release in the CLBlast caching system
Added tuned parameters for various devices (see doc/tuning.md)

Assets 4

03 Jun 11:27

CNugteren

1.4.0

4471b67

CLBlast 1.4.0

CLBlast version 1.4.0. Changes since previous release (version 1.3.0):

Added Python interface to CLBlast 'PyCLBlast'
Added CLBlast to Ubuntu PPA and macOS Homebrew package managers
Added an API to run the tuners programmatically without any I/O
Improved the performance potential by adding a second tunable GEMM kernel with 2D register tiling
Added support for Intel specific subgroup shuffling extensions for faster GEMM on Intel GPUs
Re-added a local memory size constraint to the tuners
The routine tuners now automatically pick up tuning results from disk from the kernel tuners
Updated and reorganised the CLBlast documentation
Added a 'canary' region to check for overflows in the tuner and tests (inspired by clARMOR)
Added an option to test against and compare performance with Intel's MKL
Fixed an access violation when compiled with Visual Studio upon releasing the OpenCL program
Fixed incorrect releasing of the OpenCL program resulting in segfaults / access violations
Various minor fixes and enhancements
Added tuned parameters for various devices (see doc/tuning.md)
Added non-BLAS level-1 routines:
- SHAD/DHAD/CHAD/ZHAD/HHAD (Hadamard element-wise vector-vector product)

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: CNugteren/CLBlast

CLBlast 1.6.3

CLBlast 1.6.2

CLBlast 1.6.1

CLBlast 1.6.0

CLBlast 1.5.3

CLBlast 1.5.2

CLBlast 1.5.1

CLBlast 1.5.0

CLBlast 1.4.1

CLBlast 1.4.0