Releases · ROCm/Tensile

Make messagepack python dependency optional
TensileCreateLibraryFiles: auto create target for build time lib generation
Tensile cluster tuning tool
Framework for filtering solutions
Workflow for manually editing Kernels
Tuning client design doc
MatrixInstruction for general int8
Tensile integration test for TensileCreateLibrary
Trig float and random narrow init patterns for new client
Summation dimension mirroring (contributed by timlathy & Slimakanzer)
ROCm 4.1 TargetID support in Tensile; source kernels force xnack=OFF
Tensile/Utilities/merge.py revamp for merging logic yaml files
- now merge.py requires python3
- add -v verbosity levels (up to 2)
- add --notrim to retain leading dimensions in sizes
New BoundsCheck design: Access guard page will trigger memory fault
Solution fitness metric
Auto-tuning documentation and build script improvements
Support for High Precision Accumulate FP16/BF16 In FP32 Out
CHANGELOG.md

NOTE: ABI/API breaking changes introduced in this release, related to the addition of a separate pointer to the output Matrix D.

Features

new solution selection logic
add persistant kernel
enable StaggerU
Add 6x6 and 6x4 micro-tiles
Add FUSS kernels
gfx906 DGEMM NN/NT tuning
add dot4 instructions for i8_r/i32_r gemm_ex on gfx906
Matrix D Support.
- GEMM Calls now take in a separate pointer to the output matrix D, replacing matrix C for output (D = aAB + bC).
- Matrix C will now only be used for the input to GEMM calls.
- For this release, Matrix D uses the same stride as Matrix C.
- Previous functionality can be obtained by passing in the same pointer to both Matrix C and D.
fixes for merge script
add scripts for tuning automation
add replacement kernel logic
Improved Tensile run times for large numbers of solutions

Provide feedback