Releases: ROCm/Tensile
Releases · ROCm/Tensile
Tensile-4.26.0 for ROCm 4.1.0
Added
- Make messagepack python dependency optional
- TensileCreateLibraryFiles: auto create target for build time lib generation
- Tensile cluster tuning tool
- Framework for filtering solutions
- Workflow for manually editing Kernels
- Tuning client design doc
- MatrixInstruction for general int8
- Tensile integration test for TensileCreateLibrary
- Trig float and random narrow init patterns for new client
- Summation dimension mirroring (contributed by timlathy & Slimakanzer)
- ROCm 4.1 TargetID support in Tensile; source kernels force xnack=OFF
- Tensile/Utilities/merge.py revamp for merging logic yaml files
- now merge.py requires python3
- add
-v
verbosity levels (up to 2) - add
--notrim
to retain leading dimensions in sizes
- New BoundsCheck design: Access guard page will trigger memory fault
- Solution fitness metric
- Auto-tuning documentation and build script improvements
- Support for High Precision Accumulate FP16/BF16 In FP32 Out
- CHANGELOG.md
Optimizations
- Refine PersistentKernel: support PKn1, EPS, optimize LW-vmcnt and sMagicDiv2
Fixed
- targets to clang-offload-bundler updated to use hipv4 prefix when appropriate
- Fix bugs of tail-loop branch label, and LR addr restore
- locateExe in Tensile/Common.py looks in defaultPath first
- Honor $ENV{ROCM_PATH} to support relocatable ROCm location
Tensile 4.24.0 for rocm 3.10.0
New Features
- No new features
Known Issues
- None
Tensile 4.24.0 for rocm 3.10.0
Known Issues
- None
Tensile 4.23.0 for rocm 3.9.0
Merge pull request #1160 from zaliu/master ROCm 3.9 merge develop into master
Tensile-4.22.0 for ROCm 3.8.0
New Features
Known Issues
- None
V4.9.0 Performance improvements
Features
- Improve persistent kernel implementation
- Add sequential indexing to tensile merge script
V4.8.0 Performance improvements, bug fixes, add assembly hpa_igemm
NOTE: ABI/API breaking changes introduced in this release, related to the addition of a separate pointer to the output Matrix D.
Features
- new solution selection logic
- add persistant kernel
- enable StaggerU
- Add 6x6 and 6x4 micro-tiles
- Add FUSS kernels
- gfx906 DGEMM NN/NT tuning
- add dot4 instructions for i8_r/i32_r gemm_ex on gfx906
- Matrix D Support.
- GEMM Calls now take in a separate pointer to the output matrix D, replacing matrix C for output (D = aAB + bC).
- Matrix C will now only be used for the input to GEMM calls.
- For this release, Matrix D uses the same stride as Matrix C.
- Previous functionality can be obtained by passing in the same pointer to both Matrix C and D.
- fixes for merge script
- add scripts for tuning automation
- add replacement kernel logic
- Improved Tensile run times for large numbers of solutions
-
V4.7.0 Performance improvements, bug fixes, add assembly hpa_hgemm, initial source hpa_igemm
Features
- add dot2 instructions for fp16/fp32 hpa_hgemm on gfx906
- initial i8/i32 hpa_igemm
- enable fractional loads
- enable precise bounds check
V4.6.0 Performance improvements, Bug fixes, add source hpa_hgemm
Features
- Merge gfx906 code into gfx900/gfx803 code
- Tune hgemm and sgemm for Resnet50 on gfx906
- Add source hpa_hgemm
- Use precise bounds check when possible
- Tested on ROCm 1.9