Releases: ROCm/Tensile
Releases · ROCm/Tensile
Tensile 4.41.0 for ROCm 6.2.4
Tensile code for ROCm 6.2.4 did not change. The library was rebuilt for the updated ROCm 6.2.4 stack.
Tensile 4.42.0 for ROCm 6.3.1
Tensile code for ROCm 6.3.1 did not change. The library was rebuilt for the updated ROCm 6.3.1 stack.
Tensile 4.42.0 for ROCm 6.3.0
Additions
- add contributor and developer guide
- add testing and documentation for MasterSolutionLibrary.ArchitectureIndexMap and remapSolutionIndicesStartingFrom
- add gfx12 support
- add functions for writing master file
- add tPrint and reconciles printing options
- add Python unit test coverage report
- add factor embed library logic into function and test
- add clang++ as cxx-compiler option for windows
- add logic to cope with different compilers
- add generateManifest fxn and rename generateManifest to toFile and move to Utilities
- add profiling CI job
- add support for amdclang and use defaults
- add architecture management functions to TensileCreateLibrary
- add TensileCreateLibrary cli reference docs
- add new documentation (sphinx prototype, build out skeleton)
Optimizations
- add prediction model for optimal number of Stream-K tiles to run
- use analytical grid size prediction model for Stream-K
- remap XCC-based workgroup for Stream-K kernels
- add two-tile algorithm with Stream-K after DP
- add atomic 2-tile Stream-K and clean-up tuning parameters
Changes
- improve rocBLAS build output by allowing warning suppression, ignoring only developer warnings, progress bar and quiet printing
- reorder extensions for Windows in which function
- remove deprecated flag from CI profiling job
- update amdclang++ and asm directories
- update duplicate marking tests with mocks
- remove diagnostic print, and restore print ordering, and add missing print option
- bump rocm-docs-core from 1.2.0 to 1.5.0 in /docs/sphinx
- refactor kernel duplicate matching
- refactor generateLogicDataAndSolutions
- remove globals from prepAsm
- restrict XCC mapping to gfx942
- refactor argument parsing in TensileCreateLibrary
- disable failing rhel9 tests
- change line length for formatting to 100 characters
- change YAML operations to use C libyaml backend
- improve warning wording
- remove deprecated package-library option
- update clang support for Windows
- update supportedCompiler fxn
- use conditional choices and defaults
- remove duplicate which function and minor cleanup
- refactor sanity check in TensileCreateLibrary
- factor client config logic from TensileCreateLibrary main into createClientConfig
- use glob to find logic files in TensileCreateLibrary
- use function to confirm supported compiler rather than raw logic
- update verifyManifest in TensileCreateLibrary
- update RTD configs
- cleanup the CMake to prevent redundant work in client builds
- update Stream-K debug settings
Fixes
- fix Stream-K XCC configs for gfx942
- update WMMA capability command for ISA 10+
- fix progress bar character encoding error on Windows
- fix solution redundancy removal
- fix tuning imports for pyyaml
- fix printing ASM capabilities for ROCm < 6.3
- fix code objects by filtering kernels with build errors and unprocessed kernels
- fix fully qualify std::get in contraction solutions
- fix add -v flag and change system invocation
- use conditional imports for new dependencies to fix yaml CSafe load and dump import, and to fix rich terminal print import
- fix comments on scalarStaticDivideAndRemainder
Tensile 4.40.0 for ROCm 6.1.2
Tensile code for ROCm 6.1.2 did not change. The library was rebuilt for the updated ROCm 6.1.2 stack.
Tensile 4.40.0 for ROCm 6.1.1
Tensile code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.
Tensile 4.41.0 for ROCm 6.2.2
Tensile code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.
Tensile 4.41.0 for ROCm 6.2.1
Tensile code for ROCm 6.2.1 did not change. The library was rebuilt for the updated ROCm 6.2.1 stack.
Tensile 4.41.0 for ROCm 6.2.0
Additions
- new tuning script to summarize rocBLAS log file
- new environment variable to test fixed grid size with Stream-K kernels
- new Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
- add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
- add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
- support NT flag for global load and store for gfx94x
- new Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
- add option for rotating buffers timing with cache eviction
- add predicate for arithmetic intensity
- add DirectToVgpr + packing for f8/f16 + TLU cases
- enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
- add test cases for DirectToVgpr + packing
- add batch support for Stream-K kernels and new test cases
- new tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
- enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
- support VectorWidthB (new parameter VectorWidthB)
- support VectorWidth + non SourceSwap
- add test cases for VectorWidthB, VectorWidth + non SourceSwap
- add code owners file
- new environment variables to dynamically adjust number of CUs used in Stream-K
- add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
- add xf32 option to rocblas-bench input creator
Optimizations
- initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py)
- optimize temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
- reverse MFMA order in inner loop for odd outer iteration
- optimize waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
- enhance maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)
Changes
- update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
- limit build threads based on CPUs/RAM available on system (for tests)
- update required workspace size for Stream-K, skip kernel initialization when possible
- use fallback libraries for archs without optimized logic
- use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
- remove OCL tests
- disable HostLibraryTests
- reduce extended test time by removing extra parameters in the test config files
- disable InitAccVgprOpt for Stream-K
- skip sgemm 64bit offset tests for gfx94x
- skip DTV, DTL, LSU+MFMA tests for gfx908
- increase extended test timeout to 720 min
- update xfail test (1sum tests only failing on gfx90a)
- update lib logic convertor script
- test limiting CI threads for only gfx11
- WGM related kernargs are removed if they are not needed (WGM=-1,0,1)
- cleanup on unused old code, mostly related to old client
- change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
- update efficiency script for new architecture and xf32 datatype
- re-enable negative values for WorkGroupMapping (asm kernel only)
- disable HW monitor for aquvavanjaram941
- pre-apply offsets for strided batch kernels
- update tensile build with 16 threads
Fixes
- fix WorkspaceCheck implementation when used in rocBLAS
- ignore asm cap check for kernel arg preload for rocm6.0 and older
- fix Stream-K partials cache behavior
- fix MasterSolutionLibrary indexing for multiple architecture build
- fix memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size)
- fix BufferLoad=False with Stream-K
- fix mismatch issue with GlobalReadCoalesceGroup
- fix rocblas build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"])
- fix for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2)
- fix inacurate vgpr allocation for ClusterLocalRead
- fix mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2
- fix mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0)
- fix HostLibraryTests on gfx942 and gfx941
- fix LLVM crash issue
- fix for newer windows vcpkg msgpack and vcpkg version package name
- fix an error with DisableKernelPieces + 32bit ShadowLimit
Tensile 4.40.0 for ROCm 6.1.0
Additions
- new DisableKernelPieces values to invalidate local read, local write, and global read
- stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
- feature to allow testing stream-k grid multipliers
- debug output to check occupancy for Stream-K
- reject condition for FractionalLoad + DepthU!=power of 2
- new TENSILE_DB debugging value to dump the common kernel parameters
- predicate for APU libs
- new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
- new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
- new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
- initialization type and general batched options to the rocblas-bench input creator script
Optimizations
- enabled MFMA + LocalSplitU=4 for MT16x16
- enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
- optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
- auto LdsPad calculation for TileMajorLds + MI16x16
- auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth
Changes
- cleared hipErrorNotFound error since it is an expected part of the search
- modified hipcc search path for Linux
- changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
- changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
- changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
- updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
- Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug
Fixes
- hipcc compile append flag parallel-jobs=4
- race condition in Stream-K that appeared with large grids and small sizes
- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
- mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
- incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
- small fix for LdsPad optimization (LdsElement calculation)
Tensile 4.39.0 for ROCm 6.0.2
Tensile code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.