Releases · ROCm/Tensile

add contributor and developer guide
add testing and documentation for MasterSolutionLibrary.ArchitectureIndexMap and remapSolutionIndicesStartingFrom
add gfx12 support
add functions for writing master file
add tPrint and reconciles printing options
add Python unit test coverage report
add factor embed library logic into function and test
add clang++ as cxx-compiler option for windows
add logic to cope with different compilers
add generateManifest fxn and rename generateManifest to toFile and move to Utilities
add profiling CI job
add support for amdclang and use defaults
add architecture management functions to TensileCreateLibrary
add TensileCreateLibrary cli reference docs
add new documentation (sphinx prototype, build out skeleton)

Optimizations

add prediction model for optimal number of Stream-K tiles to run
use analytical grid size prediction model for Stream-K
remap XCC-based workgroup for Stream-K kernels
add two-tile algorithm with Stream-K after DP
add atomic 2-tile Stream-K and clean-up tuning parameters

Changes

improve rocBLAS build output by allowing warning suppression, ignoring only developer warnings, progress bar and quiet printing
reorder extensions for Windows in which function
remove deprecated flag from CI profiling job
update amdclang++ and asm directories
update duplicate marking tests with mocks
remove diagnostic print, and restore print ordering, and add missing print option
bump rocm-docs-core from 1.2.0 to 1.5.0 in /docs/sphinx
refactor kernel duplicate matching
refactor generateLogicDataAndSolutions
remove globals from prepAsm
restrict XCC mapping to gfx942
refactor argument parsing in TensileCreateLibrary
disable failing rhel9 tests
change line length for formatting to 100 characters
change YAML operations to use C libyaml backend
improve warning wording
remove deprecated package-library option
update clang support for Windows
update supportedCompiler fxn
use conditional choices and defaults
remove duplicate which function and minor cleanup
refactor sanity check in TensileCreateLibrary
factor client config logic from TensileCreateLibrary main into createClientConfig
use glob to find logic files in TensileCreateLibrary
use function to confirm supported compiler rather than raw logic
update verifyManifest in TensileCreateLibrary
update RTD configs
cleanup the CMake to prevent redundant work in client builds
update Stream-K debug settings

Fixes

fix Stream-K XCC configs for gfx942
update WMMA capability command for ISA 10+
fix progress bar character encoding error on Windows
fix solution redundancy removal
fix tuning imports for pyyaml
fix printing ASM capabilities for ROCm < 6.3
fix code objects by filtering kernels with build errors and unprocessed kernels
fix fully qualify std::get in contraction solutions
fix add -v flag and change system invocation
use conditional imports for new dependencies to fix yaml CSafe load and dump import, and to fix rich terminal print import
fix comments on scalarStaticDivideAndRemainder

Assets 2

04 Jun 16:52

rocm-ci

rocm-6.1.2

bf05992

Tensile 4.40.0 for ROCm 6.1.2

Tensile code for ROCm 6.1.2 did not change. The library was rebuilt for the updated ROCm 6.1.2 stack.

Assets 2

08 May 17:59

rocm-ci

rocm-6.1.1

bf05992

Tensile 4.40.0 for ROCm 6.1.1

Tensile code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.

Assets 2

27 Sep 16:01

rocm-ci

rocm-6.2.2

dbc2062

Tensile 4.41.0 for ROCm 6.2.2

Tensile code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.

Assets 2

20 Sep 19:57

rocm-ci

rocm-6.2.1

dbc2062

Tensile 4.41.0 for ROCm 6.2.1

Tensile code for ROCm 6.2.1 did not change. The library was rebuilt for the updated ROCm 6.2.1 stack.

Assets 2

02 Aug 16:15

rocm-ci

rocm-6.2.0

dbc2062

Tensile 4.41.0 for ROCm 6.2.0

Additions

new tuning script to summarize rocBLAS log file
new environment variable to test fixed grid size with Stream-K kernels
new Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
support NT flag for global load and store for gfx94x
new Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
add option for rotating buffers timing with cache eviction
add predicate for arithmetic intensity
add DirectToVgpr + packing for f8/f16 + TLU cases
enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
add test cases for DirectToVgpr + packing
add batch support for Stream-K kernels and new test cases
new tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
support VectorWidthB (new parameter VectorWidthB)
support VectorWidth + non SourceSwap
add test cases for VectorWidthB, VectorWidth + non SourceSwap
add code owners file
new environment variables to dynamically adjust number of CUs used in Stream-K
add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
add xf32 option to rocblas-bench input creator

Optimizations

initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py)
optimize temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
reverse MFMA order in inner loop for odd outer iteration
optimize waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
enhance maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)

Changes

update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
limit build threads based on CPUs/RAM available on system (for tests)
update required workspace size for Stream-K, skip kernel initialization when possible
use fallback libraries for archs without optimized logic
use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
remove OCL tests
disable HostLibraryTests
reduce extended test time by removing extra parameters in the test config files
disable InitAccVgprOpt for Stream-K
skip sgemm 64bit offset tests for gfx94x
skip DTV, DTL, LSU+MFMA tests for gfx908
increase extended test timeout to 720 min
update xfail test (1sum tests only failing on gfx90a)
update lib logic convertor script
test limiting CI threads for only gfx11
WGM related kernargs are removed if they are not needed (WGM=-1,0,1)
cleanup on unused old code, mostly related to old client
change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
update efficiency script for new architecture and xf32 datatype
re-enable negative values for WorkGroupMapping (asm kernel only)
disable HW monitor for aquvavanjaram941
pre-apply offsets for strided batch kernels
update tensile build with 16 threads

Fixes

fix WorkspaceCheck implementation when used in rocBLAS
ignore asm cap check for kernel arg preload for rocm6.0 and older
fix Stream-K partials cache behavior
fix MasterSolutionLibrary indexing for multiple architecture build
fix memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size)
fix BufferLoad=False with Stream-K
fix mismatch issue with GlobalReadCoalesceGroup
fix rocblas build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"])
fix for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2)
fix inacurate vgpr allocation for ClusterLocalRead
fix mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2
fix mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0)
fix HostLibraryTests on gfx942 and gfx941
fix LLVM crash issue
fix for newer windows vcpkg msgpack and vcpkg version package name
fix an error with DisableKernelPieces + 32bit ShadowLimit

Assets 2

16 Apr 19:07

rocm-ci

rocm-6.1.0

be9f7da

Tensile 4.40.0 for ROCm 6.1.0

Additions

new DisableKernelPieces values to invalidate local read, local write, and global read
stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
feature to allow testing stream-k grid multipliers
debug output to check occupancy for Stream-K
reject condition for FractionalLoad + DepthU!=power of 2
new TENSILE_DB debugging value to dump the common kernel parameters
predicate for APU libs
new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
initialization type and general batched options to the rocblas-bench input creator script

Optimizations

enabled MFMA + LocalSplitU=4 for MT16x16
enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
auto LdsPad calculation for TileMajorLds + MI16x16
auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth

Changes

cleared hipErrorNotFound error since it is an expected part of the search
modified hipcc search path for Linux
changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug

Fixes

hipcc compile append flag parallel-jobs=4
race condition in Stream-K that appeared with large grids and small sizes
mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
small fix for LdsPad optimization (LdsElement calculation)

Assets 2

31 Jan 20:12

rocm-ci

rocm-6.0.2

17df881

Tensile 4.39.0 for ROCm 6.0.2

Tensile code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additions

Optimizations

Changes

Fixes

Additions

Optimizations

Changes

Fixes

Additions

Optimizations

Changes

Fixes

Releases: ROCm/Tensile

Tensile 4.41.0 for ROCm 6.2.4

Tensile 4.42.0 for ROCm 6.3.1

Tensile 4.42.0 for ROCm 6.3.0

Additions

Optimizations

Changes

Fixes

Tensile 4.40.0 for ROCm 6.1.2

Tensile 4.40.0 for ROCm 6.1.1

Tensile 4.41.0 for ROCm 6.2.2

Tensile 4.41.0 for ROCm 6.2.1

Tensile 4.41.0 for ROCm 6.2.0

Additions

Optimizations

Changes

Fixes

Tensile 4.40.0 for ROCm 6.1.0

Additions

Optimizations

Changes

Fixes

Tensile 4.39.0 for ROCm 6.0.2