Skip to content

Releases: ROCm/Tensile

Tensile 4.41.0 for ROCm 6.2.4

06 Nov 19:55
81ae953
Compare
Choose a tag to compare

Tensile code for ROCm 6.2.4 did not change. The library was rebuilt for the updated ROCm 6.2.4 stack.

Tensile 4.42.0 for ROCm 6.3.1

20 Dec 16:12
aca95d1
Compare
Choose a tag to compare

Tensile code for ROCm 6.3.1 did not change. The library was rebuilt for the updated ROCm 6.3.1 stack.

Tensile 4.42.0 for ROCm 6.3.0

03 Dec 19:49
aca95d1
Compare
Choose a tag to compare

Additions

  • add contributor and developer guide
  • add testing and documentation for MasterSolutionLibrary.ArchitectureIndexMap and remapSolutionIndicesStartingFrom
  • add gfx12 support
  • add functions for writing master file
  • add tPrint and reconciles printing options
  • add Python unit test coverage report
  • add factor embed library logic into function and test
  • add clang++ as cxx-compiler option for windows
  • add logic to cope with different compilers
  • add generateManifest fxn and rename generateManifest to toFile and move to Utilities
  • add profiling CI job
  • add support for amdclang and use defaults
  • add architecture management functions to TensileCreateLibrary
  • add TensileCreateLibrary cli reference docs
  • add new documentation (sphinx prototype, build out skeleton)

Optimizations

  • add prediction model for optimal number of Stream-K tiles to run
  • use analytical grid size prediction model for Stream-K
  • remap XCC-based workgroup for Stream-K kernels
  • add two-tile algorithm with Stream-K after DP
  • add atomic 2-tile Stream-K and clean-up tuning parameters

Changes

  • improve rocBLAS build output by allowing warning suppression, ignoring only developer warnings, progress bar and quiet printing
  • reorder extensions for Windows in which function
  • remove deprecated flag from CI profiling job
  • update amdclang++ and asm directories
  • update duplicate marking tests with mocks
  • remove diagnostic print, and restore print ordering, and add missing print option
  • bump rocm-docs-core from 1.2.0 to 1.5.0 in /docs/sphinx
  • refactor kernel duplicate matching
  • refactor generateLogicDataAndSolutions
  • remove globals from prepAsm
  • restrict XCC mapping to gfx942
  • refactor argument parsing in TensileCreateLibrary
  • disable failing rhel9 tests
  • change line length for formatting to 100 characters
  • change YAML operations to use C libyaml backend
  • improve warning wording
  • remove deprecated package-library option
  • update clang support for Windows
  • update supportedCompiler fxn
  • use conditional choices and defaults
  • remove duplicate which function and minor cleanup
  • refactor sanity check in TensileCreateLibrary
  • factor client config logic from TensileCreateLibrary main into createClientConfig
  • use glob to find logic files in TensileCreateLibrary
  • use function to confirm supported compiler rather than raw logic
  • update verifyManifest in TensileCreateLibrary
  • update RTD configs
  • cleanup the CMake to prevent redundant work in client builds
  • update Stream-K debug settings

Fixes

  • fix Stream-K XCC configs for gfx942
  • update WMMA capability command for ISA 10+
  • fix progress bar character encoding error on Windows
  • fix solution redundancy removal
  • fix tuning imports for pyyaml
  • fix printing ASM capabilities for ROCm < 6.3
  • fix code objects by filtering kernels with build errors and unprocessed kernels
  • fix fully qualify std::get in contraction solutions
  • fix add -v flag and change system invocation
  • use conditional imports for new dependencies to fix yaml CSafe load and dump import, and to fix rich terminal print import
  • fix comments on scalarStaticDivideAndRemainder

Tensile 4.40.0 for ROCm 6.1.2

04 Jun 16:52
bf05992
Compare
Choose a tag to compare

Tensile code for ROCm 6.1.2 did not change. The library was rebuilt for the updated ROCm 6.1.2 stack.

Tensile 4.40.0 for ROCm 6.1.1

08 May 17:59
bf05992
Compare
Choose a tag to compare

Tensile code for ROCm 6.1.1 did not change. The library was rebuilt for the updated ROCm 6.1.1 stack.

Tensile 4.41.0 for ROCm 6.2.2

27 Sep 16:01
dbc2062
Compare
Choose a tag to compare

Tensile code for ROCm 6.2.2 did not change. The library was rebuilt for the updated ROCm 6.2.2 stack.

Tensile 4.41.0 for ROCm 6.2.1

20 Sep 19:57
dbc2062
Compare
Choose a tag to compare

Tensile code for ROCm 6.2.1 did not change. The library was rebuilt for the updated ROCm 6.2.1 stack.

Tensile 4.41.0 for ROCm 6.2.0

02 Aug 16:15
dbc2062
Compare
Choose a tag to compare

Additions

  • new tuning script to summarize rocBLAS log file
  • new environment variable to test fixed grid size with Stream-K kernels
  • new Stream-K dynamic mode to run large problems at slightly reduced CU count if it improves work division and power
  • add reject conditions for SourceKernel + PrefetchGlobalRead/LoopDoWhile
  • add reject condition for PreloadKernelArguments (disable PreloadKernelArguments if not supported (instead of rejecting kernel generation))
  • support NT flag for global load and store for gfx94x
  • new Kernarg preloading feature (DelayRemainingArgument: initiate the load of the remaining (non-preloaded) arguments, updated AsmCaps, AsmRegisterPool to track registers for arguments and preload)
  • add option for rotating buffers timing with cache eviction
  • add predicate for arithmetic intensity
  • add DirectToVgpr + packing for f8/f16 + TLU cases
  • enable negative values for ExtraLatencyForLR to reduce interval of local read and wait for DTV
  • add test cases for DirectToVgpr + packing
  • add batch support for Stream-K kernels and new test cases
  • new tuning scripts to analyze rocblas-bench results and remove tuned sizes from liblogic
  • enable VgprForLocalReadPacking + PrefetchLocalRead=1 (removed the reject condition for VFLRP + PLR=1, added test cases for VFLRP + PLR=1)
  • support VectorWidthB (new parameter VectorWidthB)
  • support VectorWidth + non SourceSwap
  • add test cases for VectorWidthB, VectorWidth + non SourceSwap
  • add code owners file
  • new environment variables to dynamically adjust number of CUs used in Stream-K
  • add new parameters to specify global load width for A and B separately (GlobalLoadVectorWidthA, B (effective with GlobalReadVectorWidth=-1))
  • add xf32 option to rocblas-bench input creator

Optimizations

  • initialization optimizations (reordered init code for PreloadKernelArguments opt, used s_mov_b64 for 64 bit address copy, used v_mov_b64/ds_read_b64 for C register initialization, added undefine AddressC/D with PreloadKernelArguments, optimized waitcnt for prefetch global read with DirectToVgpr, refactored waitcnt code for DTV and moved all asm related code to KernelWriterAssembly.py)
  • optimize temp vgpr allocation for ClusterLocalRead (added if condition to allocate temp vgpr only for 8bit datatype)
  • reverse MFMA order in inner loop for odd outer iteration
  • optimize waitcnt lgkmcnt for 1LDSBuffer + PGR>1 (removed redundant waitcnt lgkmcnt after 1LDSBuffer sync)
  • enhance maximum value of DepthU to 1024 (used globalParameters MaxDepthU to define maximum value of DepthU)

Changes

  • update rocBLAS-bench-input-create script (added number of iteration based on performance, rotating buffer flag)
  • limit build threads based on CPUs/RAM available on system (for tests)
  • update required workspace size for Stream-K, skip kernel initialization when possible
  • use fallback libraries for archs without optimized logic
  • use hipMemcpyAsync for validation (replace hipMemcpy with hipMemcpyAsync + hipStreamSynchronize in ReferenceValidator)
  • remove OCL tests
  • disable HostLibraryTests
  • reduce extended test time by removing extra parameters in the test config files
  • disable InitAccVgprOpt for Stream-K
  • skip sgemm 64bit offset tests for gfx94x
  • skip DTV, DTL, LSU+MFMA tests for gfx908
  • increase extended test timeout to 720 min
  • update xfail test (1sum tests only failing on gfx90a)
  • update lib logic convertor script
  • test limiting CI threads for only gfx11
  • WGM related kernargs are removed if they are not needed (WGM=-1,0,1)
  • cleanup on unused old code, mostly related to old client
  • change GSUA to SingleBuffer if GlobalSplitU=1 + MultipleBuffer, instead of rejecting it
  • update efficiency script for new architecture and xf32 datatype
  • re-enable negative values for WorkGroupMapping (asm kernel only)
  • disable HW monitor for aquvavanjaram941
  • pre-apply offsets for strided batch kernels
  • update tensile build with 16 threads

Fixes

  • fix WorkspaceCheck implementation when used in rocBLAS
  • ignore asm cap check for kernel arg preload for rocm6.0 and older
  • fix Stream-K partials cache behavior
  • fix MasterSolutionLibrary indexing for multiple architecture build
  • fix memory allocation fail with FlushMemorySize + StridedBatched/Batched cases (multiply batch count size when calculating array size)
  • fix BufferLoad=False with Stream-K
  • fix mismatch issue with GlobalReadCoalesceGroup
  • fix rocblas build fail on gfx11 (used state["ISA"] for reject conditions instead of globalParameters["CurrentISA"])
  • fix for LdsPad auto (fixed incorrect value assignment for autoAdjusted, set LdsBlockSizePerPadA or B = 0 if stride is not power of 2)
  • fix inacurate vgpr allocation for ClusterLocalRead
  • fix mismatch issue with LdsBlockSizePerPad + MT1(or 0) not power of 2
  • fix mismatch issue with InitAccOpt + InnerUnroll (use const 0 for src1 of MFMA only if index of innerUnrll (iui) is 0)
  • fix HostLibraryTests on gfx942 and gfx941
  • fix LLVM crash issue
  • fix for newer windows vcpkg msgpack and vcpkg version package name
  • fix an error with DisableKernelPieces + 32bit ShadowLimit

Tensile 4.40.0 for ROCm 6.1.0

16 Apr 19:07
be9f7da
Compare
Choose a tag to compare

Additions

  • new DisableKernelPieces values to invalidate local read, local write, and global read
  • stream-K kernel generation, including two-tile stream-k algorithm by setting StreamK=3
  • feature to allow testing stream-k grid multipliers
  • debug output to check occupancy for Stream-K
  • reject condition for FractionalLoad + DepthU!=power of 2
  • new TENSILE_DB debugging value to dump the common kernel parameters
  • predicate for APU libs
  • new parameter (ClusterLocalRead) to turn on/off wider local read opt for TileMajorLDS
  • new parameter (ExtraLatencyForLR) to add extra interval between local read and wait
  • new logic to check LDS size with auto LdsPad(=1) and change LdsPad to 0 if LDS overflows
  • initialization type and general batched options to the rocblas-bench input creator script

Optimizations

  • enabled MFMA + LocalSplitU=4 for MT16x16
  • enabled (DirectToVgpr + MI4x4) and supported skinny MacroTile
  • optimized postGSU kernel: separate postGSU kernels for different GSU values, loop unroll for GSU loop, wider global load depending on array size, and parallel reduction depending on array size
  • auto LdsPad calculation for TileMajorLds + MI16x16
  • auto LdsPad calculation for UnrollMajorLds + MI16x16 + VectorWidth

Changes

  • cleared hipErrorNotFound error since it is an expected part of the search
  • modified hipcc search path for Linux
  • changed PCI ID from 32bit to 64bit for ROCm SMI HW monitor
  • changed LdsBlockSizePerPad to LdsBlockSizePerPadA, B to specify LBSPP separately
  • changed the default value of LdsPadA, B, LdsBlockSizePerPadA, B from 0 to -1
  • updated test cases according to parameter changes for LdsPad, LBSPP and ClusterLocalRead
  • Replaced std::regex with fnmatch()/PathMatchSpec as a workaround to std::regex stack overflow known bug

Fixes

  • hipcc compile append flag parallel-jobs=4
  • race condition in Stream-K that appeared with large grids and small sizes
  • mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and TailLoop
  • mismatch issue with LdsPad + LdsBlockSizePerPad!=0 and SplitLds
  • incorrect reject condition check for DirectToLds + LdsBlockSizePerPad=-1 case
  • small fix for LdsPad optimization (LdsElement calculation)

Tensile 4.39.0 for ROCm 6.0.2

31 Jan 20:12
17df881
Compare
Choose a tag to compare

Tensile code for ROCm 6.0.2 did not change. The library was rebuilt for the updated ROCm 6.0.2 stack.