Skip to content

V4.3.0 Performance Improvements and Bug Fixes

Compare
Choose a tag to compare
@amcamd amcamd released this 28 Jun 21:36
· 3861 commits to master since this release

Features

  • source kernels for k<=128 to fix stride_b=0, batch_count > 1
  • __hfma no longer needed
  • Modify default handling for LdsPad, if -1, only pad the TLU=0 cases
  • Combine second-to-last MAC iter into common loop
  • Reset local pointers at iteration based on PrefetchLocalRead
  • Multi-thread the kernel writing, provides 3X-4X speedup for build
  • Support -1 default LdsPad (matches VectorWidth)
  • refactor .yaml files
  • Optimize overhang calculation
  • use glvw in overhang calculation
  • Enable CodeFromFiles
  • Feature detect invalid kernel
  • Change order to better match write batching reclaim algorithm
  • Allocate LoopCounters in middle of SGPRs so tmp sgpr recovery works