V4.3.0 Performance Improvements and Bug Fixes

amcamd released this 28 Jun 21:36

· 3861 commits to master since this release

510b8e2

Features

source kernels for k<=128 to fix stride_b=0, batch_count > 1
__hfma no longer needed
Modify default handling for LdsPad, if -1, only pad the TLU=0 cases
Combine second-to-last MAC iter into common loop
Reset local pointers at iteration based on PrefetchLocalRead
Multi-thread the kernel writing, provides 3X-4X speedup for build
Support -1 default LdsPad (matches VectorWidth)
refactor .yaml files
Optimize overhang calculation
use glvw in overhang calculation
Enable CodeFromFiles
Feature detect invalid kernel
Change order to better match write batching reclaim algorithm
Allocate LoopCounters in middle of SGPRs so tmp sgpr recovery works

Assets 2