V4.3.0 Performance Improvements and Bug Fixes
Features
- source kernels for k<=128 to fix stride_b=0, batch_count > 1
- __hfma no longer needed
- Modify default handling for LdsPad, if -1, only pad the TLU=0 cases
- Combine second-to-last MAC iter into common loop
- Reset local pointers at iteration based on PrefetchLocalRead
- Multi-thread the kernel writing, provides 3X-4X speedup for build
- Support -1 default LdsPad (matches VectorWidth)
- refactor .yaml files
- Optimize overhang calculation
- use glvw in overhang calculation
- Enable CodeFromFiles
- Feature detect invalid kernel
- Change order to better match write batching reclaim algorithm
- Allocate LoopCounters in middle of SGPRs so tmp sgpr recovery works