v4.0.2 Performance improvements and initial mixed precision support
Features
- Initial mixed precision support
- Performance Improvements
- Use Buffer Load for global reads (saves registers, reduce instruction count)
- Support DirectToLds (save registers, reduce latency)
- Reduce global read offset vgprs (save registers)
- Use Buffer Store for global stores (reduce instruction count)
- Optimize global store address calculaton (reduce instruction count)
- Support LdsPad to reduce LDS write bank conflicts
- Improve debug for assembly path (asserts, state dump, init LDS)