V4.4.0 Performance Improvements and Bug Fixes
Features
- Support Global Split U for half and double
- Support Local Split U for half and hpa
- Fix beta for hpa
- Add AssertFree0ElementMultiple requirement and runtime launch check
- Intercept solution selection logic and call hgemm HIP kernel when summation index or first free index is odd
- correct reordered_schedules fallback for hgemm
- disable PreciseBoundsCheck
- update rocblas_hgemm_asm_full.yaml to call source with VW=2 for m,n,k <= 32
- update rocblas_hgemm_asm_full.yaml to call source with VW=1 for m,n,k == 1
- Use alternating sign in random init for half
- use hipGetDevice in place of hipCtxGetDevice
- use _Float16 in place of __fp16
- add device to llvm_fma_v2f16