Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Replace cvt instructions with bitwise operations in s8->bf16 conversions
Hopper has very low throughput of conversion instructions that cause this operations to quickly become an ALU bottleneck. Restating it in terms of bitwise ops and SIMD bf16 instructions increases the throughput significantly and translates to meaningful speedups (e.g. 10% end-to-end on one matmul I was looking at).
- Loading branch information