CUTLASS 1.0.1
CUTLASS 1.0.1.
Intra-threadblock reduction added for small threadblock tile sizes
- sgemm_64x128x16, sgemm_128x128x16, sgemm_128x64x16, sgemm_128x32x16, sgemm_64x64x16, sgemm_64x32x16
- igemm_32x32x128
- GEMM K residue handled during prologue prior to mainloop
Replaced Google Test copy with submodule. Use git submodule init