Why is there a 16x8x16 TensorOp for tf32 but not a 16x16x8? #1382
-
I am learning about CUTLASS with the ultimate goal of accelerating a batched multiply-add of small matrix ops (like 16x16x8). Most of the terminology in this library still eludes me, so please forgive me if I am asking something obvious... According to the CUDA docs there exists only one tensor core operation for tf32, which is 16x16x8: OTOH, in the CUTLASS docs the following TensorOps are available: How come there is a 16x8x16 mode but not a 16x16x8 one? |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
16x16x8 instruction shape is only supported by WMMA (the first table you show is documenting the shapes supported by WMMA). CUTLASS uses the PTX API ( |
Beta Was this translation helpful? Give feedback.
16x16x8 instruction shape is only supported by WMMA (the first table you show is documenting the shapes supported by WMMA). CUTLASS uses the PTX API (
mma.sync.*
) for Ampere tensor core ops, which natively supports an instruction shape of 16x8x4 or 16x8x8