[RFC] WMMA support #1033

giuseros · 2023-03-31T13:33:14Z

giuseros
Mar 31, 2023
Collaborator

Goal

Adding WMMA support to rocMLIr

What is WMMA

WMMA is the new intrinsics available on RDNA3. A simpler explanation (and hello world HIP application) can be found here:

https://gpuopen.com/learn/wmma_on_rdna3/

Different WMMA flavours

There are in total 12 versions of the intrinsics:

__builtin_amdgcn_wmma_f32_16x16x16_f16_{w32,w64}
__builtin_amdgcn_wmma_f32_16x16x16_bf16_{w32,w64}
__builtin_amdgcn_wmma_f16_16x16x16_f16_{w32,w64}
__builtin_amdgcn_wmma_bf16_16x16x16_bf16_{w32,w64}
__builtin_amdgcn_wmma_i32_16x16x16_iu8_{w32,w64}
__builtin_amdgcn_wmma_i32_16x16x16_iu4_{w32,w64}

The output is always stored in 8 VGPR for the w32 version and in 4 VGPRs for the w64 version. Because of that, the fp16 and bf16 versiosn have an OPSEL that decides if to store the data in the high or low section of each VGPR.

This means that the return type of the different intrinsics in LLVM can be:

8xfloat, 8xi32, 16xf16 , 16xbf16 for the w32 version
4xfloat, 4xi32, 8xf16 , 4xbf16 for the w32 version

Add support in the backend (`AMDGPU.td` and `ROCDLOps.td`)

I came up with the following AMDGPU operation:

// wmma
def WMMAInTypes : AnyTypeOf<[VectorOfLengthAndType<[16], [F16,BF16,I8]>]>;
def WMMAOutTypes : AnyTypeOf<[VectorOfLengthAndType<[4, 8], [F32, I32]>,
                              VectorOfLengthAndType<[8, 16], [F16, BF16]>]>;

def AMDGPU_WMMAOp :
    AMDGPU_Op<"wmma", [AllTypesMatch<["destC", "destD"]>,
                       AllTypesMatch<["sourceA", "sourceB"]>,
                        Pure]>,
    Arguments<(ins
                   WMMAInTypes:$sourceA,
                   WMMAInTypes:$sourceB,
                   WMMAOutTypes:$destC,
                   DefaultValuedAttr<BoolAttr, "true">:$zeroIndexing,
                   DefaultValuedAttr<BoolAttr, "true">:$signedA,
                   DefaultValuedAttr<BoolAttr, "true">:$signedB,
                   DefaultValuedAttr<BoolAttr, "false">:$clamp)>,
    Results<(outs WMMAOutTypes: $destD)> {
  let summary = "MLIR wrapper for RDNA3 wmma instructions";
  let description = [{
    The `amdgpu.wmma` op is an MLIR wrapper around intrinsics
    for various `wmma` instructions in the RDNA3 architecture, which perform
    a 16x16 matrix multiplication for different data types.
  }];
  let assemblyFormat = [{
    $sourceA `*` $sourceB `+` $destC
    attr-dict
    `:` type($sourceA) `,` type($sourceB) `,` type($destC)
  }];
  let hasVerifier = 1;
}

Notes:

While the f16->f16 (and bf16->bf16) wmma outputs a logical 8x2 or 16x2 vector, the translation of a 2D vector in LLVM results in an llvm.array which is hard to flatten out. So the 2D -> 1D conversion logic needs to happen before we issue the wmma operation
We need to specify the signedness for int8. While ui8 and si8 is supported in MLIR, it is not supported in LLVM. This means that when the conversion MLIR->LLVM happens, unrealized_casts appear. So the best way to go is to let the user of wmma specify if the int8 input is signed or not.
The LLVM intrinsics accept a packed vector<4xi32> instead of a vector<16xi8>. I implemented this conversion inside the AMDGPUToROCDL conversion

The implementation is available at: #1035

Add support in the front-end (`GridwiseGemmToBlockWise`, `BlockwiseGemmToThreadWise`, `ThreadWiseGemLowering`)

This depends on the previous task and can be done later.

giuseros · 2023-03-31T13:35:03Z

giuseros
Mar 31, 2023
Collaborator Author

@krzysz00 , @sjw36 , @jerryyin This is a very first draft of this task, but I thought to get the ball rolling.

3 replies

krzysz00 Mar 31, 2023
Maintainer

Ok, so, notes:

I think this'll either be a good fit for gridwise_gemm_v2 (where we have working kpack, etc.) or for a new, third flavor of gridwise gemm.
This might also mean extending MIOpen to add the WMMA kernel as its own thing ... which may also give us an excuse to do the perf config versioning refactor finally.
We'll want to poke around about the perf implications of wave32 vs wave64. My gut is telling me we probably want wave32, but I don't know the tradeoffs there.

But, on the low level

ROCDLOps should be a 1:1 mapping to the LLVM intrinsics. We can create amdgpu.wmma by analogy to amdgpu.mfma. It's also worth noting that the LLVM intrinsics don't expose the wave32/wave64 distinction - you're just expected to set the return type appropriately and will get yelled at if you chose an incompatible combination.
Re OPSEL, what I'd do is expose it logically at the MLIR level. That is, while, for LLVM purposes, the return type of the f16 += f16 * f16 intrinsic is, for example, <16 x f16>, it is, logically, vector<8 x 2 x f16>, and whether you set %high or not determines if your results live in %res[N][1] or %res[N][0]. I'm pretty sure that in the MLIR->LLVM lowering that 2D vector will rewrite to its 1D equivalent - or, in the worst case, we can use vector operations to extract a slice - I'm pretty sure we have something like vector.extract_slice %res[*,0] : vector<8x2xT> -> vector<8xT>, even though I definitely flubbed the spelling. If not, we can create it.

giuseros Apr 3, 2023
Collaborator Author

Thanks for the reply! About point 2., are you suggesting to have something like WMMAOutTypes = VectorOfRankAndType(Vector) in AMDGPU.td? Because while I like the idea of a vector<8x2xf16> I was thinking to introduce it at a higher level (with AMDGPU operations always accepting 1D vectors).

giuseros Apr 3, 2023
Collaborator Author

Also, there is another similar issue with int8 packing. For instance, LLVM intrinsic llvm.amdgcn.wmma.i32.16x16x16.iu8.v8i32 accepts vectors of 4xi32 which contain 16 int8 elements. Where should this packing/unpacking happen? Should the caller of AMDGPU_WMMAOp do the packing, or should she write 16xi8 and then those will be packed during AMDToRocdl lowering?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] WMMA support #1033

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

[RFC] WMMA support #1033

giuseros Mar 31, 2023 Collaborator

Goal

What is WMMA

Different WMMA flavours

Add support in the backend (AMDGPU.td and ROCDLOps.td)

Add support in the front-end (GridwiseGemmToBlockWise, BlockwiseGemmToThreadWise, ThreadWiseGemLowering)

Replies: 1 comment · 3 replies

giuseros Mar 31, 2023 Collaborator Author

krzysz00 Mar 31, 2023 Maintainer

giuseros Apr 3, 2023 Collaborator Author

giuseros Apr 3, 2023 Collaborator Author

giuseros
Mar 31, 2023
Collaborator

Add support in the backend (`AMDGPU.td` and `ROCDLOps.td`)

Add support in the front-end (`GridwiseGemmToBlockWise`, `BlockwiseGemmToThreadWise`, `ThreadWiseGemLowering`)

Replies: 1 comment 3 replies

giuseros
Mar 31, 2023
Collaborator Author

krzysz00 Mar 31, 2023
Maintainer

giuseros Apr 3, 2023
Collaborator Author

giuseros Apr 3, 2023
Collaborator Author