[RFC] Reduction fusion support for GPU-gemm like kernels #925

manupak · 2022-12-23T14:07:25Z

manupak
Dec 23, 2022
Collaborator

[RFC] Reduction fusion support for GPU-gemm like kernels

Introduction

This RFC proposes to fuse axes-wide reduction operations to gemm-like operations in roCMLIR.

A simple example would be as follows :

func.func @test_matmul_reduce_max(%arg0: tensor<12x12x32xf32>, %arg1: tensor<12x32x12xf32>) -> tensor<12x12x1xf32> attributes {kernel, arch = ""} {
    %0 = "tosa.matmul"(%arg0, %arg1) : (tensor<12x12x32xf32>, tensor<12x32x12xf32>) -> tensor<12x12x12xf32>
    %1 = "tosa.reduce_sum"(%0) {axis = 2 : i64} : (tensor<12x12x12xf32>) -> tensor<12x12x1xf32>
    return %1 : tensor<12x12x1xf32>
}

Terminology

IFM : Input Feature Map

Scope

Mostly there are two types of non-filter related reduction operations in the ML space.
There are :

R1. Sliding window type reductions [Out of Scope]

Examples for this category would be as MaxPool and AvgPool operations. These operations are similiar to convolultion operation where a window is slided in the IFM, however instead of getting a dot product of the moving window and the extracted window, a reduction is made on the extracted window. Hence, these have closer semantics to be a peer to our gemm-like Ops.

Hence, not covered here.

R2. Axes-wide reductions [In Scope]

Examples for this category would be reduce_sum and reduce_max like operations. These operations could be fused to the preceding gemm-like operation to save on bandwidth and improve performance. Hence these are are in scope for this RFC.

Proposed Methodology

Initially, we will create a "rock.reduce" operator with versioning where we gradually increase the complexity observing the gains.
Once we have the initial version, we will start working the fusing the folded/legalized rock.reduce with a preceding gemm-like Op.

Why not just work on the fusion straight away ?

It would be easier to experiment with varied implementations easily and reason about them and also would serve as a baseline until the optimal is known.
Additionally, if there are orphan reductions (possibly preceded by unsupported non-rock ops) we can still run them on the GPU.

As for the staged iterative developement, I would propose 3 stages as follows :

Directly just doing reductions to using atomics to global memory
Using LDS for the reduction in blocks and doing per-block atomic reductions
Do per-wave DPP based reductions -> store per-wave partial reduction them to LDS -> do atomic reductions to global memory per block

Stage 1 : Directly just doing reductions to using atomics to global memory

This would be the easiest to develop and will stand as the baseline for the other stages.
The main idea behind this approach would be we'll be basically looping over the reduction axis and storing them

func.func @test_matmul_reduce_max(%arg0: memref<2x12x32xf32>, %arg1: memref<2x32x12xf32>, %arg2: memref<2x12x1xf32>) attributes {arch = "", block_size = 64 : i32, grid_size = 2 : i32, kernel} {
    %cst = arith.constant dense<0.000000e+00> : vector<8xf32>
    %c0 = arith.constant 0 : index
    %cst_0 = arith.constant 0.000000e+00 : f32
 
    ;LDS to store inputs and its transforms
    %0 = rock.alloc() : memref<1024xf32, 3>
    %1 = rock.transform %0 by #transform_map0 : memref<1024xf32, 3> to memref<512xf32, 3>
    %2 = rock.transform %1 by #transform_map1 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
    %3 = rock.transform %0 by #transform_map2 : memref<1024xf32, 3> to memref<512xf32, 3>
    %4 = rock.transform %3 by #transform_map3 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
 
    ; VGPR for thread outputs
    %5 = rock.alloc() : memref<16xf32, 5>
    %6 = rock.transform %5 by #transform_map4 : memref<16xf32, 5> to memref<4x4xf32, 5>
    rock.fill(%5, %cst_0) : memref<16xf32, 5>, f32
 
    %7 = rock.workgroup_id : index
    %8 = rock.workitem_id : index
 
    ...

    rock.lds_barrier
    rock.blockwise_gemm %6 += %2 * %4 {params = #general_gemm_params} : memref<4x4xf32, 5> += memref<16x32x1xf32, 3> * memref<16x32x1xf32, 3>
    rock.transforming_for {forceUnroll, useIndexDiffs} (%arg3) = [#transform_map21](%7, %8, %c0), (%arg4, %arg5, %arg6) = [#transform_map22, #transform_map23, #transform_map24, #transform_map25](%7, %8, %c0) bounds [1, 1, 16] strides [1, 1, 1] {
 
      ; loading the gemm output value to VGPR
      %11 = rock.alloc() : memref<1xf32, 5>
      %12 = rock.in_bounds_load %5[%arg3] : memref<16xf32, 5>, index -> f32
 
      ; We do the reduction to globals in atomics
      ; Also there is only atomic_add but there are more intrinsics in llvm.
      rock.buffer_store atomic_add %12 -> %arg2[%arg4, %arg5, %c0] {leftOobDims = [], rightOobDims = [1 : i32]} : f32 -> memref<2x12x1xf32>, index, index, index
      rock.yield
    }
    return
  } // end of function

Stage 2 : Using LDS for the reduction in blocks and doing per-block atomic reductions

Prerequisties :

We would need to rock dialect operator to handle out of bounds ld/st to LDS

This basically going to allow the threads to use LDS as the workspace to perform the parallel tree reduction.

First, we create a space in the LDS to hold $[i_1, i_2, ... , blockSize, ... , i_{n-1}]$
Then, in the CopyOutput transform for loop, we just do the intra-thread reduction and store it the unique location per thread per block.
Then after the $[i_1, i_2, ... , blockSize, ... , i_{n-1}]$ is filled with partial reductions, we run parallel tree reduction on it and save 0th element to the global output.

func.func @test_matmul_reduce_max(%arg0: memref<2x12x32xf32>, %arg1: memref<2x32x12xf32>, %arg2: memref<2x12x1xf32>) attributes {arch = "", block_size = 64 : i32, grid_size = 2 : i32, kernel} {
    %cst = arith.constant dense<0.000000e+00> : vector<8xf32>
    %c0 = arith.constant 0 : index
    %cst_0 = arith.constant 0.000000e+00 : f32
 
    ; LDS to store inputs and its transforms
    %0 = rock.alloc() : memref<1024xf32, 3>
    %1 = rock.transform %0 by #transform_map0 : memref<1024xf32, 3> to memref<512xf32, 3>
    %2 = rock.transform %1 by #transform_map1 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
    %3 = rock.transform %0 by #transform_map2 : memref<1024xf32, 3> to memref<512xf32, 3>
    %4 = rock.transform %3 by #transform_map3 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
 
    ; VGPR for thread outputs
    %5 = rock.alloc() : memref<16xf32, 5>
    %6 = rock.transform %5 by #transform_map4 : memref<16xf32, 5> to memref<4x4xf32, 5>
    rock.fill(%5, %cst_0) : memref<16xf32, 5>, f32
 
    ; LDS for Gemm Output
    %gemm_o_lds_flattened = rock.alloc() : memref<${24xblockSize}xf32, 3>
    rock.fill(%gemm_o_lds_flattened, %cst_0) : memref<${24xblockSize}xf32, 3>, f32 
    %gemm_o_lds = rock.transform %gemm_o_lds_flattened  by #transform_map_gemm_o_dims : memref<${24xblockSize}xf32, 3> to memref<2x12xblockSizexf32, 3>
 
    %7 = rock.workgroup_id : index
    %8 = rock.workitem_id : index

    ...

    rock.lds_barrier
    rock.blockwise_gemm %6 += %2 * %4 {params = #general_gemm_params} : memref<4x4xf32, 5> += memref<16x32x1xf32, 3> * memref<16x32x1xf32, 3>
    %gtid_acc = rock.alloc() : memref<1xf32, 5>
    rock.transforming_for {forceUnroll, useIndexDiffs} (%arg3) = [#transform_map21](%7, %8, %c0), (%arg4, %arg5) = [<New set of transform maps that omits reduction axis>](%7, %8, %c0) bounds [1, 1, 16] strides [1, 1, 1] {
 
      ; loading the gemm output value to VGPR
      %11 = rock.alloc() : memref<1xf32, 5>
      %12 = rock.in_bounds_load %5[%arg3] : memref<16xf32, 5>, index -> f32
      rock.in_bounds_store %12 -> %11[%c0] : f32 -> memref<1xf32, 5>, index
 
      ; loading the reduction thread-based accumulator
      ; Here needs something that can load from lds with bound checks
      %gtid_acc_val = rock.global_load %gemm_o_lds[%arg4, %arg5, %8] {leftOobDims = [], rightOobDims = [1 : i32]} : memref<2x12xtotal_threadsxf32, 5>, index, index, index -> f32
      rock.in_bounds_store  %gtid_acc_val -> %gtid_acc[%c0] : f32 -> memref<1xf32, 5>, index
 
      linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel"]} ins(%11 : memref<1xf32, 5>) outs(%gtid_acc : memref<1xf32, 5>) {
      ^bb0(%arg7: f32, %arg8: f32):
        %14 = arith.addf %arg7, %arg8 : f32
        linalg.yield %14 : f32
      }
 
      ; This is not supported (yet) and global might not be the right name
      ; The main difference from a non-reduction fused kernel is that we store the lds to start the reduction later.
      rock.global_store %gtid_acc[%c0] -> %gemm_o_lds[%arg4, %arg5, %8] storeMethod( set) {leftOobDims = [], length = 1 : index, rightOobDims = [1 : i32]} : memref<1xf32, 5> -> memref<2x12xblockSizexf32>, index, index, index
      rock.yield
    }
 
    %cblock_size = arith.constant ${blockSize} : index
    %c1 = arith.constant 1 : index
    %cblock_size_half = arith.shrsi %cblock_size %c1 : index
 
    ; Here we start doing the reduction from memref<2x12xblockSizexf32> to memref<2x12x1xf32>
    rock.transforming_for {forceUnroll, useIndexDiffs} (%arg4, %arg5) = [<New set of transform maps that omits reduction axis>](%7, %8, %c0) bounds [1, 1, 16] strides [1, 1, 1] {
      scf.while (%stride = %cblock_size_half ) : (i32) -> (){
        %cond = arith.cmpi sgt, %stride, %c0 : i32
        scf.condition(%cond) %stride : i32
      } do {
        %tid_less_than_stide = arith.cmpi slt, %gtid, %stride : index
        scf.if %tid_less_than_stide {
          %gtid_acc = rock.alloc() : memref<1xf32, 5>
          %gtid_acc_s = rock.alloc() : memref<1xf32, 5>
          %gtid_s = arith.addi %8, %stride : index
          ; Here needs something that can load from lds with bound checks
          rock.lds_barrier
          %gtid_acc_val = rock.global_load %gemm_o_lds[%arg4, %arg5, %8] {leftOobDims = [], rightOobDims = [1 : i32]} : memref<2x12xblockSizexf32, 5>, index, index, index -> f32
          %gtid_acc_val_s = rock.global_load %gemm_o_lds[%arg4, %arg5, %gtid_s] {leftOobDims = [], rightOobDims = [1 : i32]} : memref<2x12xblockSizexf32, 5>, index, index, index -> f32
          rock.in_bounds_store  %gtid_acc_val -> %gtid_acc[%c0] : f32 -> memref<1xf32, 5>, index          
          rock.in_bounds_store  %gtid_acc_val_s -> %gtid_acc_s[%c0] : f32 -> memref<1xf32, 5>, index
 
          linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel"]} ins(%gtid_acc_s : memref<1xf32, 5>) outs(%gtid_acc : memref<1xf32, 5>) {
          ^bb0(%arg7: f32, %arg8: f32):
            %14 = arith.addf %arg7, %arg8 : f32
            linalg.yield %14 : f32
          }
          %gtid_acc_red_val = rock.in_bounds_load %gtid_acc[%c0] : memref<1xf32, 5>, index -> f32
 
          %is_stride_equal_to_1 = arith.cmpi rq, %stride, %c1: index
          scf.if %is_stride_equal_to_1 {
            ; If the stride is 1, that mean it is the last reduction per block hence we do atomic reduction to global memory
            ; Also there is only has atomic_add but there are more intrinsics in llvm.
            rock.buffer_store atomic_add %gtid_acc_red_val -> %arg2[%arg4, %arg5, %c0] {leftOobDims = [], rightOobDims = [1 : i32]} : f32 -> memref<2x12x1xf32>, index, index, index
            scf.yield
          }
          else {
            ; Stride is not equal to one -- we are still in partial reductions...
            ; Here needs something that can store to lds with bound checks
            rock.global_store %gtid_acc[%c0] -> %gemm_o_lds[%arg4, %arg5, %gtid] storeMethod( set) {leftOobDims = [], length = 1 : index, rightOobDims = [1 : i32]} : memref<1xf32, 5> -> memref<2x12xtotal_threadsxf32>, index, index, index
            scf.yield
          }
           
          scf.yield
        } else {
          scf.yield
        }
      rock.yield
    } ; end of scf.while
 
   } ; end of reduction loop
    return
  } ; end of function

Stage 3 : Do per-wave DPP based reductions -> store per-wave partial reduction them to LDS -> do atomic reductions to global memory per block

Prerequisties :

We would need DPP support in some dialect in MLIR
We would need to rock dialect operator to handle out of bounds ld/st to LDS

Once we have this, we can decide to make the kernel generation conditionally emit the LDS bits based on wavesPerBlock.

Case 3.1: If wavesPerBlock is low

func.func @test_matmul_reduce_max(%arg0: memref<2x12x32xf32>, %arg1: memref<2x32x12xf32>, %arg2: memref<2x12x1xf32>) attributes {arch = "", block_size = 64 : i32, grid_size = 2 : i32, kernel} {
    %cst = arith.constant dense<0.000000e+00> : vector<8xf32>
    %c0 = arith.constant 0 : index
    %cst_0 = arith.constant 0.000000e+00 : f32
 
    ; LDS to store inputs and its transforms
    %0 = rock.alloc() : memref<1024xf32, 3>
    %1 = rock.transform %0 by #transform_map0 : memref<1024xf32, 3> to memref<512xf32, 3>
    %2 = rock.transform %1 by #transform_map1 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
    %3 = rock.transform %0 by #transform_map2 : memref<1024xf32, 3> to memref<512xf32, 3>
    %4 = rock.transform %3 by #transform_map3 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
 
    ; VGPR for thread outputs
    %5 = rock.alloc() : memref<16xf32, 5>
    %6 = rock.transform %5 by #transform_map4 : memref<16xf32, 5> to memref<4x4xf32, 5>
    rock.fill(%5, %cst_0) : memref<16xf32, 5>, f32
  
    %7 = rock.workgroup_id : index
    %8 = rock.workitem_id : index
    %clog2wave_size = arith.constant ${log2waveSize} : index
    %wave_size = arith.constant ${waveSize} : index
    %wave_size_minus_1 = arith.subi ${waveSize}, %c1 : index
    %wave_thread_id = arith.xori %8, %wave_size_minus_1 : index

    ...

    rock.lds_barrier
    rock.blockwise_gemm %6 += %2 * %4 {params = #general_gemm_params} : memref<4x4xf32, 5> += memref<16x32x1xf32, 3> * memref<16x32x1xf32, 3>
    ; reductionMask = calculate a mask for lanes where zero is present where reductions are available.
    %creduction_mask = arith.constant dense<${reductionMask}> : vector<${waveSize}xindex>
    %is_wave_thread_reduction_dest = arith.cmpi ult %creduction_mask[wave_thread_id], %c0 : index
    %gtid_acc = rock.alloc() : memref<1xf32, 5>
    rock.transforming_for {forceUnroll, useIndexDiffs} (%arg3) = [#transform_map21](%7, %8, %c0), (%arg4, %arg5) = [<New set of transform maps that omits reduction axis>](%7, %8, %c0) bounds [1, 1, 16] strides [1, 1, 1] {
 
      ; loading the gemm output value to VGPR
      %11 = rock.alloc() : memref<1xf32, 5>
      %12 = rock.in_bounds_load %5[%arg3] : memref<16xf32, 5>, index -> f32
      rock.in_bounds_store %12 -> %11[%c0] : f32 -> memref<1xf32, 5>, index
 
      ; This is a future instruction that will reduce the values in mirroring registers of threads in the wave
      ; The instruction will use the reduction stride to generate single or multiple reductions..
      ; For e.g. : if all the threads in the wave hold values to be stored to same %arg4, %arg5 (say AAAAAAAA assuming hypothetical waveSize==8) --> reduction stride = [1]
      ; Therefore it should reduce everything down to %wave_thread_id=0.
      ; For e.g. : if every other threads in the wave hold values to be stored to same %arg4, %arg5 (say ABABABAB assuming hypothetical waveSize==8) --> reduction stride = [2]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,1].
      ; For e.g. : for AABBAABB --> reduction stride = [2, 1]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,2].
      ; For e.g. : for AAAABBBB --> reduction stride = [4, 1]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,4].
      ; For e.g. : for AABBCCDD --> reduction stride = [4, 2, 1]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,2,4,6].
      ; For e.g. : for ABCDEFGH --> reduction stride = [8]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,1,2,3,4,5,6,7]. (i.e. no reduction)
      rock.dpp_reduce_sum %gtid_acc {reductionStride = reduction stride } :  memref<1xf32, 5>
 
      ; We do the reduction to globals in atomics
      ; Also there is only atomic_fadd but there are more intrinsics in llvm.
      %gtid_acc_val = rock.in_bounds_load %gtid_acc[%c0] : memref<16xf32, 5>, index -> f32
      scf.if %is_wave_thread_reduction_dest {
        rock.buffer_store atomic_add %gtid_acc_red_val -> %arg2[%arg4, %arg5, %c0] {leftOobDims = [], rightOobDims = [1 : i32]} : f32 -> memref<2x12x1xf32>, index, index, index
      }
      rock.yield
    }

    return
  } ; end of function

Case 3.2: If wavesPerBlock is high:

func.func @test_matmul_reduce_max(%arg0: memref<2x12x32xf32>, %arg1: memref<2x32x12xf32>, %arg2: memref<2x12x1xf32>) attributes {arch = "", block_size = 64 : i32, grid_size = 2 : i32, kernel} {
    %cst = arith.constant dense<0.000000e+00> : vector<8xf32>
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %cst_0 = arith.constant 0.000000e+00 : f32
 
    ; LDS to store inputs and its transforms
    %0 = rock.alloc() : memref<1024xf32, 3>
    %1 = rock.transform %0 by #transform_map0 : memref<1024xf32, 3> to memref<512xf32, 3>
    %2 = rock.transform %1 by #transform_map1 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
    %3 = rock.transform %0 by #transform_map2 : memref<1024xf32, 3> to memref<512xf32, 3>
    %4 = rock.transform %3 by #transform_map3 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
 
    ; VGPR for thread outputs
    %5 = rock.alloc() : memref<16xf32, 5>
    %6 = rock.transform %5 by #transform_map4 : memref<16xf32, 5> to memref<4x4xf32, 5>
    rock.fill(%5, %cst_0) : memref<16xf32, 5>, f32
 
    ; LDS to store per wave Gemm output
    ; def wavesPerBlock = blockSize/waveSize
    ; NOTE : we can decide to omit this based on a wavesPerBlock heuristic
    %gemm_o_lds_flattened = rock.alloc() : memref<${24xwavesPerBlock}xf32, 3>
    rock.fill(%gemm_o_lds_flattened, %cst_0) : memref<${24xwavesPerBlock}xf32, 3>, f32 
    %gemm_o_lds = rock.transform %5 by #transform_map_gemm_o_dims : memref<${24xwavesPerBlock}xf32, 3> to memref<2x12xwavesPerBlockxf32, 3>
 
    %7 = rock.workgroup_id : index
    %8 = rock.workitem_id : index
    %clog2wave_size = arith.constant ${log2waveSize} : index
    %wave_size = arith.constant ${waveSize} : index
    %wave_size_minus_1 = arith.subi ${waveSize}, %c1 : index
    %wave_thread_id = arith.xori %8, %wave_size_minus_1 : index
    %wave_id = arith.shrui %8, $clog2wave_size : index

    ...

    rock.lds_barrier
    rock.blockwise_gemm %6 += %2 * %4 {params = #general_gemm_params} : memref<4x4xf32, 5> += memref<16x32x1xf32, 3> * memref<16x32x1xf32, 3>
    ; reductionMask = calculate a mask for lanes where zero is present where reductions are available.
    %creduction_mask = arith.constant dense<${reductionMask}> : vector<${waveSize}xindex>
    %is_wave_thread_reduction_dest = arith.cmpi ult %creduction_mask[wave_thread_id], %c0 : index
    %gtid_acc = rock.alloc() : memref<1xf32, 5>
    rock.transforming_for {forceUnroll, useIndexDiffs} (%arg3) = [#transform_map21](%7, %8, %c0), (%arg4, %arg5) = [<New set of transform maps that omits reduction axis>](%7, %8, %c0) bounds [1, 1, 16] strides [1, 1, 1] {
 
      ; loading the gemm output value to VGPR
      %11 = rock.alloc() : memref<1xf32, 5>
      %12 = rock.in_bounds_load %5[%arg3] : memref<16xf32, 5>, index -> f32
      rock.in_bounds_store %12 -> %11[%c0] : f32 -> memref<1xf32, 5>, index
 
      ; loading the reduction thread-based accumulator
      ; Here needs something that can load from lds with bound checks
      %gtid_acc_val = rock.global_load %gemm_o_lds[%arg4, %arg5, %wave_id] {leftOobDims = [], rightOobDims = [1 : i32]} : memref<2x12xtotal_threadsxf32, 5>, index, index, index -> f32
      rock.in_bounds_store  %gtid_acc_val -> %gtid_acc[%c0] : f32 -> memref<1xf32, 5>, index
 
      linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel"]} ins(%11 : memref<1xf32, 5>) outs(%gtid_acc : memref<1xf32, 5>) {
      ^bb0(%arg7: f32, %arg8: f32):
        %14 = arith.addf %arg7, %arg8 : f32
        linalg.yield %14 : f32
      }

      ; This is a future instruction that will reduce the values in mirroring registers of threads in the wave
      ; The instruction will use the reduction stride to generate single or multiple reductions..
      ; For e.g. : if all the threads in the wave hold values to be stored to same %arg4, %arg5 (say AAAAAAAA assuming hypothetical waveSize==8) --> reduction stride = [1]
      ; Therefore it should reduce everything down to %wave_thread_id=0.
      ; For e.g. : if every other threads in the wave hold values to be stored to same %arg4, %arg5 (say ABABABAB assuming hypothetical waveSize==8) --> reduction stride = [2]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,1].
      ; For e.g. : for AABBAABB --> reduction stride = [2, 1]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,2].
      ; For e.g. : for AAAABBBB --> reduction stride = [4, 1]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,4].
      ; For e.g. : for AABBCCDD --> reduction stride = [4, 2, 1]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,2,4,6].
      ; For e.g. : for ABCDEFGH --> reduction stride = [8]
      ; Therefore it should reduce everything down to %wave_thread_id in [0,1,2,3,4,5,6,7]. (i.e. no reduction)
      rock.dpp_reduce_sum %gtid_acc {reductionStride = reduction stride } :  memref<1xf32, 5>
 
      ; This is not supported (yet) and global might not be the right name
      ; The main difference from a non-reduction fused kernel is that we store the lds to start the reduction later.
      ; NOTE: we can decide to omit this based on a wavesPerBlock heuristic and directly do the atomic add reduction to global memory similiar to Stage 1.
      scf.if %is_wave_thread_reduction_dest {
        rock.global_store %gtid_acc[%c0] -> %gemm_o_lds[%arg4, %arg5, %wave_id] storeMethod( set) {leftOobDims = [], length = 1 : index, rightOobDims = [1 : i32]} : memref<1xf32, 5> -> memref<2x12xblockSizexf32>, index, index, index
      }
      rock.yield
    }
 
    %cwaves_per_block = arith.constant ${wavesPerBlock} : index
    %c1 = arith.constant 1 : index
    %cwaves_per_block_half = arith.shrsi %cblock_size %c1 : index
 
    ; Here we start doing the reduction from memref<2x12xwavesPerBlockxf32> to memref<2x12x1xf32>
    rock.transforming_for {forceUnroll, useIndexDiffs} (%arg4, %arg5) = [<New set of transform maps that omits reduction axis>](%7, %8, %c0) bounds [1, 1, 16] strides [1, 1, 1] {
      scf.while (%stride = %cwaves_per_block_half ) : (i32) -> (){
        %cond = arith.cmpi sgt, %stride, %c0 : i32
        scf.condition(%cond) %stride : i32
      } do {
        %tid_less_than_stide = arith.cmpi slt, %gtid, %stride : index
        scf.if %tid_less_than_stide {
          %gtid_acc = rock.alloc() : memref<1xf32, 5>
          %gtid_acc_s = rock.alloc() : memref<1xf32, 5>
          %gtid_s = arith.addi %8, %stride : index
          ; Here needs something that can load from lds with bound checks
          rock.lds_barrier
          %gtid_acc_val = rock.global_load %gemm_o_lds[%arg4, %arg5, %8] {leftOobDims = [], rightOobDims = [1 : i32]} : memref<2x12xwavesPerBlockxf32, 5>, index, index, index -> f32
          %gtid_acc_val_s = rock.global_load %gemm_o_lds[%arg4, %arg5, %gtid_s] {leftOobDims = [], rightOobDims = [1 : i32]} : memref<2x12xwavesPerBlockxf32, 5>, index, index, index -> f32
          rock.in_bounds_store  %gtid_acc_val -> %gtid_acc[%c0] : f32 -> memref<1xf32, 5>, index          
          rock.in_bounds_store  %gtid_acc_val_s -> %gtid_acc_s[%c0] : f32 -> memref<1xf32, 5>, index
 
          linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel"]} ins(%gtid_acc_s : memref<1xf32, 5>) outs(%gtid_acc : memref<1xf32, 5>) {
          ^bb0(%arg7: f32, %arg8: f32):
            %14 = arith.addf %arg7, %arg8 : f32
            linalg.yield %14 : f32
          }
          %gtid_acc_red_val = rock.in_bounds_load %gtid_acc[%c0] : memref<1xf32, 5>, index -> f32
 
          %is_stride_equal_to_1 = arith.cmpi rq, %stride, %c1: index
          scf.if %is_stride_equal_to_1 {
            ; If the stride is 1, that mean it is the last reduction per block hence we do atomic reduction to global memory
            ; Also there is only atomic_fadd but there are more intrinsics in llvm.
            rock.buffer_store atomic_add %gtid_acc_red_val -> %arg2[%arg4, %arg5, %c0] {leftOobDims = [], rightOobDims = [1 : i32]} : f32 -> memref<2x12x1xf32>, index, index, index
            scf.yield
          }
          else {
            ; Stride is not equal to one -- we are still in partial reductions...
            ; Here needs something that can store to lds with bound checks
            rock.global_store %gtid_acc[%c0] -> %gemm_o_lds[%arg4, %arg5, %gtid] storeMethod( set) {leftOobDims = [], length = 1 : index, rightOobDims = [1 : i32]} : memref<1xf32, 5> -> memref<2x12xwavesPerBlockxf32>, index, index, index
            scf.yield
          }
           
          scf.yield
        } else {
          scf.yield
        }
      rock.yield
    } ; end of scf.while
 
   } ; end of reduction loop
    return
  } ; end of function

Implementation Plan

This is subjective to discussion with other teams (compiler/CK) if we can get data if we can be certain to skip
certain stages.

Why rock.reduce ?

It provides a stable interface into rock dialect for reductions that could stand as entry point to standalone lowering as well as fusions.
We dont need to maintain/support upstream components (i.e. TosaToVector)
It does not preclude us from adding/mapping support for similiar ops like rock.reduce by adding a conversion from them to rock.reduce
- I thought this was the reason that we have rock.gemm (and others) instead of being a backend to linalg/vector.

Alternatives for rock.reduce operator

vector.multi_reduction -> vector.contract

The semantics of the operator is similiar to rock.reduce except they work on Nd vectors instead of memrefs/tensors. In the lowering they get canonicalized to vector.contract which is similiar to a linalg.generic.

Pros :
- The operator exists; no need for a new operator.
Cons :
- AFAIK, vector.contract does not have general lowering; it simply has a special cased lowering where the it matches a matmul.
- We would need to TosaToVector to VectorToRock conversion to get to coordinate transforms for the standalone case.
- We would have to deal with vectors/tensors together which sounds messy to me.
linalg.generic version of a reduction

An instance of LA generic could be used to represent a reduction operation at a lower granularity using operator space affine accesses.

Pros :
- We could define a pattern to recognize the axis-wide reduction; no need for a new operator
Cons :
- Feels like a "raising" rather than a lowering here because we just recover information that was present in tosa.reduce_x operator. I see the same problem as lowering/collapsing of reshape/transforms and trying to recover them rock.transforms later.
- linalg only has linalg to spirv as lowering for GPU targets and does not support reductions in the threadspace fully there (AFAIK, limited to reductions within a block). Also, I m not sure whether we want to go down spirv path.
- Due to above reason, we would need 'some' conversion to threadSpace -> tensorSpace; thus would need something like linalg to rock.

Stage 1

Standalone

func.func @test_reduce_sum(%arg0: tensor<2x12x12xf32>) -> tensor<2x12x1xf32> attributes {kernel, arch = ""} {
  %0 = "tosa.reduce_sum"(%arg0) {axis = 2 : i64} : (tensor<2x12x12xf32>) -> tensor<2x12x1xf32>
  return %0 : tensor<2x12x1xf32>
}

| tosa-to-rock

V

// -----// IR Dump After TosaToRockPass (tosa-to-rock) //----- //
func.func @test_fusion(%arg0: tensor<2x12x12xf32>) -> tensor<2x12x1xf32> attributes {arch = "", kernel} {
  %0 = bufferization.alloc_tensor() : tensor<2x12x1xf32>
  %1 = rock.reduce  sum(%arg0, %0) {axis = 2 : index, blockSize = 256 : i32, gridSize = 2 : i32} : (tensor<2x12x12xf32>, tensor<2x12x1xf32>) -> tensor<2x12x1xf32>
  return %1 : tensor<2x12x1xf32>
}

| Bufferization

V

module {
  func.func @test_fusion(%arg0: memref<2x12x12xf32>, %arg1: memref<2x12x1xf32>) attributes {arch = "", kernel} {
    rock.reduce  sum(%arg0, %arg1) {axis = 2 : index, blockSize = 256 : i32, gridSize = 2 : i32} : (memref<2x12x12xf32>, memref<2x12x1xf32>)
    return
  }
}

| rock-lower-reduce

V

#transform_map0 = #rock.transform_map<affine_map<(d0, d1, d2) -> (d0 * 256 + d1 + d2)> by [<Unmerge{2, 256, 1} ["bid", "tid", "iter"] at [0, 1, 2] -> ["flatDim"] at [0]>] bounds = [2, 256, 1] -> [512]>
#transform_map1 = #rock.transform_map<affine_map<(d0) -> (d0)> by [<Pad{0, 224} ["flatDim"] at [0] -> ["flatDim"] at [0]>] bounds = [512] -> [288]>
#transform_map2 = #rock.transform_map<affine_map<(d0) -> (d0 floordiv 144, (d0 mod 144) floordiv 12, d0 mod 12)> by [<Merge{2, 12, 12} ["flatDim"] at [0] -> ["m0", "m1", "m2"] at [0, 1, 2]>] bounds = [288] -> [2, 12, 12]>
module {
  func.func @test_fusion(%arg0: memref<2x12x12xf32>, %arg1: memref<2x12x1xf32>) attributes {arch = "", kernel} {
    %0 = rock.workgroup_id : index
    %1 = rock.workitem_id : index
    %c0 = arith.constant 0 : index
    rock.transforming_for {forceUnroll, useIndexDiffs} (%arg2, %arg3, %arg4) = [#transform_map0, #transform_map1, #transform_map2](%0, %1, %c0) bounds [1, 1, 1] strides [1, 1, 1] {
      %2 = rock.global_load %arg0[%arg2, %arg3, %arg4] {leftOobDims = [], rightOobDims = [0 : i32]} : memref<2x12x12xf32> -> f32
      rock.buffer_store  atomic_add %2 -> %arg1[%arg2, %arg3, %c0] {leftOobDims = [], rightOobDims = [0 : i32]} : f32 -> memref<2x12x1xf32>, index, index, index
      rock.yield
    }
    return
  }
}

Fusion

func.func @test_matmul_reduce_max(%arg0: tensor<12x12x32xf32>, %arg1: tensor<12x32x12xf32>) -> tensor<12x12x1xf32> attributes {kernel, arch = ""} {
    %0 = "tosa.matmul"(%arg0, %arg1) : (tensor<12x12x32xf32>, tensor<12x32x12xf32>) -> tensor<12x12x12xf32>
    %1 = "tosa.reduce_sum"(%0) {axis = 2 : i64} : (tensor<12x12x12xf32>) -> tensor<12x12x1xf32>
    return %1 : tensor<12x12x1xf32>
}

| Highlevel pipeline

V

module {
  func.func @test_matmul_reduce_max(%arg0: memref<12x12x32xf32>, %arg1: memref<12x32x12xf32>, %arg2: memref<12x12x1xf32>) attributes {arch = "", kernel} {
    %0 = memref.alloc() {alignment = 128 : i64} : memref<12x12x12xf32>
    rock.gemm %0 = %arg0 * %arg1 features =  none storeMethod =  set {arch = ""} : memref<12x12x12xf32> = memref<12x12x32xf32> * memref<12x32x12xf32>
    rock.reduce  sum(%0, %arg2) {axis = 2 : index, blockSize = 256 : i32, gridSize = 7 : i32} : (memref<12x12x12xf32>, memref<12x12x1xf32>)
    return
  }
}

| AlignTiling (rock-linalg-align)

V

Here we align/fuse the TransformingFor that would otherwise be present in the rock.reduce to respect the ThreadSpace to TensorSpace mapping and remove the loads as the value is already live.

func.func @test_matmul_reduce_max(%arg0: memref<2x12x32xf32>, %arg1: memref<2x32x12xf32>, %arg2: memref<2x12x1xf32>) attributes {arch = "", block_size = 64 : i32, grid_size = 2 : i32, kernel} {
    %cst = arith.constant dense<0.000000e+00> : vector<8xf32>
    %c0 = arith.constant 0 : index
    %cst_0 = arith.constant 0.000000e+00 : f32
 
    ;LDS to store inputs and its transforms
    %0 = rock.alloc() : memref<1024xf32, 3>
    %1 = rock.transform %0 by #transform_map0 : memref<1024xf32, 3> to memref<512xf32, 3>
    %2 = rock.transform %1 by #transform_map1 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
    %3 = rock.transform %0 by #transform_map2 : memref<1024xf32, 3> to memref<512xf32, 3>
    %4 = rock.transform %3 by #transform_map3 : memref<512xf32, 3> to memref<16x32x1xf32, 3>
 
    ; VGPR for thread outputs
    %5 = rock.alloc() : memref<16xf32, 5>
    %6 = rock.transform %5 by #transform_map4 : memref<16xf32, 5> to memref<4x4xf32, 5>
    rock.fill(%5, %cst_0) : memref<16xf32, 5>, f32
 
    %7 = rock.workgroup_id : index
    %8 = rock.workitem_id : index
 
    ...

    rock.lds_barrier
    rock.blockwise_gemm %6 += %2 * %4 {params = #general_gemm_params} : memref<4x4xf32, 5> += memref<16x32x1xf32, 3> * memref<16x32x1xf32, 3>
    rock.transforming_for {forceUnroll, useIndexDiffs} (%arg3) = [#transform_map21](%7, %8, %c0), (%arg4, %arg5, %arg6) = [#transform_map22, #transform_map23, #transform_map24, #transform_map25](%7, %8, %c0) bounds [1, 1, 16] strides [1, 1, 1] {
 
      ; loading the gemm output value to VGPR
      %11 = rock.alloc() : memref<1xf32, 5>
      %12 = rock.in_bounds_load %5[%arg3] : memref<16xf32, 5>, index -> f32
 
      ; We do the reduction to globals in atomics
      ; Also there is only atomic_add but there are more intrinsics in llvm.
      rock.buffer_store atomic_add %12 -> %arg2[%arg4, %arg5, %c0] {leftOobDims = [], rightOobDims = [1 : i32]} : f32 -> memref<2x12x1xf32>, index, index, index
      rock.yield
    }
    return
  }

Implement rock.reduce sum operator + lowering from tosa.reduce_sum --> rock.reduce sum
- https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/772
Implement fusion for tosa.reduce_sum
- https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/773
Implement atomic_max support for amdgpu, rock.buffer_store atomic_fmax
- https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/774
Implement rock.reduce max operator + lowering from tosa.reduce_max --> rock.reduce max
- https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/775
Implement fusion for tosa.reduce_max
- https://github.com/ROCmSoftwarePlatform/llvm-project-private/issues/776

Stage 2

Prototype/Investigate LDS reduction for scenarios where OOBs are not needed & Compare performance w.r.t Stage 1
If performant, add support for LDS rock.buffer_load / rock.buffer_store with software OOB checks.
Implement rock.reduce sum {use_lds=True} operator
Implement fusion support for rock.reduce sum {use_lds=True} operator
Implement rock.reduce max {use_lds=True} operator
Implement fusion support for rock.reduce max {use_lds=True} operator

Stage 3

Implement rock.dpp_reduce sum operator
- We might need to discuss & agree on the format, one approach is proposed above to use reduction stride pattern.
Implement rock.dpp_reduce max operator
[3.1] Implement rock.reduce sum {use_dpp=True, use_lds=False} operator
[3.1] Implement rock.reduce max {use_dpp=True, use_lds=False} operator
If LDS reduction is performant && wavesPerBlock is high generally, [3.2] Implement rock.reduce sum/max {use_dpp=True, use_lds=True} fusion support.
Compare performance of all stages across representative set of workloads.

Other Anticipated changes

[Small] rock-copy-opt

We would need to handle the that there is going to be linalg.fill to initialiaze reduction workspace.

[Small/Medium] rock-fold-transpose

I would like to call this something like "rock-fold-gemm-output" (Its not really relevant to the RFC :))

Now the gemm output buffer is going to be different to fused-kernel output and its not addressing reconfiguration.
So it should not fold the outputs in this case (well supporting the other cases where it should) and allow the GEMM
lowering to happen.
(May need few discussion along vectorization implications)

[Medium] Insert transposes post-gemm for tree reduction to allow better vectorization ?

This might be a new pass that we may want to add to transpose gemm output layout if we want reduction
to happen in vectors. I need to dig into vectorization bits understand more.

Alternatives

Do Stage 1 --> Stage 3.1 --> Stage 2 --> Stage 3.2 if we find wavesPerBlock is generally low.

jungpark-mlir · 2023-01-03T11:21:42Z

jungpark-mlir
Jan 3, 2023
Collaborator

Question for the case 1.1 is,
How do we determine the block size of the fused kernel? Does the reduction always fits into a blocksize for the gemm?

18 replies

krzysz00 Jan 3, 2023
Maintainer

I'm suggesting the very naive strategy of having everyone fire off atomic writes to the final location in global memory and letting the hardware sort it out.

And I'll note that we can skip the LDS step by having the accumulation be across waves and then to global. That is, we use DPP to compute the log2(threads_per_waves) bits - or maybe even things like log4 or other clevelness - and then a designated thread per wave does a global write

krzysz00 Jan 3, 2023
Maintainer

One thing I'd suggest we do is ask around other teams - possibly the compiler folks might know - for how reductions are typically implemented. There may be hardware support we can exploit.

manupak Jan 3, 2023
Collaborator Author

Yea its about how much of hierarchy we want to exploit when we map the reduction tree.

I agree with DPP in waves -- but thought more as an improvement on LDS but not entirely skipping it.
Skipping LDS bits, would mean that there would be atomics reductions per wave as opposed to per block.

I m sure CK folks should know this -- so I ll try and reach out.

jerryyin Jan 3, 2023
Maintainer

We allocate a buffer on LDS per block (to the amount of threads in the block)

Can we reuse the already allocated LDS since they are done at this point? I'm asking because the LDS for some populated gemm kernels are already tight.

then each block will do partial reductions per block and store the partial reductions to global memory

I think a global partial reductions allocation is probably un-avoidable. Just can't see how we can finish reduction without it.

then for last stage we do a tail of reductions of all blocks

Do we have __threadfence() like instructions so that after this, a block could do the final step without having a launch another tail of reduction on partial result?

manupak Jan 4, 2023
Collaborator Author

Can we reuse the already allocated LDS since they are done at this point? I'm asking because the LDS for some populated gemm kernels are already tight.

I think its valid a optimization but feels like general optimization bufferization pass should do to collapse allocates based on liveness. I ll have a look.

I think a global partial reductions allocation is probably un-avoidable. Just can't see how we can finish reduction without it.

Agreed but wanted to reduce as much as possible towards the later implementations. See the described stages now.

Do we have __threadfence() like instructions so that after this, a block could do the final step without having a launch another tail of reduction on partial result?

We could do this but the stage based approach just used atomic operations anyway due to needing to global syncs otherwise.

manupak · 2023-01-04T19:08:55Z

manupak
Jan 4, 2023
Collaborator Author

Hi All,

Thanks for the review! helped a lot in forming the v2 of this proposal now.

I've updated the discussion for staged approach to capture the points of discussion.
There are few unknowns that I need to dig but I think broader ideas are there.

0 replies

jerryyin · 2023-01-06T14:35:07Z

jerryyin
Jan 6, 2023
Maintainer

Stage 1:

I see addition base reduction only in the example. Could you define what operation we want to support in the plan as well? I want a crispy definition here because if there are operation beyond addition we want to support, we can likely have another developer to work on it in parallel as well.
Is the last transform map a common pattern in the fusion side? And did you develop that using an existing similar paradigm? I'm curious about how that's done so would appreciate if you can share me the command line to populate that example.
How would you alter the transform map for reduction in different axis?
Testing wise, how do you plan to do it? I'd suggest some plan about (between cpu and gpu):
- Test standalone reduction (for stage 2 and 3)
- Test reduction with fusion (for stage 1, 2 and 3)

Stage 2:

I see lds allocated but not used in the sample code? I am assuming %gemm_o_lds_flattened would be used in successive access instead of global load/stores?

Stage 3:

stage 3.1 should go in parallel with entire process. Desirably there's a op for in-warp reduction using DPP and we have its independent unit test to verify its correctness. This doesn't need to depend on the execution of the reduction itself.

Misc:

I can imagine atomic add being particular hostile to performance when there's reduction that happen to carry on the longest axis (dimension of largest size). Any alternatives to atomic add?

2 replies

manupak Jan 6, 2023
Collaborator Author

Thanks @jerryyin for taking a look!

Could you define what operation we want to support in the plan as well? I want a crispy definition here because if there are operation beyond addition we want to support, we can likely have another developer to work on it in parallel as well.

I have roughly said this in the implementation plan. for BERTs, we would need MAX in addition to ADD.
Yes it would be great to get it hooked atomic_fmax hooked up in amdgpu and rock dialects.

However beyond that, given the exploratory nature of this task (via experimentation or chats with other teams), I would suggest we could hold it up until we know what works for us in terms of the stages -- and just implement that version.

Is the last transform map a common pattern in the fusion side? And did you develop that using an existing similar paradigm? I'm curious about how that's done so would appreciate if you can share me the command line to populate that example.

Yes, those are created in AlignTiling (a.k.a rock-linalg-align) pass.
E.g. : https://github.com/ROCmSoftwarePlatform/rocMLIR/blob/develop/mlir/test/fusion/tosa-to-rock-tp-add-tp.mlir

How would you alter the transform map for reduction in different axis?

We dont actually. In stage 1, It is just doing atomic reduction (add,max) to in the output coordinates just omitting (really it would ....x1x... in the output so we just put %c0) the axis that is being reduced.

Testing wise, how do you plan to do it? I'd suggest some plan about (between cpu and gpu):

Actually I was planning to do all three versions standalone as well as fused.

I see lds allocated but not used in the sample code? I am assuming %gemm_o_lds_flattened would be used in successive access instead of global load/stores?

oops a bug :). fixed now. It is being used immediately create appropriate views.

stage 3.1 should go in parallel with entire process. Desirably there's a op for in-warp reduction using DPP and we have its independent unit test to verify its correctness. This doesn't need to depend on the execution of the reduction itself.

Agreed. It can go in parallel.
I've also proposed alternate implementation process where Stage 3.1 is attempted soon after Stage 1 -- It is just I felt higher confidence in getting stage 2 (since we have all the boiler plate for that) done but otherwise it doesn't need to be serialized.

I can imagine atomic add being particular hostile to performance when there's reduction that happen to carry on the longest axis (dimension of largest size). Any alternatives to atomic add?

I concur.

Hence with all the tricks (Stage 3.2) it should just be a single atomic add per block.
To outline sync costs : it would log2(#waves/#blocks) lds syncs + #blocks atomics.

jerryyin Jan 10, 2023
Maintainer

so we just put %c0) the axis that is being reduced

Hah, that's right, thanks for reminding me about it.

Hence with all the tricks (Stage 3.2) it should just be a single atomic add per block.

Just for understanding: You mean the use of lds as a media to aggregate the intermediate result. Then only when it is last step that you will atomic write the lds result to global? This is versus in 3.1 which, every wave reduction requires an atomic?

After your explanations this proposal LGTM. Sounds like an actionable plan. Please feel free to share if you obtained more information from CK, we should be able to work on some validation part of this for now. Thanks for your work!

manupak · 2024-10-14T14:18:56Z

manupak
Oct 14, 2024
Collaborator Author

Stage 2 will be complete with #1668

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Reduction fusion support for GPU-gemm like kernels #925

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 20 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[RFC] Reduction fusion support for GPU-gemm like kernels #925

manupak Dec 23, 2022 Collaborator

[RFC] Reduction fusion support for GPU-gemm like kernels

Introduction

Terminology

Scope

R1. Sliding window type reductions [Out of Scope]

R2. Axes-wide reductions [In Scope]

Proposed Methodology

Why not just work on the fusion straight away ?

Stage 1 : Directly just doing reductions to using atomics to global memory

Stage 2 : Using LDS for the reduction in blocks and doing per-block atomic reductions

Prerequisties :

Stage 3 : Do per-wave DPP based reductions -> store per-wave partial reduction them to LDS -> do atomic reductions to global memory per block

Prerequisties :

Case 3.1: If wavesPerBlock is low

Case 3.2: If wavesPerBlock is high:

Implementation Plan

Why rock.reduce ?

Alternatives for rock.reduce operator

Stage 1

Standalone

Fusion

Stage 2

Stage 3

Other Anticipated changes

[Small] rock-copy-opt

[Small/Medium] rock-fold-transpose

[Medium] Insert transposes post-gemm for tree reduction to allow better vectorization ?

Alternatives

Replies: 4 comments · 20 replies

jungpark-mlir Jan 3, 2023 Collaborator

krzysz00 Jan 3, 2023 Maintainer

krzysz00 Jan 3, 2023 Maintainer

manupak Jan 3, 2023 Collaborator Author

jerryyin Jan 3, 2023 Maintainer

manupak Jan 4, 2023 Collaborator Author

manupak Jan 4, 2023 Collaborator Author

jerryyin Jan 6, 2023 Maintainer

manupak Jan 6, 2023 Collaborator Author

jerryyin Jan 10, 2023 Maintainer

manupak Oct 14, 2024 Collaborator Author

manupak
Dec 23, 2022
Collaborator

Replies: 4 comments 20 replies

jungpark-mlir
Jan 3, 2023
Collaborator

krzysz00 Jan 3, 2023
Maintainer

krzysz00 Jan 3, 2023
Maintainer

manupak Jan 3, 2023
Collaborator Author

jerryyin Jan 3, 2023
Maintainer

manupak Jan 4, 2023
Collaborator Author

manupak
Jan 4, 2023
Collaborator Author

jerryyin
Jan 6, 2023
Maintainer

manupak Jan 6, 2023
Collaborator Author

jerryyin Jan 10, 2023
Maintainer

manupak
Oct 14, 2024
Collaborator Author