Replies: 4 comments 20 replies
-
Question for the case 1.1 is, |
Beta Was this translation helpful? Give feedback.
-
Hi All, Thanks for the review! helped a lot in forming the v2 of this proposal now. I've updated the discussion for staged approach to capture the points of discussion. |
Beta Was this translation helpful? Give feedback.
-
Stage 1:
Stage 2:
Stage 3:
Misc:
|
Beta Was this translation helpful? Give feedback.
-
Stage 2 will be complete with #1668 |
Beta Was this translation helpful? Give feedback.
-
[RFC] Reduction fusion support for GPU-gemm like kernels
Introduction
This RFC proposes to fuse axes-wide reduction operations to gemm-like operations in roCMLIR.
A simple example would be as follows :
Terminology
IFM : Input Feature Map
Scope
Mostly there are two types of non-filter related reduction operations in the ML space.
There are :
R1. Sliding window type reductions [Out of Scope]
Examples for this category would be as MaxPool and AvgPool operations. These operations are similiar to convolultion operation where a window is slided in the IFM, however instead of getting a dot product of the moving window and the extracted window, a reduction is made on the extracted window. Hence, these have closer semantics to be a peer to our gemm-like Ops.
Hence, not covered here.
R2. Axes-wide reductions [In Scope]
Examples for this category would be reduce_sum and reduce_max like operations. These operations could be fused to the preceding gemm-like operation to save on bandwidth and improve performance. Hence these are are in scope for this RFC.
Proposed Methodology
Initially, we will create a "rock.reduce" operator with versioning where we gradually increase the complexity observing the gains.
Once we have the initial version, we will start working the fusing the folded/legalized rock.reduce with a preceding gemm-like Op.
Why not just work on the fusion straight away ?
It would be easier to experiment with varied implementations easily and reason about them and also would serve as a baseline until the optimal is known.
Additionally, if there are orphan reductions (possibly preceded by unsupported non-rock ops) we can still run them on the GPU.
As for the staged iterative developement, I would propose 3 stages as follows :
Stage 1 : Directly just doing reductions to using atomics to global memory
This would be the easiest to develop and will stand as the baseline for the other stages.
The main idea behind this approach would be we'll be basically looping over the reduction axis and storing them
Stage 2 : Using LDS for the reduction in blocks and doing per-block atomic reductions
Prerequisties :
This basically going to allow the threads to use LDS as the workspace to perform the parallel tree reduction.
Stage 3 : Do per-wave DPP based reductions -> store per-wave partial reduction them to LDS -> do atomic reductions to global memory per block
Prerequisties :
Once we have this, we can decide to make the kernel generation conditionally emit the LDS bits based on wavesPerBlock.
Case 3.1: If wavesPerBlock is low
Case 3.2: If wavesPerBlock is high:
Implementation Plan
This is subjective to discussion with other teams (compiler/CK) if we can get data if we can be certain to skip
certain stages.
Why rock.reduce ?
Alternatives for rock.reduce operator
vector.multi_reduction -> vector.contract
The semantics of the operator is similiar to rock.reduce except they work on Nd vectors instead of memrefs/tensors. In the lowering they get canonicalized to vector.contract which is similiar to a linalg.generic.
Pros :
Cons :
linalg.generic version of a reduction
An instance of LA generic could be used to represent a reduction operation at a lower granularity using operator space affine accesses.
Pros :
Cons :
Stage 1
Standalone
| tosa-to-rock
V
| Bufferization
V
| rock-lower-reduce
V
Fusion
| Highlevel pipeline
V
| AlignTiling (rock-linalg-align)
V
Here we align/fuse the TransformingFor that would otherwise be present in the rock.reduce to respect the ThreadSpace to TensorSpace mapping and remove the loads as the value is already live.
Stage 2
Stage 3
Other Anticipated changes
[Small] rock-copy-opt
We would need to handle the that there is going to be linalg.fill to initialiaze reduction workspace.
[Small/Medium] rock-fold-transpose
I would like to call this something like "rock-fold-gemm-output" (Its not really relevant to the RFC :))
Now the gemm output buffer is going to be different to fused-kernel output and its not addressing reconfiguration.
So it should not fold the outputs in this case (well supporting the other cases where it should) and allow the GEMM
lowering to happen.
(May need few discussion along vectorization implications)
[Medium] Insert transposes post-gemm for tree reduction to allow better vectorization ?
This might be a new pass that we may want to add to transpose gemm output layout if we want reduction
to happen in vectors. I need to dig into vectorization bits understand more.
Alternatives
Beta Was this translation helpful? Give feedback.
All reactions