[RFC] GEMM-SOFTMAX-GEMM fusion #1076

manupak · 2023-05-10T14:15:32Z

manupak
May 10, 2023
Collaborator

This design discussion is about how to implement gemm-softmax-gemm fusion in rocMLIR.
This is an algorithm presented in the paper known as "FlashAttention" : https://arxiv.org/pdf/2205.14135.pdf

Problem definition

The attention subgraph found in most transformer models could be expressed as follows :

O = ATTENTION (Q, K, V)

Let $$Q,K,V \in R^{N \times d}$$

$$S = QK^T$$

$$P = SOFTMAX(S)$$

$$O = PV$$

Where as in SOFTMAX is :

$$ SOFTMAX(x) = \frac{[e^{x_1 - max(x)} , ... , e^{x_N - max(x)}]}{sum(x_i - max(x_i))} $$

Note : SOFTMAX is conducted in the row axis.

So to do this in a single fused kernel, we need to figure out a way to tile the whole computation.

Solution : FlashAttention

The paper describes a solution that rougly translates to following pseudo code :
DISCLAIMER : could be error prone as I decoded this in a day; so a review is more than welcome here.

def flash_attention (Q: [N, d], K:[N, d], V: [N, d])

Bc = block_width
Br = block_height

# Initialize output and intermediary buffers
O = zeros(N, d)
l = zeros(N)
m = neg_infs(N)

row_block_cnt = N / Bc
col_block_cnt = N / Br

for j in 0 to row_block_cnt - 1:
   Kj = K[j*Bc : (j+1)*Bc, :]
   Vj = V[j*Bc : (j+1)*Bc, :]
   for i in 0 to col_block_cnt - 1:
        # load single dim slices of output and intemediary buffers
        Qi = Q[i*Br : (i+1)*Br, :]
        Oi = O[i*Br : (i+1)*Br, :]
        li = l[i*Br : (i+1)*Br]
        mi = m[i*Br : (i+1)*Br]

        # compute first gemm
        S = Qi x Transpose(Kj)
        m_tile = max(S) {axis = 0}
        # broadcasting assumed.
        P = exp(S - m_tile)
        l_tile = sum(P) {axis = 0}

        # second gemm
        O_new_partial_sum = P x Vj

        # Updating max we know so far
        m_new = max(mi, m_tile)
        # broadcasting assumed.
        l_new = exp(mi - m_new) * li + exp(m_tile - m_new) * l_tile

        # Rescale previous max and remove the division for now
        O_maxscaled = exp(mi - m_new) * Oi
        O_maxscaled_nosumdiv = O_maxscaled * li
        O_new_partial_sum_maxscaled = exp(m_tile - m_new) * O_new_partial_sum
        O_new = (O_maxscaled_nosumdiv + O_new_partial_sum_maxscaled) / l_new

        O[i*Br : (i+1)*Br, :] = O_new
        m[i*Br : (i+1)*Br] = m_new
        l[i*Br : **(i+1)*Br] = l_new

Restrictions

The iterations in first loop (row_block_cnt loop) has RAW dependencies
- The O, l, m are read and written by every row block in a row of blocks.
- Hence, we would need to map that loop to iteration space inside the threads.
- [> v1] Just a thought : Will it be too clever to invent block-based mutex lock (i.e. allow threads in the block) ?

Short-term solution

CK more or less uses the same algorithm in doing the compute.
Basically, we introduce rock.gridwise_attention op that lowers to somewhat psuedo IR as follows :

func.func @attention(%Qi: memref<GxMxNxf32>, %Ki: memref<GxMxNxf32>, %Vi: memref<GxMxNxf32>, %O: memref<GxMxNxf32>):
  %bid = rock.workgroup_id : index
  %tid = rock.workitem_id : index

  // The tuning params are expected to provide following
  // M_PER_BLOCK
  // M_PER_WAVE
  // N_PER_BLOCK
  // N_PER_WAVE
  // KPACK_PER_BLOCK
  // KPACK

  // We derive the following params as follows :
  // M_BLOCKS = M / M_PER_BLOCK
  // N_BLOCKS = N / N_PER_BLOCK
  // M_REPEATS = M_BLOCKS
  // N_REPEATS = f(N_PER_WAVE, selected MFMA)
  // K_PER_BLOCK = KPACK * KPACK_PER_BLOCK

  // Input tile cache buffers
  %Qi_lds = rock.alloc() : memref<M_PER_BLOCKxNxf32>, #gpu.address_space<workgroup>>
  // %Q_view = G, M, N -unmerge-> G, M_BLOCKS, M_PER_BLOCK, N -slice->  G, M_PER_BLOCK, N
  // There is a choice here : 1) we can read the whole row slice 2) we can read M_PER_BLOCK x K_PER_BLOCK slice down in the loops
  // This shows 1).
  rock.threadwise_read_into(%Q_view, %Q_lds)
  // We maintain a transposed buffer for Q.K^T
  %Ki_tp_lds = rock.alloc() : memref<K_PER_BLOCKxM_PER_BLOCKxf32, #gpu.address_space<workgroup>>
  %Vi_lds = rock.alloc() : memref<K_PER_BLOCKxN_PER_BLOCKxf32, #gpu.address_space<workgroup>>
  // Output tile cache buffers
  %O_lds = rock.alloc() : memref<M_PER_BLOCKxN_PER_BLOCKxf32>, #gpu.address_space<workgroup>>
  rock.blockwise_fill(%O_lds, 0)

  // Intermediary LDS buffers
  %reduce_workspace_lds = rock.alloc() : memref<M_PER_BLOCKxN_PER_BLOCKxf32, #gpu.address_space<workgroup>>
  %gemm_0_lds = rock.alloc() : memref<M_PER_BLOCKxM_PER_BLOCKxf32, #gpu.address_space<workgroup>>

  // Intermediary registers
  %gemm0_reg = rock.alloc() : memref<M_PER_THREADxN_PER_THREADxf32, #gpu.address_space<private>>

  %m_tile_reg_max = rock.alloc() : memref<M_PER_THREADxN_PER_THREADxf32, #gpu.address_space<private>>
  %l_tile_reg_sum = rock.alloc() : memref<M_PER_THREADxN_PER_THREADxf32, #gpu.address_space<private>>
  %m_tile_new_reg_max = rock.alloc() : memref<M_PER_THREADxN_PER_THREADxf32, #gpu.address_space<private>>
  %l_tile_new_reg_sum = rock.alloc() : memref<M_PER_THREADxN_PER_THREADxf32, #gpu.address_space<private>>

  %gemm1_reg = rock.alloc() : memref<M_PER_THREADxN_PER_THREADxf32, #gpu.address_space<private>>

  rock.fill(l_tile_reg_sum, 0)
  rock.fill(m_tile_reg_max, -inf)

  // TODO : Software pipeline following...
  affine.for %mRepeatIter = 0 to M_REPEATS - 1 {
    // gemm0
    affine.for %kPerBlockIter = 0 to N(=K) / K_PER_BLOCK - 1  {
      // %Ki_view = G, M, N -unmerge-> G, M_BLOCKS, M_PER_BLOCK, K_BLOCKS, K_PER_BLOCK 
      //                    -slice(%mRepeatIter, %kPerBlockIter, %bid)->  G, M_PER_BLOCK, K_PER_BLOCK
      //                    -transpose-> G, K_PER_BLOCK, M_PER_BLOCK
      rock.threadwise_read_into(%Ki_view, %Ki_tp_lds)
      // %Qi_KSlice_View = G, M_PER_BLOCK, N -slice(%kPerBlockIter)-> G, M_PER_BLOCK, K_PER_BLOCK
      rock.blockwise_gemm %gemm0_reg = %Qi_KSlice_View * %Ki_tp_lds
    }
    // %gemm0_reg is a sub-view of gemm0 output but still has region of m & n dimensions.
    // This is done because another blockwise_gemm awaits below that takes this as the input
    rock.threadwise_write_all(%gemm0_reg, %gemm_0_lds)
    // The following will produce a register output : %m_tile_reg_max which is a sub-view of reduce_max(gemm0) {axis = 0}
    // The register distribution cannot be random, the following elementwise ops have to make sense.
    // Thus, blockwise_reduce max should produce max belonging to each thread.
    // In otherwords, it needs to broadcast the max at the to threads.
    rock.blockwise_reduce max %gemm0_reg into %m_tile_new_reg_max using %reduce_workspace_lds {
      axis = 0,
      input_view = <tensor to tid view>, 
      output_view = view from memref<N_PER_BLOCKxf32> to memref<M_PER_BLOCKxN_PER_BLOCKxf32>
    }
    // do sub and exp
    linalg.generic {...} inputs(%gemm0_reg, %m_tile_new_reg_max) outputs(%gemm0_reg)

    rock.blockwise_reduce sum %l_before_reduce_lds into %l_tile_reg_sum using %reduce_workspace_lds{
      axis = 0,
      input_view = <tensor to tid view>, 
      output_view = view from memref<N_PER_BLOCKxf32> to memref<M_PER_BLOCKxN_PER_BLOCKxf32>
    }

    // gemm1
    affine.for %kPerBlockIter = 0 to K=M / K_PER_BLOCK  {
      // %Vi_view = G, M, N -unmerge-> G, K_BLOCKS, K_PER_BLOCK, N_BLOCKS, N_PER_BLOCK 
      //                    -slice(%mRepeatIter, %kPerBlockIter, %bid)->  G, K_PER_BLOCK, N_PER_BLOCK
      rock.threadwise_read_into(%Vi_view, %Vi_lds)
      // %gemm0_Slice_View = G, M_PER_BLOCK, M_PER_BLOCK -slice(%kPerBlockIter)-> G, M_PER_BLOCK, K_PER_BLOCK
      // uses %gemm_0_lds
      rock.blockwise_gemm %gemm1_reg = %gemm0_Slice_View * %Vi_lds
    }

    // do max
    linalg.generic {...} inputs(%m_tile_new_reg_max, %m_tile_reg_max) outputs(%m_tile_new_reg_max)
    // do l_new = exp(mi - m_new) * li + exp(m_tile - m_new) * l_tile
    linalg.generic {...} inputs(%m_tile_reg_max, %m_tile_new_reg_max, %l_tile_reg_sum, %m_tile_reg_max, %m_tile_new_reg_max) outputs(%l_tile_new_reg_sum)

    // O_maxscaled = exp(mi - m_new) * Oi
    // O_maxscaled_nosumdiv = O_maxscaled * li
    // O_new_partial_sum_maxscaled = exp(m_tile - m_new) * O_new_partial_sum
    // O_new = (O_maxscaled_nosumdiv + O_new_partial_sum_maxscaled) / l_new
    linalg.generic {...} inputs(%m_tile_reg_max, %m_tile_new_reg_max, %O_lds, %l_tile_reg_sum, %m_tile_reg_max, %m_tile_new_reg_max, %gemm1_reg, %l_tile_new_reg_sum) outputs(%gemm1_reg)

    // %gemm0_reg is a sub-view of gemm1 output but still has region of m_per_block & n_per_block dimensions.
    rock.threadwise_write_all(%gemm1_reg, %O_lds)
    rock.threadwise_write_all(%gemm1_reg, %O)
  }

Updates from Flash Attention 2

Paper: https://tridao.me/publications/flash2/flash2.pdf)

There are 3 main changes in the flash attention 2

Algorithm change
Parallelism change
Warp partitioning change

Algorithm change

In v1,

Softmax normalize first gemm output using tile rowmax s and tile rowsum.
Then calculate running global rowsum & rowmax
Correct the tile output.

NOTE: softmax normalize is elementwise sub of rowmax followed by a division by rowsum

In v2

They correct the running global sums & maxes first
Then softmax normalize the first gemm output using that (not using tile rowsums/rowmaxs)
They also dont do the division ( i.e. divide by rowsum) until the last tile where [running global rowsum] = [global rowsum]

So as a summary, in v2 number of ops for correction is reduced.

Parallelism change

"These ideas of swapping the order of the loop (outer loop over row blocks and inner loop over column blocks, instead of the other way round in the original FlashAttention paper), as well as parallelizing over the sequence length dimension were first suggested and implemented by Phil Tillet in the Triton [17] implementation." from the paper

Wave partitioning change

I think this is how CK does it anyway. i.e. splitting Q across warps such that QK^t rows are kept within a warp for iteration / reductions.

Features required

rock.blockwise_reduction
- block-level reduction that does everything within a block. This can enjoy lds barriers and will iterate if the number of threads within the block is not sufficient.
- We can first iterate and partial reductions to fit to #threads in the block.
- Then use DPP to reduce it further to partial reducion #waves in the block.
rock.blockwise_fill / rock.fill + LDS
- This is basically doing a block-level fill (we might not need a block level one)
- Alternatively, we could extend rock.fill to cover LDS ? or we could use linalg.fill

Long-term ideas for generality

This is basically a mix of two main features :

Fusion of softmax to a preceding gemm
- This should be ideally broken to reduction and elementwise ops.
- So the tiling across reduction axis have to be conducted sequentially.
- When doing n+1 tile, n tile reduction have to reverted; this is the real essense of flash attention algorithm.
  - This could be buried under exp, div, etc for all which the inverse operation has to be conducted.
  - Above could be tricky -- i.e. a sub straction inside exp has to fixed by a mul of exp(old_value - new_value)
Fusion of gemm to preceding gemm (that maybe fused with softmax/element-wise ops)
- This could be done independently of whether a softmax was present or not.
- If there were reduction sandwiched between the gemms that needs to be reverted on a tile basis
  - In the FlashAttention algorithm, they compute the second gemm with 'wrongly' reduced tile of the first gemm
  - Therefore, they fix it up following that.
  - What if we fix the first gemm output tile it self ? if that is viable, then it would be identical to a tile of gemm+reduction+elementwise kernel.
    - This allow progressive fusion...

Short term solution: high-level task breakdown.

New OP - blockwise_reduction

Inputs : LDS memref with  view of reduction axis, axis

Outputs : Register memref

New OP - blockwise_fill

Inputs/Outputs : LDS memref, constant splat attr

This fills the LDS memory with a constant splat.

New OP - gridwise.attention op and its lowering (explained as an example above)

Inputs: Q, K, V : m x n tensors/memrefs

Outputs : O m x n tensor/memref

Compute fused attention

[Optional] IMPROVE - blockwise_* LDS->Reg ops.

NOTE : Add a member function to obtain a virtual memref type with annotation of per-thread sub-view of the output.

krzysz00 · 2023-05-10T14:52:52Z

krzysz00
May 10, 2023
Maintainer

Re blockwise fill, we can probably adapt some of the utility kernels code for the memset kernel, probably as a more general op.

But yeah, I think having this as a kernel could work just fine.

[off topic] One potential thought is whether we can extend the accelerator gemm setup to cover gridwise_gemm itself and therefore be able to slide in a [generate write to O that matches our blockwise gemm] at the end of the flash attention kernel. (Though that'd be more of a phase 2 thing and not having flash attention on Navi2x initially is ... probably fine.)

0 replies

krzysz00 · 2023-05-10T14:54:13Z

krzysz00
May 10, 2023
Maintainer

One thing I'd want to keep an eye on is opportunities to reuse buffers

0 replies

manupak · 2023-05-15T17:31:55Z

manupak
May 15, 2023
Collaborator Author

I've read through and updated with a more realistic example lowering. I hope this is clearer.

0 replies

sunway513 · 2023-05-16T15:27:12Z

sunway513
May 16, 2023
Collaborator

GEMM-SOFTMAX-GEMM is only part of the flash attention optimization, do we have plan to tackle the other algorithms especially those can help on memory efficiency?
I would encourage to review what is contained in CK today, and what is contained in the following Hazel Research FA:
https://github.com/ROCmSoftwarePlatform/flash-attention

8 replies

manupak May 16, 2023
Collaborator Author

Im assuming that figure is from Hazy Research FA GH site.
It has a description tailing that figure as : "We show memory savings in this graph (note that memory footprint is the same no matter if you use dropout or masking). Memory savings are proportional to sequence length -- since standard attention has memory quadratic in sequence length, whereas FlashAttention has memory linear in sequence length."

Dont you only do dropout and masking in training only ?

Even in training scenario, Im not particularly worried, because masking and dropout are elementwise operations that can be easily fused to the preceding op.

sunway513 May 16, 2023
Collaborator

aha, that's a good clarification! yes, if dropout/masking is only for training I don't think we need to have those in the current scope. thank you!

zhanglx13 May 24, 2023
Collaborator

Why are dropout and masking only used in training? It would be nice if you have some pointers to docs or blogs : )

manupak May 24, 2023
Collaborator Author

Dropouts are commonly used as regularization technique in training to avoid over fitting.
I found this : https://discuss.pytorch.org/t/transformer-dropout-at-inference-time/97006

Masking can be tricky (glad you asked this; i went onto read about bit more) :

Initial understanding was they were used to mask tokens to avoid so called 'cheating' by knowing future tokens in the decoder (thought of a training thing).

The berts are just a stack of encoders which may not have them. I suppose you are right in the sense we'd need masking in inference where downstream training is done (i.e. decoders are inserted).

There are other reasons that why one would need masks :

Anyways, I thought of including them for the forward pass as they are elementwise ops.

manupak May 24, 2023
Collaborator Author

(It was rather confusing to find when and where masking is used when you have differences between : inference vs training, encoder only models and single/multi batched inference)

manupak · 2023-07-18T16:55:07Z

manupak
Jul 18, 2023
Collaborator Author

Added a section for updates/comparison against FA2. @zhanglx13 @sjw36 @sunway513 @krzysz00

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] GEMM-SOFTMAX-GEMM fusion #1076

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

[RFC] GEMM-SOFTMAX-GEMM fusion #1076

manupak May 10, 2023 Collaborator

Problem definition

O = ATTENTION (Q, K, V)

Solution : FlashAttention

Restrictions

Short-term solution

Updates from Flash Attention 2

Algorithm change

Parallelism change

Wave partitioning change

Features required

Long-term ideas for generality

Short term solution: high-level task breakdown.

New OP - blockwise_reduction

New OP - blockwise_fill

New OP - gridwise.attention op and its lowering (explained as an example above)

[Optional] IMPROVE - blockwise_* LDS->Reg ops.

Replies: 5 comments · 8 replies

krzysz00 May 10, 2023 Maintainer

krzysz00 May 10, 2023 Maintainer

manupak May 15, 2023 Collaborator Author

sunway513 May 16, 2023 Collaborator

manupak May 16, 2023 Collaborator Author

sunway513 May 16, 2023 Collaborator

zhanglx13 May 24, 2023 Collaborator

manupak May 24, 2023 Collaborator Author

manupak May 24, 2023 Collaborator Author

manupak Jul 18, 2023 Collaborator Author

manupak
May 10, 2023
Collaborator

Replies: 5 comments 8 replies

krzysz00
May 10, 2023
Maintainer

krzysz00
May 10, 2023
Maintainer

manupak
May 15, 2023
Collaborator Author

sunway513
May 16, 2023
Collaborator

manupak May 16, 2023
Collaborator Author

sunway513 May 16, 2023
Collaborator

zhanglx13 May 24, 2023
Collaborator

manupak May 24, 2023
Collaborator Author

manupak May 24, 2023
Collaborator Author

manupak
Jul 18, 2023
Collaborator Author