Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[midend] Optimization on matmul-transpose-b vectorization #465

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JuniMay
Copy link
Contributor

@JuniMay JuniMay commented Feb 21, 2025

Current vectorization pass of matmul-transpose-b reduce the vector in each iteration and accumulate it to the result element. This commit modify it into elementwise addition and do the reduction after the inner loop with reassoc enabled.

This leads to a ~4x speedup on musepi

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchMatmulTransposeB/matmul                   334 ms          334 ms            2
benchMatmulTransposeB/matmul_vectorized        707 ms          707 ms            1
benchMatmulTransposeB/matmul_opt0              174 ms          174 ms            4
------------------------------------------------
MatmulTransposeB-vectorized PASS
MatmulTransposeB-opt0 PASS
------------------------------------------------

And if with -mllvm --riscv-v-vector-bits-min=256 options enabled, the speedup is ~3x

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchMatmulTransposeB/matmul                   333 ms          333 ms            2
benchMatmulTransposeB/matmul_vectorized        325 ms          325 ms            2
benchMatmulTransposeB/matmul_opt0              108 ms          108 ms            6
------------------------------------------------
MatmulTransposeB-vectorized PASS
MatmulTransposeB-opt0 PASS
------------------------------------------------

And, if enable reduction reassoc on the current vectorization result, there is still a speedup

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchMatmulTransposeB/matmul                   334 ms          333 ms            2
benchMatmulTransposeB/matmul_vectorized        155 ms          155 ms            5
benchMatmulTransposeB/matmul_opt0              108 ms          108 ms            6
------------------------------------------------
MatmulTransposeB-vectorized PASS
MatmulTransposeB-opt0 PASS
------------------------------------------------

The vectorized code compared to the current version:

  // current vectorization result
  func.func @matmul_transpose_b_kernel_vectorized(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c32 = arith.constant 32 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = vector.splat %cst : vector<32xf32>
    %dim = memref.dim %arg0, %c0 : memref<?x?xf32>
    %dim_0 = memref.dim %arg1, %c0 : memref<?x?xf32>
    %dim_1 = memref.dim %arg1, %c1 : memref<?x?xf32>
    affine.for %arg3 = #map(%c0) to #map(%dim) {
      affine.for %arg4 = #map(%c0) to #map(%dim_0) {
        affine.for %arg5 = #map(%c0) to #map1(%dim_1) {
          %1 = arith.muli %arg5, %c32 : index
          %2 = arith.subi %dim_1, %1 : index
          %3 = arith.cmpi sge, %2, %c32 : index
          scf.if %3 {
            %4 = affine.vector_load %arg0[%arg3, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %5 = affine.vector_load %arg1[%arg4, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %6 = arith.mulf %4, %5 : vector<32xf32>
            %7 = vector.reduction <add>, %6 : vector<32xf32> into f32
            %8 = memref.load %arg2[%arg3, %arg4] : memref<?x?xf32>
            %9 = arith.addf %7, %8 : f32
            memref.store %9, %arg2[%arg3, %arg4] : memref<?x?xf32>
          } else {
            %4 = affine.vector_load %arg0[%arg3, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %5 = vector.create_mask %2 : vector<32xi1>
            %6 = arith.muli %arg5, %c32 : index
            %7 = vector.maskedload %arg0[%arg3, %6], %5, %0 : memref<?x?xf32>, vector<32xi1>, vector<32xf32> into vector<32xf32>
            %8 = vector.maskedload %arg1[%arg4, %6], %5, %0 : memref<?x?xf32>, vector<32xi1>, vector<32xf32> into vector<32xf32>
            %9 = arith.mulf %7, %8 : vector<32xf32>
            %10 = vector.reduction <add>, %9 : vector<32xf32> into f32
            %11 = memref.load %arg2[%arg3, %arg4] : memref<?x?xf32>
            %12 = arith.addf %10, %11 : f32
            memref.store %12, %arg2[%arg3, %arg4] : memref<?x?xf32>
          }
        }
      }
    }
    return
  }

  // after this commit
  func.func @matmul_transpose_b_kernel_opt0(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c32 = arith.constant 32 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = vector.splat %cst : vector<32xf32>
    %dim = memref.dim %arg0, %c0 : memref<?x?xf32>
    %dim_0 = memref.dim %arg1, %c0 : memref<?x?xf32>
    %dim_1 = memref.dim %arg1, %c1 : memref<?x?xf32>
    affine.for %arg3 = #map(%c0) to #map(%dim) {
      affine.for %arg4 = #map(%c0) to #map(%dim_0) {
        %1 = affine.for %arg5 = #map(%c0) to #map1(%dim_1) iter_args(%arg6 = %0) -> (vector<32xf32>) {
          %4 = arith.muli %arg5, %c32 : index
          %5 = arith.subi %dim_1, %4 : index
          %6 = arith.cmpi sge, %5, %c32 : index
          %7 = scf.if %6 -> (vector<32xf32>) {
            %8 = affine.vector_load %arg0[%arg3, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %9 = affine.vector_load %arg1[%arg4, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %10 = arith.mulf %8, %9 : vector<32xf32>
            %11 = arith.addf %arg6, %10 : vector<32xf32>
            scf.yield %11 : vector<32xf32>
          } else {
            %8 = vector.create_mask %5 : vector<32xi1>
            %9 = arith.muli %arg5, %c32 : index
            %10 = vector.maskedload %arg0[%arg3, %9], %8, %0 : memref<?x?xf32>, vector<32xi1>, vector<32xf32> into vector<32xf32>
            %11 = vector.maskedload %arg1[%arg4, %9], %8, %0 : memref<?x?xf32>, vector<32xi1>, vector<32xf32> into vector<32xf32>
            %12 = arith.mulf %10, %11 : vector<32xf32>
            %13 = arith.addf %arg6, %12 : vector<32xf32>
            scf.yield %13 : vector<32xf32>
          }
          affine.yield %7 : vector<32xf32>
        }
        %2 = memref.load %arg2[%arg3, %arg4] : memref<?x?xf32>
        %3 = vector.reduction <add>, %1, %2 fastmath<reassoc> : vector<32xf32> into f32
        memref.store %3, %arg2[%arg3, %arg4] : memref<?x?xf32>
      }
    }
    return
  }

A simple testcase is also added under the tests dir.

Current vectorization pass of matmul-transpose-b reduce the vector in
each iteration and accumulate it to the result element. This commit
modify it into elementwise addition and do the reduction after the inner
loop with reassoc enabled.

Signed-off-by: Junyi Mei <juni_may@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant