[midend] Optimization on matmul-transpose-b vectorization #465

JuniMay · 2025-02-21T03:34:27Z

Current vectorization pass of matmul-transpose-b reduce the vector in each iteration and accumulate it to the result element. This commit modify it into elementwise addition and do the reduction after the inner loop with reassoc enabled.

This leads to a ~4x speedup on musepi

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchMatmulTransposeB/matmul                   334 ms          334 ms            2
benchMatmulTransposeB/matmul_vectorized        707 ms          707 ms            1
benchMatmulTransposeB/matmul_opt0              174 ms          174 ms            4
------------------------------------------------
MatmulTransposeB-vectorized PASS
MatmulTransposeB-opt0 PASS
------------------------------------------------

And if with -mllvm --riscv-v-vector-bits-min=256 options enabled, the speedup is ~3x

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchMatmulTransposeB/matmul                   333 ms          333 ms            2
benchMatmulTransposeB/matmul_vectorized        325 ms          325 ms            2
benchMatmulTransposeB/matmul_opt0              108 ms          108 ms            6
------------------------------------------------
MatmulTransposeB-vectorized PASS
MatmulTransposeB-opt0 PASS
------------------------------------------------

And, if enable reduction reassoc on the current vectorization result, there is still a speedup

----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
benchMatmulTransposeB/matmul                   334 ms          333 ms            2
benchMatmulTransposeB/matmul_vectorized        155 ms          155 ms            5
benchMatmulTransposeB/matmul_opt0              108 ms          108 ms            6
------------------------------------------------
MatmulTransposeB-vectorized PASS
MatmulTransposeB-opt0 PASS
------------------------------------------------

The vectorized code compared to the current version:

  // current vectorization result
  func.func @matmul_transpose_b_kernel_vectorized(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c32 = arith.constant 32 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = vector.splat %cst : vector<32xf32>
    %dim = memref.dim %arg0, %c0 : memref<?x?xf32>
    %dim_0 = memref.dim %arg1, %c0 : memref<?x?xf32>
    %dim_1 = memref.dim %arg1, %c1 : memref<?x?xf32>
    affine.for %arg3 = #map(%c0) to #map(%dim) {
      affine.for %arg4 = #map(%c0) to #map(%dim_0) {
        affine.for %arg5 = #map(%c0) to #map1(%dim_1) {
          %1 = arith.muli %arg5, %c32 : index
          %2 = arith.subi %dim_1, %1 : index
          %3 = arith.cmpi sge, %2, %c32 : index
          scf.if %3 {
            %4 = affine.vector_load %arg0[%arg3, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %5 = affine.vector_load %arg1[%arg4, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %6 = arith.mulf %4, %5 : vector<32xf32>
            %7 = vector.reduction <add>, %6 : vector<32xf32> into f32
            %8 = memref.load %arg2[%arg3, %arg4] : memref<?x?xf32>
            %9 = arith.addf %7, %8 : f32
            memref.store %9, %arg2[%arg3, %arg4] : memref<?x?xf32>
          } else {
            %4 = affine.vector_load %arg0[%arg3, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %5 = vector.create_mask %2 : vector<32xi1>
            %6 = arith.muli %arg5, %c32 : index
            %7 = vector.maskedload %arg0[%arg3, %6], %5, %0 : memref<?x?xf32>, vector<32xi1>, vector<32xf32> into vector<32xf32>
            %8 = vector.maskedload %arg1[%arg4, %6], %5, %0 : memref<?x?xf32>, vector<32xi1>, vector<32xf32> into vector<32xf32>
            %9 = arith.mulf %7, %8 : vector<32xf32>
            %10 = vector.reduction <add>, %9 : vector<32xf32> into f32
            %11 = memref.load %arg2[%arg3, %arg4] : memref<?x?xf32>
            %12 = arith.addf %10, %11 : f32
            memref.store %12, %arg2[%arg3, %arg4] : memref<?x?xf32>
          }
        }
      }
    }
    return
  }

  // after this commit
  func.func @matmul_transpose_b_kernel_opt0(%arg0: memref<?x?xf32>, %arg1: memref<?x?xf32>, %arg2: memref<?x?xf32>) {
    %c0 = arith.constant 0 : index
    %c1 = arith.constant 1 : index
    %c32 = arith.constant 32 : index
    %cst = arith.constant 0.000000e+00 : f32
    %0 = vector.splat %cst : vector<32xf32>
    %dim = memref.dim %arg0, %c0 : memref<?x?xf32>
    %dim_0 = memref.dim %arg1, %c0 : memref<?x?xf32>
    %dim_1 = memref.dim %arg1, %c1 : memref<?x?xf32>
    affine.for %arg3 = #map(%c0) to #map(%dim) {
      affine.for %arg4 = #map(%c0) to #map(%dim_0) {
        %1 = affine.for %arg5 = #map(%c0) to #map1(%dim_1) iter_args(%arg6 = %0) -> (vector<32xf32>) {
          %4 = arith.muli %arg5, %c32 : index
          %5 = arith.subi %dim_1, %4 : index
          %6 = arith.cmpi sge, %5, %c32 : index
          %7 = scf.if %6 -> (vector<32xf32>) {
            %8 = affine.vector_load %arg0[%arg3, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %9 = affine.vector_load %arg1[%arg4, %arg5 * 32] : memref<?x?xf32>, vector<32xf32>
            %10 = arith.mulf %8, %9 : vector<32xf32>
            %11 = arith.addf %arg6, %10 : vector<32xf32>
            scf.yield %11 : vector<32xf32>
          } else {
            %8 = vector.create_mask %5 : vector<32xi1>
            %9 = arith.muli %arg5, %c32 : index
            %10 = vector.maskedload %arg0[%arg3, %9], %8, %0 : memref<?x?xf32>, vector<32xi1>, vector<32xf32> into vector<32xf32>
            %11 = vector.maskedload %arg1[%arg4, %9], %8, %0 : memref<?x?xf32>, vector<32xi1>, vector<32xf32> into vector<32xf32>
            %12 = arith.mulf %10, %11 : vector<32xf32>
            %13 = arith.addf %arg6, %12 : vector<32xf32>
            scf.yield %13 : vector<32xf32>
          }
          affine.yield %7 : vector<32xf32>
        }
        %2 = memref.load %arg2[%arg3, %arg4] : memref<?x?xf32>
        %3 = vector.reduction <add>, %1, %2 fastmath<reassoc> : vector<32xf32> into f32
        memref.store %3, %arg2[%arg3, %arg4] : memref<?x?xf32>
      }
    }
    return
  }

A simple testcase is also added under the tests dir.

Current vectorization pass of matmul-transpose-b reduce the vector in each iteration and accumulate it to the result element. This commit modify it into elementwise addition and do the reduction after the inner loop with reassoc enabled. Signed-off-by: Junyi Mei <juni_may@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[midend] Optimization on matmul-transpose-b vectorization #465

[midend] Optimization on matmul-transpose-b vectorization #465

JuniMay commented Feb 21, 2025

[midend] Optimization on matmul-transpose-b vectorization #465

Are you sure you want to change the base?

[midend] Optimization on matmul-transpose-b vectorization #465

Conversation

JuniMay commented Feb 21, 2025