[Midend] Enhancements and Optimizations and [Examples] Added MLIRLinalg Examples for Various Optimization Options #384

6somehow · 2024-09-12T13:00:05Z

[Midend] Enhancements and Optimizations

BatchMatMul:
- Updated unroll, tile, and vectorization strategies for BatchMatMul operations.
- Enhanced SCF (Structured Control Flow) version for BatchMatMul to support further optimization and work.
Conv2D NHWC-FHWC:
- Improved vectorization and tiling techniques for Conv2D operations in NHWC-FHWC layout.
- These updates provide a vectorization option and a tile+vectorization option for buddy-opt.
Depthwise Conv2D NHWC-HWC:
- Introduced vectorization and tiling strategies for Depthwise Conv2D in NHWC-HWC format, ensuring more efficient execution.

[Examples] Added MLIRLinalg Examples for Various Optimization Options

New MLIRLinalg examples were added to demonstrate the usage of new buddy-opt various optimization options , including:

conv-nhwc-fhwc-optimize: Conv2D NHWC-FHWC layout vectorization optimization.
conv-nhwc-fhwc-tile-optimize: Conv2D NHWC-FHWC vectorization with tiling optimizations.
depthwise-conv-nhwc-hwc-optimize: Depthwise Conv2D NHWC-HWC
vectorization layout optimization.
batchmatmul-tile-optimize: tiling and unroll optimization for BatchMatMul.
batchmatmul-scf-optimize: using SCF optimization for BatchMatMul.

Example MLIR Files:

batchmatmul
conv2d_nhwc_fhwc
depthwise_conv_2d_nhwc_hwc

These updates collectively improve the performance of matrix and convolution operations in MLIR, providing optimized patterns for common workloads like BatchMatMul and convolution layers.

… scf version for further work. Update conv2dnhwcfhwc vectorization and tile. Update depthwise_conv2dnhwchwc vectorization and tile

…ize 2.conv-nhwc-fhwc-tile-optimize 3.depthwise-conv-nhwc-hwc-optimize 4.batchmatmul-tile-optimize 5.batchmatmul-scf-optimize . Example mlir: batchmatmul conv2d_nhwc_fhwc depthwise_conv_2d_nhwc_hwc

midend/lib/Conversion/CMakeLists.txt

llvm

examples/MLIRLinalg/linalg-depthwise_conv_2d_nhwc_hwc.mlir

examples/MLIRLinalg/makefile

examples/MLIRLinalg/linalg-batch-matmul-dync.mlir

examples/ConvOpt/pointwise_conv2d_nhwc_filter_hwcf.mlir

midend/lib/Conversion/ConvOptimization/CMakeLists.txt

midend/lib/Conversion/ConvOptimization/ConvNhwcFhwcOptimize.cpp

midend/lib/Conversion/MatMulOptimization/CMakeLists.txt

tools/buddy-opt/buddy-opt.cpp

tools/buddy-opt/CMakeLists.txt

…ion [Examples] Added MLIRLinalg Examples for Various Optimization Options

…ion [Examples] Added MLIRLinalg Examples for Various Optimization Options. fixed thirdparty.

xlinsist · 2024-09-21T04:12:56Z

midend/lib/Conversion/ConvOptimization/ConvNhwcFhwcOptimizeTile.cpp

+        loc, AffineMap::get(1, 0, d0.ceilDiv(tilingOC)), OC);
+
+    // clang format off
+    //  Step 1: Create outer most loops.


Do we have Step 2 in this file? If not, the word Step had better be removed.

xlinsist · 2024-09-21T04:14:24Z

midend/lib/Conversion/ConvOptimization/ConvNhwcFhwcOptimize.cpp

+    Value FW = rewriter.create<memref::DimOp>(loc, filter, 2); // FW
+
+    // clang format off
+    //  Step 1: Create outer most loops.


Do we have Step 2 in this file? If not, the word Step had better be removed.

xlinsist · 2024-09-21T04:23:17Z

examples/MLIRLinalg/makefile

+linalg-conv2d_nhwc_fhwc-optimize-lower:
+	@${BUDDY_OPT} linalg-conv2d_nhwc_fhwc.mlir \
+		-conv-nhwc-fhwc-optimize="vec-size=16" \
+	  -o ./log.mlir
+
+linalg-conv2d_nhwc_fhwc-tile-optimize-lower:
+	@${BUDDY_OPT} linalg-conv2d_nhwc_fhwc.mlir \
+		-conv-nhwc-fhwc-tile-optimize="vec-size=16 tiling-height=2 tiling-width=3" \
+	  -o ./log.mlir
+
+linalg-conv2d_nhwc_fhwc-optimize-run:
+	@${BUDDY_OPT} linalg-conv2d_nhwc_fhwc.mlir ${MLIR_OPT_OPTIONS} \
+		-conv-nhwc-fhwc-optimize="vec-size=16" \
+		-lower-affine -convert-scf-to-cf \
+		-convert-vector-to-llvm -finalize-memref-to-llvm -convert-arith-to-llvm \
+		-convert-func-to-llvm -reconcile-unrealized-casts | \
+	${MLIR_CPU_RUNNER} ${OPT_FLAG} -e main -entry-point-result=void -shared-libs=${MLIR_RUNNER_UTILS} -shared-libs=${MLIR_C_RUNNER_UTILS}
+
+linalg-conv2d_nhwc_fhwc-tile-optimize-run:
+	@${BUDDY_OPT} linalg-conv2d_nhwc_fhwc.mlir ${MLIR_OPT_OPTIONS} \
+		-conv-nhwc-fhwc-tile-optimize="vec-size=16 tiling-height=2 tiling-width=3" \
+		-lower-affine -convert-scf-to-cf \
+		-convert-vector-to-llvm -finalize-memref-to-llvm -convert-arith-to-llvm \
+		-convert-func-to-llvm -reconcile-unrealized-casts | \
+	${MLIR_CPU_RUNNER} ${OPT_FLAG} -e main -entry-point-result=void -shared-libs=${MLIR_RUNNER_UTILS} -shared-libs=${MLIR_C_RUNNER_UTILS}


Better to obey the order of the existing test methods. For example, the order of tests in this code should be:

linalg-conv2d_nhwc_fhwc-optimize-lower

linalg-conv2d_nhwc_fhwc-optimize-run

linalg-conv2d_nhwc_fhwc-tile-optimize-lower

linalg-conv2d_nhwc_fhwc-tile-optimize-run

xlinsist · 2024-09-21T04:27:35Z

midend/lib/Conversion/ConvOptimization/ConvOptimize.cpp

+                                                              AffineApplyOp>(
+                                                          loc,
+                                                          AffineMap::get(
+                                                              1, 0,
+                                                              d0 + j * vecSize),
+                                                          ivG);
+
+                                                  Value i = builder.create<
+                                                      TransferReadOp>(
+                                                      loc, vecTy, input,
+                                                      ValueRange{ivA, ivE,
+                                                                 rowInput,
+                                                                 columnInput});
+
+                                                  auto protectedF =
+                                                      builder.create<
+                                                          affine::AffineIfOp>(
+                                                          loc, vecTy,
+                                                          IntegerSet::get(
+                                                              1, 1,
+                                                              {s0 - 1 - d0},
+                                                              {false}),
+                                                          ValueRange{rowFilter,
+                                                                     FH},
+                                                          true);
+
+                                                  // if row in range, read
+                                                  // normally.
+                                                  auto thenBuilder =
+                                                      protectedF
+                                                          .getThenBodyBuilder();
+                                                  Value normalReadVec =
+                                                      thenBuilder.create<
+                                                          TransferReadOp>(
+                                                          loc, vecTy, filter,
+                                                          ValueRange{
+                                                              ivB, ivE,
+                                                              rowFilter,
+                                                              columnFilter});
+                                                  thenBuilder.create<
+                                                      affine::AffineYieldOp>(
+                                                      loc, normalReadVec);
+
+                                                  // if row out of range, give
+                                                  // back N empty vector.
+                                                  auto elseBuilder =
+                                                      protectedF
+                                                          .getElseBodyBuilder();
+                                                  Value emptyVec =
+                                                      elseBuilder
+                                                          .create<SplatOp>(
+                                                              loc, vecTy, cf0);
+                                                  elseBuilder.create<
+                                                      affine::AffineYieldOp>(
+                                                      loc, emptyVec);
+
+                                                  iList.push_back(i);
+                                                  fList.push_back(
+                                                      protectedF->getOpResult(
+                                                          0));
+                                                }
+                                              }
+                                              Value lastResult =
+                                                  builder
+                                                      .create<memref::LoadOp>(
+                                                          loc, buffer, c0);
+                                              for (int i = 0; i < kernelM;
+                                                   ++i) {
+                                                for (int j = 0; j < kernelN;
+                                                     ++j) {
+                                                  lastResult = builder.create<
+                                                      vector::FMAOp>(
+                                                      loc, vecTy,
+                                                      iList[i * kernelN + j],
+                                                      fList[i * kernelN + j],
+                                                      lastResult);
+                                                }
+                                              }
+
+                                              builder.create<memref::StoreOp>(
+                                                  loc, lastResult, buffer, c0);
+                                            });
+                                      });
+                                });
+
+                            Value reduceVec =
+                                builder.create<memref::LoadOp>(loc, buffer, c0);
+                            Value reducedRes =
+                                builder.create<vector::ReductionOp>(
+                                    loc, vector::CombiningKind::ADD, reduceVec);
+                            Value bias = builder.create<memref::LoadOp>(
+                                loc, output, ValueRange{ivA, ivB, ivC, ivD});
+                            Value addRes = builder.create<arith::AddFOp>(
+                                loc, bias, reducedRes);
+                            builder.create<memref::StoreOp>(
+                                loc, addRes, output,
+                                ValueRange{ivA, ivB, ivC, ivD});
+                          });
+                    });
              });


Is the purpose of modifying this code for beautifying/formatting? Would the original code look better?

xlinsist · 2024-09-21T04:28:37Z

midend/lib/Conversion/DepthwiseConvOptimization/DepthwiseConvNhwcHwc.cpp

+    Value FW = rewriter.create<memref::DimOp>(loc, filter, 1); // FW
+
+    // clang format off
+    //  Step 1: Create outer most loops.


Do we have Step 2 in this file? If not, the word Step had better be removed.

xlinsist · 2024-09-21T04:32:19Z

midend/lib/Conversion/MatMulOptimization/BatchMatMulOptimize.cpp

+
+    const Value zeroElementTypeVec =
+        isa<IntegerType>(elementType)
+            ? rewriter
+                  .create<vector::BroadcastOp>(
+                      loc, VectorType::get({affineVectorSize}, elementType),
+                      zeroElementType)
+                  .getResult()
+            : rewriter
+                  .create<vector::SplatOp>(
+                      loc, VectorType::get({affineVectorSize}, elementType),
+                      zeroElementType)
+                  .getResult();


What is the reason for this change(If there are actual changes, there is no need to add comments in this code, I just wonder)?

xlinsist · 2024-09-21T04:40:25Z

midend/lib/Conversion/MatMulOptimization/CMakeLists.txt

+add_mlir_library(BatchMatMulTileOptimization
+  BatchMatMulTileOptimize.cpp
+)
+
+add_mlir_library(BatchMatMulSCFOptimization
+  BatchMatMulSCFOptimize.cpp
+)


Do BatchMatMulOptimize, BatchMatMulTileOptimize and BatchMatMulSCFOptimize need to "add_mlir_library" twice? Should all three of them belong to BatchMatMulOptimization?

xlinsist · 2024-09-21T05:57:37Z

It seems that the FileCheck of mlir files in examples/MLIRLinalg directory does not pass the lit check. When you have modified this part, you need to use ninja check-buddy command in the build folder to check whether the modified part is correct.

xlinsist · 2024-09-21T06:13:53Z

Running make linalg-depthwise_conv_2d_nhwc_hwc-optimize-lower leads to lowering error:

xlinsist · 2024-09-21T06:18:40Z

Running either linalg-batch-matmul-tile-optimize-lower or linalg-batch-matmul-scf-optimize-lower leads to the following error:

xlinsist

When I try to understand the optimization passes of this PR, I have some questions that I would like to verify:

We don't need to modify the strategies of the original ConvOptimize.cpp and BatchMatMulOptimize pass implementations, right? But I saw that there are code changes in the PR.
linalg-conv2d_nhwc_fhwc-optimize vectorizes the Channel dimension, and the number of vector elements in one iteration is fixed to 16, with no tail processing implemented. Does this mean this pass is only applicable to scenarios where the number of channels is divisible by 16?
The block size of linalg-conv2d_nhwc_fhwc-tile-optimize is 2x3, which I think should be the block size for Height and Width? It would be better if there are comments to elaborate optimization strategies in the code.
The lowerings of linalg-batch-matmul-scf-optimize, linalg-batch-matmul-tile-optimize, linalg-depthwise_conv_2d_nhwc_hwc-optimize in makefile report error. It is recommended to ensure that the examples in examples/MLIRLinalg can be generated correctly, before integrating these passes into buddy-benchmark for Ops testing.

Somehow6 and others added 30 commits June 18, 2024 20:12

bmm2mm 0.0

e402e12

tiling batch matmul

695e971

tiling batch matmul

e943030

bmm tile try to remove redundant subview

6e11bab

pass check

29586f9

bmm tile try to remove redundant subview

7cebe95

bmm tile to vector.load/store

f73f1ca

buddy opt add(fake rvv version)

6c737ac

bmm fuse for loop

ce2fc20

bmm m n border control add

9f66741

remove rvv

115bea1

add int support

bb17f67

add conv nhwc

a3ce471

add conv nhwc

aed5e43

oc bug repair

328038a

conv2d +dilation,strides

1724c98

conv2d +tilling

da048f5

fixed int float determine

0b97d1e

conv2d to forall

fb6d594

clear useless commits

c8f8f1f

conv2d pass float test

70ff6b4

add bmm scf

1b4ff8a

add depthwise

5bb895e

add depthwise correct

7b7a2fe

for dev merge

5e4bafd

[midend] Update batchmatmal unroll,tile and vectorization version and…

4e6ad71

… scf version for further work. Update conv2dnhwcfhwc vectorization and tile. Update depthwise_conv2dnhwchwc vectorization and tile

[examples] add MLIRLinalg example for options: 1.conv-nhwc-fhwc-optim…

955c3e8

…ize 2.conv-nhwc-fhwc-tile-optimize 3.depthwise-conv-nhwc-hwc-optimize 4.batchmatmul-tile-optimize 5.batchmatmul-scf-optimize . Example mlir: batchmatmul conv2d_nhwc_fhwc depthwise_conv_2d_nhwc_hwc

[examples] add MLIRLinalg example for options: 1.conv-nhwc-fhwc-optim…

d7f7bdb

…ize 2.conv-nhwc-fhwc-tile-optimize 3.depthwise-conv-nhwc-hwc-optimize 4.batchmatmul-tile-optimize 5.batchmatmul-scf-optimize . Example mlir: batchmatmul conv2d_nhwc_fhwc depthwise_conv_2d_nhwc_hwc

Update .gitmodules

71fa356

Update .gitignore

2ef66a0