-
Notifications
You must be signed in to change notification settings - Fork 31
Default Pipeline
Our current prototype is based on standard MLIR passes using the mlir::OpPassManager
(link) and a local helper (link).
See an analysis of the passes on an example IR here.
There are three main files that implement the pipeline (and plenty more for the passes):
- DefaultPipeline.cpp: The main entry point.
- DefaultTppPasses.cpp: The CPU pipeline.
- GpuPipeline.cpp: The GPU pipeline.
The main entry point calls either the CPU or the GPU pipelines first (depending on the target), then the final lowering to LLVM dialect and cleanups.
| Linalg + Tensor | --> (CPU Pipeline) --> (Lowering + Cleanups) -> | LLVM dialect |
\> (GPU Pipeline) /
Figure 1: Overview of the default pipeline.
The CPU pipeline is composed of multiple steps in a single chain.
There is a testing mode that lowers Linalg to loops directly, basically avoiding the whole CPU pipeline, just lowers everything to loops, bufferize and then continue on the pipeline. We use this mode to make sure our passes are correct and can be ignored from the design point of view.
The main CPU chain is composed of the following steps (in order):
- Pre-TPP passes: These need to run before the main TPP passes so that packing and tiling works on a simpler form.
- Add In Place: Pre-optimization that replaces
zero+matmul+bias
withbias+matmul
to avoid two extra operations. - BrGEMM To GEMM: Certain cases can be simplified if the batch reduction is unnecessary.
- Add In Place: Pre-optimization that replaces
- TPP Passes: The core of our work.
- Some early convolution passes, converting them to matmuls.
- Pack Matmul: Find optimal shapes and re-layout tensors for best CPU caching.
- Pack VNNI: On supporting hardware, additional type packing for correct instruction utilization (VNNI, BFMMLA, etc).
- Propagate Packing: Push packed shapes through element-wise ops until the next pack instruction.
- Constant Fold Pack: Replace constant packs with packed constants, if possible.
- Simplify Pack: Merge
pack
/unpack
pairs when possible. - Cleanups & artifact removal
- Lower Packing: Lower packs to optimal loops, linalg, copies.
- Bufferization: Converts tensors to memrefs, still at Linalg + Loops semantics.
- Linalg Lowering: Linalg to LIBXSMM mapping.
- Linalg to XSMM: Convert linalg operations to XSMM dialect (micro-kernel tile ops).
- Combine XSMM: Fuse matmuls with element-wise if library allows (
fused_brgemm
). - Fold XSMM flags: Further folding of independent ops (on operands producer) to flags (on consumer operations).
- Verify XSMM: Final pass to make sure the library semantics is correct.
- Forall to Parallel: Due to an OpenMP dialect restriction, we need to convert
scf.forall
intoscf.parallel
. - Local dialects lowering:
perf
andcheck
dialects to loops or LLVM dialect. - Loop tiling & extension configuration: Low-level loop tiling (xsmm dialect) + AMX config tricks.
- LICM: Loop-invariant code motion (pre parallelization).
- Parallel loop tiling: 2D parallelization for multi-threading.
- AMX Tile Config: Setup + LICM + hoisting.
TODO.
After either CPU or GPU pipelines finish, there's a final lowering stage to convert the resulting IR into LLVM dialect. Most of these passes apply to either CPU or GPU, but running them on generic IR should not break anything.
The main post-processing passes in the lowering are:
- Partial lowering: Cascade dialect conversion into loops + func.
- Linalg to loops: For any remaining (non-XSMM) ops.
- SCF to OpenMP: Expand
scf.parallel
into OMP loops. - Arith/Affine lowering.
- Final lowering to LLVM dialect.
- Vector, memref, SCF, OpenMP, Math, etc.
- GPU lowering passes: this should really only be run for GPU targets.
- TODO
- Left overs? Is this related to the GPU pass?
- Func to LLVM
- Arith to LLVM
- Cleanups.
- Canonicalizer pass.
- CSE pass.
- Reconcile unrealized casts.
- DCE pass.