Skip to content

Default Pipeline

Renato Golin edited this page May 23, 2024 · 1 revision

Our current prototype is based on standard MLIR passes using the mlir::OpPassManager (link) and a local helper (link).

See an analysis of the passes on an example IR here.

There are three main files that implement the pipeline (and plenty more for the passes):

The Pipeline

The main entry point calls either the CPU or the GPU pipelines first (depending on the target), then the final lowering to LLVM dialect and cleanups.

| Linalg + Tensor | --> (CPU Pipeline) --> (Lowering + Cleanups) -> | LLVM dialect |
                     \> (GPU Pipeline) /

Figure 1: Overview of the default pipeline.

CPU Pipeline

The CPU pipeline is composed of multiple steps in a single chain.

There is a testing mode that lowers Linalg to loops directly, basically avoiding the whole CPU pipeline, just lowers everything to loops, bufferize and then continue on the pipeline. We use this mode to make sure our passes are correct and can be ignored from the design point of view.

The main CPU chain is composed of the following steps (in order):

  1. Pre-TPP passes: These need to run before the main TPP passes so that packing and tiling works on a simpler form.
    • Add In Place: Pre-optimization that replaces zero+matmul+bias with bias+matmul to avoid two extra operations.
    • BrGEMM To GEMM: Certain cases can be simplified if the batch reduction is unnecessary.
  2. TPP Passes: The core of our work.
    • Some early convolution passes, converting them to matmuls.
    • Pack Matmul: Find optimal shapes and re-layout tensors for best CPU caching.
    • Pack VNNI: On supporting hardware, additional type packing for correct instruction utilization (VNNI, BFMMLA, etc).
    • Propagate Packing: Push packed shapes through element-wise ops until the next pack instruction.
    • Constant Fold Pack: Replace constant packs with packed constants, if possible.
    • Simplify Pack: Merge pack/unpack pairs when possible.
    • Cleanups & artifact removal
  3. Lower Packing: Lower packs to optimal loops, linalg, copies.
  4. Bufferization: Converts tensors to memrefs, still at Linalg + Loops semantics.
  5. Linalg Lowering: Linalg to LIBXSMM mapping.
    • Linalg to XSMM: Convert linalg operations to XSMM dialect (micro-kernel tile ops).
    • Combine XSMM: Fuse matmuls with element-wise if library allows (fused_brgemm).
    • Fold XSMM flags: Further folding of independent ops (on operands producer) to flags (on consumer operations).
    • Verify XSMM: Final pass to make sure the library semantics is correct.
  6. Forall to Parallel: Due to an OpenMP dialect restriction, we need to convert scf.forall into scf.parallel.
  7. Local dialects lowering: perf and check dialects to loops or LLVM dialect.
  8. Loop tiling & extension configuration: Low-level loop tiling (xsmm dialect) + AMX config tricks.
    • LICM: Loop-invariant code motion (pre parallelization).
    • Parallel loop tiling: 2D parallelization for multi-threading.
    • AMX Tile Config: Setup + LICM + hoisting.

GPU Pipeline

TODO.

Lowering

After either CPU or GPU pipelines finish, there's a final lowering stage to convert the resulting IR into LLVM dialect. Most of these passes apply to either CPU or GPU, but running them on generic IR should not break anything.

The main post-processing passes in the lowering are:

  1. Partial lowering: Cascade dialect conversion into loops + func.
    • Linalg to loops: For any remaining (non-XSMM) ops.
    • SCF to OpenMP: Expand scf.parallel into OMP loops.
    • Arith/Affine lowering.
  2. Final lowering to LLVM dialect.
    • Vector, memref, SCF, OpenMP, Math, etc.
  3. GPU lowering passes: this should really only be run for GPU targets.
    • TODO
  4. Left overs? Is this related to the GPU pass?
    • Func to LLVM
    • Arith to LLVM
  5. Cleanups.
    • Canonicalizer pass.
    • CSE pass.
    • Reconcile unrealized casts.
    • DCE pass.