Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mlir] [dataflow] unify semantics of program point #110344

Merged
merged 1 commit into from
Oct 11, 2024

Conversation

cxy-1993
Copy link
Contributor

The concept of a 'program point' in the original data flow framework is ambiguous. It can refer to either an operation or a block itself. This representation has different interpretations in forward and backward data-flow analysis. In forward data-flow analysis, the program point of an operation represents the state after the operation, while in backward data flow analysis, it represents the state before the operation. When using forward or backward data-flow analysis, it is crucial to carefully handle this distinction to ensure correctness.

This patch refactors the definition of program point, unifying the interpretation of program points in both forward and backward data-flow analysis.

How to integrate this patch?

For dense forward data-flow analysis and other analysis (except dense backward data-flow analysis), the program point corresponding to the original operation can be obtained by getProgramPointAfter(op), and the program point corresponding to the original block can be obtained by getProgramPointBefore(block).

For dense backward data-flow analysis, the program point corresponding to the original operation can be obtained by getProgramPointBefore(op), and the program point corresponding to the original block can be obtained by getProgramPointAfter(block).

NOTE: If you need to get the lattice of other data-flow analyses in dense backward data-flow analysis, you should still use the dense forward data-flow approach. For example, to get the Executable state of a block in dense backward data-flow analysis and add the dependency of the current operation, you should write:

getOrCreateFor<Executable>(getProgramPointBefore(op), getProgramPointBefore(block))

In case above, we use getProgramPointBefore(op) because the analysis we rely on is dense backward data-flow, and we use getProgramPointBefore(block) because the lattice we query is the result of a non-dense backward data flow computation.

related dsscussion: https://discourse.llvm.org/t/rfc-unify-the-semantics-of-program-points/80671/8
corresponding PSA: https://discourse.llvm.org/t/psa-program-point-semantics-change/81479

The concept of a 'program point' in the original data flow framework is ambiguous.
It can refer to either an operation or a block itself. This representation has
different interpretations in forward and backward data-flow analysis. In forward
data-flow analysis, the program point of an operation represents the state after
the operation, while in backward data flow analysis, it represents the state
before the operation. When using forward or backward data-flow analysis, it is
crucial to carefully handle this distinction to ensure correctness.

This patch refactors the definition of program point, unifying the interpretation
of program points in both forward and backward data-flow analysis.

How to integrate this patch?

For dense forward data-flow analysis and other analysis (except dense backward
data-flow analysis), the program point corresponding to the original operation
can be obtained by `getProgramPointAfter(op)`, and the program point
corresponding to the original block can be obtained by
`getProgramPointBefore(block)`.

For dense backward data-flow analysis, the program point corresponding to the
original operation can be obtained by `getProgramPointBefore(op)`, and the
program point corresponding to the original block can be obtained by
`getProgramPointAfter(block)`.

NOTE: If you need to get the lattice of other data-flow analyses in dense
backward data-flow analysis, you should still use the dense forward data-flow
approach. For example, to get the Executable state of a block in dense backward
data-flow analysis and add the dependency of the current operation, you should write:

``getOrCreateFor<Executable>(getProgramPointBefore(op), getProgramPointBefore(block))``

In case above, we use getProgramPointBefore(op) because the analysis we rely on
is dense backward data-flow, and we use getProgramPointBefore(block) because the
lattice we query is the result of a non-dense backward data flow computation.

related dsscussion: https://discourse.llvm.org/t/rfc-unify-the-semantics-of-program-points/80671/8
corresponding PSA: https://discourse.llvm.org/t/psa-program-point-semantics-change/81479
@llvmbot llvmbot added mlir:core MLIR Core Infrastructure mlir flang Flang issues not falling into any other category mlir:arith flang:fir-hlfir labels Sep 28, 2024
@llvmbot
Copy link
Member

llvmbot commented Sep 28, 2024

@llvm/pr-subscribers-flang-fir-hlfir
@llvm/pr-subscribers-mlir-arith

@llvm/pr-subscribers-mlir

Author: donald chen (cxy-1993)

Changes

The concept of a 'program point' in the original data flow framework is ambiguous. It can refer to either an operation or a block itself. This representation has different interpretations in forward and backward data-flow analysis. In forward data-flow analysis, the program point of an operation represents the state after the operation, while in backward data flow analysis, it represents the state before the operation. When using forward or backward data-flow analysis, it is crucial to carefully handle this distinction to ensure correctness.

This patch refactors the definition of program point, unifying the interpretation of program points in both forward and backward data-flow analysis.

How to integrate this patch?

For dense forward data-flow analysis and other analysis (except dense backward data-flow analysis), the program point corresponding to the original operation can be obtained by getProgramPointAfter(op), and the program point corresponding to the original block can be obtained by getProgramPointBefore(block).

For dense backward data-flow analysis, the program point corresponding to the original operation can be obtained by getProgramPointBefore(op), and the program point corresponding to the original block can be obtained by getProgramPointAfter(block).

NOTE: If you need to get the lattice of other data-flow analyses in dense backward data-flow analysis, you should still use the dense forward data-flow approach. For example, to get the Executable state of a block in dense backward data-flow analysis and add the dependency of the current operation, you should write:

getOrCreateFor&lt;Executable&gt;(getProgramPointBefore(op), getProgramPointBefore(block))

In case above, we use getProgramPointBefore(op) because the analysis we rely on is dense backward data-flow, and we use getProgramPointBefore(block) because the lattice we query is the result of a non-dense backward data flow computation.

related dsscussion: https://discourse.llvm.org/t/rfc-unify-the-semantics-of-program-points/80671/8
corresponding PSA: https://discourse.llvm.org/t/psa-program-point-semantics-change/81479


Patch is 87.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110344.diff

18 Files Affected:

  • (modified) flang/lib/Optimizer/Transforms/StackArrays.cpp (+13-11)
  • (modified) mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h (+1-1)
  • (modified) mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h (+36-40)
  • (modified) mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h (+22-13)
  • (modified) mlir/include/mlir/Analysis/DataFlowFramework.h (+174-47)
  • (modified) mlir/include/mlir/IR/Block.h (+21)
  • (modified) mlir/lib/Analysis/DataFlow/DeadCodeAnalysis.cpp (+46-30)
  • (modified) mlir/lib/Analysis/DataFlow/DenseAnalysis.cpp (+69-79)
  • (modified) mlir/lib/Analysis/DataFlow/IntegerRangeAnalysis.cpp (+2-2)
  • (modified) mlir/lib/Analysis/DataFlow/LivenessAnalysis.cpp (+1-1)
  • (modified) mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp (+50-37)
  • (modified) mlir/lib/Analysis/DataFlowFramework.cpp (+18-11)
  • (modified) mlir/lib/Dialect/Arith/Transforms/IntRangeOptimizations.cpp (+1-1)
  • (modified) mlir/test/lib/Analysis/DataFlow/TestDeadCodeAnalysis.cpp (+9-6)
  • (modified) mlir/test/lib/Analysis/DataFlow/TestDenseBackwardDataFlowAnalysis.cpp (+7-8)
  • (modified) mlir/test/lib/Analysis/DataFlow/TestDenseForwardDataFlowAnalysis.cpp (+4-3)
  • (modified) mlir/test/lib/Analysis/DataFlow/TestSparseBackwardDataFlowAnalysis.cpp (+1-1)
  • (modified) mlir/test/lib/Analysis/TestDataFlowFramework.cpp (+16-15)
diff --git a/flang/lib/Optimizer/Transforms/StackArrays.cpp b/flang/lib/Optimizer/Transforms/StackArrays.cpp
index a8f1a744cda5fe..02a594b1e0cd37 100644
--- a/flang/lib/Optimizer/Transforms/StackArrays.cpp
+++ b/flang/lib/Optimizer/Transforms/StackArrays.cpp
@@ -376,7 +376,7 @@ mlir::LogicalResult AllocationAnalysis::visitOperation(
     }
   } else if (mlir::isa<fir::ResultOp>(op)) {
     mlir::Operation *parent = op->getParentOp();
-    LatticePoint *parentLattice = getLattice(parent);
+    LatticePoint *parentLattice = getLattice(getProgramPointAfter(parent));
     assert(parentLattice);
     mlir::ChangeResult parentChanged = parentLattice->join(*after);
     propagateIfChanged(parentLattice, parentChanged);
@@ -397,28 +397,29 @@ void AllocationAnalysis::setToEntryState(LatticePoint *lattice) {
 /// Mostly a copy of AbstractDenseLattice::processOperation - the difference
 /// being that call operations are passed through to the transfer function
 mlir::LogicalResult AllocationAnalysis::processOperation(mlir::Operation *op) {
+  mlir::ProgramPoint *point = getProgramPointAfter(op);
   // If the containing block is not executable, bail out.
-  if (!getOrCreateFor<mlir::dataflow::Executable>(op, op->getBlock())->isLive())
+  if (op->getBlock() != nullptr &&
+      !getOrCreateFor<mlir::dataflow::Executable>(
+           point, getProgramPointBefore(op->getBlock()))
+           ->isLive())
     return mlir::success();
 
   // Get the dense lattice to update
-  mlir::dataflow::AbstractDenseLattice *after = getLattice(op);
+  mlir::dataflow::AbstractDenseLattice *after = getLattice(point);
 
   // If this op implements region control-flow, then control-flow dictates its
   // transfer function.
   if (auto branch = mlir::dyn_cast<mlir::RegionBranchOpInterface>(op)) {
-    visitRegionBranchOperation(op, branch, after);
+    visitRegionBranchOperation(point, branch, after);
     return mlir::success();
   }
 
   // pass call operations through to the transfer function
 
   // Get the dense state before the execution of the op.
-  const mlir::dataflow::AbstractDenseLattice *before;
-  if (mlir::Operation *prev = op->getPrevNode())
-    before = getLatticeFor(op, prev);
-  else
-    before = getLatticeFor(op, op->getBlock());
+  const mlir::dataflow::AbstractDenseLattice *before =
+      getLatticeFor(point, getProgramPointBefore(op));
 
   /// Invoke the operation transfer function
   return visitOperationImpl(op, *before, after);
@@ -453,9 +454,10 @@ StackArraysAnalysisWrapper::analyseFunction(mlir::Operation *func) {
     return mlir::failure();
   }
 
-  LatticePoint point{func};
+  LatticePoint point{solver.getProgramPointAfter(func)};
   auto joinOperationLattice = [&](mlir::Operation *op) {
-    const LatticePoint *lattice = solver.lookupState<LatticePoint>(op);
+    const LatticePoint *lattice =
+        solver.lookupState<LatticePoint>(solver.getProgramPointAfter(op));
     // there will be no lattice for an unreachable block
     if (lattice)
       (void)point.join(*lattice);
diff --git a/mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h b/mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h
index 80c8b86c63678a..2250db823b5519 100644
--- a/mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h
+++ b/mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h
@@ -182,7 +182,7 @@ class DeadCodeAnalysis : public DataFlowAnalysis {
 
   /// Visit an operation with control-flow semantics and deduce which of its
   /// successors are live.
-  LogicalResult visit(ProgramPoint point) override;
+  LogicalResult visit(ProgramPoint *point) override;
 
 private:
   /// Find and mark symbol callables with potentially unknown callsites as
diff --git a/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h b/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
index 7917f1e3ba6485..2e32bd1bc14617 100644
--- a/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
+++ b/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
@@ -36,8 +36,7 @@ enum class CallControlFlowAction { EnterCallee, ExitCallee, ExternalCallee };
 //===----------------------------------------------------------------------===//
 
 /// This class represents a dense lattice. A dense lattice is attached to
-/// operations to represent the program state after their execution or to blocks
-/// to represent the program state at the beginning of the block. A dense
+/// program point to represent the program state at the program point.
 /// lattice is propagated through the IR by dense data-flow analysis.
 class AbstractDenseLattice : public AnalysisState {
 public:
@@ -59,15 +58,13 @@ class AbstractDenseLattice : public AnalysisState {
 //===----------------------------------------------------------------------===//
 
 /// Base class for dense forward data-flow analyses. Dense data-flow analysis
-/// attaches a lattice between the execution of operations and implements a
-/// transfer function from the lattice before each operation to the lattice
-/// after. The lattice contains information about the state of the program at
-/// that point.
+/// attaches a lattice to program points and implements a transfer function from
+/// the lattice before each operation to the lattice after. The lattice contains
+/// information about the state of the program at that program point.
 ///
-/// In this implementation, a lattice attached to an operation represents the
-/// state of the program after its execution, and a lattice attached to block
-/// represents the state of the program right before it starts executing its
-/// body.
+/// Visit a program point in forward dense data-flow analysis will invoke the
+/// transfer function of the operation preceding the program point iterator.
+/// Visit a program point at the begining of block will visit the block itself.
 class AbstractDenseForwardDataFlowAnalysis : public DataFlowAnalysis {
 public:
   using DataFlowAnalysis::DataFlowAnalysis;
@@ -76,13 +73,14 @@ class AbstractDenseForwardDataFlowAnalysis : public DataFlowAnalysis {
   /// may modify the program state; that is, every operation and block.
   LogicalResult initialize(Operation *top) override;
 
-  /// Visit a program point that modifies the state of the program. If this is a
-  /// block, then the state is propagated from control-flow predecessors or
-  /// callsites. If this is a call operation or region control-flow operation,
-  /// then the state after the execution of the operation is set by control-flow
-  /// or the callgraph. Otherwise, this function invokes the operation transfer
-  /// function.
-  LogicalResult visit(ProgramPoint point) override;
+  /// Visit a program point that modifies the state of the program. If the
+  /// program point is at the beginning of a block, then the state is propagated
+  /// from control-flow predecessors or callsites.  If the operation before
+  /// program point iterator is a call operation or region control-flow
+  /// operation, then the state after the execution of the operation is set by
+  /// control-flow or the callgraph. Otherwise, this function invokes the
+  /// operation transfer function before the program point iterator.
+  LogicalResult visit(ProgramPoint *point) override;
 
 protected:
   /// Propagate the dense lattice before the execution of an operation to the
@@ -91,15 +89,14 @@ class AbstractDenseForwardDataFlowAnalysis : public DataFlowAnalysis {
                                            const AbstractDenseLattice &before,
                                            AbstractDenseLattice *after) = 0;
 
-  /// Get the dense lattice after the execution of the given lattice anchor.
+  /// Get the dense lattice on the given lattice anchor.
   virtual AbstractDenseLattice *getLattice(LatticeAnchor anchor) = 0;
 
-  /// Get the dense lattice after the execution of the given program point and
-  /// add it as a dependency to a lattice anchor. That is, every time the
-  /// lattice after anchor is updated, the dependent program point must be
-  /// visited, and the newly triggered visit might update the lattice after
-  /// dependent.
-  const AbstractDenseLattice *getLatticeFor(ProgramPoint dependent,
+  /// Get the dense lattice on the given lattice anchor and add dependent as its
+  /// dependency. That is, every time the lattice after anchor is updated, the
+  /// dependent program point must be visited, and the newly triggered visit
+  /// might update the lattice on dependent.
+  const AbstractDenseLattice *getLatticeFor(ProgramPoint *dependent,
                                             LatticeAnchor anchor);
 
   /// Set the dense lattice at control flow entry point and propagate an update
@@ -153,7 +150,7 @@ class AbstractDenseForwardDataFlowAnalysis : public DataFlowAnalysis {
   /// Visit a program point within a region branch operation with predecessors
   /// in it. This can either be an entry block of one of the regions of the
   /// parent operation itself.
-  void visitRegionBranchOperation(ProgramPoint point,
+  void visitRegionBranchOperation(ProgramPoint *point,
                                   RegionBranchOpInterface branch,
                                   AbstractDenseLattice *after);
 
@@ -294,14 +291,12 @@ class DenseForwardDataFlowAnalysis
 //===----------------------------------------------------------------------===//
 
 /// Base class for dense backward dataflow analyses. Such analyses attach a
-/// lattice between the execution of operations and implement a transfer
-/// function from the lattice after the operation ot the lattice before it, thus
-/// propagating backward.
+/// lattice to program point and implement a transfer function from the lattice
+/// after the operation to the lattice before it, thus propagating backward.
 ///
-/// In this implementation, a lattice attached to an operation represents the
-/// state of the program before its execution, and a lattice attached to a block
-/// represents the state of the program before the end of the block, i.e., after
-/// its terminator.
+/// Visit a program point in dense backward data-flow analysis will invoke the
+/// transfer function of the operation following the program point iterator.
+/// Visit a program point at the end of block will visit the block itself.
 class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
 public:
   /// Construct the analysis in the given solver. Takes a symbol table
@@ -321,9 +316,9 @@ class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// operations, the state is propagated using the transfer function
   /// (visitOperation).
   ///
-  /// Note: the transfer function is currently *not* invoked for operations with
-  /// region or call interface, but *is* invoked for block terminators.
-  LogicalResult visit(ProgramPoint point) override;
+  /// Note: the transfer function is currently *not* invoked before operations
+  /// with region or call interface, but *is* invoked before block terminators.
+  LogicalResult visit(ProgramPoint *point) override;
 
 protected:
   /// Propagate the dense lattice after the execution of an operation to the
@@ -337,10 +332,11 @@ class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// block.
   virtual AbstractDenseLattice *getLattice(LatticeAnchor anchor) = 0;
 
-  /// Get the dense lattice before the execution of the program point in
-  /// `anchor` and declare that the `dependent` program point must be updated
-  /// every time `point` is.
-  const AbstractDenseLattice *getLatticeFor(ProgramPoint dependent,
+  /// Get the dense lattice on the given lattice anchor and add dependent as its
+  /// dependency. That is, every time the lattice after anchor is updated, the
+  /// dependent program point must be visited, and the newly triggered visit
+  /// might update the lattice before dependent.
+  const AbstractDenseLattice *getLatticeFor(ProgramPoint *dependent,
                                             LatticeAnchor anchor);
 
   /// Set the dense lattice before at the control flow exit point and propagate
@@ -400,7 +396,7 @@ class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// (from which the state is propagated) in or after it. `regionNo` indicates
   /// the region that contains the successor, `nullopt` indicating the successor
   /// of the branch operation itself.
-  void visitRegionBranchOperation(ProgramPoint point,
+  void visitRegionBranchOperation(ProgramPoint *point,
                                   RegionBranchOpInterface branch,
                                   RegionBranchPoint branchPoint,
                                   AbstractDenseLattice *before);
diff --git a/mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h b/mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h
index 933790b4f2a6eb..dce7ab3bb5ee79 100644
--- a/mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h
+++ b/mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h
@@ -179,18 +179,22 @@ class Lattice : public AbstractSparseLattice {
 /// operands to the lattices of the results. This analysis will propagate
 /// lattices across control-flow edges and the callgraph using liveness
 /// information.
+///
+/// Visit a program point in sparse forward data-flow analysis will invoke the
+/// transfer function of the operation preceding the program point iterator.
+/// Visit a program point at the begining of block will visit the block itself.
 class AbstractSparseForwardDataFlowAnalysis : public DataFlowAnalysis {
 public:
   /// Initialize the analysis by visiting every owner of an SSA value: all
   /// operations and blocks.
   LogicalResult initialize(Operation *top) override;
 
-  /// Visit a program point. If this is a block and all control-flow
-  /// predecessors or callsites are known, then the arguments lattices are
-  /// propagated from them. If this is a call operation or an operation with
-  /// region control-flow, then its result lattices are set accordingly.
-  /// Otherwise, the operation transfer function is invoked.
-  LogicalResult visit(ProgramPoint point) override;
+  /// Visit a program point. If this is at beginning of block and all
+  /// control-flow predecessors or callsites are known, then the arguments
+  /// lattices are propagated from them. If this is after call operation or an
+  /// operation with region control-flow, then its result lattices are set
+  /// accordingly.  Otherwise, the operation transfer function is invoked.
+  LogicalResult visit(ProgramPoint *point) override;
 
 protected:
   explicit AbstractSparseForwardDataFlowAnalysis(DataFlowSolver &solver);
@@ -221,7 +225,7 @@ class AbstractSparseForwardDataFlowAnalysis : public DataFlowAnalysis {
 
   /// Get a read-only lattice element for a value and add it as a dependency to
   /// a program point.
-  const AbstractSparseLattice *getLatticeElementFor(ProgramPoint point,
+  const AbstractSparseLattice *getLatticeElementFor(ProgramPoint *point,
                                                     Value value);
 
   /// Set the given lattice element(s) at control flow entry point(s).
@@ -251,7 +255,8 @@ class AbstractSparseForwardDataFlowAnalysis : public DataFlowAnalysis {
   /// operation `branch`, which can either be the entry block of one of the
   /// regions or the parent operation itself, and set either the argument or
   /// parent result lattices.
-  void visitRegionSuccessors(ProgramPoint point, RegionBranchOpInterface branch,
+  void visitRegionSuccessors(ProgramPoint *point,
+                             RegionBranchOpInterface branch,
                              RegionBranchPoint successor,
                              ArrayRef<AbstractSparseLattice *> lattices);
 };
@@ -312,7 +317,7 @@ class SparseForwardDataFlowAnalysis
 
   /// Get the lattice element for a value and create a dependency on the
   /// provided program point.
-  const StateT *getLatticeElementFor(ProgramPoint point, Value value) {
+  const StateT *getLatticeElementFor(ProgramPoint *point, Value value) {
     return static_cast<const StateT *>(
         AbstractSparseForwardDataFlowAnalysis::getLatticeElementFor(point,
                                                                     value));
@@ -377,10 +382,10 @@ class AbstractSparseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// under it.
   LogicalResult initialize(Operation *top) override;
 
-  /// Visit a program point. If this is a call operation or an operation with
+  /// Visit a program point. If it is after call operation or an operation with
   /// block or region control-flow, then operand lattices are set accordingly.
   /// Otherwise, invokes the operation transfer function (`visitOperationImpl`).
-  LogicalResult visit(ProgramPoint point) override;
+  LogicalResult visit(ProgramPoint *point) override;
 
 protected:
   explicit AbstractSparseBackwardDataFlowAnalysis(
@@ -445,14 +450,14 @@ class AbstractSparseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// Get the lattice element for a value, and also set up
   /// dependencies so that the analysis on the given ProgramPoint is re-invoked
   /// if the value changes.
-  const AbstractSparseLattice *getLatticeElementFor(ProgramPoint point,
+  const AbstractSparseLattice *getLatticeElementFor(ProgramPoint *point,
                                                     Value value);
 
   /// Get the lattice elements for a range of values, and also set up
   /// dependencies so that the analysis on the given ProgramPoint is re-invoked
   /// if any of the values change.
   SmallVector<const AbstractSparseLattice *>
-  getLatticeElementsFor(ProgramPoint point, ValueRange values);
+  getLatticeElementsFor(ProgramPoint *point, ValueRange values);
 
   SymbolTableCollection &symbolTable;
 };
@@ -465,6 +470,10 @@ class AbstractSparseBackwardDataFlowAnalysis : public DataFlowAnalysis {
 /// backwards across the IR by implementing transfer functions for operations.
 ///
 /// `StateT` is expected to be a subclass of `AbstractSparseLattice`.
+///
+/// Visit a program point in sparse backward data-flow analysis will invoke the
+/// transfer function of the operation preceding the program point iterator.
+/// Visit a program point at the begining of block will visit the block itself.
 template <typename StateT>
 class SparseBackwardDataFlowAnalysis
     : public AbstractSparseBackwardDataFlowAnalysis {
diff --git a/mlir/include/mlir/Analysis/DataFlowFramework.h b/mlir/include/mlir/Analysis/DataFlowFramework.h
index b0450ecdbd99b8..969664dc7a4fe3 100644
--- a/mlir/include/mlir/Analysis/DataFlowFramework.h
+++ b/mlir/include/mlir/Analysis/DataFlowFramework.h
@@ -18,10 +18,12 @@
 
 #include "mlir/IR/Operation.h"
 #include "mlir/Support/StorageUniquer.h"
+#include "llvm/ADT/Hashing.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/TypeName.h"
 #include <queue>
+#include <tuple>
 
 namespace mlir {
 
@@ -51,23 +53,104 @@ class AnalysisState;
 
 /// Program point represents a specific location in the execution of a program.
 /// A sequence of program points can be combined into a control flow graph.
-struct ProgramPoint : public PointerUnion<Operation *, Block *> {
-  using ParentTy = PointerUnion<Operation *, Block *>;
-  /// Inherit constructors.
-  using ParentTy::PointerUnion;
-  /// Allow implicit conversion from the parent type.
-  ProgramPoint(ParentTy point = nullptr) : ParentTy(point) {}
-  /// Allow implicit conversions from operation wrappers.
-  /// TODO: For Windows only. Find a better solution.
-  template <typename OpT, typename = std::enable_if_t<
-                              std::is_convertible<OpT, Operation *>::value &&
-                              !std::is_same<OpT, Operation *>::value>>
-  ProgramPoint(OpT op) : ParentTy(op) {}
+struct ProgramPoint : public StorageUniquer::BaseStorage {
+  /// Creates a new program point at the given location.
+  ProgramPoint(Block *parentBlock, Block::iterator pp)
+      : block(parentBlock), point(pp) {}
+
+  /// Creates a new program point at the given operation.
+  ProgramPoint(Operation *op) : op(op) {}
+
+  /// The concrete key type used by the storage uniquer. This class is uniqued
+  /// by its contents.
+  using KeyTy = std::tuple<Block *, Block::iterator, Operation *>;
+
+  /// Create a empty program point.
+  ProgramPoint() {}
+
+  /// Create a new program point from the given program point.
+  ProgramPoint(const ProgramPoint &point) {
+    this->block ...
[truncated]

@llvmbot
Copy link
Member

llvmbot commented Sep 28, 2024

@llvm/pr-subscribers-mlir-core

Author: donald chen (cxy-1993)

Changes

The concept of a 'program point' in the original data flow framework is ambiguous. It can refer to either an operation or a block itself. This representation has different interpretations in forward and backward data-flow analysis. In forward data-flow analysis, the program point of an operation represents the state after the operation, while in backward data flow analysis, it represents the state before the operation. When using forward or backward data-flow analysis, it is crucial to carefully handle this distinction to ensure correctness.

This patch refactors the definition of program point, unifying the interpretation of program points in both forward and backward data-flow analysis.

How to integrate this patch?

For dense forward data-flow analysis and other analysis (except dense backward data-flow analysis), the program point corresponding to the original operation can be obtained by getProgramPointAfter(op), and the program point corresponding to the original block can be obtained by getProgramPointBefore(block).

For dense backward data-flow analysis, the program point corresponding to the original operation can be obtained by getProgramPointBefore(op), and the program point corresponding to the original block can be obtained by getProgramPointAfter(block).

NOTE: If you need to get the lattice of other data-flow analyses in dense backward data-flow analysis, you should still use the dense forward data-flow approach. For example, to get the Executable state of a block in dense backward data-flow analysis and add the dependency of the current operation, you should write:

getOrCreateFor&lt;Executable&gt;(getProgramPointBefore(op), getProgramPointBefore(block))

In case above, we use getProgramPointBefore(op) because the analysis we rely on is dense backward data-flow, and we use getProgramPointBefore(block) because the lattice we query is the result of a non-dense backward data flow computation.

related dsscussion: https://discourse.llvm.org/t/rfc-unify-the-semantics-of-program-points/80671/8
corresponding PSA: https://discourse.llvm.org/t/psa-program-point-semantics-change/81479


Patch is 87.78 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110344.diff

18 Files Affected:

  • (modified) flang/lib/Optimizer/Transforms/StackArrays.cpp (+13-11)
  • (modified) mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h (+1-1)
  • (modified) mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h (+36-40)
  • (modified) mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h (+22-13)
  • (modified) mlir/include/mlir/Analysis/DataFlowFramework.h (+174-47)
  • (modified) mlir/include/mlir/IR/Block.h (+21)
  • (modified) mlir/lib/Analysis/DataFlow/DeadCodeAnalysis.cpp (+46-30)
  • (modified) mlir/lib/Analysis/DataFlow/DenseAnalysis.cpp (+69-79)
  • (modified) mlir/lib/Analysis/DataFlow/IntegerRangeAnalysis.cpp (+2-2)
  • (modified) mlir/lib/Analysis/DataFlow/LivenessAnalysis.cpp (+1-1)
  • (modified) mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp (+50-37)
  • (modified) mlir/lib/Analysis/DataFlowFramework.cpp (+18-11)
  • (modified) mlir/lib/Dialect/Arith/Transforms/IntRangeOptimizations.cpp (+1-1)
  • (modified) mlir/test/lib/Analysis/DataFlow/TestDeadCodeAnalysis.cpp (+9-6)
  • (modified) mlir/test/lib/Analysis/DataFlow/TestDenseBackwardDataFlowAnalysis.cpp (+7-8)
  • (modified) mlir/test/lib/Analysis/DataFlow/TestDenseForwardDataFlowAnalysis.cpp (+4-3)
  • (modified) mlir/test/lib/Analysis/DataFlow/TestSparseBackwardDataFlowAnalysis.cpp (+1-1)
  • (modified) mlir/test/lib/Analysis/TestDataFlowFramework.cpp (+16-15)
diff --git a/flang/lib/Optimizer/Transforms/StackArrays.cpp b/flang/lib/Optimizer/Transforms/StackArrays.cpp
index a8f1a744cda5fe..02a594b1e0cd37 100644
--- a/flang/lib/Optimizer/Transforms/StackArrays.cpp
+++ b/flang/lib/Optimizer/Transforms/StackArrays.cpp
@@ -376,7 +376,7 @@ mlir::LogicalResult AllocationAnalysis::visitOperation(
     }
   } else if (mlir::isa<fir::ResultOp>(op)) {
     mlir::Operation *parent = op->getParentOp();
-    LatticePoint *parentLattice = getLattice(parent);
+    LatticePoint *parentLattice = getLattice(getProgramPointAfter(parent));
     assert(parentLattice);
     mlir::ChangeResult parentChanged = parentLattice->join(*after);
     propagateIfChanged(parentLattice, parentChanged);
@@ -397,28 +397,29 @@ void AllocationAnalysis::setToEntryState(LatticePoint *lattice) {
 /// Mostly a copy of AbstractDenseLattice::processOperation - the difference
 /// being that call operations are passed through to the transfer function
 mlir::LogicalResult AllocationAnalysis::processOperation(mlir::Operation *op) {
+  mlir::ProgramPoint *point = getProgramPointAfter(op);
   // If the containing block is not executable, bail out.
-  if (!getOrCreateFor<mlir::dataflow::Executable>(op, op->getBlock())->isLive())
+  if (op->getBlock() != nullptr &&
+      !getOrCreateFor<mlir::dataflow::Executable>(
+           point, getProgramPointBefore(op->getBlock()))
+           ->isLive())
     return mlir::success();
 
   // Get the dense lattice to update
-  mlir::dataflow::AbstractDenseLattice *after = getLattice(op);
+  mlir::dataflow::AbstractDenseLattice *after = getLattice(point);
 
   // If this op implements region control-flow, then control-flow dictates its
   // transfer function.
   if (auto branch = mlir::dyn_cast<mlir::RegionBranchOpInterface>(op)) {
-    visitRegionBranchOperation(op, branch, after);
+    visitRegionBranchOperation(point, branch, after);
     return mlir::success();
   }
 
   // pass call operations through to the transfer function
 
   // Get the dense state before the execution of the op.
-  const mlir::dataflow::AbstractDenseLattice *before;
-  if (mlir::Operation *prev = op->getPrevNode())
-    before = getLatticeFor(op, prev);
-  else
-    before = getLatticeFor(op, op->getBlock());
+  const mlir::dataflow::AbstractDenseLattice *before =
+      getLatticeFor(point, getProgramPointBefore(op));
 
   /// Invoke the operation transfer function
   return visitOperationImpl(op, *before, after);
@@ -453,9 +454,10 @@ StackArraysAnalysisWrapper::analyseFunction(mlir::Operation *func) {
     return mlir::failure();
   }
 
-  LatticePoint point{func};
+  LatticePoint point{solver.getProgramPointAfter(func)};
   auto joinOperationLattice = [&](mlir::Operation *op) {
-    const LatticePoint *lattice = solver.lookupState<LatticePoint>(op);
+    const LatticePoint *lattice =
+        solver.lookupState<LatticePoint>(solver.getProgramPointAfter(op));
     // there will be no lattice for an unreachable block
     if (lattice)
       (void)point.join(*lattice);
diff --git a/mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h b/mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h
index 80c8b86c63678a..2250db823b5519 100644
--- a/mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h
+++ b/mlir/include/mlir/Analysis/DataFlow/DeadCodeAnalysis.h
@@ -182,7 +182,7 @@ class DeadCodeAnalysis : public DataFlowAnalysis {
 
   /// Visit an operation with control-flow semantics and deduce which of its
   /// successors are live.
-  LogicalResult visit(ProgramPoint point) override;
+  LogicalResult visit(ProgramPoint *point) override;
 
 private:
   /// Find and mark symbol callables with potentially unknown callsites as
diff --git a/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h b/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
index 7917f1e3ba6485..2e32bd1bc14617 100644
--- a/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
+++ b/mlir/include/mlir/Analysis/DataFlow/DenseAnalysis.h
@@ -36,8 +36,7 @@ enum class CallControlFlowAction { EnterCallee, ExitCallee, ExternalCallee };
 //===----------------------------------------------------------------------===//
 
 /// This class represents a dense lattice. A dense lattice is attached to
-/// operations to represent the program state after their execution or to blocks
-/// to represent the program state at the beginning of the block. A dense
+/// program point to represent the program state at the program point.
 /// lattice is propagated through the IR by dense data-flow analysis.
 class AbstractDenseLattice : public AnalysisState {
 public:
@@ -59,15 +58,13 @@ class AbstractDenseLattice : public AnalysisState {
 //===----------------------------------------------------------------------===//
 
 /// Base class for dense forward data-flow analyses. Dense data-flow analysis
-/// attaches a lattice between the execution of operations and implements a
-/// transfer function from the lattice before each operation to the lattice
-/// after. The lattice contains information about the state of the program at
-/// that point.
+/// attaches a lattice to program points and implements a transfer function from
+/// the lattice before each operation to the lattice after. The lattice contains
+/// information about the state of the program at that program point.
 ///
-/// In this implementation, a lattice attached to an operation represents the
-/// state of the program after its execution, and a lattice attached to block
-/// represents the state of the program right before it starts executing its
-/// body.
+/// Visit a program point in forward dense data-flow analysis will invoke the
+/// transfer function of the operation preceding the program point iterator.
+/// Visit a program point at the begining of block will visit the block itself.
 class AbstractDenseForwardDataFlowAnalysis : public DataFlowAnalysis {
 public:
   using DataFlowAnalysis::DataFlowAnalysis;
@@ -76,13 +73,14 @@ class AbstractDenseForwardDataFlowAnalysis : public DataFlowAnalysis {
   /// may modify the program state; that is, every operation and block.
   LogicalResult initialize(Operation *top) override;
 
-  /// Visit a program point that modifies the state of the program. If this is a
-  /// block, then the state is propagated from control-flow predecessors or
-  /// callsites. If this is a call operation or region control-flow operation,
-  /// then the state after the execution of the operation is set by control-flow
-  /// or the callgraph. Otherwise, this function invokes the operation transfer
-  /// function.
-  LogicalResult visit(ProgramPoint point) override;
+  /// Visit a program point that modifies the state of the program. If the
+  /// program point is at the beginning of a block, then the state is propagated
+  /// from control-flow predecessors or callsites.  If the operation before
+  /// program point iterator is a call operation or region control-flow
+  /// operation, then the state after the execution of the operation is set by
+  /// control-flow or the callgraph. Otherwise, this function invokes the
+  /// operation transfer function before the program point iterator.
+  LogicalResult visit(ProgramPoint *point) override;
 
 protected:
   /// Propagate the dense lattice before the execution of an operation to the
@@ -91,15 +89,14 @@ class AbstractDenseForwardDataFlowAnalysis : public DataFlowAnalysis {
                                            const AbstractDenseLattice &before,
                                            AbstractDenseLattice *after) = 0;
 
-  /// Get the dense lattice after the execution of the given lattice anchor.
+  /// Get the dense lattice on the given lattice anchor.
   virtual AbstractDenseLattice *getLattice(LatticeAnchor anchor) = 0;
 
-  /// Get the dense lattice after the execution of the given program point and
-  /// add it as a dependency to a lattice anchor. That is, every time the
-  /// lattice after anchor is updated, the dependent program point must be
-  /// visited, and the newly triggered visit might update the lattice after
-  /// dependent.
-  const AbstractDenseLattice *getLatticeFor(ProgramPoint dependent,
+  /// Get the dense lattice on the given lattice anchor and add dependent as its
+  /// dependency. That is, every time the lattice after anchor is updated, the
+  /// dependent program point must be visited, and the newly triggered visit
+  /// might update the lattice on dependent.
+  const AbstractDenseLattice *getLatticeFor(ProgramPoint *dependent,
                                             LatticeAnchor anchor);
 
   /// Set the dense lattice at control flow entry point and propagate an update
@@ -153,7 +150,7 @@ class AbstractDenseForwardDataFlowAnalysis : public DataFlowAnalysis {
   /// Visit a program point within a region branch operation with predecessors
   /// in it. This can either be an entry block of one of the regions of the
   /// parent operation itself.
-  void visitRegionBranchOperation(ProgramPoint point,
+  void visitRegionBranchOperation(ProgramPoint *point,
                                   RegionBranchOpInterface branch,
                                   AbstractDenseLattice *after);
 
@@ -294,14 +291,12 @@ class DenseForwardDataFlowAnalysis
 //===----------------------------------------------------------------------===//
 
 /// Base class for dense backward dataflow analyses. Such analyses attach a
-/// lattice between the execution of operations and implement a transfer
-/// function from the lattice after the operation ot the lattice before it, thus
-/// propagating backward.
+/// lattice to program point and implement a transfer function from the lattice
+/// after the operation to the lattice before it, thus propagating backward.
 ///
-/// In this implementation, a lattice attached to an operation represents the
-/// state of the program before its execution, and a lattice attached to a block
-/// represents the state of the program before the end of the block, i.e., after
-/// its terminator.
+/// Visit a program point in dense backward data-flow analysis will invoke the
+/// transfer function of the operation following the program point iterator.
+/// Visit a program point at the end of block will visit the block itself.
 class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
 public:
   /// Construct the analysis in the given solver. Takes a symbol table
@@ -321,9 +316,9 @@ class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// operations, the state is propagated using the transfer function
   /// (visitOperation).
   ///
-  /// Note: the transfer function is currently *not* invoked for operations with
-  /// region or call interface, but *is* invoked for block terminators.
-  LogicalResult visit(ProgramPoint point) override;
+  /// Note: the transfer function is currently *not* invoked before operations
+  /// with region or call interface, but *is* invoked before block terminators.
+  LogicalResult visit(ProgramPoint *point) override;
 
 protected:
   /// Propagate the dense lattice after the execution of an operation to the
@@ -337,10 +332,11 @@ class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// block.
   virtual AbstractDenseLattice *getLattice(LatticeAnchor anchor) = 0;
 
-  /// Get the dense lattice before the execution of the program point in
-  /// `anchor` and declare that the `dependent` program point must be updated
-  /// every time `point` is.
-  const AbstractDenseLattice *getLatticeFor(ProgramPoint dependent,
+  /// Get the dense lattice on the given lattice anchor and add dependent as its
+  /// dependency. That is, every time the lattice after anchor is updated, the
+  /// dependent program point must be visited, and the newly triggered visit
+  /// might update the lattice before dependent.
+  const AbstractDenseLattice *getLatticeFor(ProgramPoint *dependent,
                                             LatticeAnchor anchor);
 
   /// Set the dense lattice before at the control flow exit point and propagate
@@ -400,7 +396,7 @@ class AbstractDenseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// (from which the state is propagated) in or after it. `regionNo` indicates
   /// the region that contains the successor, `nullopt` indicating the successor
   /// of the branch operation itself.
-  void visitRegionBranchOperation(ProgramPoint point,
+  void visitRegionBranchOperation(ProgramPoint *point,
                                   RegionBranchOpInterface branch,
                                   RegionBranchPoint branchPoint,
                                   AbstractDenseLattice *before);
diff --git a/mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h b/mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h
index 933790b4f2a6eb..dce7ab3bb5ee79 100644
--- a/mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h
+++ b/mlir/include/mlir/Analysis/DataFlow/SparseAnalysis.h
@@ -179,18 +179,22 @@ class Lattice : public AbstractSparseLattice {
 /// operands to the lattices of the results. This analysis will propagate
 /// lattices across control-flow edges and the callgraph using liveness
 /// information.
+///
+/// Visit a program point in sparse forward data-flow analysis will invoke the
+/// transfer function of the operation preceding the program point iterator.
+/// Visit a program point at the begining of block will visit the block itself.
 class AbstractSparseForwardDataFlowAnalysis : public DataFlowAnalysis {
 public:
   /// Initialize the analysis by visiting every owner of an SSA value: all
   /// operations and blocks.
   LogicalResult initialize(Operation *top) override;
 
-  /// Visit a program point. If this is a block and all control-flow
-  /// predecessors or callsites are known, then the arguments lattices are
-  /// propagated from them. If this is a call operation or an operation with
-  /// region control-flow, then its result lattices are set accordingly.
-  /// Otherwise, the operation transfer function is invoked.
-  LogicalResult visit(ProgramPoint point) override;
+  /// Visit a program point. If this is at beginning of block and all
+  /// control-flow predecessors or callsites are known, then the arguments
+  /// lattices are propagated from them. If this is after call operation or an
+  /// operation with region control-flow, then its result lattices are set
+  /// accordingly.  Otherwise, the operation transfer function is invoked.
+  LogicalResult visit(ProgramPoint *point) override;
 
 protected:
   explicit AbstractSparseForwardDataFlowAnalysis(DataFlowSolver &solver);
@@ -221,7 +225,7 @@ class AbstractSparseForwardDataFlowAnalysis : public DataFlowAnalysis {
 
   /// Get a read-only lattice element for a value and add it as a dependency to
   /// a program point.
-  const AbstractSparseLattice *getLatticeElementFor(ProgramPoint point,
+  const AbstractSparseLattice *getLatticeElementFor(ProgramPoint *point,
                                                     Value value);
 
   /// Set the given lattice element(s) at control flow entry point(s).
@@ -251,7 +255,8 @@ class AbstractSparseForwardDataFlowAnalysis : public DataFlowAnalysis {
   /// operation `branch`, which can either be the entry block of one of the
   /// regions or the parent operation itself, and set either the argument or
   /// parent result lattices.
-  void visitRegionSuccessors(ProgramPoint point, RegionBranchOpInterface branch,
+  void visitRegionSuccessors(ProgramPoint *point,
+                             RegionBranchOpInterface branch,
                              RegionBranchPoint successor,
                              ArrayRef<AbstractSparseLattice *> lattices);
 };
@@ -312,7 +317,7 @@ class SparseForwardDataFlowAnalysis
 
   /// Get the lattice element for a value and create a dependency on the
   /// provided program point.
-  const StateT *getLatticeElementFor(ProgramPoint point, Value value) {
+  const StateT *getLatticeElementFor(ProgramPoint *point, Value value) {
     return static_cast<const StateT *>(
         AbstractSparseForwardDataFlowAnalysis::getLatticeElementFor(point,
                                                                     value));
@@ -377,10 +382,10 @@ class AbstractSparseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// under it.
   LogicalResult initialize(Operation *top) override;
 
-  /// Visit a program point. If this is a call operation or an operation with
+  /// Visit a program point. If it is after call operation or an operation with
   /// block or region control-flow, then operand lattices are set accordingly.
   /// Otherwise, invokes the operation transfer function (`visitOperationImpl`).
-  LogicalResult visit(ProgramPoint point) override;
+  LogicalResult visit(ProgramPoint *point) override;
 
 protected:
   explicit AbstractSparseBackwardDataFlowAnalysis(
@@ -445,14 +450,14 @@ class AbstractSparseBackwardDataFlowAnalysis : public DataFlowAnalysis {
   /// Get the lattice element for a value, and also set up
   /// dependencies so that the analysis on the given ProgramPoint is re-invoked
   /// if the value changes.
-  const AbstractSparseLattice *getLatticeElementFor(ProgramPoint point,
+  const AbstractSparseLattice *getLatticeElementFor(ProgramPoint *point,
                                                     Value value);
 
   /// Get the lattice elements for a range of values, and also set up
   /// dependencies so that the analysis on the given ProgramPoint is re-invoked
   /// if any of the values change.
   SmallVector<const AbstractSparseLattice *>
-  getLatticeElementsFor(ProgramPoint point, ValueRange values);
+  getLatticeElementsFor(ProgramPoint *point, ValueRange values);
 
   SymbolTableCollection &symbolTable;
 };
@@ -465,6 +470,10 @@ class AbstractSparseBackwardDataFlowAnalysis : public DataFlowAnalysis {
 /// backwards across the IR by implementing transfer functions for operations.
 ///
 /// `StateT` is expected to be a subclass of `AbstractSparseLattice`.
+///
+/// Visit a program point in sparse backward data-flow analysis will invoke the
+/// transfer function of the operation preceding the program point iterator.
+/// Visit a program point at the begining of block will visit the block itself.
 template <typename StateT>
 class SparseBackwardDataFlowAnalysis
     : public AbstractSparseBackwardDataFlowAnalysis {
diff --git a/mlir/include/mlir/Analysis/DataFlowFramework.h b/mlir/include/mlir/Analysis/DataFlowFramework.h
index b0450ecdbd99b8..969664dc7a4fe3 100644
--- a/mlir/include/mlir/Analysis/DataFlowFramework.h
+++ b/mlir/include/mlir/Analysis/DataFlowFramework.h
@@ -18,10 +18,12 @@
 
 #include "mlir/IR/Operation.h"
 #include "mlir/Support/StorageUniquer.h"
+#include "llvm/ADT/Hashing.h"
 #include "llvm/ADT/SetVector.h"
 #include "llvm/Support/Compiler.h"
 #include "llvm/Support/TypeName.h"
 #include <queue>
+#include <tuple>
 
 namespace mlir {
 
@@ -51,23 +53,104 @@ class AnalysisState;
 
 /// Program point represents a specific location in the execution of a program.
 /// A sequence of program points can be combined into a control flow graph.
-struct ProgramPoint : public PointerUnion<Operation *, Block *> {
-  using ParentTy = PointerUnion<Operation *, Block *>;
-  /// Inherit constructors.
-  using ParentTy::PointerUnion;
-  /// Allow implicit conversion from the parent type.
-  ProgramPoint(ParentTy point = nullptr) : ParentTy(point) {}
-  /// Allow implicit conversions from operation wrappers.
-  /// TODO: For Windows only. Find a better solution.
-  template <typename OpT, typename = std::enable_if_t<
-                              std::is_convertible<OpT, Operation *>::value &&
-                              !std::is_same<OpT, Operation *>::value>>
-  ProgramPoint(OpT op) : ParentTy(op) {}
+struct ProgramPoint : public StorageUniquer::BaseStorage {
+  /// Creates a new program point at the given location.
+  ProgramPoint(Block *parentBlock, Block::iterator pp)
+      : block(parentBlock), point(pp) {}
+
+  /// Creates a new program point at the given operation.
+  ProgramPoint(Operation *op) : op(op) {}
+
+  /// The concrete key type used by the storage uniquer. This class is uniqued
+  /// by its contents.
+  using KeyTy = std::tuple<Block *, Block::iterator, Operation *>;
+
+  /// Create a empty program point.
+  ProgramPoint() {}
+
+  /// Create a new program point from the given program point.
+  ProgramPoint(const ProgramPoint &point) {
+    this->block ...
[truncated]

Copy link
Contributor

@tblah tblah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flang changes LGTM. Please wait for somebody else to approve MLIR changes.

@cxy-1993
Copy link
Contributor Author

cxy-1993 commented Oct 2, 2024

@Mogball Could you please help review this patch, Thanks!

@cxy-1993
Copy link
Contributor Author

cxy-1993 commented Oct 8, 2024

ping @Mogball @ftynse @stellaraccident

Copy link
Contributor

@stellaraccident stellaraccident left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks ok to me but I'm hardly authoritative on this code. I think @Mogball is probably still the best reviewer and we should give a few more days before finding an alternate. I'm also going to patch this in to a downstream and make sure it doesn't break any tests.

@cxy-1993
Copy link
Contributor Author

ping @Mogball

@cxy-1993 cxy-1993 merged commit 4b3f251 into llvm:main Oct 11, 2024
14 checks passed
yzhang93 added a commit to iree-org/iree that referenced this pull request Oct 15, 2024
- Also bump torch-mlir to
[edd1bbec46fc08318163c9dc0eb45decee63ec5b](https://github.com/llvm/torch-mlir/tree/edd1bbec46fc08318163c9dc0eb45decee63ec5b).
- No local patch is carried.
- Most of the changes are made due to
llvm/llvm-project#110344

---------

Signed-off-by: yzhang93 <zhyuhang88@gmail.com>
DanielCChen pushed a commit to DanielCChen/llvm-project that referenced this pull request Oct 16, 2024
The concept of a 'program point' in the original data flow framework is
ambiguous. It can refer to either an operation or a block itself. This
representation has different interpretations in forward and backward
data-flow analysis. In forward data-flow analysis, the program point of
an operation represents the state after the operation, while in backward
data flow analysis, it represents the state before the operation. When
using forward or backward data-flow analysis, it is crucial to carefully
handle this distinction to ensure correctness.

This patch refactors the definition of program point, unifying the
interpretation of program points in both forward and backward data-flow
analysis.

How to integrate this patch?

For dense forward data-flow analysis and other analysis (except dense
backward data-flow analysis), the program point corresponding to the
original operation can be obtained by `getProgramPointAfter(op)`, and
the program point corresponding to the original block can be obtained by
`getProgramPointBefore(block)`.

For dense backward data-flow analysis, the program point corresponding
to the original operation can be obtained by
`getProgramPointBefore(op)`, and the program point corresponding to the
original block can be obtained by `getProgramPointAfter(block)`.

NOTE: If you need to get the lattice of other data-flow analyses in
dense backward data-flow analysis, you should still use the dense
forward data-flow approach. For example, to get the Executable state of
a block in dense backward data-flow analysis and add the dependency of
the current operation, you should write:

``getOrCreateFor<Executable>(getProgramPointBefore(op),
getProgramPointBefore(block))``

In case above, we use getProgramPointBefore(op) because the analysis we
rely on is dense backward data-flow, and we use
getProgramPointBefore(block) because the lattice we query is the result
of a non-dense backward data flow computation.

related dsscussion:
https://discourse.llvm.org/t/rfc-unify-the-semantics-of-program-points/80671/8
corresponding PSA:
https://discourse.llvm.org/t/psa-program-point-semantics-change/81479
bricknerb pushed a commit to bricknerb/llvm-project that referenced this pull request Oct 17, 2024
The concept of a 'program point' in the original data flow framework is
ambiguous. It can refer to either an operation or a block itself. This
representation has different interpretations in forward and backward
data-flow analysis. In forward data-flow analysis, the program point of
an operation represents the state after the operation, while in backward
data flow analysis, it represents the state before the operation. When
using forward or backward data-flow analysis, it is crucial to carefully
handle this distinction to ensure correctness.

This patch refactors the definition of program point, unifying the
interpretation of program points in both forward and backward data-flow
analysis.

How to integrate this patch?

For dense forward data-flow analysis and other analysis (except dense
backward data-flow analysis), the program point corresponding to the
original operation can be obtained by `getProgramPointAfter(op)`, and
the program point corresponding to the original block can be obtained by
`getProgramPointBefore(block)`.

For dense backward data-flow analysis, the program point corresponding
to the original operation can be obtained by
`getProgramPointBefore(op)`, and the program point corresponding to the
original block can be obtained by `getProgramPointAfter(block)`.

NOTE: If you need to get the lattice of other data-flow analyses in
dense backward data-flow analysis, you should still use the dense
forward data-flow approach. For example, to get the Executable state of
a block in dense backward data-flow analysis and add the dependency of
the current operation, you should write:

``getOrCreateFor<Executable>(getProgramPointBefore(op),
getProgramPointBefore(block))``

In case above, we use getProgramPointBefore(op) because the analysis we
rely on is dense backward data-flow, and we use
getProgramPointBefore(block) because the lattice we query is the result
of a non-dense backward data flow computation.

related dsscussion:
https://discourse.llvm.org/t/rfc-unify-the-semantics-of-program-points/80671/8
corresponding PSA:
https://discourse.llvm.org/t/psa-program-point-semantics-change/81479
Jokeren pushed a commit to triton-lang/triton that referenced this pull request Nov 18, 2024
Fixes #5122.

The `ProgramPoint`
[here](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1087)
is created on the stack. Then its address is
[passed](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1088-L1089)
to the MLIR `SparseAnalysis` code, where it is [added as a
dependency](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L311)
and later
[dereferenced](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L90).
By the time the `ProramPoint` is dereferenced in the
`AbstractSparseForwardDataFlowAnalysis::visit`, the
`AxisInfoAnalysis::visitForOpInductionVar` will have finished and the
`ProgramPoint` stack variable destroyed. This leads to a segfault (which
can be reproed on the base rev with the lit test added in this PR).

The code modified in this PR was originally added in #4927, in
conjunction with updating the `llvm-project` hash to `b5cc222d7429`.
However, as noted in llvm/llvm-project#110344
(the `llvm-project` PR that has made the refactoring prompting the
`AxisInfo.cpp` change in #4927):

> For dense forward data-flow analysis and other analysis (except dense
backward data-flow analysis), the program point corresponding to the
original operation can be obtained by `getProgramPointAfter(op)`

As the `AxisInfoAnalysis` (in Triton) inherits from
`SparseForwardDataFlowAnalysis` (in MLIR), in this PR we follow the
above which resolves the segfault issue (as the `ProgramPoint` is now
stored in the instance-level state of the pass).

P.S. The lit test added in this PR is not exactly minimal. However, I
did my best to minimize it starting from the 400-line repro TTGIR in
#5122. Further minimization does not seem to expose the segfault.
bertmaher pushed a commit to triton-lang/triton that referenced this pull request Nov 18, 2024
Fixes #5122.

The `ProgramPoint`
[here](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1087)
is created on the stack. Then its address is
[passed](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1088-L1089)
to the MLIR `SparseAnalysis` code, where it is [added as a
dependency](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L311)
and later
[dereferenced](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L90).
By the time the `ProramPoint` is dereferenced in the
`AbstractSparseForwardDataFlowAnalysis::visit`, the
`AxisInfoAnalysis::visitForOpInductionVar` will have finished and the
`ProgramPoint` stack variable destroyed. This leads to a segfault (which
can be reproed on the base rev with the lit test added in this PR).

The code modified in this PR was originally added in #4927, in
conjunction with updating the `llvm-project` hash to `b5cc222d7429`.
However, as noted in llvm/llvm-project#110344
(the `llvm-project` PR that has made the refactoring prompting the
`AxisInfo.cpp` change in #4927):

> For dense forward data-flow analysis and other analysis (except dense
backward data-flow analysis), the program point corresponding to the
original operation can be obtained by `getProgramPointAfter(op)`

As the `AxisInfoAnalysis` (in Triton) inherits from
`SparseForwardDataFlowAnalysis` (in MLIR), in this PR we follow the
above which resolves the segfault issue (as the `ProgramPoint` is now
stored in the instance-level state of the pass).

P.S. The lit test added in this PR is not exactly minimal. However, I
did my best to minimize it starting from the 400-line repro TTGIR in
#5122. Further minimization does not seem to expose the segfault.
Jokeren added a commit to triton-lang/triton that referenced this pull request Dec 6, 2024
Update

Update

Update

Update

Update

Use pytest' `tmp_path` in `test_irsource.py` (#5145)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[TEST] Make mixed matmul test deterministic (#5151)

This prevents surprises when some value may go above the tolerance
threshold

Fix `gtest_discover_tests` timeout argument (#5149)

`gtest_discover_tests` runs the built unittest executable to create a
distinct CMake target for every individual unittest in each executable.
However, this was previously noted to time out on MacOS frequently
(because MacOS scans newly built executables for viruses, or
something...) but the timeout argument was incorrectly specified.

[Triton] Remove upstream bug workaround (NFC) (#5152)

Upstream handling of splatted bools in `DenseElementsAttr` was fixed, so
the workaround can be removed when lowering `arith.constant` to
TritonGPU.

Co-authored-by: peterbell10 <peterbell10@openai.com>

[Triton] Generate local MLIR reproducers when possible (#5155)

By setting a reproducer path, the pass manager will dump a standard MLIR
reproducer before each pass manager invocation. This PR also enables
additional local crash reproducer generation (to the same path set
through the env var), which tries to narrow down the specific pass that
failed, if the pass pipeline fails at any point.

Revert "[AMD][Pipeliner] Improve clustering and add prefetch (#4881)" (#5157)

This reverts commit cc25374
due to perf regressions.

[IR] Add typing for tensor descriptor types (#5147)

Currently tensor descriptors are just typed as `!tt.ptr<i8>` which is
exposing the assumption it's using a TMA descriptor. This changes it to
a custom type `!tt.tensordesc<tensor<...>>` which is lowered to a
pointer type in the LLVM IR.

I also add two new IR Ops which are used to cast between pointers and
tensordesc objects.
```mlir
tt.reinterpret_tensor_descriptor %ptr : !tt.ptr<i8> to !tt.tensordesc<...>
triton_nvidia_gpu.tensor_desc_to_tma_ptr %desc : !tt.tensordesc<...> -> !tt.ptr<i8>
```

Really both of these should be nvidia-specific but the first is exposed
in the triton IR to keep support for the by-value TMA descriptor API
around while we figure out if it's possible to update to the new style.

Load backend dialects in `IRSource` to make sure `parse_mlir_module` works for third_party backends (#5146)

The changes from #4924 do not
take into account the situation when `ttgir` level contains dialects
defined in third_party plugins (at least that's my understanding).

I'd also like to point out that the second use of `parse_mlir_module`
function (via `parse` function call) happens after the dialects are
loaded for the backend as well, which is why I thought my changes make
sense.

I hope this implementation will suit Triton, or maybe one can suggest
other options.

---------

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[BACKEND][NVIDIA] Add DotOp Hoisting Pass for WGMMA and Add Lowering for SMEM-to-MMAv3 DotOp Copy (#5003)

Hopper has two kinds of WGMMAs, "SS" (both operands in shmem) and "RS"
(LHS operand A in registers).
In cases where we apply elementwise operations on A before WGMMA, Triton
previously will copy A from global memory (GMEM) into registers (RF),
perform the elementwise ops, and then copy to shared memory (SMEM) to
perform SS WGMMA.

This PR adds an optimization for the case above to use RS GEMM. This
requires the following changes:

- In TritonGPU OptimizeDotOperands pass, add optimizations to change SS
GEMM into RS GEMM.
- Add TritonGPU -> LLVM lowering for copying from SMEM to RF in MMA v3
dotOperand layout.

NOTE: This may not see perf gain, and may even see perf loss, for
certain shapes (e.g. small-K), and additional optimizations are in a
separate [PR](openxla#19) (still more
optimizations are WIP). Please advise on the merging strategy.

Restore the CentOS 7 build (#5158)

We likely need it for the PyTorch 2.6 release

[BACKEND] Add folder for `addptr(ptr, 0) -> ptr` (#5166)

I noticed this rather obvious pattern was missing. It might come up for
example if you have an expression like:
```python
ptrs = ptr + y_stride * tl.arange(0, YBLOCK)[:, None]
```
and the `YBLOCK` is set to 1 during autotuning.

[TritonGPU] Fix incorrect mask operand used in for loop pipeliner (#5161)

When the OOB values for a `tt.load` are non-zero, the for loop pipeliner
needs to generate an `arith.select` to mask the loaded values with the
default OOB value. However, if the load memory requires a layout change,
the wrong mask operand was being passed to the `arith.select`, causing a
shape mismatch. The fix is to just use the same mask operand of the
origianl `tt.load` op.

Fixes #4739

[BACKEND] Cleanup redundant broadcast combine pattern (#5167)

Summary of changes:
- Remove `broadcast(cst) -> cst` from the triton-combine pass since it's
redundant with the existing folder.
- Reorder the triton-combine pass to come after the canonicalize pass,
to simplify pattern matching
- Cleanup patterns in triton-reorder-broadcast that called
`Op::canonicalize` in favor of `Op::getCanonicalizationPatterns`.

[AMD] NFC: Drop duplicated moveUpTranspose (#5168)

It was duplicated due to resolving merge conflicts.

[Triton] Default diagnostic handler only filters for errors (#5173)

A regular SourceMgrDiagnosticHandler is causing all remarks to be
emitted even if the user doesn't ask for it!

[AMD] Refactor instruction scheduling hints (#5144)

- Renamed instruction scheduling variants
- Enabled `buffer-ops` for `local-prefetch`
- Added  documentation regarding current variants

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

[AMD] Enable mixed precision matmul test (#5177)

This commit enables mixed precision matmul test
for AMD backend. For FP8 E4M3, we test
`fp8e4m3fnuz` given that's natively supported on
MI300 series.

Update to llvm/llvm-project@bd9145c8c213 (#5180)

This pulls in llvm/llvm-project@bd9145c8c213
to enable ASan on AMD backend.

[AMD] Implement RepOrder for AMD MMA layouts (#5126)

Implement RepOrder methods for MFMA and WMMA layouts. Both layouts have
row major rep layout. Also,
isTranspose flag in MFMA layout does not affect RepOrder, meaning
RepOrder is row major in both cases.

Co-authored-by: Ognjen Plavsic <ognjen.plavsic@luxoft.com>

[BACKEND] Fix ProgramPoint passing in AxisInfoAnalysis (#5181)

Fixes #5122.

The `ProgramPoint`
[here](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1087)
is created on the stack. Then its address is
[passed](https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/lib/Analysis/AxisInfo.cpp#L1088-L1089)
to the MLIR `SparseAnalysis` code, where it is [added as a
dependency](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L311)
and later
[dereferenced](https://github.com/llvm/llvm-project/blob/33ff9e43b4c5bdc3da31c6b11ad51d35a69bec5f/mlir/lib/Analysis/DataFlow/SparseAnalysis.cpp#L90).
By the time the `ProramPoint` is dereferenced in the
`AbstractSparseForwardDataFlowAnalysis::visit`, the
`AxisInfoAnalysis::visitForOpInductionVar` will have finished and the
`ProgramPoint` stack variable destroyed. This leads to a segfault (which
can be reproed on the base rev with the lit test added in this PR).

The code modified in this PR was originally added in #4927, in
conjunction with updating the `llvm-project` hash to `b5cc222d7429`.
However, as noted in llvm/llvm-project#110344
(the `llvm-project` PR that has made the refactoring prompting the
`AxisInfo.cpp` change in #4927):

> For dense forward data-flow analysis and other analysis (except dense
backward data-flow analysis), the program point corresponding to the
original operation can be obtained by `getProgramPointAfter(op)`

As the `AxisInfoAnalysis` (in Triton) inherits from
`SparseForwardDataFlowAnalysis` (in MLIR), in this PR we follow the
above which resolves the segfault issue (as the `ProgramPoint` is now
stored in the instance-level state of the pass).

P.S. The lit test added in this PR is not exactly minimal. However, I
did my best to minimize it starting from the 400-line repro TTGIR in

[INTERPRETER] Fix argument passing for internal parameters in function declarations (#5169)

[NFC] Use reference instead of copies in few places (#5118)

Apply fixes suggested by coverity static analysis.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[BACKEND] Add missing precondition in optimize acc init (#5184)

We need scalar select to be able to do this optimization.

[BACKEND] Fix accumulator init optimization for integer matmuls (#5192)

[AMD][Pipeliner] Reland "Improve clustering and add prefetch" (#5175)

This unreverts commit 38c6284
to reland #4881
with the following fixes:

* Still keep `scheduleGlobalLoadLocalStore` as original--it turns
to be not totally ready to replace yet. Further iteration on it needed.
* Turn on `TRITON_HIP_STREAM_PREFETCH` if the instruction
  scheduling variant is `local-prefetch`, given it's needed there.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

[AMD] Define an extract slice operation (#4804)

This commit introduces an extract_slice operation for AMD backend
to enable extracting slice of a tensor in registers without data
exchange.
It enables breaking down large tiles of tensors into smaller ones
for better instruction interleaving and scheduling.

This can be useful for hiding global memory latency when a global
load/store can be efficiently split into several loads/stores to be
overlapped with compute fo attention.

[BACKEND] Fix getElemsPerThread for mmav3 dot operand (#5189)

In mmav3 case the number of elements per threads should be independent
of the element type, we should only consider kWidth.
TODO: it should also be true for MMAv2 but the logic is a bit more
complicated.

Also enable larger block_m in mixed mode tests to exercise MMAv3 case

[INTERPRETER][NFC] Rename `tensor_shape` -> `block_shape` in interpreter (#5195)

`tensor_shape` is a confusing name and doesn't match block pointer's
semantic.
`block_shape` is much clearer.

[LAYOUTS] Implement LL conversion for DotOperand(Hopper) (#5193)

We also rewrite the way we implement DotOperand(Ampere) and mma Ampere
to promote code reusing. I also started using what I believe is a rather
compact pattern to write these things, where you first call `identiyND`
with the `repOrder`, which gives you an LL with the dims in the correct
order, and then you construct the final layout by specifying the tiles
by multiplying `identity1D` maps. Using this allowed me to heavily
simplify the handling of the `warps` of `DotOperand` which used to be a
tad messy.

Update README.md to remove triton conference (#5198)

It happened two months ago

[PROTON] Add `proton.state` utility (#5110)

`state` is different from `scope` in several ways:

1. State is not recursive; each operation can have only a single state.
Inner most state will overwrite the outer most state.
2. A states is a suffix, meaning that the original call path will append
a state above the name of each kernel.
3. State is compatible with both Python and shadow contexts.

[CI] remove unused inductor workflows (#5073)

These tests have completely offloaded torch inductor tests to Meta a few
months ago. They are currently disabled on GitHub.

Signed-off-by: Sébastien Han <seb@redhat.com>

[INTERPRETER] Fix lower bound check for block pointers (#5201)

We forgot to check `offset >= 0` previously.

Now that it should match the semantic in the GPU backend

https://github.com/triton-lang/triton/blob/7bce3613755e26953518962d02315dfd343dc50c/lib/Dialect/Triton/Transforms/RewriteTensorPointer.cpp#L136

[IR] Remove memdesc from `tt.trans` and implements `ttg.memdesc_trans` (#5194)

[LLs] [BE] Simplify identityND (#5199)

The auxiliary function `identityND` used to take an `order` parameter,
that comes from triton, and a set of dimensions. Now, the order in
triton is defined wrt. `dim0..dim<rank-1>`, so the dimension arg was
redundant. This was quite confusing.

We see that in all the uses of `identiyND`, we would pass the canonical
dimensions, other than in one that we simply remove as it was not
necessary.

We remove the dims arg and simply return a layout with output dims
`dim0..dim<rank-1>`.

[MXFP] Fix packing for mxfp4 type (#5197)

When packing we should have element 0 in the lower bits, until this PR
it was in higher bits.

[LAYOUTS] Unify the implementation of getShapePerCTA (#5183)

We unify it and simplify its API (it was taking an unused `shape`
parameter). While doing this, we found that the previous implementation
was incorrect at least for `AMDWmmaEncodingAttr`, as this layout was
using the shape parameter.

Interestingly enough the doc in the header file for this function noted
that the function is indeed independent of the tensor shape, even though
the function does take a shape as an input!

https://github.com/triton-lang/triton/blob/0bd30a2f3192204c5a50d5ffde27ad8493f6c026/include/triton/Dialect/TritonGPU/IR/Dialect.h#L113-L114

[BACKEND] Use the LL API to replace the using of legacy layout attribute API. (#5196)

The util function `getDistributedLayoutStr` uses the `DistributedLayout`
attribute interface, which is not flexible for third-party extensions.
Use the `getInDimSize` of the `LinearLayout`, which is better since the
legacy layout has been converted to the `LinearLayout`.

There is no new test case since it is only a change in API usage.

[CI] Fix ccache cache restoration to improve build times (#5202)

This improves a warm-cache macOS build from ~25 mins to 2 mins.

[CI] Fix `du` failling if cache restore fails (#5206)

Follow up to #5202

It's currently failing with the error
```
du: /Users/runner/.triton/**: No such file or directory
Error: Process completed with exit code 1.
```
which happens because even though the `.triton` directory exists, it is
empty. This instead uses du on `.triton` with a depth of 1.

[BACKEND][LAYOUT] Use LL for AMDMfma related layout conversions (#5210)

[BUILD] Add option to limit number of parallel link jobs (#5212)

[CI] Fix cache not saving (#5213)

1. [CI] Fix cache not saving

    Re-using the output of the cache restore step was recommended by the
`actons/cache` docs, but it doesn't work here because we actually start
from a clean cache when we run save so there is no output available to
    read.

    The annoyances of testing in the PR but main being a different
    environment.
2. Bump macOS timeout

[LAYOUTS] Implement IR support for LinearLayouts (#5170)

We also exercise this in scale_dot, where we enable support for warps of
arbitrary shape (before we just allowed `[num_warps, 1]`).

With this infra in place, it should be rather easy to move from the
legacy layouts to using LLs to represent all of our layouts.

Something I'm concerned about is the amount of recomputation that
happens when calling methods like `getSizePerThread` and the like, where
we keep recomputing the result. There might be an optimisation
opportunity here where we cache the result of all these functions.

We choose the IR representation of an LL via its canonical form + a
`repOrder` for several reasons:
- It's generally more compact
- It's easier to CSE, so it's easier to see when two layouts are in fact
  the same.
- A technical reason: the `toLinearLayout` function returns a tensor
  with dimensions `dim0, ..., dim<rank-1>`, in other words, it "forgets"
  the repetition order. Without the repetition order, we cannot recover
  the tile size of the argument. In particular, we cannot recover
  `getSizePerThread`. There is an argument to be made about whether
  `getSizePerThread` is useful on its own, or whether it is
  `getElemsPerThread` the real useful abstraction here, but for now, we
  keep both for BC.

[CI] Run tests when CI is manually triggered (#5216)

Currently you can manually call a workflow dispatch, but it won't
actually run the tests because the variable enable_integration isn't
set.

[PROTON] Introduce the Proton dialect as a third-party plugin for intra-kernel perf tooling (#5119)

This PR introduces the `Proton Dialect` to enable intra kernel profiling
and tooling for Triton. As a third-party dialect, it serves as the
building blocks to create 3rd-party perf tools (e.g., profilers,
analysis, modeling) for Triton compiler developers in a compiler-centric
way, such as an intra-kernel latency profiler to understand software
pipelining, warp specialization, and CTA fine-grained orchestration
(e.g., cuda core, tensor core, TMA). Future developments would integrate
this dialect with the existing Proton backend profiling infrastructure
to make it a powerful and general perf tool utility. As a first step,
this PR adds some basic boilerplate code and mechanics, and the
`proton.record` op for the `Proton Dialect`.

---------

Co-authored-by: Yuanwei Fang <fywkevin@fb.com>
Co-authored-by: Keren Zhou <kerenzhou@openai.com>

[DRAFT] Completely remove `MemDesc` from the Triton dialect (#5208)

After this PR, `MemDesc` will be a type only in the TritonGPU dialect,
as will the `TensorOrMemDesc` interface.

[AMD] Prevent wrong reordering of scf operations (#5203)

The pass was reordering scf.if operations without checking the extra
dependencies coming from the region.
For now just prevent this case although this part of the code might
still be fragile.

[AMD] Cover default case in MfmaGroup (#5218)

If you build using the `CMakeLists.txt` and not `setup.py` and you build
in `Release` then you get

```
/__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp: In function ‘std::pair<mlir::Type, mlir::Type> mlir::TypesFromMfmaId(MLIRContext*, MfmaTypeId)’:
Warning: /__w/triton/triton/third_party/amd/lib/TritonAMDGPUTransforms/MfmaGroup.cpp:240:1: warning: control reaches end of non-void function [-Wreturn-type]
```

Allow Layouts to propogate to local_load (#5219)

While working on some higher dimension tensor kernels, I noticed poor
performance due to the fact that layouts wouldn't propagate to local
loads. Since we do allow layout folding with local store and local
alloc, this seems like a bit of an oversight.

The change gives a 40% speed improvement on certain kernels for NVidia
GPUs.

This also removes asserts in lowering for higher dimensional kernels. As
far as I can tell, those restrictions aren't required in practice.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] I have added tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices)

[BACKEND] Fix transpose optimization missed during refactor (#5226)

[AMD] Use warp shuffle for fp8 MFMA to dot operand layout conversion (#5139)

Adding a shortcut case for fp8 MFMA to dot operand layout conversion
that avoids using shared memory, to speed up FP8 attention kernels.

[LAYOUTS] [BE] Simplify Ampere/Hopper paths introduced in #5189 (#5200)

We simplify the implementation of `getElemsPerThread` and strengthen the
preconditions of `getRepForOperand`.

More generally, we should try to minimise the calls to `isAmpere` and
`isHopper` throughout the codebase. I'll do a pass fixing many of these
once we land LLs for `ldmatrix` and Hopper.

[BACKEND] Use LL to simplify redundant elements check and fix related issues (#5225)

Make TMA tests compatible with older CUDA toolchains (#5221)

TMA fences require CUDA toolchain 12.3 or greater, but current gating
does not check the CUDA toolchain version. This causes
`test_experimental_tma.py` to fail when run with older CUDA toolchains.

With cuda-12.0:
```
55 failed, 9 passed in 18.11s
```

With cuda-12.4:
```
64 passed in 11.99s
```

With cuda-12.0:
```
9 passed, 55 skipped in 4.26s
```

With cuda-12.4:
```
64 passed in 11.96s
```

[CMake] Add C as project language (#5217)

If you build with `-DTRITON_BUILD_UT=OFF` on Mac you will get something
like

```
-- Looking for histedit.h
CMake Error at /opt/homebrew/Cellar/cmake/3.30.5/share/cmake/Modules/CheckIncludeFile.cmake:90 (try_compile):
  Unknown extension ".c" for file
-- Looking for histedit.h - not found

    /Users/runner/work/triton/triton/triton-build/CMakeFiles/CMakeScratch/TryCompile-QA06d6/CheckIncludeFile.c

  try_compile() works only for enabled languages.  Currently these are:

    CXX

  See project() command to enable other languages.
Call Stack (most recent call first):
  llvm-bd9145c8-macos-arm64/lib/cmake/llvm/FindLibEdit.cmake:28 (check_include_file)
  llvm-bd9145c8-macos-arm64/lib/cmake/llvm/LLVMConfig.cmake:177 (find_package)
  llvm-bd9145c8-macos-arm64/lib/cmake/mlir/MLIRConfig.cmake:10 (find_package)
```

because `C` isn't an enabled project language.

[AMD] Fix slow compilation due to inlining print calls (#5153)

This PR disables inline of print related functions, which speeds up
compilation of test_scan_layouts dramatically.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

[AMD] Re-enable overflow test in test_reduce_layouts (#5233)

#5153 fixed
the issue; but we missed enabling one of the disabled
case.

[BACKEND] Fix a missed transpose optimization during refactor (#5236)

Revert "Allow Layouts to propogate to local_load" (#5237)

This is causing some performance regression. I'll investigate and reland
it.
Reverts #5219

Revert "[AMD] Use warp shuffle for MFMA to Dot operand layout conversion (FP8)" (#5240)

It is causing performance regression, revert until it can be
investigated
Reverts #5139

Updated README.md to show the steps for overriding kernel's IR (#5239)

Ensure device context before launching kernel (#3731)

If a kernel is launched on a thread which has not initialized a CUDA
context (as can happen in the linked issue), it will throw an error. A
simple fix is to call `cudaFree(0)` to establish a device context.

Fixes #3729

[LLVM] Update to llvm-project@86b69c3 (#5242)

This includes llvm/llvm-project#115627

[BUILD] Add a stable symlink to llvm in the triton cache (#5234)

Currently the llvm path changes every time the pin updates which makes
it annoying to use the included tools. e.g. I use the tablegen language
server, but currently need to update my editor config every time the
llvm pin changes.

This adds a stable symlink which for me is
`~/.triton/llvm/llvm-macos-x64`. This will always point to the most
recent version of llvm used to build triton.

As a bonus this also refactors the symlink update code which was
copy-pasted a few times.

[PIPELINER] tweak pipeline heuristic (#5247)

Don't pipeline the dot accumulator in the default heuristic.
In the finer grain control will allow user to decide.

Allow Layouts to propogate to local_load (#5219) (#5249)

recommit of #5219

While working on some higher dimension tensor kernels, I noticed poor
performance due to the fact that layouts wouldn't propagate to local
loads. Since we do allow layout folding with local store and local
alloc, this seems like a bit of an oversight.

The change gives a 40% speed improvement on certain kernels for NVidia
GPUs.

This also removes asserts in lowering for higher dimensional kernels. As
far as I can tell, those restrictions aren't required in practice.

- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
[rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] I have added tests.
- [x] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices)

<!---
The core Triton is a small number of people, and we receive many PRs
(thank
you!).  To help us review your code more quickly, **if you are a new
contributor (less than 3 PRs merged) we ask that you complete the
following
tasks and include the filled-out checklist in your PR description.**

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them.
-->

- [ ] I am not making a trivial change, such as fixing a typo in a
comment.

- [ ] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).

- [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.

- Select one of the following.
  - [ ] I have added tests.
    - `/test` for `lit` tests
    - `/unittest` for C++ tests
    - `/python/test` for end-to-end tests
  - [ ] This PR does not need a test because `FILL THIS IN`.

- Select one of the following.
  - [ ] I have not added any `lit` tests.
- [ ] The `lit` tests I have added follow these [best
practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices),
including the "tests should be minimal" section. (Usually running Python
code
    and using the instructions it generates is not minimal.)

Co-authored-by: Matthew Brookhart <matthewbrookhart@gmail.com>

Windows related changes in `CMakeLists.txt` (#5186)

Upstreaming some of our Windows related changes assuming that there is
interest in this
#5094 (comment)
and hoping that it will not make it much more difficult to support this
CMake file.

---------

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[AMD] NFC: Unified header guard in third_party/amd (#5244)

This commit unified the names of header guards in third_party/amd.

[AMD] NFC: Drop v2 Suffix from Stream Pipeline (#5251)

Since StreamPipelineV2 has been the default for a while, this
commit promoted StreamPipelineV2 to the general
StreamPipeline by removing 'v2' suffix.

[NFC] Cleanup references to unused index dialect (#5257)

Also cleans up some includes clang thinks are unused.

[BUILD] Ensure parent directory exists before creating symlinks (#5258)

Fixes #5256

Tmp

[BACKEND] Fold transpose(splat_const) (#5259)

Add folding for a transpose of a splat constant.

---------

Co-authored-by: peterbell10 <peterbell10@live.co.uk>

[LAYOUTS] Use LLs for Hopper whenever we wouldn't use ldmatrix (#5235)

The legacy path has some bugs for cases like `kWidth=1`. I'm starting to
port Hopper to use LLs to try to isolate them.

[AMD] NFC: Cleanup namespace hierachy (#5246)

Refactored namespace hierarchy by squeezing separate
namespace hierarchy together.

[AMD] Fix unhandled profile event in RoctracerProfiler (#5252)

Fixes proton unit tests when upgrading to ROCm 6.2 by
adding missing event handlers.

Magic number is replaced with the corresponding enum
value which was added by upgrading the HIP headers
#5077.

Fix Blocked FMA path in isLayoutOK (#5260)

Fixes
https://github.com/triton-lang/triton/pull/5235/files/de18e21ddf5bf03f17f779fef032d53ea87a53a0#r1858955613

[Tutorial] Remove incorrect caching from softmax tutorial (#5162)

The fused softmax implementation in the tutorial precompiles the kernel
to query the register usage of the kernel, based on the parameters used
to specialize the kernel. On top of this, it implements a simple caching
system for this step based on just the block size.

As noted in #4739, this
caching is incorrect, because it's also not keyed on the `num_stages`
constexpr argument or the shapes of the tensors. Since triton already
has its own JIT compilation cache, and this caching bit is not really
relevant to the tutorial, just remove it to get rid of the footgun.

[INSTRUMENTATION] Generalize code in `test_gpuhello.py` (#5263)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Create an aggregate `check-triton-unit` target (#5150)

This adds a CMake target `check-triton-unit` that builds an runs all
Triton unittests written in gtest. This makes it more conveninent to
rebuild and run all unittests at once with finer granularity (instead of
`ninja; ctest`).

[NFC] Add `test_bessel` into `test_libdevice.py` (#5261)

Just a port of one of our tests. I didn't find any similar ones in
Triton itself, this should increase the test coverage.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[NFC] Add functional regression test for cummax with bool type (#5264)

This kernel was obtained using PyTorch inductor some time ago.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[AMD] NFC: Unified comment style (#5248)

Script:

egrep -nrI --exclude-dir "backend" "^\s*/\*+" third_party/amd

[AMD] Upgrade AMD CI docker image (#5230)

This commits updates the CI to use a new docker image that
contains ROCm 6.2.2 with ASan support and PyTorch 2.5.1.

This also switches to ubuntu's default clang toolchain instead
of using the one which comes with ROCm.

Implement `dot_scaled(mmav3)` (#5269)

As per title

[BUILD] Some CMake cleanup/modernisation (#5271)

- Prefer `find_package` over ad-hoc variable passing
- Prefer `target_` api vs global `_directories` apis
- Use `target_link_options` to specify link options instead of
`target_link_libraries`

Closes #5270

[DIALECT] Rename `triton_gpu` to `ttg` and `triton_nvidia_gpu` to `ttng` (#5266)

It may cause changes for downstream tasks but we think it's beneficial
to shorten dialect name and make them consistent. That is, we are using
`tt` to represent the `triton` dialect.

[BACKEND] Fix inline asm bug for multiple packed <32bit output (#5273)

Resolves #5272

- Fixes logic for walking result struct from LLVM InlineAsm in case of
multiple sub-32bit results
- Adds lit test

[NVIDIA][Backend] Add CoalesceAsyncCopy Pass for in-DotOpEnc Upcasting (#5222)

This is a follow-up to the dotOp hoisting optimization for WGMMA
(MMAv3). See
#5003 (comment)

In short, when upcasting operand A in registers prior to WGMMA and when
pipelining is enabled, `AsyncCopyGLobalToLocal`'s src gmem blocked
encoding will have `sizePerThread` > smem view's `vec` (along the
contiguous dimension). This will resulting in multiple `cp.async`
instructions being generated for a contiguous global data segment,
resulting in uncoalesced loads. This was previously confirmed in ncu.
See above comment for an example.

I've added a generalized fix in a new pass after the pipeliner. I've
reused the logic in the LLVM lowering for `AsyncCopyGlobalToLocal` to
calculate the max contiguous copy size. I compare that to the blockEnc's
`sizePerThread` along the inner (contiguous) dimension. If the former is
less than latter, I set the latter to former.

When A is k-major, can verify a small perf improvement and that ncu no
longer reports uncoalesced loads.
When A is m-major, this pass is a no-op because `copy size ==
sizePerThread == 16`

ptal, thanks @ThomasRaoux

[Triton] Add `tl.gather` with a naive codegen implementation (#5262)

This PR adds a `tl.gather` builtin that implements a local gather along
a single axis, with semantics matching `torch.gather`. `tl.gather`
generates a `tt.gather` op, which is piped through the compiler mostly
untouched at the moment, since the codegen is very naive.

The `tt.gather` is implemented by writing the source tensor into shared
memory and then performing a gather out of shared memory, thus it
requires scratch space to be allocated. In a follow-up, I will implement
an optimized layout rule for the op that ensures the gather axis fits
into a single warp, allowing the gather to be implemented using warp
shuffles.

There are other avenues for optimization as well: `tt.gather(tt.load)`
where the load only has one use can be lowered into a DMA from global
memory to shared, and then gather directly from shared.

[NVIDIA][Launcher] Ensure device context is valid before calling getPointer (#5276)

[CMAKE] Add `triton-tensor-layout` dep to lit tests (#5275)

Noticed this when `triton_gpu` was renamed to `ttg`.

[BACKEND] Fix and document logic for creating warp shapes in MMAv3 (#5277)

[NFC] Remove dead code for python<3.8 (#5280)

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[NFC] Remove `CMAKE_VERBOSE_MAKEFILE` var (#5282)

Warning:
```bash
  CMake Warning:
    Manually-specified variables were not used by the project:

      CMAKE_VERBOSE_MAKEFILE
```

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[AMD] Use Linear Layout convertions for AMDWmma (#5255)

Enable LL conwertions for WMMA as well as for MFMA layouts.

See also: #5210

Signed-off-by: Ilya Veselov <iveselov.nn@gmail.com>

Add tests for 3D local_load local_alloc and relax asserts (#5285)

Also switch 3D dot_operand cases to use linear layout path, This may be
suboptimal in some cases but that solves the functionality problems
which is more important. There is ongoing work from Mario that should
get the code quality to be good again soon.

[Build] Don't require Development.Embed python component (#5287)

This component is missing from the wheel building image, so we need to
make the requirement more specific.

https://github.com/triton-lang/triton/actions/runs/12081047335/job/33689420657#step:6:332

[NFC] Remove unused forOp argument from `setStageCluster` (#5288)

<git-pr-chain>

[NFC] Remove unused forOp argument from `setStageCluster`

1. 👉 #5288 👈 **YOU ARE HERE**
1. #5289
1. #5290

</git-pr-chain>

[PROTON] Don't use designated initializers in `CuptiPCSampling.cpp` as it relates to c++20 (#5291)

Hi @Jokeren,

these changes relates to your PR:
#4674, so I would like to ask
if this was done on purpose? (considering that the project declares
support for the c++17 standard).

I discovered this while trying to compile proton using MSVC. It looked
like this:
`\CuptiPCSampling.cpp(18): error C7555: use of designated initializers
requires at least '/std:c++20'`.

This might also be a good opportunity to ask you about your plans to
transition Triton to `с++20`.

---------

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

Add back missing check

Replace triton_gpu with ttg

Update

Update

Update

Define `pytest-forked` and `pytest-xdist` as `tests` target deps (#5292)

This way, the dependencies needed for testing are localized in one place
- `setup.py` (instead of several), which makes maintenance easier.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[BUILD] Skip installing test related python packages (#5294)

#5292 failed because of macOS
build. Since we don’t run any tests on macOS anyway, it’s fine to simply
skip them.

Update

Update

[TESTING] Add golden sample test for pipelining matmul with descriptors (#5289)

<git-pr-chain>

[TESTING] Add golden sample test for pipelining matmul with descriptors

1. #5288
1. 👉 #5289 👈 **YOU ARE HERE**
1. #5290

⚠️⚠️ Please **do not click the green "merge" button** unless you know
what
you're doing.  This PR is part of a chain of PRs, and clicking the merge
button will not merge it into master. ⚠️⚠️
</git-pr-chain>

Specify in `setup.py` that `setuptools>=40.8.0` is a required dependency (#5293)

Closes #5090

vancoykendall is right that the dependency is used not only during build.
However, for now I added it to `setup.py`, since the migration of
dependencies to `pyproject.toml` has not yet occurred.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[TOOLS] Improve `generate-test-checks.py` (#5300)

- Format the doc string using the `reStructuredText` format.
- Lift the example instructions from the `.mlir.in` file to the
docstring. Previously we matched the `module` keyword twice and
encountered errors such as `assert len(output_segments) ==
len(source_segments),`. It's also fine to update the regex to something
like `\bmodule` to solve the problem, but I think lifting it from the
input file is just simpler.

[NFC][DIALECT] Remove dependency on `mlir::tensor::TensorDialect` (#5303)

[IR] Improve `ttg.memdesc` (#5296)

- Add an `allocShape` field to denote the shape a memory descriptor when
it's allocated. The value will be propagated to all its descendants
created through `subview` ops.
- Make `encoding` and `memorySpace` fields required instead of optional.
- Implement the `getAlias` function for `#ttg.shared_memory` to shorten
its length in `.mlir` files

Update

Update

[Pipeliner] Handle masking for atomic_rmw (#5231)

This commit is to support atomic_rmw in the function
predicateOp to mask operations during scheduling.

[TESTS] Forward fix for CI break (#5323)

PR #5231 was authored before the `triton_gpu` -> `ttg` rename and CI is
currently broken.

Search for `ptxas` only for cuda backend in `supports_tma` function (#5314)

For other backends, `ptxas` may not be installed.

Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>

[LLVM] Update to llvm/llvm-project@1f20eee6dc36 (#5308)

This pulls in the AMDGPU backend support for the
gfx950 target.

We need to fix the rewrites in `Combine.td` given that
llvm/llvm-project#112700 adds
a new attribute for denorm mode for `arith.addf`.

---------

Co-authored-by: Lei Zhang <antiagainst@gmail.com>

[AMD][BACKEND] Add gfx950 target definitions. (#5281)

Enable new arch target since backend support has been added.

[AMD] Adjust local_store and global_load ordering (#5254)

This commit adjusts local store and global load
ordering to let local store be ahead of global
load when they are not in the same stage. It
should help GEMM kernel performance.

Re-align main and llvm-head (#5334)

We have a couple of PRs that landed in the `llvm-head` branch that are
not in `main`.

Merging those into `main` to prevent further divergence between
branches.

---------

Co-authored-by: Won-Kyu Park <wkpark@gmail.com>
Co-authored-by: Lei Zhang <antiagainst@gmail.com>

[PIPELINER] Cleanup of LoopScheduling.cpp, introduction of AssignLatencies (#5176)

This change breaks down LoopScheduling into two sub-passes: latency
assignment and actual scheduling.
Latency assignment is a transformation that analyzes the loop and based
on the requested number of stages it assigns "latencies" to the ops that
are going to be converted to async ops by the pipeliner. Latencies are
expressed in terms of number of iterations of the loop and can be
thought as per-operation num_stages.
Scheduling transformation takes these latencies and builds a pipeliner
schedule based on it. The process of building a schedule was slightly
rewritten to simplify the code and cleanup the logic that was no longer
needed after recent refactoring.
Breaking down the schedule into latency assignment and proper scheduling
has number of purposes:
1. Code became more modular, with cleaner interfaces that helps with
maintanance
2. Both parts can be tested in separation, I have added lit tests for
both pieces. We can finally test our pipeliner infrastructure in
manageable chunks
3. It opens up opportunity to expose per-op "latencies" to the frontend,
enabling creating user-defined schedules right from the language level

Next step in the cleanup process is to clearly separate lowering and
pipelining phases.

Update

Update

Update

Update

Update

Update

Update

Update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang:fir-hlfir flang Flang issues not falling into any other category mlir:arith mlir:core MLIR Core Infrastructure mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants