[VPlan] Implement interleaving as VPlan-to-VPlan transform. #95842

fhahn · 2024-06-17T20:32:15Z

This patch implements explicit interleaving as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in an each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts).

It also unifies the logic dealing with unrolled parts in a single place, rather than spreading it out across multiple places (e.g. VPlan post processing for header-phi recipes previously.)

In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part.

The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul.

This has been split off #94339 which also includes changes to simplify VPTransfomState and recipes' ::execute.

The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1.

A follow-up patch will clean up all references to VPTransformState::UF.

Another follow-up patch will simplify VPTransformState to only store a single vector value per VPValue.

llvmbot · 2024-06-17T20:32:45Z

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch implements explicit interleaving as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in a each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part.

The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul.

This has been split off #94339 which also includes changes to simplify VPTransfomState and recipes' ::execute.

The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1.

Patch is 161.29 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/95842.diff

29 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+9)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+31-15)
(modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+12)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+49-1)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+108-49)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+418)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+2)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-gather-scatter.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-live-out-pointer-induction.ll (+4-1)
(modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-phi.ll (+1-1)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-accesses.ll (+6-6)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll (-3)
(modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-interleave.ll (+6-7)
(modified) llvm/test/Transforms/LoopVectorize/X86/epilog-vectorization-inductions.ll (+5-3)
(modified) llvm/test/Transforms/LoopVectorize/X86/induction-costs.ll (-3)
(modified) llvm/test/Transforms/LoopVectorize/X86/interleaving.ll (+34-34)
(modified) llvm/test/Transforms/LoopVectorize/X86/pr47437.ll (+24-24)
(modified) llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll (+12-21)
(modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+42-42)
(modified) llvm/test/Transforms/LoopVectorize/float-induction.ll (+13-15)
(modified) llvm/test/Transforms/LoopVectorize/induction.ll (+39-38)
(modified) llvm/test/Transforms/LoopVectorize/interleave-and-scalarize-only.ll (-2)
(modified) llvm/test/Transforms/LoopVectorize/pr45679-fold-tail-by-masking.ll (+12-12)
(modified) llvm/test/Transforms/LoopVectorize/reduction-inloop-uf4.ll (+60-60)
(modified) llvm/test/Transforms/LoopVectorize/scalable-inductions.ll (+2-2)
(modified) llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll (+13-6)

diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index c03c278fcebe7..61fc02cdf9b8b 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -161,6 +161,15 @@ class VPBuilder {
     return tryInsertInstruction(
         new VPInstruction(Opcode, Operands, WrapFlags, DL, Name));
   }
+
+  VPInstruction *createFPOp(unsigned Opcode,
+                            std::initializer_list<VPValue *> Operands,
+                            DebugLoc DL = {}, const Twine &Name = "",
+                            FastMathFlags FMFs = {}) {
+    auto *Op = new VPInstruction(Opcode, Operands, FMFs, DL, Name);
+    return tryInsertInstruction(Op);
+  }
+
   VPValue *createNot(VPValue *Operand, DebugLoc DL = {},
                      const Twine &Name = "") {
     return createInstruction(VPInstruction::Not, {Operand}, DL, Name);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 408083765ccb1..732499ba9a33a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3476,10 +3476,7 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
   // initial value for the recurrence when jumping to the scalar loop.
   VPValue *VPExtract = LO->getOperand(0);
   using namespace llvm::VPlanPatternMatch;
-  assert(match(VPExtract, m_VPInstruction<VPInstruction::ExtractFromEnd>(
-                              m_VPValue(), m_VPValue())) &&
-         "FOR LiveOut expects to use an extract from end.");
-  Value *ResumeScalarFOR = State.get(VPExtract, UF - 1, true);
+  Value *ResumeScalarFOR = State.get(VPExtract, 0, true);
 
   // Fix the initial value of the original recurrence in the scalar loop.
   PHINode *ScalarHeaderPhi = LO->getPhi();
@@ -7429,6 +7426,8 @@ LoopVectorizationPlanner::executePlan(
       "expanded SCEVs to reuse can only be used during epilogue vectorization");
   (void)IsEpilogueVectorization;
 
+  VPlanTransforms::interleave(BestVPlan, BestUF,
+                              OrigLoop->getHeader()->getModule()->getContext());
   VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
 
   LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF
@@ -9152,42 +9151,59 @@ void VPWidenPointerInductionRecipe::execute(VPTransformState &State) {
 
   auto *IVR = getParent()->getPlan()->getCanonicalIV();
   PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, 0, /*IsScalar*/ true));
+  unsigned CurrentPart = 0;
+  if (getNumOperands() == 5)
+    CurrentPart =
+        cast<ConstantInt>(getOperand(4)->getLiveInIRValue())->getZExtValue();
   Type *PhiType = IndDesc.getStep()->getType();
 
   // Build a pointer phi
   Value *ScalarStartValue = getStartValue()->getLiveInIRValue();
   Type *ScStValueType = ScalarStartValue->getType();
-  PHINode *NewPointerPhi = PHINode::Create(ScStValueType, 2, "pointer.phi",
-                                           CanonicalIV->getIterator());
+  PHINode *NewPointerPhi = nullptr;
 
   BasicBlock *VectorPH = State.CFG.getPreheaderBBFor(this);
-  NewPointerPhi->addIncoming(ScalarStartValue, VectorPH);
+  if (getNumOperands() == 5) {
+    auto *GEP = cast<GetElementPtrInst>(State.get(getOperand(3), 0));
+    NewPointerPhi = cast<PHINode>(GEP->getPointerOperand());
+  } else {
+    NewPointerPhi =
+        PHINode::Create(ScStValueType, 2, "pointer.phi", CanonicalIV);
+    NewPointerPhi->addIncoming(ScalarStartValue, VectorPH);
+  }
 
   // A pointer induction, performed by using a gep
   BasicBlock::iterator InductionLoc = State.Builder.GetInsertPoint();
+  unsigned UF = getNumOperands() == 2
+                    ? 1
+                    : cast<ConstantInt>(getOperand(2)->getLiveInIRValue())
+                          ->getZExtValue();
 
   Value *ScalarStepValue = State.get(getOperand(1), VPIteration(0, 0));
   Value *RuntimeVF = getRuntimeVF(State.Builder, PhiType, State.VF);
   Value *NumUnrolledElems =
-      State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, State.UF));
-  Value *InductionGEP = GetElementPtrInst::Create(
-      State.Builder.getInt8Ty(), NewPointerPhi,
-      State.Builder.CreateMul(ScalarStepValue, NumUnrolledElems), "ptr.ind",
-      InductionLoc);
+      State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, UF));
   // Add induction update using an incorrect block temporarily. The phi node
   // will be fixed after VPlan execution. Note that at this point the latch
   // block cannot be used, as it does not exist yet.
   // TODO: Model increment value in VPlan, by turning the recipe into a
   // multi-def and a subclass of VPHeaderPHIRecipe.
-  NewPointerPhi->addIncoming(InductionGEP, VectorPH);
+  if (getNumOperands() != 5) {
+    Value *InductionGEP = GetElementPtrInst::Create(
+        State.Builder.getInt8Ty(), NewPointerPhi,
+        State.Builder.CreateMul(ScalarStepValue, NumUnrolledElems), "ptr.ind",
+        InductionLoc);
+
+    NewPointerPhi->addIncoming(InductionGEP, VectorPH);
+  }
 
   // Create UF many actual address geps that use the pointer
   // phi as base and a vectorized version of the step value
   // (<step*0, ..., step*N>) as offset.
   for (unsigned Part = 0; Part < State.UF; ++Part) {
     Type *VecPhiType = VectorType::get(PhiType, State.VF);
-    Value *StartOffsetScalar =
-        State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, Part));
+    Value *StartOffsetScalar = State.Builder.CreateMul(
+        RuntimeVF, ConstantInt::get(PhiType, CurrentPart));
     Value *StartOffset =
         State.Builder.CreateVectorSplat(State.VF, StartOffsetScalar);
     // Create a vector of consecutive numbers from zero to VF.
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 348a2be5072b4..2e47704fe274c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -821,6 +821,10 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
   // FIXME: Model VF * UF computation completely in VPlan.
   VFxUF.setUnderlyingValue(
       createStepForVF(Builder, TripCountV->getType(), State.VF, State.UF));
+  if (VF.getNumUsers() > 0) {
+    VF.setUnderlyingValue(
+        createStepForVF(Builder, TripCountV->getType(), State.VF, 1));
+  }
 
   // When vectorizing the epilogue loop, the canonical induction start value
   // needs to be changed from zero to the value after the main vector loop.
@@ -846,6 +850,7 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
 /// Assumes a single pre-header basic-block was created for this. Introduce
 /// additional basic-blocks as needed, and fill them all.
 void VPlan::execute(VPTransformState *State) {
+  State->UF = 1;
   // Initialize CFG state.
   State->CFG.PrevVPBB = nullptr;
   State->CFG.ExitBB = State->CFG.PrevBB->getSingleSuccessor();
@@ -890,6 +895,9 @@ void VPlan::execute(VPTransformState *State) {
       // Move the last step to the end of the latch block. This ensures
       // consistent placement of all induction updates.
       Instruction *Inc = cast<Instruction>(Phi->getIncomingValue(1));
+      if (isa<VPWidenIntOrFpInductionRecipe>(&R) && R.getNumOperands() == 4)
+        Inc->setOperand(0, State->get(R.getOperand(3), 0));
+
       Inc->moveBefore(VectorLatchBB->getTerminator()->getPrevNode());
       continue;
     }
@@ -1254,6 +1262,10 @@ void VPlanIngredient::print(raw_ostream &O) const {
 
 template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);
 
+bool VPValue::isDefinedOutsideVectorRegions() const {
+  return !hasDefiningRecipe() || !getDefiningRecipe()->getParent()->getParent();
+}
+
 void VPValue::replaceAllUsesWith(VPValue *New) {
   replaceUsesWithIf(New, [](VPUser &, unsigned) { return true; });
 }
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 5bb88e4a57dc3..706851f84de23 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -701,6 +701,8 @@ class VPLiveOut : public VPUser {
 
   PHINode *getPhi() const { return Phi; }
 
+  bool onlyFirstPartUsed(const VPValue *Op) const override { return true; }
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the VPLiveOut to \p O.
   void print(raw_ostream &O, VPSlotTracker &SlotTracker) const;
@@ -1330,6 +1332,9 @@ class VPInstruction : public VPRecipeWithIRFlags {
   /// Returns true if this VPInstruction produces a scalar value from a vector,
   /// e.g. by performing a reduction or extracting a lane.
   bool isVectorToScalar() const;
+
+  /// Return the interleave count from the VPInstruction's last argument.
+  unsigned getInterleaveCount() const;
 };
 
 /// VPWidenRecipe is a recipe for producing a copy of vector type its
@@ -1611,6 +1616,9 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags {
                                      isInBounds(), getDebugLoc());
   }
 
+  /// Return the current part for this vector pointer.
+  unsigned getPartForRecipe() const;
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe.
   void print(raw_ostream &O, const Twine &Indent,
@@ -1951,6 +1959,9 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe {
 
   /// Returns true, if the phi is part of an in-loop reduction.
   bool isInLoop() const { return IsInLoop; }
+
+  /// Return the current part for this scalar step.
+  unsigned getPartForRecipe() const;
 };
 
 /// A recipe for vectorizing a phi-node as a sequence of mask-based select
@@ -2593,6 +2604,9 @@ class VPCanonicalIVPHIRecipe : public VPHeaderPHIRecipe {
   /// Generate the canonical scalar induction phi of the vector loop.
   void execute(VPTransformState &State) override;
 
+  /// Return the current part for this scalar step.
+  unsigned getPartForRecipe() const;
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe.
   void print(raw_ostream &O, const Twine &Indent,
@@ -2637,7 +2651,9 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
   ~VPActiveLaneMaskPHIRecipe() override = default;
 
   VPActiveLaneMaskPHIRecipe *clone() override {
-    return new VPActiveLaneMaskPHIRecipe(getOperand(0), getDebugLoc());
+    auto *R = new VPActiveLaneMaskPHIRecipe(getOperand(0), getDebugLoc());
+    R->addOperand(getOperand(1));
+    return R;
   }
 
   VP_CLASSOF_IMPL(VPDef::VPActiveLaneMaskPHISC)
@@ -2715,6 +2731,9 @@ class VPWidenCanonicalIVRecipe : public VPSingleDefRecipe {
   /// step = <VF*UF, VF*UF, ..., VF*UF>.
   void execute(VPTransformState &State) override;
 
+  /// Return the current part for this scalar step.
+  unsigned getPartForRecipe() const;
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe.
   void print(raw_ostream &O, const Twine &Indent,
@@ -2827,6 +2846,9 @@ class VPScalarIVStepsRecipe : public VPRecipeWithIRFlags {
            "Op must be an operand of the recipe");
     return true;
   }
+
+  /// Return the current part for this scalar step.
+  unsigned getPartForRecipe() const;
 };
 
 /// VPBasicBlock serves as the leaf of the Hierarchical Control-Flow Graph. It
@@ -3145,6 +3167,8 @@ class VPlan {
   /// Represents the loop-invariant VF * UF of the vector loop region.
   VPValue VFxUF;
 
+  VPValue VF;
+
   /// Holds a mapping between Values and their corresponding VPValue inside
   /// VPlan.
   Value2VPValueTy Value2VPValue;
@@ -3232,6 +3256,7 @@ class VPlan {
 
   /// Returns VF * UF of the vector loop region.
   VPValue &getVFxUF() { return VFxUF; }
+  VPValue &getVF() { return VF; }
 
   void addVF(ElementCount VF) { VFs.insert(VF); }
 
@@ -3665,6 +3690,29 @@ inline bool isUniformAfterVectorization(VPValue *VPV) {
     return VPI->isVectorToScalar();
   return false;
 }
+
+/// Checks if \p C is uniform across all VFs and UFs. It is considered as such
+/// if it is either defined outside the vector region or its operand is known to
+/// be uniform across all VFs and UFs (e.g. VPDerivedIV or VPCanonicalIVPHI).
+inline bool isUniformAcrossVFsAndUFs(VPValue *V) {
+  if (auto *VPI = dyn_cast_or_null<VPInstruction>(V->getDefiningRecipe())) {
+    return VPI ==
+           VPI->getParent()->getPlan()->getCanonicalIV()->getBackedgeValue();
+  }
+  if (isa<VPCanonicalIVPHIRecipe, VPDerivedIVRecipe, VPExpandSCEVRecipe>(V))
+    return true;
+  if (isa<VPReplicateRecipe>(V) && cast<VPReplicateRecipe>(V)->isUniform() &&
+      (isa<LoadInst, StoreInst>(V->getUnderlyingValue())) &&
+      all_of(V->getDefiningRecipe()->operands(),
+             [](VPValue *Op) { return Op->isDefinedOutsideVectorRegions(); }))
+    return true;
+
+  auto *C = dyn_cast_or_null<VPScalarCastRecipe>(V->getDefiningRecipe());
+  return C && (C->isDefinedOutsideVectorRegions() ||
+               isa<VPDerivedIVRecipe>(C->getOperand(0)) ||
+               isa<VPCanonicalIVPHIRecipe>(C->getOperand(0)));
+}
+
 } // end namespace vputils
 
 } // end namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index a3ff6395bb39e..50aae9f28c562 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -393,9 +393,9 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
     if (Part != 0)
       return State.get(this, 0, /*IsScalar*/ true);
 
+    unsigned UF = getInterleaveCount();
     Value *ScalarTC = State.get(getOperand(0), {0, 0});
-    Value *Step =
-        createStepForVF(Builder, ScalarTC->getType(), State.VF, State.UF);
+    Value *Step = createStepForVF(Builder, ScalarTC->getType(), State.VF, UF);
     Value *Sub = Builder.CreateSub(ScalarTC, Step);
     Value *Cmp = Builder.CreateICmp(CmpInst::Predicate::ICMP_UGT, ScalarTC, Step);
     Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
@@ -428,14 +428,15 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
   }
   case VPInstruction::CanonicalIVIncrementForPart: {
     auto *IV = State.get(getOperand(0), VPIteration(0, 0));
-    if (Part == 0)
-      return IV;
-
-    // The canonical IV is incremented by the vectorization factor (num of SIMD
-    // elements) times the unroll part.
-    Value *Step = createStepForVF(Builder, IV->getType(), State.VF, Part);
-    return Builder.CreateAdd(IV, Step, Name, hasNoUnsignedWrap(),
-                             hasNoSignedWrap());
+    if (getNumOperands() == 2) {
+      // The canonical IV is incremented by the vectorization factor (num of
+      // SIMD elements) times the unroll part.
+      Value *Step = createStepForVF(Builder, IV->getType(), State.VF,
+                                    getInterleaveCount());
+      return Builder.CreateAdd(IV, Step, Name, hasNoUnsignedWrap(),
+                               hasNoSignedWrap());
+    }
+    return IV;
   }
   case VPInstruction::BranchOnCond: {
     if (Part != 0)
@@ -483,8 +484,7 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
     return CondBr;
   }
   case VPInstruction::ComputeReductionResult: {
-    if (Part != 0)
-      return State.get(this, 0, /*IsScalar*/ true);
+    unsigned NumParts = getNumOperands() - 1;
 
     // FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
     // and will be removed by breaking up the recipe further.
@@ -495,11 +495,10 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
 
     RecurKind RK = RdxDesc.getRecurrenceKind();
 
-    VPValue *LoopExitingDef = getOperand(1);
     Type *PhiTy = OrigPhi->getType();
-    VectorParts RdxParts(State.UF);
-    for (unsigned Part = 0; Part < State.UF; ++Part)
-      RdxParts[Part] = State.get(LoopExitingDef, Part, PhiR->isInLoop());
+    VectorParts RdxParts(NumParts);
+    for (unsigned Part = 0; Part != NumParts; ++Part)
+      RdxParts[Part] = State.get(getOperand(1 + Part), 0, PhiR->isInLoop());
 
     // If the vector reduction can be performed in a smaller type, we truncate
     // then extend the loop exit value to enable InstCombine to evaluate the
@@ -507,7 +506,7 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
     // TODO: Handle this in truncateToMinBW.
     if (State.VF.isVector() && PhiTy != RdxDesc.getRecurrenceType()) {
       Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), State.VF);
-      for (unsigned Part = 0; Part < State.UF; ++Part)
+      for (unsigned Part = 0; Part < NumParts; ++Part)
         RdxParts[Part] = Builder.CreateTrunc(RdxParts[Part], RdxVecTy);
     }
     // Reduce all of the unrolled parts into a single vector.
@@ -517,12 +516,12 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
       Op = Instruction::Or;
 
     if (PhiR->isOrdered()) {
-      ReducedPartRdx = RdxParts[State.UF - 1];
+      ReducedPartRdx = RdxParts[NumParts - 1];
     } else {
       // Floating-point operations should have some FMF to enable the reduction.
       IRBuilderBase::FastMathFlagGuard FMFG(Builder);
       Builder.setFastMathFlags(RdxDesc.getFastMathFlags());
-      for (unsigned Part = 1; Part < State.UF; ++Part) {
+      for (unsigned Part = 1; Part < NumParts; ++Part) {
         Value *RdxPart = RdxParts[Part];
         if (Op != Instruction::ICmp && Op != Instruction::FCmp)
           ReducedPartRdx = Builder.CreateBinOp(
@@ -603,6 +602,13 @@ bool VPInstruction::isVectorToScalar() const {
          getOpcode() == VPInstruction::ComputeReductionResult;
 }
 
+unsigned VPInstruction::getInterleaveCount() const {
+  return getNumOperands() == 1
+             ? 1
+             : cast<ConstantInt>(getOperand(1)->getLiveInIRValue())
+                   ->getZExtValue();
+}
+
 #if !defined(NDEBUG)
 bool VPInstruction::isFPMathOp() const {
   // Inspired by FPMathOperator::classof. Notable differences are that we don't
@@ -1215,24 +1221,32 @@ void VPWidenIntOrFpInductionRecipe::execute(VPTransformState &State) {
     MulOp = Instruction::FMul;
   }
 
-  // Multiply the vectorization factor by the step using integer or
-  // floating-point arithmetic as appropriate.
-  Type *StepType = Step->getType();
-  Value *RuntimeVF;
-  if (Step->getType()->isFloatingPointTy())
-    RuntimeVF = getRuntimeVFAsFloat(Builder, StepType, State.VF);
-  else
-    RuntimeVF = getRuntimeVF(Builder, StepType, State.VF);
-  Value *Mul = Builder.CreateBinOp(MulOp, Step, RuntimeVF);
-
-  // Create a vector splat to use in the induction update.
-  //
-  // FIXME: If the step is non-constant, we create the vector splat with
-  //        IRBuilder. IRBuilder can constant-fold the multiply, but it doesn't
-  //        handle a constant vector splat.
-  Value *SplatVF = isa<Constant>(Mul)
-                       ? ConstantVector::getSplat(State.VF, cast<Constant>(Mul))
-                       : Builder.CreateVectorSplat(State.VF, Mul);
+  Value *SplatVF;
+  if (getNumOperands() == 4) {
+    // Need to create stuff in PH.
+    SplatVF = State.get(getOperand(2), 0);
+  } else {
+
+    // Multiply the vectorization factor by the step using integer or
+    // floating-point arithmetic as appropriate.
+    Type *StepType = Step->getType();
+    Value *RuntimeVF;
+    if (Step->getType()->isFloatingPointTy())
+      RuntimeVF = getRuntimeVFAsFloat(Builder, StepType, State.VF);
+    else
+      RuntimeVF = getRuntimeVF(Builder, StepType, State.VF);
+    Value *Mul = Builder.CreateBinOp(MulOp, Step, RuntimeVF);
+
+    // Create a vector splat to use in the induction update.
+    //
+    // FIXME: If the step is non-constant, we create the vector splat with
+    //        IRBuilder. IRBuilder can constant-fold the multiply, but it
+    //        doesn't handle a constant vector splat.
+    SplatVF = isa<Constant>(Mul)
+                  ? ConstantVector::getSplat(State.VF, cast<Constant>(Mul))
+                  : Builder.CreateVectorSplat(State.VF, Mul);
+  }
+
   Builder.restoreIP(CurrIP);
 
   // We may need to add the step a number of times, depending on the unroll
@@ -1362,7 +1376,8 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
     EndLane = StartLane + 1;
   }
   for (unsigned Part = StartPart; Part < EndPart; ++Part) {
-    Value *StartIdx0 = createStepForVF(Builder, IntStepTy, State.VF, Part);
+    Value *StartIdx0 =
+        cre...
[truncated]

ayalz

First scan, some comments may be obsolete - drafted before InterleaveState was introduced.

In some cases splitting recipes by dedicating a separate recipe to code generated inside the preheader may simplify interleaving, as the latter need not be cloned across UF.

ayalz · 2024-06-19T19:37:50Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -3476,10 +3476,7 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
  // initial value for the recurrence when jumping to the scalar loop.
  VPValue *VPExtract = LO->getOperand(0);


Should this still be called "Extract" given that the assert for ExtractFromEnd is dropped, or simply
VPExtract >> VPLastElement,
// Extract the last vector element ... >> // Retrieve the last vector element ...
?

ayalz · 2024-06-19T19:47:54Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  unsigned CurrentPart = 0;
+  if (getNumOperands() == 5)
+    CurrentPart =
+        cast<ConstantInt>(getOperand(4)->getLiveInIRValue())->getZExtValue();


Define CurrentPart later, closer to where it's needed?

Better record the constant Part as an unsigned field, possibly zero if needed, rather than disrupting the traditional operands?

ayalz · 2024-06-19T20:56:31Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -821,6 +821,10 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
  // FIXME: Model VF * UF computation completely in VPlan.
  VFxUF.setUnderlyingValue(
      createStepForVF(Builder, TripCountV->getType(), State.VF, State.UF));
+  if (VF.getNumUsers() > 0) {
+    VF.setUnderlyingValue(
+        createStepForVF(Builder, TripCountV->getType(), State.VF, 1));


Suggested change

createStepForVF(Builder, TripCountV->getType(), State.VF, 1));

getRuntimeVF(Builder, TripCountV->getType(), State.VF));

Updated, thanks!

ayalz · 2024-06-19T21:00:19Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -890,6 +895,9 @@ void VPlan::execute(VPTransformState *State) {
      // Move the last step to the end of the latch block. This ensures
      // consistent placement of all induction updates.
      Instruction *Inc = cast<Instruction>(Phi->getIncomingValue(1));
+      if (isa<VPWidenIntOrFpInductionRecipe>(&R) && R.getNumOperands() == 4)
+        Inc->setOperand(0, State->get(R.getOperand(3), 0));
+


Good to add some explanation why Inc's first operand is set to R's last operand (if it has four).

Better to place this after Inc->moveBefore() below or before //Move the last explanation above.

BTW, above setting of Phi also suffers from WidenPointer recipe producing both a phi and a gep.

Added a comment and moved, thanks!

ayalz · 2024-06-19T21:03:45Z

llvm/lib/Transforms/Vectorize/VPlanValue.h

@@ -184,7 +184,7 @@ class VPValue {
  /// Returns true if the VPValue is defined outside any vector regions, i.e. it
  /// is a live-in value.
  /// TODO: Also handle recipes defined in pre-header blocks.


This TODO is now taken care of? Can be pushed separately, or tested only w/ this patch? Can be pushed independently?

Can be pushed independently, but I think not visible with the current tests

ok, so the TODO should be cleared, along with the "i.e. it is a live-in value" preceding it.

Removed, thanks!

ayalz · 2024-06-23T09:56:33Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+          1, InterleavedValues.getInterleavedValue(H.getOperand(1), IC - 1));
+      continue;
+    }
+    if (InterleavedValues.contains(H.getVPSingleValue()) ||


Add some explanation, along with what I stands for?

Added a comment to document the while loop

ayalz · 2024-06-23T10:01:26Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    return;
+  InterleaveState InterleavedValues;
+
+  SmallPtrSet<VPRecipeBase *, 8> ToSkip;


Explain why ToSkip is needed.

Done, thanks!

ayalz · 2024-06-23T10:05:32Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  void addUniform(VPSingleDefRecipe *R, unsigned IC) {
+    auto Ins = InterleavedValues.insert({R, {}});
+    for (unsigned I = 1; I != IC; ++I)
+      Ins.first->second.push_back(R);


An alternative is for getInterleavedValue() to treat uniform values as live-ins or Part0 values?

Left as is for now, as there are some cases which need handling here but not caught by uniform analysis I think

ayalz · 2024-06-23T10:07:08Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    }
+  }
+}
+


Document. Handles non-uniform, non-header-phi recipes.

Added brief comment, thanks!

ayalz · 2024-06-23T10:09:27Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    if (!ScalarVFOnly) {
+      InterleavedValues.remapOperands(&R, IC - 1);
+      return;
+    }


Also handle ScalarVFOnly case here?

Done, thanks!

This patch implements explicit interleaving as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in a each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts). In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part. The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul. This has been split off llvm#94339 which also includes changes to simplify VPTransfomState and recipes' ::execute. The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1.

fhahn · 2024-08-26T13:14:42Z

ping :)

ayalz

Still have parts to go over, but various comments have accumulated so starting the pipe, yet to be completed.

One issue worth discussing is names:

"interleaving" may be confused with "interleave-groups"
IC is synonymous to UF and to NumParts, implying that Interleaving is aka Unrolling ("in-place", and Jamming). E.g., "interleaveByUF()" should accept UF as parameter rather than IC, and perhaps better called unrollByUF().
"replicate" as in recipe and region is also synonymous to "interleaving", i.e., making copies in-place.
Should we pick one term for consistency, throughout - "replicating" or "unrolling" rather than "interleaving" to avoid confusion with interleave-groups?

Another issue is retaining State.UF (set to 1) and having execute()'s iterate over its Parts, as a gradual step with reduced diff. This may raise some eyebrows when looking at the code, even if temporarily until State.UF and its loops are retired. Is it worth trying to keep both versions alive, say with a flag, or will that only complicate things further, needlessly, and detrimental - inflating back the patch size.

ayalz · 2024-08-30T09:29:21Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

+                            DebugLoc DL = {}, const Twine &Name = "",
+                            FastMathFlags FMFs = {}) {


nit, for consistency:

Suggested change

DebugLoc DL = {}, const Twine &Name = "",

FastMathFlags FMFs = {}) {

FastMathFlags FMFs = {}, DebugLoc DL = {}, const Twine &Name = "") {

Flags tend to follow Operands as they can be considered optional constant operands (as in MLIR's Attributes), Name tends to appear last. Matching the order in the delegated call to VPInstruction's constructor.
Both callers below feed all parameters anyway, although one could use the default empty name (only).

Updated, thanks!

ayalz · 2024-08-31T18:54:52Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    VPInstruction *Mul;
+    if (IVTy->isFloatingPointTy())
+      Mul = Builder.createFPOp(Instruction::FMul, {Step, Scale},
+                               IV->getDebugLoc(), "", FMFs);


Suggested change

IV->getDebugLoc(), "", FMFs);

FMFs, IV->getDebugLoc());

can drop empty Name parameter if last, consistent with following call to createNaryOp.

Adjusted thanks

ayalz · 2024-08-31T19:16:46Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7392,6 +7392,8 @@ LoopVectorizationPlanner::executePlan(
      "expanded SCEVs to reuse can only be used during epilogue vectorization");
  (void)IsEpilogueVectorization;

+  VPlanTransforms::interleaveByUF(
+      BestVPlan, BestUF, OrigLoop->getHeader()->getModule()->getContext());


Potential TODO thought: have interleaveByUF() and its follower optimizeForVFAndUF() be (a late) part of the planning stage, facilitating more accurate cost estimation, rather than here at the execution stage? This involves fixing the VF(s) and keeping a single VPlan - for the best VF (or two - for main and epilog), and determining best UF, before these transforms are applied.

Sounds good for a follow-up, this will help improve cost modeling once we completed the transition to the VPlan-based cost model.

Added TODO.

ayalz · 2024-08-31T19:19:48Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -572,8 +572,7 @@ VPBasicBlock *VPBasicBlock::splitAt(iterator SplitAt) {
  return SplitBlock;
 }

-VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() {
-  VPRegionBlock *P = getParent();
+template <typename T> static T *getEnclosingLoopRegionImpl(T *P) {


Add a comment that this is (only) to support both const and non-const versions?

Done thanks!

ayalz · 2024-08-31T19:24:42Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -846,6 +850,7 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
 /// Assumes a single pre-header basic-block was created for this. Introduce
 /// additional basic-blocks as needed, and fill them all.
 void VPlan::execute(VPTransformState *State) {
+  State->UF = 1;


Hmm, InnerLoopUnroller used to override a few methods, but indeed seems ready to be retired.

ayalz · 2024-09-03T21:08:30Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+      R.getVPSingleValue()->replaceAllUsesWith(
+          InterleavedValues.getInterleavedValue(Op0, IC - Offset));
+    } else {
+      InterleavedValues.remapOperands(&R, IC - 1);


Worth explaining - extracting from end of (vector of) last part?

Done thanks!

ayalz · 2024-09-03T21:48:18Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  if (auto *RepR = dyn_cast<VPReplicateRecipe>(&R)) {
+    if (isa<StoreInst>(RepR->getUnderlyingValue()) &&
+        RepR->getOperand(1)->isDefinedOutsideVectorRegions()) {
+      InterleavedValues.remapOperands(&R, IC - 1);


Stores need not be added as they have no users, if invariant address suffice to keep last part?

Added comment thanks

ayalz · 2024-09-03T21:52:20Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    }
+    if (auto *II = dyn_cast<IntrinsicInst>(RepR->getUnderlyingValue())) {
+      if (II->getIntrinsicID() == Intrinsic::experimental_noalias_scope_decl) {
+        InterleavedValues.addUniform(RepR, IC);


Free of operands.

Yep, also treated as uniforms so remapping wouldn't be needed.

ayalz · 2024-09-03T21:52:48Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+
+  if (isa<VPInstruction>(&R) &&
+      vputils::onlyFirstPartUsed(R.getVPSingleValue())) {
+    InterleavedValues.addUniform(cast<VPInstruction>(&R), IC);


What about (re)mapping operands?

Not needed as we only use the first part.

ayalz · 2024-09-03T21:53:23Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    InterleavedValues.addUniform(cast<VPInstruction>(&R), IC);
+    return;
+  }
+


Some explanation what's coming next - replicating non-uniforms?

Added thanks!

…-step-1

…-step-1-tmp

fhahn

Addressed comments and also moved most unroll helper functions to UnrollState, to reduce number of parameters passed and improve clarity

Still have parts to go over, but various comments have accumulated so starting the pipe, yet to be completed.

One issue worth discussing is names:

"interleaving" may be confused with "interleave-groups"
IC is synonymous to UF and to NumParts, implying that Interleaving is aka Unrolling ("in-place", and Jamming). E.g., "interleaveByUF()" should accept UF as parameter rather than IC, and perhaps better called unrollByUF().
"replicate" as in recipe and region is also synonymous to "interleaving", i.e., making copies in-place.
Should we pick one term for consistency, throughout - "replicating" or "unrolling" rather than "interleaving" to avoid confusion with interleave-groups?

Excellent point, tried to consistently use unroll in the latest version.

Another issue is retaining State.UF (set to 1) and having execute()'s iterate over its Parts, as a gradual step with reduced diff. This may raise some eyebrows when looking at the code, even if temporarily until State.UF and its loops are retired. Is it worth trying to keep both versions alive, say with a flag, or will that only complicate things further, needlessly, and detrimental - inflating back the patch size.

I don't think a flag is needed (and may be even more confusing), if we follow up swiftly with the mechanical change that removes the loops over parts in ::execute.

fhahn · 2024-09-05T10:23:54Z

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

+                            DebugLoc DL = {}, const Twine &Name = "",
+                            FastMathFlags FMFs = {}) {


Updated, thanks!

fhahn · 2024-09-05T10:24:42Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -7392,6 +7392,8 @@ LoopVectorizationPlanner::executePlan(
      "expanded SCEVs to reuse can only be used during epilogue vectorization");
  (void)IsEpilogueVectorization;

+  VPlanTransforms::interleaveByUF(
+      BestVPlan, BestUF, OrigLoop->getHeader()->getModule()->getContext());


Sounds good for a follow-up, this will help improve cost modeling once we completed the transition to the VPlan-based cost model.

Added TODO.

fhahn · 2024-09-05T10:36:26Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -572,8 +572,7 @@ VPBasicBlock *VPBasicBlock::splitAt(iterator SplitAt) {
  return SplitBlock;
 }

-VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() {
-  VPRegionBlock *P = getParent();
+template <typename T> static T *getEnclosingLoopRegionImpl(T *P) {


Done thanks!

fhahn · 2024-09-05T10:37:03Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

@@ -582,6 +581,14 @@ VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() {
  return P;
 }

+const VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() const {
+  return getEnclosingLoopRegionImpl(getParent());


Updated to getEnclosingLoopRegionForRegion, thanks!

Passing the region block keeps the template simpler, otherwise we would need to specify the return type I think

fhahn · 2024-09-05T10:38:06Z

llvm/lib/Transforms/Vectorize/VPlan.cpp

+  if (VF.getNumUsers() > 0) {
+    VF.setUnderlyingValue(
+        getRuntimeVF(Builder, TripCountV->getType(), State.VF));
+  }


Moved thanks.

Probably best to first lane #95305 which adds VF separately.

fhahn · 2024-09-05T15:13:34Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    }
+    if (auto *II = dyn_cast<IntrinsicInst>(RepR->getUnderlyingValue())) {
+      if (II->getIntrinsicID() == Intrinsic::experimental_noalias_scope_decl) {
+        InterleavedValues.addUniform(RepR, IC);


Yep, also treated as uniforms so remapping wouldn't be needed.

fhahn · 2024-09-05T15:14:12Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    InterleavedValues.addUniform(cast<VPInstruction>(&R), IC);
+    return;
+  }
+


Added thanks!

fhahn · 2024-09-05T15:15:21Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    interleaveBlock(VPB, Plan, IC, InterleavedValues, TypeInfo, ToSkip,
+                    PhisToRemap);
+  }
+


That should have been added a while ago

fhahn · 2024-09-05T15:17:13Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.h

@@ -106,6 +106,8 @@ struct VPlanTransforms {
  /// this transformation.
  /// \returns true if the transformation succeeds, or false if it doesn't.
  static bool tryAddExplicitVectorLength(VPlan &Plan);
+
+  static void interleaveByUF(VPlan &Plan, unsigned IC, LLVMContext &Ctx);


Updated and also renamed interleave->unroll

fhahn · 2024-09-05T15:25:37Z

llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll

+; CHECK-NEXT:     vp<[[VPTR3:%.+]]> = vector-pointer vp<[[PADD1]]>
+; CHECK-NEXT:     vp<[[VPTR4:%.+]]> = vector-pointer vp<[[PADD1]]>
+; CHECK-NEXT:     WIDEN store vp<[[VPTR3]]>, ir<%add>
+; CHECK-NEXT:     WIDEN store vp<[[VPTR4]]>, ir<%add>.1


It's not redundant,vp<[[VPTR4:%.+]]> = vector-pointer vp<[[PADD1]]> was missing checking for the parts operand (ir<1>) which I added now.

github-actions · 2024-09-05T16:05:46Z

✅ With the latest revision this PR passed the C/C++ code formatter.

…-step-1

fhahn

Latest comments should be addressed. Also moved the unroll code to a separate file.

Also split off the isDefinedOutsideLoopRegions changes to 4eb9838 after landing VPlan-based LICM, which allowed for independent testing.

fhahn · 2024-09-20T18:22:19Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  // The newly added recipes are added to ToSkip to avoid interleaving them
+  // again.
+  VPValue *Prev = IV;
+  Builder.setInsertPoint(IV->getParent(), InsertPtForPhi);


Updated to replace cerateFPOp with createNaryOp which takes optional fast-math flags , allowing it to unify the code.

Could ID.getInductionOpcode() work for non-floats too?

Unfortunately no, as it may be nullptr in some cases for ints (e.g. extends)

fhahn · 2024-09-20T18:22:47Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    VPValue *Op0;
+    VPValue *Op1;


Done, thanks

fhahn · 2024-09-20T18:23:21Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+    }
+
+    if (vputils::onlyFirstPartUsed(VPI)) {
+      addUniformForAllParts(cast<VPInstruction>(&R));


CalculateTripCountMinusVF is now handled by isUniformAcrossVFandUF, so removed here. Updated to pass VPI, thanks!

fhahn · 2024-09-20T18:25:22Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+      continue;
+    }
+
+    auto *SingleDef = dyn_cast<VPSingleDefRecipe>(&R);


Left as is for now, as something like all_of(R.definedvalues(), [](VPValue *Op) { return vputils::isUniformAcrossVFsAndUFs(Op); }) would also include recipes without any defined values

fhahn · 2024-09-20T18:36:11Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  DenseMap<VPValue *, SmallVector<VPValue *>> VPV2Parts;
+
+  void unrollReplicateRegion(VPRegionBlock *VPR);
+  void unrollRecipe(VPRecipeBase &R);


Done, thanks!

fhahn · 2024-09-20T18:36:12Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+  // ..., UF-1.
+  DenseMap<VPValue *, SmallVector<VPValue *>> VPV2Parts;
+
+  void unrollReplicateRegion(VPRegionBlock *VPR);


Done, thanks!

…-step-1

llvm-ci · 2024-09-22T00:25:01Z

LLVM Buildbot has detected a new failure on builder lld-x86_64-win running on as-worker-93 while building llvm at step 7 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/1205

Here is the relevant piece of the build log for the reference

Step 7 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/35/87' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe-LLVM-Unit-23620-35-87.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=87 GTEST_SHARD_INDEX=35 C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe --gtest_filter=ProgramEnvTest.CreateProcessLongPath
--
C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(160): error: Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(163): error: fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied



C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:160
Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:163
fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied




********************

dyung · 2024-09-22T18:32:21Z

@fhahn this change seems to be causing an assertion failure in the compiler with details in #109581, can you take a look?

State.UF is not needed any longer after 8ec4067 (#95842). Clean it up, simplifying ::execute of existing recipes.

fhahn · 2024-09-22T21:10:44Z

@dyung this should be fixed by 53266f7., which runs additional cleanup of recipes after unrolling.

fhahn · 2024-09-22T21:11:30Z

Also landed 06c3a7d as follow-up to remove VPTransformState::UF, will follow up with more clean-ups simplifying VPTransformState tomorrow.

After 8ec4067 (#95842), VPTransformState only stores a single vector value per VPValue. Simplify the code by replacing the SmallVector in PerPartOutput with a single Value * and rename to VPV2Vector for clarity. Also remove the redundant Part argument from various accessors.

After 8ec4067 (#95842), VPTransformState only stores a single scalar vector per VPValue. Simplify the code by replacing the nested SmallVector in PerPartScalars with a single SmallVector and rename to VPV2Scalars for clarity.

After 8ec4067 (#95842), only the lane part of VPIteration is used. Simplify the code by replacing remaining uses of VPIteration with VPLane directly.

This patch implements explicit unrolling by UF as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in an each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts). It also unifies the logic dealing with unrolled parts in a single place, rather than spreading it out across multiple places (e.g. VPlan post processing for header-phi recipes previously.) In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part. The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul. This has been split off llvm#94339 which also includes changes to simplify VPTransfomState and recipes' ::execute. The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1. A follow-up patch will clean up all references to VPTransformState::UF. Another follow-up patch will simplify VPTransformState to only store a single vector value per VPValue. PR: llvm#95842

State.UF is not needed any longer after 8ec4067 (llvm#95842). Clean it up, simplifying ::execute of existing recipes.

After 8ec4067 (llvm#95842), VPTransformState only stores a single vector value per VPValue. Simplify the code by replacing the SmallVector in PerPartOutput with a single Value * and rename to VPV2Vector for clarity. Also remove the redundant Part argument from various accessors.

After 8ec4067 (llvm#95842), VPTransformState only stores a single scalar vector per VPValue. Simplify the code by replacing the nested SmallVector in PerPartScalars with a single SmallVector and rename to VPV2Scalars for clarity.

After 8ec4067 (llvm#95842), only the lane part of VPIteration is used. Simplify the code by replacing remaining uses of VPIteration with VPLane directly.

Update isDefinedOutsideLoopRegions to check if a recipe is defined outside any region. Split off already approved llvm#95842 now that this can be tested separately after landing VPlan-based LICM llvm#107501

This patch implements explicit unrolling by UF as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in an each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts). It also unifies the logic dealing with unrolled parts in a single place, rather than spreading it out across multiple places (e.g. VPlan post processing for header-phi recipes previously.) In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part. The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul. This has been split off llvm#94339 which also includes changes to simplify VPTransfomState and recipes' ::execute. The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1. A follow-up patch will clean up all references to VPTransformState::UF. Another follow-up patch will simplify VPTransformState to only store a single vector value per VPValue. PR: llvm#95842

State.UF is not needed any longer after 8ec4067 (llvm#95842). Clean it up, simplifying ::execute of existing recipes.

After 8ec4067 (llvm#95842), VPTransformState only stores a single vector value per VPValue. Simplify the code by replacing the SmallVector in PerPartOutput with a single Value * and rename to VPV2Vector for clarity. Also remove the redundant Part argument from various accessors.

After 8ec4067 (llvm#95842), VPTransformState only stores a single scalar vector per VPValue. Simplify the code by replacing the nested SmallVector in PerPartScalars with a single SmallVector and rename to VPV2Scalars for clarity.

After 8ec4067 (llvm#95842), only the lane part of VPIteration is used. Simplify the code by replacing remaining uses of VPIteration with VPLane directly.

llvmbot added vectorizers llvm:transforms labels Jun 17, 2024

fhahn requested review from rengolin, ayalz and aniragil June 17, 2024 20:32

fhahn mentioned this pull request Jun 17, 2024

[VPlan] Perform interleaving as VPlan-to-VPlan transform. #94339

Closed

fhahn force-pushed the vplan-unroll-explicit-step-1 branch from 1a08a04 to c927070 Compare June 22, 2024 21:28

ayalz reviewed Jun 23, 2024

View reviewed changes

fhahn force-pushed the vplan-unroll-explicit-step-1 branch from c927070 to 8bd7b65 Compare August 12, 2024 11:56

fhahn mentioned this pull request Aug 12, 2024

[LV] Support generating masks for switch terminators. #99808

Merged

fhahn added 3 commits August 14, 2024 20:58

!fixup use pattern matching in a few more cases.

cba8b59

!fixup rebase and fixup

f3e47f5

fhahn force-pushed the vplan-unroll-explicit-step-1 branch 2 times, most recently from 6f0eb83 to 74ec867 Compare August 14, 2024 20:03

!fixup address latest comments, thanks!

4abc317

fhahn force-pushed the vplan-unroll-explicit-step-1 branch from 74ec867 to 4abc317 Compare August 14, 2024 20:05

ayalz reviewed Sep 3, 2024

View reviewed changes

fhahn added 4 commits September 4, 2024 21:07

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

9360440

…-step-1

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

cf7d783

…-step-1-tmp

!fixup address comments, thanks

41b7cc9

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

713eec1

…-step-1-tmp

fhahn commented Sep 5, 2024

View reviewed changes

fhahn added 2 commits September 5, 2024 19:37

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

ced94e8

…-step-1

!fixup remove empty vputils namespace from VPlan.h

6fd2416

fhahn mentioned this pull request Sep 5, 2024

[VPlan] Add VPValue for VF, use it for VPWidenIntOrFpInductionRecipe. #95305

Merged

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

548474c

…-step-1

fhahn added 2 commits September 20, 2024 19:02

!fixup update tests

1a5113c

!fixup address comments, thanks!

6ce7bf8

fhahn commented Sep 20, 2024

View reviewed changes

fhahn added 6 commits September 20, 2024 20:39

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

99bc59d

…-step-1

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

65150b5

…-step-1

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

d2073e5

…-step-1

!fixup test updates

2db5340

Merge remote-tracking branch 'origin/main' into vplan-unroll-explicit…

5de37ef

…-step-1

!fixup update comment

2e8535d

fhahn merged commit 8ec4067 into llvm:main Sep 21, 2024
5 of 8 checks passed

fhahn deleted the vplan-unroll-explicit-step-1 branch September 21, 2024 18:47

fhahn added a commit that referenced this pull request Sep 22, 2024

[VPlan] Remove unneeded State.UF after 8ec4067 (NFC).

06c3a7d

State.UF is not needed any longer after 8ec4067 (#95842). Clean it up, simplifying ::execute of existing recipes.

augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024

[VPlan] Remove unneeded State.UF after 8ec4067 (NFC).

50b2748

State.UF is not needed any longer after 8ec4067 (llvm#95842). Clean it up, simplifying ::execute of existing recipes.

xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024

[VPlan] Remove unneeded State.UF after 8ec4067 (NFC).

d0e6026

State.UF is not needed any longer after 8ec4067 (llvm#95842). Clean it up, simplifying ::execute of existing recipes.

		@@ -3476,10 +3476,7 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
		// initial value for the recurrence when jumping to the scalar loop.
		VPValue *VPExtract = LO->getOperand(0);

	createStepForVF(Builder, TripCountV->getType(), State.VF, 1));
	getRuntimeVF(Builder, TripCountV->getType(), State.VF));

		DebugLoc DL = {}, const Twine &Name = "",
		FastMathFlags FMFs = {}) {

[VPlan] Implement interleaving as VPlan-to-VPlan transform. #95842

[VPlan] Implement interleaving as VPlan-to-VPlan transform. #95842

Conversation

fhahn commented Jun 17, 2024 • edited Loading

llvmbot commented Jun 17, 2024

ayalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn commented Aug 26, 2024

ayalz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Sep 5, 2024 • edited Loading

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvm-ci commented Sep 22, 2024

dyung commented Sep 22, 2024

fhahn commented Sep 22, 2024

fhahn commented Sep 22, 2024

fhahn commented Jun 17, 2024 •

edited

Loading

github-actions bot commented Sep 5, 2024 •

edited

Loading