Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VPlan] Implement interleaving as VPlan-to-VPlan transform. #95842

Merged
merged 33 commits into from
Sep 21, 2024

Conversation

fhahn
Copy link
Contributor

@fhahn fhahn commented Jun 17, 2024

This patch implements explicit interleaving as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in an each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts).

It also unifies the logic dealing with unrolled parts in a single place, rather than spreading it out across multiple places (e.g. VPlan post processing for header-phi recipes previously.)

In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part.

The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul.

This has been split off #94339 which also includes changes to simplify VPTransfomState and recipes' ::execute.

The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1.

A follow-up patch will clean up all references to VPTransformState::UF.

Another follow-up patch will simplify VPTransformState to only store a single vector value per VPValue.

@llvmbot
Copy link
Collaborator

llvmbot commented Jun 17, 2024

@llvm/pr-subscribers-llvm-transforms

Author: Florian Hahn (fhahn)

Changes

This patch implements explicit interleaving as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in a each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part.

The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul.

This has been split off #94339 which also includes changes to simplify VPTransfomState and recipes' ::execute.

The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1.


Patch is 161.29 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/95842.diff

29 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h (+9)
  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+31-15)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+12)
  • (modified) llvm/lib/Transforms/Vectorize/VPlan.h (+49-1)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+108-49)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+418)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+2)
  • (modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-gather-scatter.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions-unusual-types.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-live-out-pointer-induction.ll (+4-1)
  • (modified) llvm/test/Transforms/LoopVectorize/AArch64/sve-widen-phi.ll (+1-1)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/interleaved-accesses.ll (+6-6)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/uniform-load-store.ll (-3)
  • (modified) llvm/test/Transforms/LoopVectorize/RISCV/vectorize-force-tail-with-evl-interleave.ll (+6-7)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/epilog-vectorization-inductions.ll (+5-3)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/induction-costs.ll (-3)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/interleaving.ll (+34-34)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/pr47437.ll (+24-24)
  • (modified) llvm/test/Transforms/LoopVectorize/X86/uniform_mem_op.ll (+12-21)
  • (modified) llvm/test/Transforms/LoopVectorize/first-order-recurrence.ll (+42-42)
  • (modified) llvm/test/Transforms/LoopVectorize/float-induction.ll (+13-15)
  • (modified) llvm/test/Transforms/LoopVectorize/induction.ll (+39-38)
  • (modified) llvm/test/Transforms/LoopVectorize/interleave-and-scalarize-only.ll (-2)
  • (modified) llvm/test/Transforms/LoopVectorize/pr45679-fold-tail-by-masking.ll (+12-12)
  • (modified) llvm/test/Transforms/LoopVectorize/reduction-inloop-uf4.ll (+60-60)
  • (modified) llvm/test/Transforms/LoopVectorize/scalable-inductions.ll (+2-2)
  • (modified) llvm/test/Transforms/LoopVectorize/vplan-printing-before-execute.ll (+13-6)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index c03c278fcebe7..61fc02cdf9b8b 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -161,6 +161,15 @@ class VPBuilder {
     return tryInsertInstruction(
         new VPInstruction(Opcode, Operands, WrapFlags, DL, Name));
   }
+
+  VPInstruction *createFPOp(unsigned Opcode,
+                            std::initializer_list<VPValue *> Operands,
+                            DebugLoc DL = {}, const Twine &Name = "",
+                            FastMathFlags FMFs = {}) {
+    auto *Op = new VPInstruction(Opcode, Operands, FMFs, DL, Name);
+    return tryInsertInstruction(Op);
+  }
+
   VPValue *createNot(VPValue *Operand, DebugLoc DL = {},
                      const Twine &Name = "") {
     return createInstruction(VPInstruction::Not, {Operand}, DL, Name);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 408083765ccb1..732499ba9a33a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3476,10 +3476,7 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
   // initial value for the recurrence when jumping to the scalar loop.
   VPValue *VPExtract = LO->getOperand(0);
   using namespace llvm::VPlanPatternMatch;
-  assert(match(VPExtract, m_VPInstruction<VPInstruction::ExtractFromEnd>(
-                              m_VPValue(), m_VPValue())) &&
-         "FOR LiveOut expects to use an extract from end.");
-  Value *ResumeScalarFOR = State.get(VPExtract, UF - 1, true);
+  Value *ResumeScalarFOR = State.get(VPExtract, 0, true);
 
   // Fix the initial value of the original recurrence in the scalar loop.
   PHINode *ScalarHeaderPhi = LO->getPhi();
@@ -7429,6 +7426,8 @@ LoopVectorizationPlanner::executePlan(
       "expanded SCEVs to reuse can only be used during epilogue vectorization");
   (void)IsEpilogueVectorization;
 
+  VPlanTransforms::interleave(BestVPlan, BestUF,
+                              OrigLoop->getHeader()->getModule()->getContext());
   VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
 
   LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF
@@ -9152,42 +9151,59 @@ void VPWidenPointerInductionRecipe::execute(VPTransformState &State) {
 
   auto *IVR = getParent()->getPlan()->getCanonicalIV();
   PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, 0, /*IsScalar*/ true));
+  unsigned CurrentPart = 0;
+  if (getNumOperands() == 5)
+    CurrentPart =
+        cast<ConstantInt>(getOperand(4)->getLiveInIRValue())->getZExtValue();
   Type *PhiType = IndDesc.getStep()->getType();
 
   // Build a pointer phi
   Value *ScalarStartValue = getStartValue()->getLiveInIRValue();
   Type *ScStValueType = ScalarStartValue->getType();
-  PHINode *NewPointerPhi = PHINode::Create(ScStValueType, 2, "pointer.phi",
-                                           CanonicalIV->getIterator());
+  PHINode *NewPointerPhi = nullptr;
 
   BasicBlock *VectorPH = State.CFG.getPreheaderBBFor(this);
-  NewPointerPhi->addIncoming(ScalarStartValue, VectorPH);
+  if (getNumOperands() == 5) {
+    auto *GEP = cast<GetElementPtrInst>(State.get(getOperand(3), 0));
+    NewPointerPhi = cast<PHINode>(GEP->getPointerOperand());
+  } else {
+    NewPointerPhi =
+        PHINode::Create(ScStValueType, 2, "pointer.phi", CanonicalIV);
+    NewPointerPhi->addIncoming(ScalarStartValue, VectorPH);
+  }
 
   // A pointer induction, performed by using a gep
   BasicBlock::iterator InductionLoc = State.Builder.GetInsertPoint();
+  unsigned UF = getNumOperands() == 2
+                    ? 1
+                    : cast<ConstantInt>(getOperand(2)->getLiveInIRValue())
+                          ->getZExtValue();
 
   Value *ScalarStepValue = State.get(getOperand(1), VPIteration(0, 0));
   Value *RuntimeVF = getRuntimeVF(State.Builder, PhiType, State.VF);
   Value *NumUnrolledElems =
-      State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, State.UF));
-  Value *InductionGEP = GetElementPtrInst::Create(
-      State.Builder.getInt8Ty(), NewPointerPhi,
-      State.Builder.CreateMul(ScalarStepValue, NumUnrolledElems), "ptr.ind",
-      InductionLoc);
+      State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, UF));
   // Add induction update using an incorrect block temporarily. The phi node
   // will be fixed after VPlan execution. Note that at this point the latch
   // block cannot be used, as it does not exist yet.
   // TODO: Model increment value in VPlan, by turning the recipe into a
   // multi-def and a subclass of VPHeaderPHIRecipe.
-  NewPointerPhi->addIncoming(InductionGEP, VectorPH);
+  if (getNumOperands() != 5) {
+    Value *InductionGEP = GetElementPtrInst::Create(
+        State.Builder.getInt8Ty(), NewPointerPhi,
+        State.Builder.CreateMul(ScalarStepValue, NumUnrolledElems), "ptr.ind",
+        InductionLoc);
+
+    NewPointerPhi->addIncoming(InductionGEP, VectorPH);
+  }
 
   // Create UF many actual address geps that use the pointer
   // phi as base and a vectorized version of the step value
   // (<step*0, ..., step*N>) as offset.
   for (unsigned Part = 0; Part < State.UF; ++Part) {
     Type *VecPhiType = VectorType::get(PhiType, State.VF);
-    Value *StartOffsetScalar =
-        State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, Part));
+    Value *StartOffsetScalar = State.Builder.CreateMul(
+        RuntimeVF, ConstantInt::get(PhiType, CurrentPart));
     Value *StartOffset =
         State.Builder.CreateVectorSplat(State.VF, StartOffsetScalar);
     // Create a vector of consecutive numbers from zero to VF.
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 348a2be5072b4..2e47704fe274c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -821,6 +821,10 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
   // FIXME: Model VF * UF computation completely in VPlan.
   VFxUF.setUnderlyingValue(
       createStepForVF(Builder, TripCountV->getType(), State.VF, State.UF));
+  if (VF.getNumUsers() > 0) {
+    VF.setUnderlyingValue(
+        createStepForVF(Builder, TripCountV->getType(), State.VF, 1));
+  }
 
   // When vectorizing the epilogue loop, the canonical induction start value
   // needs to be changed from zero to the value after the main vector loop.
@@ -846,6 +850,7 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
 /// Assumes a single pre-header basic-block was created for this. Introduce
 /// additional basic-blocks as needed, and fill them all.
 void VPlan::execute(VPTransformState *State) {
+  State->UF = 1;
   // Initialize CFG state.
   State->CFG.PrevVPBB = nullptr;
   State->CFG.ExitBB = State->CFG.PrevBB->getSingleSuccessor();
@@ -890,6 +895,9 @@ void VPlan::execute(VPTransformState *State) {
       // Move the last step to the end of the latch block. This ensures
       // consistent placement of all induction updates.
       Instruction *Inc = cast<Instruction>(Phi->getIncomingValue(1));
+      if (isa<VPWidenIntOrFpInductionRecipe>(&R) && R.getNumOperands() == 4)
+        Inc->setOperand(0, State->get(R.getOperand(3), 0));
+
       Inc->moveBefore(VectorLatchBB->getTerminator()->getPrevNode());
       continue;
     }
@@ -1254,6 +1262,10 @@ void VPlanIngredient::print(raw_ostream &O) const {
 
 template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);
 
+bool VPValue::isDefinedOutsideVectorRegions() const {
+  return !hasDefiningRecipe() || !getDefiningRecipe()->getParent()->getParent();
+}
+
 void VPValue::replaceAllUsesWith(VPValue *New) {
   replaceUsesWithIf(New, [](VPUser &, unsigned) { return true; });
 }
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 5bb88e4a57dc3..706851f84de23 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -701,6 +701,8 @@ class VPLiveOut : public VPUser {
 
   PHINode *getPhi() const { return Phi; }
 
+  bool onlyFirstPartUsed(const VPValue *Op) const override { return true; }
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the VPLiveOut to \p O.
   void print(raw_ostream &O, VPSlotTracker &SlotTracker) const;
@@ -1330,6 +1332,9 @@ class VPInstruction : public VPRecipeWithIRFlags {
   /// Returns true if this VPInstruction produces a scalar value from a vector,
   /// e.g. by performing a reduction or extracting a lane.
   bool isVectorToScalar() const;
+
+  /// Return the interleave count from the VPInstruction's last argument.
+  unsigned getInterleaveCount() const;
 };
 
 /// VPWidenRecipe is a recipe for producing a copy of vector type its
@@ -1611,6 +1616,9 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags {
                                      isInBounds(), getDebugLoc());
   }
 
+  /// Return the current part for this vector pointer.
+  unsigned getPartForRecipe() const;
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe.
   void print(raw_ostream &O, const Twine &Indent,
@@ -1951,6 +1959,9 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe {
 
   /// Returns true, if the phi is part of an in-loop reduction.
   bool isInLoop() const { return IsInLoop; }
+
+  /// Return the current part for this scalar step.
+  unsigned getPartForRecipe() const;
 };
 
 /// A recipe for vectorizing a phi-node as a sequence of mask-based select
@@ -2593,6 +2604,9 @@ class VPCanonicalIVPHIRecipe : public VPHeaderPHIRecipe {
   /// Generate the canonical scalar induction phi of the vector loop.
   void execute(VPTransformState &State) override;
 
+  /// Return the current part for this scalar step.
+  unsigned getPartForRecipe() const;
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe.
   void print(raw_ostream &O, const Twine &Indent,
@@ -2637,7 +2651,9 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
   ~VPActiveLaneMaskPHIRecipe() override = default;
 
   VPActiveLaneMaskPHIRecipe *clone() override {
-    return new VPActiveLaneMaskPHIRecipe(getOperand(0), getDebugLoc());
+    auto *R = new VPActiveLaneMaskPHIRecipe(getOperand(0), getDebugLoc());
+    R->addOperand(getOperand(1));
+    return R;
   }
 
   VP_CLASSOF_IMPL(VPDef::VPActiveLaneMaskPHISC)
@@ -2715,6 +2731,9 @@ class VPWidenCanonicalIVRecipe : public VPSingleDefRecipe {
   /// step = <VF*UF, VF*UF, ..., VF*UF>.
   void execute(VPTransformState &State) override;
 
+  /// Return the current part for this scalar step.
+  unsigned getPartForRecipe() const;
+
 #if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
   /// Print the recipe.
   void print(raw_ostream &O, const Twine &Indent,
@@ -2827,6 +2846,9 @@ class VPScalarIVStepsRecipe : public VPRecipeWithIRFlags {
            "Op must be an operand of the recipe");
     return true;
   }
+
+  /// Return the current part for this scalar step.
+  unsigned getPartForRecipe() const;
 };
 
 /// VPBasicBlock serves as the leaf of the Hierarchical Control-Flow Graph. It
@@ -3145,6 +3167,8 @@ class VPlan {
   /// Represents the loop-invariant VF * UF of the vector loop region.
   VPValue VFxUF;
 
+  VPValue VF;
+
   /// Holds a mapping between Values and their corresponding VPValue inside
   /// VPlan.
   Value2VPValueTy Value2VPValue;
@@ -3232,6 +3256,7 @@ class VPlan {
 
   /// Returns VF * UF of the vector loop region.
   VPValue &getVFxUF() { return VFxUF; }
+  VPValue &getVF() { return VF; }
 
   void addVF(ElementCount VF) { VFs.insert(VF); }
 
@@ -3665,6 +3690,29 @@ inline bool isUniformAfterVectorization(VPValue *VPV) {
     return VPI->isVectorToScalar();
   return false;
 }
+
+/// Checks if \p C is uniform across all VFs and UFs. It is considered as such
+/// if it is either defined outside the vector region or its operand is known to
+/// be uniform across all VFs and UFs (e.g. VPDerivedIV or VPCanonicalIVPHI).
+inline bool isUniformAcrossVFsAndUFs(VPValue *V) {
+  if (auto *VPI = dyn_cast_or_null<VPInstruction>(V->getDefiningRecipe())) {
+    return VPI ==
+           VPI->getParent()->getPlan()->getCanonicalIV()->getBackedgeValue();
+  }
+  if (isa<VPCanonicalIVPHIRecipe, VPDerivedIVRecipe, VPExpandSCEVRecipe>(V))
+    return true;
+  if (isa<VPReplicateRecipe>(V) && cast<VPReplicateRecipe>(V)->isUniform() &&
+      (isa<LoadInst, StoreInst>(V->getUnderlyingValue())) &&
+      all_of(V->getDefiningRecipe()->operands(),
+             [](VPValue *Op) { return Op->isDefinedOutsideVectorRegions(); }))
+    return true;
+
+  auto *C = dyn_cast_or_null<VPScalarCastRecipe>(V->getDefiningRecipe());
+  return C && (C->isDefinedOutsideVectorRegions() ||
+               isa<VPDerivedIVRecipe>(C->getOperand(0)) ||
+               isa<VPCanonicalIVPHIRecipe>(C->getOperand(0)));
+}
+
 } // end namespace vputils
 
 } // end namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index a3ff6395bb39e..50aae9f28c562 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -393,9 +393,9 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
     if (Part != 0)
       return State.get(this, 0, /*IsScalar*/ true);
 
+    unsigned UF = getInterleaveCount();
     Value *ScalarTC = State.get(getOperand(0), {0, 0});
-    Value *Step =
-        createStepForVF(Builder, ScalarTC->getType(), State.VF, State.UF);
+    Value *Step = createStepForVF(Builder, ScalarTC->getType(), State.VF, UF);
     Value *Sub = Builder.CreateSub(ScalarTC, Step);
     Value *Cmp = Builder.CreateICmp(CmpInst::Predicate::ICMP_UGT, ScalarTC, Step);
     Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
@@ -428,14 +428,15 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
   }
   case VPInstruction::CanonicalIVIncrementForPart: {
     auto *IV = State.get(getOperand(0), VPIteration(0, 0));
-    if (Part == 0)
-      return IV;
-
-    // The canonical IV is incremented by the vectorization factor (num of SIMD
-    // elements) times the unroll part.
-    Value *Step = createStepForVF(Builder, IV->getType(), State.VF, Part);
-    return Builder.CreateAdd(IV, Step, Name, hasNoUnsignedWrap(),
-                             hasNoSignedWrap());
+    if (getNumOperands() == 2) {
+      // The canonical IV is incremented by the vectorization factor (num of
+      // SIMD elements) times the unroll part.
+      Value *Step = createStepForVF(Builder, IV->getType(), State.VF,
+                                    getInterleaveCount());
+      return Builder.CreateAdd(IV, Step, Name, hasNoUnsignedWrap(),
+                               hasNoSignedWrap());
+    }
+    return IV;
   }
   case VPInstruction::BranchOnCond: {
     if (Part != 0)
@@ -483,8 +484,7 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
     return CondBr;
   }
   case VPInstruction::ComputeReductionResult: {
-    if (Part != 0)
-      return State.get(this, 0, /*IsScalar*/ true);
+    unsigned NumParts = getNumOperands() - 1;
 
     // FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
     // and will be removed by breaking up the recipe further.
@@ -495,11 +495,10 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
 
     RecurKind RK = RdxDesc.getRecurrenceKind();
 
-    VPValue *LoopExitingDef = getOperand(1);
     Type *PhiTy = OrigPhi->getType();
-    VectorParts RdxParts(State.UF);
-    for (unsigned Part = 0; Part < State.UF; ++Part)
-      RdxParts[Part] = State.get(LoopExitingDef, Part, PhiR->isInLoop());
+    VectorParts RdxParts(NumParts);
+    for (unsigned Part = 0; Part != NumParts; ++Part)
+      RdxParts[Part] = State.get(getOperand(1 + Part), 0, PhiR->isInLoop());
 
     // If the vector reduction can be performed in a smaller type, we truncate
     // then extend the loop exit value to enable InstCombine to evaluate the
@@ -507,7 +506,7 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
     // TODO: Handle this in truncateToMinBW.
     if (State.VF.isVector() && PhiTy != RdxDesc.getRecurrenceType()) {
       Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), State.VF);
-      for (unsigned Part = 0; Part < State.UF; ++Part)
+      for (unsigned Part = 0; Part < NumParts; ++Part)
         RdxParts[Part] = Builder.CreateTrunc(RdxParts[Part], RdxVecTy);
     }
     // Reduce all of the unrolled parts into a single vector.
@@ -517,12 +516,12 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
       Op = Instruction::Or;
 
     if (PhiR->isOrdered()) {
-      ReducedPartRdx = RdxParts[State.UF - 1];
+      ReducedPartRdx = RdxParts[NumParts - 1];
     } else {
       // Floating-point operations should have some FMF to enable the reduction.
       IRBuilderBase::FastMathFlagGuard FMFG(Builder);
       Builder.setFastMathFlags(RdxDesc.getFastMathFlags());
-      for (unsigned Part = 1; Part < State.UF; ++Part) {
+      for (unsigned Part = 1; Part < NumParts; ++Part) {
         Value *RdxPart = RdxParts[Part];
         if (Op != Instruction::ICmp && Op != Instruction::FCmp)
           ReducedPartRdx = Builder.CreateBinOp(
@@ -603,6 +602,13 @@ bool VPInstruction::isVectorToScalar() const {
          getOpcode() == VPInstruction::ComputeReductionResult;
 }
 
+unsigned VPInstruction::getInterleaveCount() const {
+  return getNumOperands() == 1
+             ? 1
+             : cast<ConstantInt>(getOperand(1)->getLiveInIRValue())
+                   ->getZExtValue();
+}
+
 #if !defined(NDEBUG)
 bool VPInstruction::isFPMathOp() const {
   // Inspired by FPMathOperator::classof. Notable differences are that we don't
@@ -1215,24 +1221,32 @@ void VPWidenIntOrFpInductionRecipe::execute(VPTransformState &State) {
     MulOp = Instruction::FMul;
   }
 
-  // Multiply the vectorization factor by the step using integer or
-  // floating-point arithmetic as appropriate.
-  Type *StepType = Step->getType();
-  Value *RuntimeVF;
-  if (Step->getType()->isFloatingPointTy())
-    RuntimeVF = getRuntimeVFAsFloat(Builder, StepType, State.VF);
-  else
-    RuntimeVF = getRuntimeVF(Builder, StepType, State.VF);
-  Value *Mul = Builder.CreateBinOp(MulOp, Step, RuntimeVF);
-
-  // Create a vector splat to use in the induction update.
-  //
-  // FIXME: If the step is non-constant, we create the vector splat with
-  //        IRBuilder. IRBuilder can constant-fold the multiply, but it doesn't
-  //        handle a constant vector splat.
-  Value *SplatVF = isa<Constant>(Mul)
-                       ? ConstantVector::getSplat(State.VF, cast<Constant>(Mul))
-                       : Builder.CreateVectorSplat(State.VF, Mul);
+  Value *SplatVF;
+  if (getNumOperands() == 4) {
+    // Need to create stuff in PH.
+    SplatVF = State.get(getOperand(2), 0);
+  } else {
+
+    // Multiply the vectorization factor by the step using integer or
+    // floating-point arithmetic as appropriate.
+    Type *StepType = Step->getType();
+    Value *RuntimeVF;
+    if (Step->getType()->isFloatingPointTy())
+      RuntimeVF = getRuntimeVFAsFloat(Builder, StepType, State.VF);
+    else
+      RuntimeVF = getRuntimeVF(Builder, StepType, State.VF);
+    Value *Mul = Builder.CreateBinOp(MulOp, Step, RuntimeVF);
+
+    // Create a vector splat to use in the induction update.
+    //
+    // FIXME: If the step is non-constant, we create the vector splat with
+    //        IRBuilder. IRBuilder can constant-fold the multiply, but it
+    //        doesn't handle a constant vector splat.
+    SplatVF = isa<Constant>(Mul)
+                  ? ConstantVector::getSplat(State.VF, cast<Constant>(Mul))
+                  : Builder.CreateVectorSplat(State.VF, Mul);
+  }
+
   Builder.restoreIP(CurrIP);
 
   // We may need to add the step a number of times, depending on the unroll
@@ -1362,7 +1376,8 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
     EndLane = StartLane + 1;
   }
   for (unsigned Part = StartPart; Part < EndPart; ++Part) {
-    Value *StartIdx0 = createStepForVF(Builder, IntStepTy, State.VF, Part);
+    Value *StartIdx0 =
+        cre...
[truncated]

Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First scan, some comments may be obsolete - drafted before InterleaveState was introduced.

In some cases splitting recipes by dedicating a separate recipe to code generated inside the preheader may simplify interleaving, as the latter need not be cloned across UF.

@@ -3476,10 +3476,7 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
// initial value for the recurrence when jumping to the scalar loop.
VPValue *VPExtract = LO->getOperand(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this still be called "Extract" given that the assert for ExtractFromEnd is dropped, or simply
VPExtract >> VPLastElement,
// Extract the last vector element ... >> // Retrieve the last vector element ...
?

unsigned CurrentPart = 0;
if (getNumOperands() == 5)
CurrentPart =
cast<ConstantInt>(getOperand(4)->getLiveInIRValue())->getZExtValue();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define CurrentPart later, closer to where it's needed?

Better record the constant Part as an unsigned field, possibly zero if needed, rather than disrupting the traditional operands?

@@ -821,6 +821,10 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
// FIXME: Model VF * UF computation completely in VPlan.
VFxUF.setUnderlyingValue(
createStepForVF(Builder, TripCountV->getType(), State.VF, State.UF));
if (VF.getNumUsers() > 0) {
VF.setUnderlyingValue(
createStepForVF(Builder, TripCountV->getType(), State.VF, 1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
createStepForVF(Builder, TripCountV->getType(), State.VF, 1));
getRuntimeVF(Builder, TripCountV->getType(), State.VF));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

@@ -890,6 +895,9 @@ void VPlan::execute(VPTransformState *State) {
// Move the last step to the end of the latch block. This ensures
// consistent placement of all induction updates.
Instruction *Inc = cast<Instruction>(Phi->getIncomingValue(1));
if (isa<VPWidenIntOrFpInductionRecipe>(&R) && R.getNumOperands() == 4)
Inc->setOperand(0, State->get(R.getOperand(3), 0));

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to add some explanation why Inc's first operand is set to R's last operand (if it has four).

Better to place this after Inc->moveBefore() below or before //Move the last explanation above.

BTW, above setting of Phi also suffers from WidenPointer recipe producing both a phi and a gep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment and moved, thanks!

@@ -184,7 +184,7 @@ class VPValue {
/// Returns true if the VPValue is defined outside any vector regions, i.e. it
/// is a live-in value.
/// TODO: Also handle recipes defined in pre-header blocks.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO is now taken care of? Can be pushed separately, or tested only w/ this patch? Can be pushed independently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be pushed independently, but I think not visible with the current tests

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so the TODO should be cleared, along with the "i.e. it is a live-in value" preceding it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

1, InterleavedValues.getInterleavedValue(H.getOperand(1), IC - 1));
continue;
}
if (InterleavedValues.contains(H.getVPSingleValue()) ||
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some explanation, along with what I stands for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to document the while loop

return;
InterleaveState InterleavedValues;

SmallPtrSet<VPRecipeBase *, 8> ToSkip;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explain why ToSkip is needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

void addUniform(VPSingleDefRecipe *R, unsigned IC) {
auto Ins = InterleavedValues.insert({R, {}});
for (unsigned I = 1; I != IC; ++I)
Ins.first->second.push_back(R);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative is for getInterleavedValue() to treat uniform values as live-ins or Part0 values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left as is for now, as there are some cases which need handling here but not caught by uniform analysis I think

}
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document. Handles non-uniform, non-header-phi recipes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added brief comment, thanks!

if (!ScalarVFOnly) {
InterleavedValues.remapOperands(&R, IC - 1);
return;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also handle ScalarVFOnly case here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

This patch implements explicit interleaving as VPlan transform. In
follow up patches this will allow  simplifying VPTransform state
(no need to store unrolled parts) as well as recipe execution (no
need to generate code for multiple parts in a each recipe). It also
allows for more general optimziations (e.g. avoid generating code for
recipes that are uniform-across parts).

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off  llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.
@fhahn fhahn force-pushed the vplan-unroll-explicit-step-1 branch 2 times, most recently from 6f0eb83 to 74ec867 Compare August 14, 2024 20:03
@fhahn fhahn force-pushed the vplan-unroll-explicit-step-1 branch from 74ec867 to 4abc317 Compare August 14, 2024 20:05
@fhahn
Copy link
Contributor Author

fhahn commented Aug 26, 2024

ping :)

Copy link
Collaborator

@ayalz ayalz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still have parts to go over, but various comments have accumulated so starting the pipe, yet to be completed.

One issue worth discussing is names:

  • "interleaving" may be confused with "interleave-groups"
  • IC is synonymous to UF and to NumParts, implying that Interleaving is aka Unrolling ("in-place", and Jamming). E.g., "interleaveByUF()" should accept UF as parameter rather than IC, and perhaps better called unrollByUF().
  • "replicate" as in recipe and region is also synonymous to "interleaving", i.e., making copies in-place.
    Should we pick one term for consistency, throughout - "replicating" or "unrolling" rather than "interleaving" to avoid confusion with interleave-groups?

Another issue is retaining State.UF (set to 1) and having execute()'s iterate over its Parts, as a gradual step with reduced diff. This may raise some eyebrows when looking at the code, even if temporarily until State.UF and its loops are retired. Is it worth trying to keep both versions alive, say with a flag, or will that only complicate things further, needlessly, and detrimental - inflating back the patch size.

Comment on lines 167 to 168
DebugLoc DL = {}, const Twine &Name = "",
FastMathFlags FMFs = {}) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, for consistency:

Suggested change
DebugLoc DL = {}, const Twine &Name = "",
FastMathFlags FMFs = {}) {
FastMathFlags FMFs = {}, DebugLoc DL = {}, const Twine &Name = "") {

Flags tend to follow Operands as they can be considered optional constant operands (as in MLIR's Attributes), Name tends to appear last. Matching the order in the delegated call to VPInstruction's constructor.
Both callers below feed all parameters anyway, although one could use the default empty name (only).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

VPInstruction *Mul;
if (IVTy->isFloatingPointTy())
Mul = Builder.createFPOp(Instruction::FMul, {Step, Scale},
IV->getDebugLoc(), "", FMFs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
IV->getDebugLoc(), "", FMFs);
FMFs, IV->getDebugLoc());

can drop empty Name parameter if last, consistent with following call to createNaryOp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusted thanks

@@ -7392,6 +7392,8 @@ LoopVectorizationPlanner::executePlan(
"expanded SCEVs to reuse can only be used during epilogue vectorization");
(void)IsEpilogueVectorization;

VPlanTransforms::interleaveByUF(
BestVPlan, BestUF, OrigLoop->getHeader()->getModule()->getContext());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential TODO thought: have interleaveByUF() and its follower optimizeForVFAndUF() be (a late) part of the planning stage, facilitating more accurate cost estimation, rather than here at the execution stage? This involves fixing the VF(s) and keeping a single VPlan - for the best VF (or two - for main and epilog), and determining best UF, before these transforms are applied.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good for a follow-up, this will help improve cost modeling once we completed the transition to the VPlan-based cost model.

Added TODO.

@@ -572,8 +572,7 @@ VPBasicBlock *VPBasicBlock::splitAt(iterator SplitAt) {
return SplitBlock;
}

VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() {
VPRegionBlock *P = getParent();
template <typename T> static T *getEnclosingLoopRegionImpl(T *P) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment that this is (only) to support both const and non-const versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

@@ -846,6 +850,7 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
/// Assumes a single pre-header basic-block was created for this. Introduce
/// additional basic-blocks as needed, and fill them all.
void VPlan::execute(VPTransformState *State) {
State->UF = 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, InnerLoopUnroller used to override a few methods, but indeed seems ready to be retired.

R.getVPSingleValue()->replaceAllUsesWith(
InterleavedValues.getInterleavedValue(Op0, IC - Offset));
} else {
InterleavedValues.remapOperands(&R, IC - 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth explaining - extracting from end of (vector of) last part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

if (auto *RepR = dyn_cast<VPReplicateRecipe>(&R)) {
if (isa<StoreInst>(RepR->getUnderlyingValue()) &&
RepR->getOperand(1)->isDefinedOutsideVectorRegions()) {
InterleavedValues.remapOperands(&R, IC - 1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stores need not be added as they have no users, if invariant address suffice to keep last part?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment thanks

}
if (auto *II = dyn_cast<IntrinsicInst>(RepR->getUnderlyingValue())) {
if (II->getIntrinsicID() == Intrinsic::experimental_noalias_scope_decl) {
InterleavedValues.addUniform(RepR, IC);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Free of operands.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, also treated as uniforms so remapping wouldn't be needed.


if (isa<VPInstruction>(&R) &&
vputils::onlyFirstPartUsed(R.getVPSingleValue())) {
InterleavedValues.addUniform(cast<VPInstruction>(&R), IC);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about (re)mapping operands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed as we only use the first part.

InterleavedValues.addUniform(cast<VPInstruction>(&R), IC);
return;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some explanation what's coming next - replicating non-uniforms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added thanks!

Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed comments and also moved most unroll helper functions to UnrollState, to reduce number of parameters passed and improve clarity

Still have parts to go over, but various comments have accumulated so starting the pipe, yet to be completed.

One issue worth discussing is names:

"interleaving" may be confused with "interleave-groups"
IC is synonymous to UF and to NumParts, implying that Interleaving is aka Unrolling ("in-place", and Jamming). E.g., "interleaveByUF()" should accept UF as parameter rather than IC, and perhaps better called unrollByUF().
"replicate" as in recipe and region is also synonymous to "interleaving", i.e., making copies in-place.
Should we pick one term for consistency, throughout - "replicating" or "unrolling" rather than "interleaving" to avoid confusion with interleave-groups?

Excellent point, tried to consistently use unroll in the latest version.

Another issue is retaining State.UF (set to 1) and having execute()'s iterate over its Parts, as a gradual step with reduced diff. This may raise some eyebrows when looking at the code, even if temporarily until State.UF and its loops are retired. Is it worth trying to keep both versions alive, say with a flag, or will that only complicate things further, needlessly, and detrimental - inflating back the patch size.

I don't think a flag is needed (and may be even more confusing), if we follow up swiftly with the mechanical change that removes the loops over parts in ::execute.

Comment on lines 167 to 168
DebugLoc DL = {}, const Twine &Name = "",
FastMathFlags FMFs = {}) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks!

@@ -7392,6 +7392,8 @@ LoopVectorizationPlanner::executePlan(
"expanded SCEVs to reuse can only be used during epilogue vectorization");
(void)IsEpilogueVectorization;

VPlanTransforms::interleaveByUF(
BestVPlan, BestUF, OrigLoop->getHeader()->getModule()->getContext());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good for a follow-up, this will help improve cost modeling once we completed the transition to the VPlan-based cost model.

Added TODO.

@@ -572,8 +572,7 @@ VPBasicBlock *VPBasicBlock::splitAt(iterator SplitAt) {
return SplitBlock;
}

VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() {
VPRegionBlock *P = getParent();
template <typename T> static T *getEnclosingLoopRegionImpl(T *P) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done thanks!

@@ -582,6 +581,14 @@ VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() {
return P;
}

const VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() const {
return getEnclosingLoopRegionImpl(getParent());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to getEnclosingLoopRegionForRegion, thanks!

Passing the region block keeps the template simpler, otherwise we would need to specify the return type I think

if (VF.getNumUsers() > 0) {
VF.setUnderlyingValue(
getRuntimeVF(Builder, TripCountV->getType(), State.VF));
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved thanks.

Probably best to first lane #95305 which adds VF separately.

}
if (auto *II = dyn_cast<IntrinsicInst>(RepR->getUnderlyingValue())) {
if (II->getIntrinsicID() == Intrinsic::experimental_noalias_scope_decl) {
InterleavedValues.addUniform(RepR, IC);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, also treated as uniforms so remapping wouldn't be needed.

InterleavedValues.addUniform(cast<VPInstruction>(&R), IC);
return;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added thanks!

interleaveBlock(VPB, Plan, IC, InterleavedValues, TypeInfo, ToSkip,
PhisToRemap);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should have been added a while ago

@@ -106,6 +106,8 @@ struct VPlanTransforms {
/// this transformation.
/// \returns true if the transformation succeeds, or false if it doesn't.
static bool tryAddExplicitVectorLength(VPlan &Plan);

static void interleaveByUF(VPlan &Plan, unsigned IC, LLVMContext &Ctx);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated and also renamed interleave->unroll

; CHECK-NEXT: vp<[[VPTR3:%.+]]> = vector-pointer vp<[[PADD1]]>
; CHECK-NEXT: vp<[[VPTR4:%.+]]> = vector-pointer vp<[[PADD1]]>
; CHECK-NEXT: WIDEN store vp<[[VPTR3]]>, ir<%add>
; CHECK-NEXT: WIDEN store vp<[[VPTR4]]>, ir<%add>.1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not redundant,vp<[[VPTR4:%.+]]> = vector-pointer vp<[[PADD1]]> was missing checking for the parts operand (ir<1>) which I added now.

Copy link

github-actions bot commented Sep 5, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

Copy link
Contributor Author

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest comments should be addressed. Also moved the unroll code to a separate file.

Also split off the isDefinedOutsideLoopRegions changes to 4eb9838 after landing VPlan-based LICM, which allowed for independent testing.

// The newly added recipes are added to ToSkip to avoid interleaving them
// again.
VPValue *Prev = IV;
Builder.setInsertPoint(IV->getParent(), InsertPtForPhi);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to replace cerateFPOp with createNaryOp which takes optional fast-math flags , allowing it to unify the code.

Could ID.getInductionOpcode() work for non-floats too?

Unfortunately no, as it may be nullptr in some cases for ints (e.g. extends)

Comment on lines 1818 to 1819
VPValue *Op0;
VPValue *Op1;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks

}

if (vputils::onlyFirstPartUsed(VPI)) {
addUniformForAllParts(cast<VPInstruction>(&R));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CalculateTripCountMinusVF is now handled by isUniformAcrossVFandUF, so removed here. Updated to pass VPI, thanks!

continue;
}

auto *SingleDef = dyn_cast<VPSingleDefRecipe>(&R);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left as is for now, as something like all_of(R.definedvalues(), [](VPValue *Op) { return vputils::isUniformAcrossVFsAndUFs(Op); }) would also include recipes without any defined values

DenseMap<VPValue *, SmallVector<VPValue *>> VPV2Parts;

void unrollReplicateRegion(VPRegionBlock *VPR);
void unrollRecipe(VPRecipeBase &R);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

// ..., UF-1.
DenseMap<VPValue *, SmallVector<VPValue *>> VPV2Parts;

void unrollReplicateRegion(VPRegionBlock *VPR);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

@fhahn fhahn merged commit 8ec4067 into llvm:main Sep 21, 2024
5 of 8 checks passed
@fhahn fhahn deleted the vplan-unroll-explicit-step-1 branch September 21, 2024 18:47
@llvm-ci
Copy link
Collaborator

llvm-ci commented Sep 22, 2024

LLVM Buildbot has detected a new failure on builder lld-x86_64-win running on as-worker-93 while building llvm at step 7 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/1205

Here is the relevant piece of the build log for the reference
Step 7 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM-Unit :: Support/./SupportTests.exe/35/87' FAILED ********************
Script(shard):
--
GTEST_OUTPUT=json:C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe-LLVM-Unit-23620-35-87.json GTEST_SHUFFLE=0 GTEST_TOTAL_SHARDS=87 GTEST_SHARD_INDEX=35 C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe
--

Script:
--
C:\a\lld-x86_64-win\build\unittests\Support\.\SupportTests.exe --gtest_filter=ProgramEnvTest.CreateProcessLongPath
--
C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(160): error: Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp(163): error: fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied



C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:160
Expected equality of these values:
  0
  RC
    Which is: -2

C:\a\lld-x86_64-win\llvm-project\llvm\unittests\Support\ProgramTest.cpp:163
fs::remove(Twine(LongPath)): did not return errc::success.
error number: 13
error message: permission denied




********************


@dyung
Copy link
Collaborator

dyung commented Sep 22, 2024

@fhahn this change seems to be causing an assertion failure in the compiler with details in #109581, can you take a look?

fhahn added a commit that referenced this pull request Sep 22, 2024
State.UF is not needed any longer after 8ec4067
(#95842). Clean it up,
simplifying ::execute of existing recipes.
@fhahn
Copy link
Contributor Author

fhahn commented Sep 22, 2024

@dyung this should be fixed by 53266f7., which runs additional cleanup of recipes after unrolling.

@fhahn
Copy link
Contributor Author

fhahn commented Sep 22, 2024

Also landed 06c3a7d as follow-up to remove VPTransformState::UF, will follow up with more clean-ups simplifying VPTransformState tomorrow.

fhahn added a commit that referenced this pull request Sep 23, 2024
After 8ec4067 (#95842),
VPTransformState only stores a single vector value per VPValue.

Simplify the code by replacing the SmallVector in PerPartOutput with a
single Value * and rename to VPV2Vector for clarity.

Also remove the redundant Part argument from various accessors.
fhahn added a commit that referenced this pull request Sep 23, 2024
After 8ec4067 (#95842),
VPTransformState only stores a single scalar vector per VPValue.

Simplify the code by replacing the nested SmallVector in PerPartScalars with
a single SmallVector and rename to VPV2Scalars for clarity.
fhahn added a commit that referenced this pull request Sep 25, 2024
After 8ec4067 (#95842),
only the lane part of VPIteration is used.

Simplify the code by replacing remaining uses of VPIteration with VPLane directly.
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
This patch implements explicit unrolling by UF  as VPlan transform. In
follow up patches this will allow simplifying VPTransform state (no need
to store unrolled parts) as well as recipe execution (no need to
generate code for multiple parts in an each recipe). It also allows for
more general optimziations (e.g. avoid generating code for recipes that
are uniform-across parts).

It also unifies the logic dealing with unrolled parts in a single place,
rather than spreading it out across multiple places (e.g. VPlan post
processing for header-phi recipes previously.)

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.

A follow-up patch will clean up all references to VPTransformState::UF.

Another follow-up patch will simplify VPTransformState to only store a
single vector value per VPValue.

PR: llvm#95842
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
State.UF is not needed any longer after 8ec4067
(llvm#95842). Clean it up,
simplifying ::execute of existing recipes.
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
After 8ec4067 (llvm#95842),
VPTransformState only stores a single vector value per VPValue.

Simplify the code by replacing the SmallVector in PerPartOutput with a
single Value * and rename to VPV2Vector for clarity.

Also remove the redundant Part argument from various accessors.
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
After 8ec4067 (llvm#95842),
VPTransformState only stores a single scalar vector per VPValue.

Simplify the code by replacing the nested SmallVector in PerPartScalars with
a single SmallVector and rename to VPV2Scalars for clarity.
augusto2112 pushed a commit to augusto2112/llvm-project that referenced this pull request Sep 26, 2024
After 8ec4067 (llvm#95842),
only the lane part of VPIteration is used.

Simplify the code by replacing remaining uses of VPIteration with VPLane directly.
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
Update isDefinedOutsideLoopRegions to check if a recipe is defined
outside any region. Split off already approved
llvm#95842 now that this can be
tested separately after landing VPlan-based LICM
llvm#107501
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
This patch implements explicit unrolling by UF  as VPlan transform. In
follow up patches this will allow simplifying VPTransform state (no need
to store unrolled parts) as well as recipe execution (no need to
generate code for multiple parts in an each recipe). It also allows for
more general optimziations (e.g. avoid generating code for recipes that
are uniform-across parts).

It also unifies the logic dealing with unrolled parts in a single place,
rather than spreading it out across multiple places (e.g. VPlan post
processing for header-phi recipes previously.)

In the initial implementation, a number of recipes still take the
unrolled part as additional, optional argument, if their execution
depends on the unrolled part.

The computation for start/step values for scalable inductions changed
slightly. Previously the step would be computed as scalar and then
splatted, now vscale gets splatted and multiplied by the step in a
vector mul.

This has been split off llvm#94339
which also includes changes to simplify VPTransfomState and recipes'
::execute.

The current version mostly leaves existing ::execute untouched and
instead sets VPTransfomState::UF to 1.

A follow-up patch will clean up all references to VPTransformState::UF.

Another follow-up patch will simplify VPTransformState to only store a
single vector value per VPValue.

PR: llvm#95842
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
State.UF is not needed any longer after 8ec4067
(llvm#95842). Clean it up,
simplifying ::execute of existing recipes.
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
After 8ec4067 (llvm#95842),
VPTransformState only stores a single vector value per VPValue.

Simplify the code by replacing the SmallVector in PerPartOutput with a
single Value * and rename to VPV2Vector for clarity.

Also remove the redundant Part argument from various accessors.
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
After 8ec4067 (llvm#95842),
VPTransformState only stores a single scalar vector per VPValue.

Simplify the code by replacing the nested SmallVector in PerPartScalars with
a single SmallVector and rename to VPV2Scalars for clarity.
xgupta pushed a commit to xgupta/llvm-project that referenced this pull request Oct 4, 2024
After 8ec4067 (llvm#95842),
only the lane part of VPIteration is used.

Simplify the code by replacing remaining uses of VPIteration with VPLane directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants