-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VPlan] Implement interleaving as VPlan-to-VPlan transform. #95842
Conversation
@llvm/pr-subscribers-llvm-transforms Author: Florian Hahn (fhahn) ChangesThis patch implements explicit interleaving as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in a each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts). In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part. The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul. This has been split off #94339 which also includes changes to simplify VPTransfomState and recipes' ::execute. The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1. Patch is 161.29 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/95842.diff 29 Files Affected:
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
index c03c278fcebe7..61fc02cdf9b8b 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h
@@ -161,6 +161,15 @@ class VPBuilder {
return tryInsertInstruction(
new VPInstruction(Opcode, Operands, WrapFlags, DL, Name));
}
+
+ VPInstruction *createFPOp(unsigned Opcode,
+ std::initializer_list<VPValue *> Operands,
+ DebugLoc DL = {}, const Twine &Name = "",
+ FastMathFlags FMFs = {}) {
+ auto *Op = new VPInstruction(Opcode, Operands, FMFs, DL, Name);
+ return tryInsertInstruction(Op);
+ }
+
VPValue *createNot(VPValue *Operand, DebugLoc DL = {},
const Twine &Name = "") {
return createInstruction(VPInstruction::Not, {Operand}, DL, Name);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 408083765ccb1..732499ba9a33a 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -3476,10 +3476,7 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO,
// initial value for the recurrence when jumping to the scalar loop.
VPValue *VPExtract = LO->getOperand(0);
using namespace llvm::VPlanPatternMatch;
- assert(match(VPExtract, m_VPInstruction<VPInstruction::ExtractFromEnd>(
- m_VPValue(), m_VPValue())) &&
- "FOR LiveOut expects to use an extract from end.");
- Value *ResumeScalarFOR = State.get(VPExtract, UF - 1, true);
+ Value *ResumeScalarFOR = State.get(VPExtract, 0, true);
// Fix the initial value of the original recurrence in the scalar loop.
PHINode *ScalarHeaderPhi = LO->getPhi();
@@ -7429,6 +7426,8 @@ LoopVectorizationPlanner::executePlan(
"expanded SCEVs to reuse can only be used during epilogue vectorization");
(void)IsEpilogueVectorization;
+ VPlanTransforms::interleave(BestVPlan, BestUF,
+ OrigLoop->getHeader()->getModule()->getContext());
VPlanTransforms::optimizeForVFAndUF(BestVPlan, BestVF, BestUF, PSE);
LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF
@@ -9152,42 +9151,59 @@ void VPWidenPointerInductionRecipe::execute(VPTransformState &State) {
auto *IVR = getParent()->getPlan()->getCanonicalIV();
PHINode *CanonicalIV = cast<PHINode>(State.get(IVR, 0, /*IsScalar*/ true));
+ unsigned CurrentPart = 0;
+ if (getNumOperands() == 5)
+ CurrentPart =
+ cast<ConstantInt>(getOperand(4)->getLiveInIRValue())->getZExtValue();
Type *PhiType = IndDesc.getStep()->getType();
// Build a pointer phi
Value *ScalarStartValue = getStartValue()->getLiveInIRValue();
Type *ScStValueType = ScalarStartValue->getType();
- PHINode *NewPointerPhi = PHINode::Create(ScStValueType, 2, "pointer.phi",
- CanonicalIV->getIterator());
+ PHINode *NewPointerPhi = nullptr;
BasicBlock *VectorPH = State.CFG.getPreheaderBBFor(this);
- NewPointerPhi->addIncoming(ScalarStartValue, VectorPH);
+ if (getNumOperands() == 5) {
+ auto *GEP = cast<GetElementPtrInst>(State.get(getOperand(3), 0));
+ NewPointerPhi = cast<PHINode>(GEP->getPointerOperand());
+ } else {
+ NewPointerPhi =
+ PHINode::Create(ScStValueType, 2, "pointer.phi", CanonicalIV);
+ NewPointerPhi->addIncoming(ScalarStartValue, VectorPH);
+ }
// A pointer induction, performed by using a gep
BasicBlock::iterator InductionLoc = State.Builder.GetInsertPoint();
+ unsigned UF = getNumOperands() == 2
+ ? 1
+ : cast<ConstantInt>(getOperand(2)->getLiveInIRValue())
+ ->getZExtValue();
Value *ScalarStepValue = State.get(getOperand(1), VPIteration(0, 0));
Value *RuntimeVF = getRuntimeVF(State.Builder, PhiType, State.VF);
Value *NumUnrolledElems =
- State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, State.UF));
- Value *InductionGEP = GetElementPtrInst::Create(
- State.Builder.getInt8Ty(), NewPointerPhi,
- State.Builder.CreateMul(ScalarStepValue, NumUnrolledElems), "ptr.ind",
- InductionLoc);
+ State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, UF));
// Add induction update using an incorrect block temporarily. The phi node
// will be fixed after VPlan execution. Note that at this point the latch
// block cannot be used, as it does not exist yet.
// TODO: Model increment value in VPlan, by turning the recipe into a
// multi-def and a subclass of VPHeaderPHIRecipe.
- NewPointerPhi->addIncoming(InductionGEP, VectorPH);
+ if (getNumOperands() != 5) {
+ Value *InductionGEP = GetElementPtrInst::Create(
+ State.Builder.getInt8Ty(), NewPointerPhi,
+ State.Builder.CreateMul(ScalarStepValue, NumUnrolledElems), "ptr.ind",
+ InductionLoc);
+
+ NewPointerPhi->addIncoming(InductionGEP, VectorPH);
+ }
// Create UF many actual address geps that use the pointer
// phi as base and a vectorized version of the step value
// (<step*0, ..., step*N>) as offset.
for (unsigned Part = 0; Part < State.UF; ++Part) {
Type *VecPhiType = VectorType::get(PhiType, State.VF);
- Value *StartOffsetScalar =
- State.Builder.CreateMul(RuntimeVF, ConstantInt::get(PhiType, Part));
+ Value *StartOffsetScalar = State.Builder.CreateMul(
+ RuntimeVF, ConstantInt::get(PhiType, CurrentPart));
Value *StartOffset =
State.Builder.CreateVectorSplat(State.VF, StartOffsetScalar);
// Create a vector of consecutive numbers from zero to VF.
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.cpp b/llvm/lib/Transforms/Vectorize/VPlan.cpp
index 348a2be5072b4..2e47704fe274c 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlan.cpp
@@ -821,6 +821,10 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
// FIXME: Model VF * UF computation completely in VPlan.
VFxUF.setUnderlyingValue(
createStepForVF(Builder, TripCountV->getType(), State.VF, State.UF));
+ if (VF.getNumUsers() > 0) {
+ VF.setUnderlyingValue(
+ createStepForVF(Builder, TripCountV->getType(), State.VF, 1));
+ }
// When vectorizing the epilogue loop, the canonical induction start value
// needs to be changed from zero to the value after the main vector loop.
@@ -846,6 +850,7 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV,
/// Assumes a single pre-header basic-block was created for this. Introduce
/// additional basic-blocks as needed, and fill them all.
void VPlan::execute(VPTransformState *State) {
+ State->UF = 1;
// Initialize CFG state.
State->CFG.PrevVPBB = nullptr;
State->CFG.ExitBB = State->CFG.PrevBB->getSingleSuccessor();
@@ -890,6 +895,9 @@ void VPlan::execute(VPTransformState *State) {
// Move the last step to the end of the latch block. This ensures
// consistent placement of all induction updates.
Instruction *Inc = cast<Instruction>(Phi->getIncomingValue(1));
+ if (isa<VPWidenIntOrFpInductionRecipe>(&R) && R.getNumOperands() == 4)
+ Inc->setOperand(0, State->get(R.getOperand(3), 0));
+
Inc->moveBefore(VectorLatchBB->getTerminator()->getPrevNode());
continue;
}
@@ -1254,6 +1262,10 @@ void VPlanIngredient::print(raw_ostream &O) const {
template void DomTreeBuilder::Calculate<VPDominatorTree>(VPDominatorTree &DT);
+bool VPValue::isDefinedOutsideVectorRegions() const {
+ return !hasDefiningRecipe() || !getDefiningRecipe()->getParent()->getParent();
+}
+
void VPValue::replaceAllUsesWith(VPValue *New) {
replaceUsesWithIf(New, [](VPUser &, unsigned) { return true; });
}
diff --git a/llvm/lib/Transforms/Vectorize/VPlan.h b/llvm/lib/Transforms/Vectorize/VPlan.h
index 5bb88e4a57dc3..706851f84de23 100644
--- a/llvm/lib/Transforms/Vectorize/VPlan.h
+++ b/llvm/lib/Transforms/Vectorize/VPlan.h
@@ -701,6 +701,8 @@ class VPLiveOut : public VPUser {
PHINode *getPhi() const { return Phi; }
+ bool onlyFirstPartUsed(const VPValue *Op) const override { return true; }
+
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the VPLiveOut to \p O.
void print(raw_ostream &O, VPSlotTracker &SlotTracker) const;
@@ -1330,6 +1332,9 @@ class VPInstruction : public VPRecipeWithIRFlags {
/// Returns true if this VPInstruction produces a scalar value from a vector,
/// e.g. by performing a reduction or extracting a lane.
bool isVectorToScalar() const;
+
+ /// Return the interleave count from the VPInstruction's last argument.
+ unsigned getInterleaveCount() const;
};
/// VPWidenRecipe is a recipe for producing a copy of vector type its
@@ -1611,6 +1616,9 @@ class VPVectorPointerRecipe : public VPRecipeWithIRFlags {
isInBounds(), getDebugLoc());
}
+ /// Return the current part for this vector pointer.
+ unsigned getPartForRecipe() const;
+
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,
@@ -1951,6 +1959,9 @@ class VPReductionPHIRecipe : public VPHeaderPHIRecipe {
/// Returns true, if the phi is part of an in-loop reduction.
bool isInLoop() const { return IsInLoop; }
+
+ /// Return the current part for this scalar step.
+ unsigned getPartForRecipe() const;
};
/// A recipe for vectorizing a phi-node as a sequence of mask-based select
@@ -2593,6 +2604,9 @@ class VPCanonicalIVPHIRecipe : public VPHeaderPHIRecipe {
/// Generate the canonical scalar induction phi of the vector loop.
void execute(VPTransformState &State) override;
+ /// Return the current part for this scalar step.
+ unsigned getPartForRecipe() const;
+
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,
@@ -2637,7 +2651,9 @@ class VPActiveLaneMaskPHIRecipe : public VPHeaderPHIRecipe {
~VPActiveLaneMaskPHIRecipe() override = default;
VPActiveLaneMaskPHIRecipe *clone() override {
- return new VPActiveLaneMaskPHIRecipe(getOperand(0), getDebugLoc());
+ auto *R = new VPActiveLaneMaskPHIRecipe(getOperand(0), getDebugLoc());
+ R->addOperand(getOperand(1));
+ return R;
}
VP_CLASSOF_IMPL(VPDef::VPActiveLaneMaskPHISC)
@@ -2715,6 +2731,9 @@ class VPWidenCanonicalIVRecipe : public VPSingleDefRecipe {
/// step = <VF*UF, VF*UF, ..., VF*UF>.
void execute(VPTransformState &State) override;
+ /// Return the current part for this scalar step.
+ unsigned getPartForRecipe() const;
+
#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)
/// Print the recipe.
void print(raw_ostream &O, const Twine &Indent,
@@ -2827,6 +2846,9 @@ class VPScalarIVStepsRecipe : public VPRecipeWithIRFlags {
"Op must be an operand of the recipe");
return true;
}
+
+ /// Return the current part for this scalar step.
+ unsigned getPartForRecipe() const;
};
/// VPBasicBlock serves as the leaf of the Hierarchical Control-Flow Graph. It
@@ -3145,6 +3167,8 @@ class VPlan {
/// Represents the loop-invariant VF * UF of the vector loop region.
VPValue VFxUF;
+ VPValue VF;
+
/// Holds a mapping between Values and their corresponding VPValue inside
/// VPlan.
Value2VPValueTy Value2VPValue;
@@ -3232,6 +3256,7 @@ class VPlan {
/// Returns VF * UF of the vector loop region.
VPValue &getVFxUF() { return VFxUF; }
+ VPValue &getVF() { return VF; }
void addVF(ElementCount VF) { VFs.insert(VF); }
@@ -3665,6 +3690,29 @@ inline bool isUniformAfterVectorization(VPValue *VPV) {
return VPI->isVectorToScalar();
return false;
}
+
+/// Checks if \p C is uniform across all VFs and UFs. It is considered as such
+/// if it is either defined outside the vector region or its operand is known to
+/// be uniform across all VFs and UFs (e.g. VPDerivedIV or VPCanonicalIVPHI).
+inline bool isUniformAcrossVFsAndUFs(VPValue *V) {
+ if (auto *VPI = dyn_cast_or_null<VPInstruction>(V->getDefiningRecipe())) {
+ return VPI ==
+ VPI->getParent()->getPlan()->getCanonicalIV()->getBackedgeValue();
+ }
+ if (isa<VPCanonicalIVPHIRecipe, VPDerivedIVRecipe, VPExpandSCEVRecipe>(V))
+ return true;
+ if (isa<VPReplicateRecipe>(V) && cast<VPReplicateRecipe>(V)->isUniform() &&
+ (isa<LoadInst, StoreInst>(V->getUnderlyingValue())) &&
+ all_of(V->getDefiningRecipe()->operands(),
+ [](VPValue *Op) { return Op->isDefinedOutsideVectorRegions(); }))
+ return true;
+
+ auto *C = dyn_cast_or_null<VPScalarCastRecipe>(V->getDefiningRecipe());
+ return C && (C->isDefinedOutsideVectorRegions() ||
+ isa<VPDerivedIVRecipe>(C->getOperand(0)) ||
+ isa<VPCanonicalIVPHIRecipe>(C->getOperand(0)));
+}
+
} // end namespace vputils
} // end namespace llvm
diff --git a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
index a3ff6395bb39e..50aae9f28c562 100644
--- a/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
+++ b/llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp
@@ -393,9 +393,9 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
if (Part != 0)
return State.get(this, 0, /*IsScalar*/ true);
+ unsigned UF = getInterleaveCount();
Value *ScalarTC = State.get(getOperand(0), {0, 0});
- Value *Step =
- createStepForVF(Builder, ScalarTC->getType(), State.VF, State.UF);
+ Value *Step = createStepForVF(Builder, ScalarTC->getType(), State.VF, UF);
Value *Sub = Builder.CreateSub(ScalarTC, Step);
Value *Cmp = Builder.CreateICmp(CmpInst::Predicate::ICMP_UGT, ScalarTC, Step);
Value *Zero = ConstantInt::get(ScalarTC->getType(), 0);
@@ -428,14 +428,15 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
}
case VPInstruction::CanonicalIVIncrementForPart: {
auto *IV = State.get(getOperand(0), VPIteration(0, 0));
- if (Part == 0)
- return IV;
-
- // The canonical IV is incremented by the vectorization factor (num of SIMD
- // elements) times the unroll part.
- Value *Step = createStepForVF(Builder, IV->getType(), State.VF, Part);
- return Builder.CreateAdd(IV, Step, Name, hasNoUnsignedWrap(),
- hasNoSignedWrap());
+ if (getNumOperands() == 2) {
+ // The canonical IV is incremented by the vectorization factor (num of
+ // SIMD elements) times the unroll part.
+ Value *Step = createStepForVF(Builder, IV->getType(), State.VF,
+ getInterleaveCount());
+ return Builder.CreateAdd(IV, Step, Name, hasNoUnsignedWrap(),
+ hasNoSignedWrap());
+ }
+ return IV;
}
case VPInstruction::BranchOnCond: {
if (Part != 0)
@@ -483,8 +484,7 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
return CondBr;
}
case VPInstruction::ComputeReductionResult: {
- if (Part != 0)
- return State.get(this, 0, /*IsScalar*/ true);
+ unsigned NumParts = getNumOperands() - 1;
// FIXME: The cross-recipe dependency on VPReductionPHIRecipe is temporary
// and will be removed by breaking up the recipe further.
@@ -495,11 +495,10 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
RecurKind RK = RdxDesc.getRecurrenceKind();
- VPValue *LoopExitingDef = getOperand(1);
Type *PhiTy = OrigPhi->getType();
- VectorParts RdxParts(State.UF);
- for (unsigned Part = 0; Part < State.UF; ++Part)
- RdxParts[Part] = State.get(LoopExitingDef, Part, PhiR->isInLoop());
+ VectorParts RdxParts(NumParts);
+ for (unsigned Part = 0; Part != NumParts; ++Part)
+ RdxParts[Part] = State.get(getOperand(1 + Part), 0, PhiR->isInLoop());
// If the vector reduction can be performed in a smaller type, we truncate
// then extend the loop exit value to enable InstCombine to evaluate the
@@ -507,7 +506,7 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
// TODO: Handle this in truncateToMinBW.
if (State.VF.isVector() && PhiTy != RdxDesc.getRecurrenceType()) {
Type *RdxVecTy = VectorType::get(RdxDesc.getRecurrenceType(), State.VF);
- for (unsigned Part = 0; Part < State.UF; ++Part)
+ for (unsigned Part = 0; Part < NumParts; ++Part)
RdxParts[Part] = Builder.CreateTrunc(RdxParts[Part], RdxVecTy);
}
// Reduce all of the unrolled parts into a single vector.
@@ -517,12 +516,12 @@ Value *VPInstruction::generatePerPart(VPTransformState &State, unsigned Part) {
Op = Instruction::Or;
if (PhiR->isOrdered()) {
- ReducedPartRdx = RdxParts[State.UF - 1];
+ ReducedPartRdx = RdxParts[NumParts - 1];
} else {
// Floating-point operations should have some FMF to enable the reduction.
IRBuilderBase::FastMathFlagGuard FMFG(Builder);
Builder.setFastMathFlags(RdxDesc.getFastMathFlags());
- for (unsigned Part = 1; Part < State.UF; ++Part) {
+ for (unsigned Part = 1; Part < NumParts; ++Part) {
Value *RdxPart = RdxParts[Part];
if (Op != Instruction::ICmp && Op != Instruction::FCmp)
ReducedPartRdx = Builder.CreateBinOp(
@@ -603,6 +602,13 @@ bool VPInstruction::isVectorToScalar() const {
getOpcode() == VPInstruction::ComputeReductionResult;
}
+unsigned VPInstruction::getInterleaveCount() const {
+ return getNumOperands() == 1
+ ? 1
+ : cast<ConstantInt>(getOperand(1)->getLiveInIRValue())
+ ->getZExtValue();
+}
+
#if !defined(NDEBUG)
bool VPInstruction::isFPMathOp() const {
// Inspired by FPMathOperator::classof. Notable differences are that we don't
@@ -1215,24 +1221,32 @@ void VPWidenIntOrFpInductionRecipe::execute(VPTransformState &State) {
MulOp = Instruction::FMul;
}
- // Multiply the vectorization factor by the step using integer or
- // floating-point arithmetic as appropriate.
- Type *StepType = Step->getType();
- Value *RuntimeVF;
- if (Step->getType()->isFloatingPointTy())
- RuntimeVF = getRuntimeVFAsFloat(Builder, StepType, State.VF);
- else
- RuntimeVF = getRuntimeVF(Builder, StepType, State.VF);
- Value *Mul = Builder.CreateBinOp(MulOp, Step, RuntimeVF);
-
- // Create a vector splat to use in the induction update.
- //
- // FIXME: If the step is non-constant, we create the vector splat with
- // IRBuilder. IRBuilder can constant-fold the multiply, but it doesn't
- // handle a constant vector splat.
- Value *SplatVF = isa<Constant>(Mul)
- ? ConstantVector::getSplat(State.VF, cast<Constant>(Mul))
- : Builder.CreateVectorSplat(State.VF, Mul);
+ Value *SplatVF;
+ if (getNumOperands() == 4) {
+ // Need to create stuff in PH.
+ SplatVF = State.get(getOperand(2), 0);
+ } else {
+
+ // Multiply the vectorization factor by the step using integer or
+ // floating-point arithmetic as appropriate.
+ Type *StepType = Step->getType();
+ Value *RuntimeVF;
+ if (Step->getType()->isFloatingPointTy())
+ RuntimeVF = getRuntimeVFAsFloat(Builder, StepType, State.VF);
+ else
+ RuntimeVF = getRuntimeVF(Builder, StepType, State.VF);
+ Value *Mul = Builder.CreateBinOp(MulOp, Step, RuntimeVF);
+
+ // Create a vector splat to use in the induction update.
+ //
+ // FIXME: If the step is non-constant, we create the vector splat with
+ // IRBuilder. IRBuilder can constant-fold the multiply, but it
+ // doesn't handle a constant vector splat.
+ SplatVF = isa<Constant>(Mul)
+ ? ConstantVector::getSplat(State.VF, cast<Constant>(Mul))
+ : Builder.CreateVectorSplat(State.VF, Mul);
+ }
+
Builder.restoreIP(CurrIP);
// We may need to add the step a number of times, depending on the unroll
@@ -1362,7 +1376,8 @@ void VPScalarIVStepsRecipe::execute(VPTransformState &State) {
EndLane = StartLane + 1;
}
for (unsigned Part = StartPart; Part < EndPart; ++Part) {
- Value *StartIdx0 = createStepForVF(Builder, IntStepTy, State.VF, Part);
+ Value *StartIdx0 =
+ cre...
[truncated]
|
1a08a04
to
c927070
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First scan, some comments may be obsolete - drafted before InterleaveState was introduced.
In some cases splitting recipes by dedicating a separate recipe to code generated inside the preheader may simplify interleaving, as the latter need not be cloned across UF.
@@ -3476,10 +3476,7 @@ void InnerLoopVectorizer::fixFixedOrderRecurrence(VPLiveOut *LO, | |||
// initial value for the recurrence when jumping to the scalar loop. | |||
VPValue *VPExtract = LO->getOperand(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this still be called "Extract" given that the assert for ExtractFromEnd is dropped, or simply
VPExtract
>> VPLastElement
,
// Extract the last vector element ...
>> // Retrieve the last vector element ...
?
unsigned CurrentPart = 0; | ||
if (getNumOperands() == 5) | ||
CurrentPart = | ||
cast<ConstantInt>(getOperand(4)->getLiveInIRValue())->getZExtValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define CurrentPart
later, closer to where it's needed?
Better record the constant Part as an unsigned field, possibly zero if needed, rather than disrupting the traditional operands?
@@ -821,6 +821,10 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV, | |||
// FIXME: Model VF * UF computation completely in VPlan. | |||
VFxUF.setUnderlyingValue( | |||
createStepForVF(Builder, TripCountV->getType(), State.VF, State.UF)); | |||
if (VF.getNumUsers() > 0) { | |||
VF.setUnderlyingValue( | |||
createStepForVF(Builder, TripCountV->getType(), State.VF, 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createStepForVF(Builder, TripCountV->getType(), State.VF, 1)); | |
getRuntimeVF(Builder, TripCountV->getType(), State.VF)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -890,6 +895,9 @@ void VPlan::execute(VPTransformState *State) { | |||
// Move the last step to the end of the latch block. This ensures | |||
// consistent placement of all induction updates. | |||
Instruction *Inc = cast<Instruction>(Phi->getIncomingValue(1)); | |||
if (isa<VPWidenIntOrFpInductionRecipe>(&R) && R.getNumOperands() == 4) | |||
Inc->setOperand(0, State->get(R.getOperand(3), 0)); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to add some explanation why Inc's first operand is set to R's last operand (if it has four).
Better to place this after Inc->moveBefore()
below or before //Move the last
explanation above.
BTW, above setting of Phi also suffers from WidenPointer recipe producing both a phi and a gep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment and moved, thanks!
@@ -184,7 +184,7 @@ class VPValue { | |||
/// Returns true if the VPValue is defined outside any vector regions, i.e. it | |||
/// is a live-in value. | |||
/// TODO: Also handle recipes defined in pre-header blocks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This TODO is now taken care of? Can be pushed separately, or tested only w/ this patch? Can be pushed independently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be pushed independently, but I think not visible with the current tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so the TODO should be cleared, along with the "i.e. it is a live-in value" preceding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
1, InterleavedValues.getInterleavedValue(H.getOperand(1), IC - 1)); | ||
continue; | ||
} | ||
if (InterleavedValues.contains(H.getVPSingleValue()) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add some explanation, along with what I
stands for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment to document the while loop
return; | ||
InterleaveState InterleavedValues; | ||
|
||
SmallPtrSet<VPRecipeBase *, 8> ToSkip; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explain why ToSkip is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
void addUniform(VPSingleDefRecipe *R, unsigned IC) { | ||
auto Ins = InterleavedValues.insert({R, {}}); | ||
for (unsigned I = 1; I != IC; ++I) | ||
Ins.first->second.push_back(R); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative is for getInterleavedValue() to treat uniform values as live-ins or Part0 values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left as is for now, as there are some cases which need handling here but not caught by uniform analysis I think
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document. Handles non-uniform, non-header-phi recipes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added brief comment, thanks!
if (!ScalarVFOnly) { | ||
InterleavedValues.remapOperands(&R, IC - 1); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also handle ScalarVFOnly case here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
c927070
to
8bd7b65
Compare
This patch implements explicit interleaving as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in a each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts). In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part. The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul. This has been split off llvm#94339 which also includes changes to simplify VPTransfomState and recipes' ::execute. The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1.
6f0eb83
to
74ec867
Compare
74ec867
to
4abc317
Compare
ping :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still have parts to go over, but various comments have accumulated so starting the pipe, yet to be completed.
One issue worth discussing is names:
- "interleaving" may be confused with "interleave-groups"
- IC is synonymous to UF and to NumParts, implying that Interleaving is aka Unrolling ("in-place", and Jamming). E.g., "interleaveByUF()" should accept UF as parameter rather than IC, and perhaps better called unrollByUF().
- "replicate" as in recipe and region is also synonymous to "interleaving", i.e., making copies in-place.
Should we pick one term for consistency, throughout - "replicating" or "unrolling" rather than "interleaving" to avoid confusion with interleave-groups?
Another issue is retaining State.UF (set to 1) and having execute()'s iterate over its Parts, as a gradual step with reduced diff. This may raise some eyebrows when looking at the code, even if temporarily until State.UF and its loops are retired. Is it worth trying to keep both versions alive, say with a flag, or will that only complicate things further, needlessly, and detrimental - inflating back the patch size.
DebugLoc DL = {}, const Twine &Name = "", | ||
FastMathFlags FMFs = {}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, for consistency:
DebugLoc DL = {}, const Twine &Name = "", | |
FastMathFlags FMFs = {}) { | |
FastMathFlags FMFs = {}, DebugLoc DL = {}, const Twine &Name = "") { |
Flags tend to follow Operands as they can be considered optional constant operands (as in MLIR's Attributes), Name tends to appear last. Matching the order in the delegated call to VPInstruction's constructor.
Both callers below feed all parameters anyway, although one could use the default empty name (only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
VPInstruction *Mul; | ||
if (IVTy->isFloatingPointTy()) | ||
Mul = Builder.createFPOp(Instruction::FMul, {Step, Scale}, | ||
IV->getDebugLoc(), "", FMFs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IV->getDebugLoc(), "", FMFs); | |
FMFs, IV->getDebugLoc()); |
can drop empty Name parameter if last, consistent with following call to createNaryOp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted thanks
@@ -7392,6 +7392,8 @@ LoopVectorizationPlanner::executePlan( | |||
"expanded SCEVs to reuse can only be used during epilogue vectorization"); | |||
(void)IsEpilogueVectorization; | |||
|
|||
VPlanTransforms::interleaveByUF( | |||
BestVPlan, BestUF, OrigLoop->getHeader()->getModule()->getContext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential TODO thought: have interleaveByUF() and its follower optimizeForVFAndUF() be (a late) part of the planning stage, facilitating more accurate cost estimation, rather than here at the execution stage? This involves fixing the VF(s) and keeping a single VPlan - for the best VF (or two - for main and epilog), and determining best UF, before these transforms are applied.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good for a follow-up, this will help improve cost modeling once we completed the transition to the VPlan-based cost model.
Added TODO.
@@ -572,8 +572,7 @@ VPBasicBlock *VPBasicBlock::splitAt(iterator SplitAt) { | |||
return SplitBlock; | |||
} | |||
|
|||
VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() { | |||
VPRegionBlock *P = getParent(); | |||
template <typename T> static T *getEnclosingLoopRegionImpl(T *P) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment that this is (only) to support both const and non-const versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks!
@@ -846,6 +850,7 @@ void VPlan::prepareToExecute(Value *TripCountV, Value *VectorTripCountV, | |||
/// Assumes a single pre-header basic-block was created for this. Introduce | |||
/// additional basic-blocks as needed, and fill them all. | |||
void VPlan::execute(VPTransformState *State) { | |||
State->UF = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, InnerLoopUnroller used to override a few methods, but indeed seems ready to be retired.
R.getVPSingleValue()->replaceAllUsesWith( | ||
InterleavedValues.getInterleavedValue(Op0, IC - Offset)); | ||
} else { | ||
InterleavedValues.remapOperands(&R, IC - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth explaining - extracting from end of (vector of) last part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks!
if (auto *RepR = dyn_cast<VPReplicateRecipe>(&R)) { | ||
if (isa<StoreInst>(RepR->getUnderlyingValue()) && | ||
RepR->getOperand(1)->isDefinedOutsideVectorRegions()) { | ||
InterleavedValues.remapOperands(&R, IC - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stores need not be added as they have no users, if invariant address suffice to keep last part?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment thanks
} | ||
if (auto *II = dyn_cast<IntrinsicInst>(RepR->getUnderlyingValue())) { | ||
if (II->getIntrinsicID() == Intrinsic::experimental_noalias_scope_decl) { | ||
InterleavedValues.addUniform(RepR, IC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Free of operands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, also treated as uniforms so remapping wouldn't be needed.
|
||
if (isa<VPInstruction>(&R) && | ||
vputils::onlyFirstPartUsed(R.getVPSingleValue())) { | ||
InterleavedValues.addUniform(cast<VPInstruction>(&R), IC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about (re)mapping operands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not needed as we only use the first part.
InterleavedValues.addUniform(cast<VPInstruction>(&R), IC); | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some explanation what's coming next - replicating non-uniforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comments and also moved most unroll helper functions to UnrollState, to reduce number of parameters passed and improve clarity
Still have parts to go over, but various comments have accumulated so starting the pipe, yet to be completed.
One issue worth discussing is names:
"interleaving" may be confused with "interleave-groups"
IC is synonymous to UF and to NumParts, implying that Interleaving is aka Unrolling ("in-place", and Jamming). E.g., "interleaveByUF()" should accept UF as parameter rather than IC, and perhaps better called unrollByUF().
"replicate" as in recipe and region is also synonymous to "interleaving", i.e., making copies in-place.
Should we pick one term for consistency, throughout - "replicating" or "unrolling" rather than "interleaving" to avoid confusion with interleave-groups?
Excellent point, tried to consistently use unroll in the latest version.
Another issue is retaining State.UF (set to 1) and having execute()'s iterate over its Parts, as a gradual step with reduced diff. This may raise some eyebrows when looking at the code, even if temporarily until State.UF and its loops are retired. Is it worth trying to keep both versions alive, say with a flag, or will that only complicate things further, needlessly, and detrimental - inflating back the patch size.
I don't think a flag is needed (and may be even more confusing), if we follow up swiftly with the mechanical change that removes the loops over parts in ::execute.
DebugLoc DL = {}, const Twine &Name = "", | ||
FastMathFlags FMFs = {}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks!
@@ -7392,6 +7392,8 @@ LoopVectorizationPlanner::executePlan( | |||
"expanded SCEVs to reuse can only be used during epilogue vectorization"); | |||
(void)IsEpilogueVectorization; | |||
|
|||
VPlanTransforms::interleaveByUF( | |||
BestVPlan, BestUF, OrigLoop->getHeader()->getModule()->getContext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good for a follow-up, this will help improve cost modeling once we completed the transition to the VPlan-based cost model.
Added TODO.
@@ -572,8 +572,7 @@ VPBasicBlock *VPBasicBlock::splitAt(iterator SplitAt) { | |||
return SplitBlock; | |||
} | |||
|
|||
VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() { | |||
VPRegionBlock *P = getParent(); | |||
template <typename T> static T *getEnclosingLoopRegionImpl(T *P) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done thanks!
@@ -582,6 +581,14 @@ VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() { | |||
return P; | |||
} | |||
|
|||
const VPRegionBlock *VPBasicBlock::getEnclosingLoopRegion() const { | |||
return getEnclosingLoopRegionImpl(getParent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to getEnclosingLoopRegionForRegion
, thanks!
Passing the region block keeps the template simpler, otherwise we would need to specify the return type I think
if (VF.getNumUsers() > 0) { | ||
VF.setUnderlyingValue( | ||
getRuntimeVF(Builder, TripCountV->getType(), State.VF)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved thanks.
Probably best to first lane #95305 which adds VF separately.
} | ||
if (auto *II = dyn_cast<IntrinsicInst>(RepR->getUnderlyingValue())) { | ||
if (II->getIntrinsicID() == Intrinsic::experimental_noalias_scope_decl) { | ||
InterleavedValues.addUniform(RepR, IC); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, also treated as uniforms so remapping wouldn't be needed.
InterleavedValues.addUniform(cast<VPInstruction>(&R), IC); | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added thanks!
interleaveBlock(VPB, Plan, IC, InterleavedValues, TypeInfo, ToSkip, | ||
PhisToRemap); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should have been added a while ago
@@ -106,6 +106,8 @@ struct VPlanTransforms { | |||
/// this transformation. | |||
/// \returns true if the transformation succeeds, or false if it doesn't. | |||
static bool tryAddExplicitVectorLength(VPlan &Plan); | |||
|
|||
static void interleaveByUF(VPlan &Plan, unsigned IC, LLVMContext &Ctx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated and also renamed interleave->unroll
; CHECK-NEXT: vp<[[VPTR3:%.+]]> = vector-pointer vp<[[PADD1]]> | ||
; CHECK-NEXT: vp<[[VPTR4:%.+]]> = vector-pointer vp<[[PADD1]]> | ||
; CHECK-NEXT: WIDEN store vp<[[VPTR3]]>, ir<%add> | ||
; CHECK-NEXT: WIDEN store vp<[[VPTR4]]>, ir<%add>.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not redundant,vp<[[VPTR4:%.+]]> = vector-pointer vp<[[PADD1]]>
was missing checking for the parts operand (ir<1>
) which I added now.
✅ With the latest revision this PR passed the C/C++ code formatter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Latest comments should be addressed. Also moved the unroll code to a separate file.
Also split off the isDefinedOutsideLoopRegions changes to 4eb9838 after landing VPlan-based LICM, which allowed for independent testing.
// The newly added recipes are added to ToSkip to avoid interleaving them | ||
// again. | ||
VPValue *Prev = IV; | ||
Builder.setInsertPoint(IV->getParent(), InsertPtForPhi); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to replace cerateFPOp with createNaryOp which takes optional fast-math flags , allowing it to unify the code.
Could ID.getInductionOpcode() work for non-floats too?
Unfortunately no, as it may be nullptr in some cases for ints (e.g. extends)
VPValue *Op0; | ||
VPValue *Op1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks
} | ||
|
||
if (vputils::onlyFirstPartUsed(VPI)) { | ||
addUniformForAllParts(cast<VPInstruction>(&R)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CalculateTripCountMinusVF
is now handled by isUniformAcrossVFandUF, so removed here. Updated to pass VPI, thanks!
continue; | ||
} | ||
|
||
auto *SingleDef = dyn_cast<VPSingleDefRecipe>(&R); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left as is for now, as something like all_of(R.definedvalues(), [](VPValue *Op) { return vputils::isUniformAcrossVFsAndUFs(Op); })
would also include recipes without any defined values
DenseMap<VPValue *, SmallVector<VPValue *>> VPV2Parts; | ||
|
||
void unrollReplicateRegion(VPRegionBlock *VPR); | ||
void unrollRecipe(VPRecipeBase &R); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
// ..., UF-1. | ||
DenseMap<VPValue *, SmallVector<VPValue *>> VPV2Parts; | ||
|
||
void unrollReplicateRegion(VPRegionBlock *VPR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/146/builds/1205 Here is the relevant piece of the build log for the reference
|
Also landed 06c3a7d as follow-up to remove VPTransformState::UF, will follow up with more clean-ups simplifying VPTransformState tomorrow. |
This patch implements explicit unrolling by UF as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in an each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts). It also unifies the logic dealing with unrolled parts in a single place, rather than spreading it out across multiple places (e.g. VPlan post processing for header-phi recipes previously.) In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part. The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul. This has been split off llvm#94339 which also includes changes to simplify VPTransfomState and recipes' ::execute. The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1. A follow-up patch will clean up all references to VPTransformState::UF. Another follow-up patch will simplify VPTransformState to only store a single vector value per VPValue. PR: llvm#95842
State.UF is not needed any longer after 8ec4067 (llvm#95842). Clean it up, simplifying ::execute of existing recipes.
After 8ec4067 (llvm#95842), VPTransformState only stores a single vector value per VPValue. Simplify the code by replacing the SmallVector in PerPartOutput with a single Value * and rename to VPV2Vector for clarity. Also remove the redundant Part argument from various accessors.
After 8ec4067 (llvm#95842), VPTransformState only stores a single scalar vector per VPValue. Simplify the code by replacing the nested SmallVector in PerPartScalars with a single SmallVector and rename to VPV2Scalars for clarity.
After 8ec4067 (llvm#95842), only the lane part of VPIteration is used. Simplify the code by replacing remaining uses of VPIteration with VPLane directly.
Update isDefinedOutsideLoopRegions to check if a recipe is defined outside any region. Split off already approved llvm#95842 now that this can be tested separately after landing VPlan-based LICM llvm#107501
This patch implements explicit unrolling by UF as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in an each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts). It also unifies the logic dealing with unrolled parts in a single place, rather than spreading it out across multiple places (e.g. VPlan post processing for header-phi recipes previously.) In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part. The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul. This has been split off llvm#94339 which also includes changes to simplify VPTransfomState and recipes' ::execute. The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1. A follow-up patch will clean up all references to VPTransformState::UF. Another follow-up patch will simplify VPTransformState to only store a single vector value per VPValue. PR: llvm#95842
State.UF is not needed any longer after 8ec4067 (llvm#95842). Clean it up, simplifying ::execute of existing recipes.
After 8ec4067 (llvm#95842), VPTransformState only stores a single vector value per VPValue. Simplify the code by replacing the SmallVector in PerPartOutput with a single Value * and rename to VPV2Vector for clarity. Also remove the redundant Part argument from various accessors.
After 8ec4067 (llvm#95842), VPTransformState only stores a single scalar vector per VPValue. Simplify the code by replacing the nested SmallVector in PerPartScalars with a single SmallVector and rename to VPV2Scalars for clarity.
After 8ec4067 (llvm#95842), only the lane part of VPIteration is used. Simplify the code by replacing remaining uses of VPIteration with VPLane directly.
This patch implements explicit interleaving as VPlan transform. In follow up patches this will allow simplifying VPTransform state (no need to store unrolled parts) as well as recipe execution (no need to generate code for multiple parts in an each recipe). It also allows for more general optimziations (e.g. avoid generating code for recipes that are uniform-across parts).
It also unifies the logic dealing with unrolled parts in a single place, rather than spreading it out across multiple places (e.g. VPlan post processing for header-phi recipes previously.)
In the initial implementation, a number of recipes still take the unrolled part as additional, optional argument, if their execution depends on the unrolled part.
The computation for start/step values for scalable inductions changed slightly. Previously the step would be computed as scalar and then splatted, now vscale gets splatted and multiplied by the step in a vector mul.
This has been split off #94339 which also includes changes to simplify VPTransfomState and recipes' ::execute.
The current version mostly leaves existing ::execute untouched and instead sets VPTransfomState::UF to 1.
A follow-up patch will clean up all references to VPTransformState::UF.
Another follow-up patch will simplify VPTransformState to only store a single vector value per VPValue.