-
Notifications
You must be signed in to change notification settings - Fork 12.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[X86] X86FixupVectorConstants - load+sign-extend vector constants that can be stored in a truncated form #79815
Conversation
@llvm/pr-subscribers-backend-x86 Author: Simon Pilgrim (RKSimon) ChangesReduce the size of the vector constant by storing it in the constant pool in a truncated form, and sign-extend it as part of the load. I've extended the existing FixupConstant functionality to support these sext constant rebuilds - we still select the smallest stored constant entry and prefer vzload/broadcast/vextload for same bitwidth to avoid domain flips. I intend to add the matching load+zero-extend handling in a future PR, but that requires some alterations to the existing MC shuffle comments handling first. NOTE: Some of the FixupConstant tables are currently created on the fly as they are dependent on the supported ISAs (HasAVX2 etc.) - should we split these (to allow initializer lists instead) and have duplicate FixupConstant calls to avoid additional stack use? Patch is 6.04 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/79815.diff 223 Files Affected:
diff --git a/llvm/lib/Target/X86/X86FixupVectorConstants.cpp b/llvm/lib/Target/X86/X86FixupVectorConstants.cpp
index d4af94c7f92ee7e..ee2bb15f6dcbc33 100644
--- a/llvm/lib/Target/X86/X86FixupVectorConstants.cpp
+++ b/llvm/lib/Target/X86/X86FixupVectorConstants.cpp
@@ -10,7 +10,7 @@
// replace them with smaller constant pool entries, including:
// * Converting AVX512 memory-fold instructions to their broadcast-fold form
// * Broadcasting of full width loads.
-// * TODO: Sign/Zero extension of full width loads.
+// * TODO: Zero extension of full width loads.
//
//===----------------------------------------------------------------------===//
@@ -216,8 +216,8 @@ static Constant *rebuildConstant(LLVMContext &Ctx, Type *SclTy,
// Attempt to rebuild a normalized splat vector constant of the requested splat
// width, built up of potentially smaller scalar values.
-static Constant *rebuildSplatableConstant(const Constant *C,
- unsigned SplatBitWidth) {
+static Constant *rebuildSplatCst(const Constant *C, unsigned NumElts,
+ unsigned SplatBitWidth) {
std::optional<APInt> Splat = getSplatableConstant(C, SplatBitWidth);
if (!Splat)
return nullptr;
@@ -238,8 +238,8 @@ static Constant *rebuildSplatableConstant(const Constant *C,
return rebuildConstant(OriginalType->getContext(), SclTy, *Splat, NumSclBits);
}
-static Constant *rebuildZeroUpperConstant(const Constant *C,
- unsigned ScalarBitWidth) {
+static Constant *rebuildZeroUpperCst(const Constant *C, unsigned NumElts,
+ unsigned ScalarBitWidth) {
Type *Ty = C->getType();
Type *SclTy = Ty->getScalarType();
unsigned NumBits = Ty->getPrimitiveSizeInBits();
@@ -265,57 +265,93 @@ static Constant *rebuildZeroUpperConstant(const Constant *C,
return nullptr;
}
-typedef std::function<Constant *(const Constant *, unsigned)> RebuildFn;
+static Constant *rebuildExtCst(const Constant *C, bool IsSExt,
+ unsigned NumElts,
+ unsigned SrcEltBitWidth) {
+ Type *Ty = C->getType();
+ unsigned NumBits = Ty->getPrimitiveSizeInBits();
+ unsigned DstEltBitWidth = NumBits / NumElts;
+ assert((NumBits % NumElts) == 0 && (NumBits % SrcEltBitWidth) == 0 &&
+ (DstEltBitWidth % SrcEltBitWidth) == 0 &&
+ (DstEltBitWidth > SrcEltBitWidth) && "Illegal extension width");
+
+ if (std::optional<APInt> Bits = extractConstantBits(C)) {
+ assert((Bits->getBitWidth() / DstEltBitWidth) == NumElts &&
+ (Bits->getBitWidth() % DstEltBitWidth) == 0 &&
+ "Unexpected constant extension");
+
+ // Ensure every vector element can be represented by the src bitwidth.
+ APInt TruncBits = APInt::getZero(NumElts * SrcEltBitWidth);
+ for (unsigned I = 0; I != NumElts; ++I) {
+ APInt Elt = Bits->extractBits(DstEltBitWidth, I * DstEltBitWidth);
+ if ((IsSExt && Elt.getSignificantBits() > SrcEltBitWidth) ||
+ (!IsSExt && Elt.getActiveBits() > SrcEltBitWidth))
+ return nullptr;
+ TruncBits.insertBits(Elt.trunc(SrcEltBitWidth), I * SrcEltBitWidth);
+ }
+
+ return rebuildConstant(Ty->getContext(), Ty->getScalarType(), TruncBits,
+ SrcEltBitWidth);
+ }
+
+ return nullptr;
+}
+static Constant *rebuildSExtCst(const Constant *C, unsigned NumElts,
+ unsigned SrcEltBitWidth) {
+ return rebuildExtCst(C, true, NumElts, SrcEltBitWidth);
+}
bool X86FixupVectorConstantsPass::processInstruction(MachineFunction &MF,
MachineBasicBlock &MBB,
MachineInstr &MI) {
unsigned Opc = MI.getOpcode();
MachineConstantPool *CP = MI.getParent()->getParent()->getConstantPool();
+ bool HasSSE41 = ST->hasSSE41();
bool HasAVX2 = ST->hasAVX2();
bool HasDQI = ST->hasDQI();
bool HasBWI = ST->hasBWI();
bool HasVLX = ST->hasVLX();
- auto FixupConstant =
- [&](unsigned OpBcst256, unsigned OpBcst128, unsigned OpBcst64,
- unsigned OpBcst32, unsigned OpBcst16, unsigned OpBcst8,
- unsigned OpUpper64, unsigned OpUpper32, unsigned OperandNo) {
- assert(MI.getNumOperands() >= (OperandNo + X86::AddrNumOperands) &&
- "Unexpected number of operands!");
-
- if (auto *C = X86::getConstantFromPool(MI, OperandNo)) {
- // Attempt to detect a suitable splat/vzload from increasing constant
- // bitwidths.
- // Prefer vzload vs broadcast for same bitwidth to avoid domain flips.
- std::tuple<unsigned, unsigned, RebuildFn> FixupLoad[] = {
- {8, OpBcst8, rebuildSplatableConstant},
- {16, OpBcst16, rebuildSplatableConstant},
- {32, OpUpper32, rebuildZeroUpperConstant},
- {32, OpBcst32, rebuildSplatableConstant},
- {64, OpUpper64, rebuildZeroUpperConstant},
- {64, OpBcst64, rebuildSplatableConstant},
- {128, OpBcst128, rebuildSplatableConstant},
- {256, OpBcst256, rebuildSplatableConstant},
- };
- for (auto [BitWidth, Op, RebuildConstant] : FixupLoad) {
- if (Op) {
- // Construct a suitable constant and adjust the MI to use the new
- // constant pool entry.
- if (Constant *NewCst = RebuildConstant(C, BitWidth)) {
- unsigned NewCPI =
- CP->getConstantPoolIndex(NewCst, Align(BitWidth / 8));
- MI.setDesc(TII->get(Op));
- MI.getOperand(OperandNo + X86::AddrDisp).setIndex(NewCPI);
- return true;
- }
- }
+ struct FixupEntry {
+ int Op;
+ int NumCstElts;
+ int BitWidth;
+ std::function<Constant *(const Constant *, unsigned, unsigned)>
+ RebuildConstant;
+ };
+ auto FixupConstant = [&](ArrayRef<FixupEntry> Fixups, unsigned OperandNo) {
+#ifdef EXPENSIVE_CHECKS
+ assert(llvm::is_sorted(Fixups,
+ [](const FixupEntry &A, const FixupEntry &B) {
+ return (A.NumCstElts * A.BitWidth) <
+ (B.NumCstElts * B.BitWidth);
+ }) &&
+ "Constant fixup table not sorted in ascending constant size");
+#endif
+ assert(MI.getNumOperands() >= (OperandNo + X86::AddrNumOperands) &&
+ "Unexpected number of operands!");
+ if (auto *C = X86::getConstantFromPool(MI, OperandNo)) {
+ for (const FixupEntry &Fixup : Fixups) {
+ if (Fixup.Op) {
+ // Construct a suitable constant and adjust the MI to use the new
+ // constant pool entry.
+ if (Constant *NewCst =
+ Fixup.RebuildConstant(C, Fixup.NumCstElts, Fixup.BitWidth)) {
+ unsigned NewCPI =
+ CP->getConstantPoolIndex(NewCst, Align(Fixup.BitWidth / 8));
+ MI.setDesc(TII->get(Fixup.Op));
+ MI.getOperand(OperandNo + X86::AddrDisp).setIndex(NewCPI);
+ return true;
}
}
- return false;
- };
+ }
+ }
+ return false;
+ };
- // Attempt to convert full width vector loads into broadcast/vzload loads.
+ // Attempt to detect a suitable vzload/broadcast/vextload from increasing
+ // constant bitwidths.
+ // Prefer vzload/broadcast/vextload for same bitwidth to avoid domain flips.
switch (Opc) {
/* FP Loads */
case X86::MOVAPDrm:
@@ -323,82 +359,161 @@ bool X86FixupVectorConstantsPass::processInstruction(MachineFunction &MF,
case X86::MOVUPDrm:
case X86::MOVUPSrm:
// TODO: SSE3 MOVDDUP Handling
- return FixupConstant(0, 0, 0, 0, 0, 0, X86::MOVSDrm, X86::MOVSSrm, 1);
+ return FixupConstant({{X86::MOVSSrm, 1, 32, rebuildZeroUpperCst},
+ {X86::MOVSDrm, 1, 64, rebuildZeroUpperCst}},
+ 1);
case X86::VMOVAPDrm:
case X86::VMOVAPSrm:
case X86::VMOVUPDrm:
case X86::VMOVUPSrm:
- return FixupConstant(0, 0, X86::VMOVDDUPrm, X86::VBROADCASTSSrm, 0, 0,
- X86::VMOVSDrm, X86::VMOVSSrm, 1);
+ return FixupConstant({{X86::VMOVSSrm, 1, 32, rebuildZeroUpperCst},
+ {X86::VBROADCASTSSrm, 1, 32, rebuildSplatCst},
+ {X86::VMOVSDrm, 1, 64, rebuildZeroUpperCst},
+ {X86::VMOVDDUPrm, 1, 64, rebuildSplatCst}},
+ 1);
case X86::VMOVAPDYrm:
case X86::VMOVAPSYrm:
case X86::VMOVUPDYrm:
case X86::VMOVUPSYrm:
- return FixupConstant(0, X86::VBROADCASTF128rm, X86::VBROADCASTSDYrm,
- X86::VBROADCASTSSYrm, 0, 0, 0, 0, 1);
+ return FixupConstant({{X86::VBROADCASTSSYrm, 1, 32, rebuildSplatCst},
+ {X86::VBROADCASTSDYrm, 1, 64, rebuildSplatCst},
+ {X86::VBROADCASTF128rm, 1, 128, rebuildSplatCst}},
+ 1);
case X86::VMOVAPDZ128rm:
case X86::VMOVAPSZ128rm:
case X86::VMOVUPDZ128rm:
case X86::VMOVUPSZ128rm:
- return FixupConstant(0, 0, X86::VMOVDDUPZ128rm, X86::VBROADCASTSSZ128rm, 0,
- 0, X86::VMOVSDZrm, X86::VMOVSSZrm, 1);
+ return FixupConstant({{X86::VMOVSSZrm, 1, 32, rebuildZeroUpperCst},
+ {X86::VBROADCASTSSZ128rm, 1, 32, rebuildSplatCst},
+ {X86::VMOVSDZrm, 1, 64, rebuildZeroUpperCst},
+ {X86::VMOVDDUPZ128rm, 1, 64, rebuildSplatCst}},
+ 1);
case X86::VMOVAPDZ256rm:
case X86::VMOVAPSZ256rm:
case X86::VMOVUPDZ256rm:
case X86::VMOVUPSZ256rm:
- return FixupConstant(0, X86::VBROADCASTF32X4Z256rm, X86::VBROADCASTSDZ256rm,
- X86::VBROADCASTSSZ256rm, 0, 0, 0, 0, 1);
+ return FixupConstant(
+ {{X86::VBROADCASTSSZ256rm, 1, 32, rebuildSplatCst},
+ {X86::VBROADCASTSDZ256rm, 1, 64, rebuildSplatCst},
+ {X86::VBROADCASTF32X4Z256rm, 1, 128, rebuildSplatCst}},
+ 1);
case X86::VMOVAPDZrm:
case X86::VMOVAPSZrm:
case X86::VMOVUPDZrm:
case X86::VMOVUPSZrm:
- return FixupConstant(X86::VBROADCASTF64X4rm, X86::VBROADCASTF32X4rm,
- X86::VBROADCASTSDZrm, X86::VBROADCASTSSZrm, 0, 0, 0, 0,
+ return FixupConstant({{X86::VBROADCASTSSZrm, 1, 32, rebuildSplatCst},
+ {X86::VBROADCASTSDZrm, 1, 64, rebuildSplatCst},
+ {X86::VBROADCASTF32X4rm, 1, 128, rebuildSplatCst},
+ {X86::VBROADCASTF64X4rm, 1, 256, rebuildSplatCst}},
1);
/* Integer Loads */
case X86::MOVDQArm:
- case X86::MOVDQUrm:
- return FixupConstant(0, 0, 0, 0, 0, 0, X86::MOVQI2PQIrm, X86::MOVDI2PDIrm,
- 1);
+ case X86::MOVDQUrm: {
+ FixupEntry Fixups[] = {
+ {HasSSE41 ? X86::PMOVSXBQrm : 0, 2, 8, rebuildSExtCst},
+ {X86::MOVDI2PDIrm, 1, 32, rebuildZeroUpperCst},
+ {HasSSE41 ? X86::PMOVSXBDrm : 0, 4, 8, rebuildSExtCst},
+ {HasSSE41 ? X86::PMOVSXWQrm : 0, 2, 16, rebuildSExtCst},
+ {X86::MOVQI2PQIrm, 1, 64, rebuildZeroUpperCst},
+ {HasSSE41 ? X86::PMOVSXBWrm : 0, 8, 8, rebuildSExtCst},
+ {HasSSE41 ? X86::PMOVSXWDrm : 0, 4, 16, rebuildSExtCst},
+ {HasSSE41 ? X86::PMOVSXDQrm : 0, 2, 32, rebuildSExtCst}};
+ return FixupConstant(Fixups, 1);
+ }
case X86::VMOVDQArm:
- case X86::VMOVDQUrm:
- return FixupConstant(0, 0, HasAVX2 ? X86::VPBROADCASTQrm : X86::VMOVDDUPrm,
- HasAVX2 ? X86::VPBROADCASTDrm : X86::VBROADCASTSSrm,
- HasAVX2 ? X86::VPBROADCASTWrm : 0,
- HasAVX2 ? X86::VPBROADCASTBrm : 0, X86::VMOVQI2PQIrm,
- X86::VMOVDI2PDIrm, 1);
+ case X86::VMOVDQUrm: {
+ FixupEntry Fixups[] = {
+ {HasAVX2 ? X86::VPBROADCASTBrm : 0, 1, 8, rebuildSplatCst},
+ {HasAVX2 ? X86::VPBROADCASTWrm : 0, 1, 16, rebuildSplatCst},
+ {X86::VPMOVSXBQrm, 2, 8, rebuildSExtCst},
+ {X86::VMOVDI2PDIrm, 1, 32, rebuildZeroUpperCst},
+ {HasAVX2 ? X86::VPBROADCASTDrm : X86::VBROADCASTSSrm, 1, 32,
+ rebuildSplatCst},
+ {X86::VPMOVSXBDrm, 4, 8, rebuildSExtCst},
+ {X86::VPMOVSXWQrm, 2, 16, rebuildSExtCst},
+ {X86::VMOVQI2PQIrm, 1, 64, rebuildZeroUpperCst},
+ {HasAVX2 ? X86::VPBROADCASTQrm : X86::VMOVDDUPrm, 1, 64,
+ rebuildSplatCst},
+ {X86::VPMOVSXBWrm, 8, 8, rebuildSExtCst},
+ {X86::VPMOVSXWDrm, 4, 16, rebuildSExtCst},
+ {X86::VPMOVSXDQrm, 2, 32, rebuildSExtCst}};
+ return FixupConstant(Fixups, 1);
+ }
case X86::VMOVDQAYrm:
- case X86::VMOVDQUYrm:
- return FixupConstant(
- 0, HasAVX2 ? X86::VBROADCASTI128rm : X86::VBROADCASTF128rm,
- HasAVX2 ? X86::VPBROADCASTQYrm : X86::VBROADCASTSDYrm,
- HasAVX2 ? X86::VPBROADCASTDYrm : X86::VBROADCASTSSYrm,
- HasAVX2 ? X86::VPBROADCASTWYrm : 0, HasAVX2 ? X86::VPBROADCASTBYrm : 0,
- 0, 0, 1);
+ case X86::VMOVDQUYrm: {
+ FixupEntry Fixups[] = {
+ {HasAVX2 ? X86::VPBROADCASTBYrm : 0, 1, 8, rebuildSplatCst},
+ {HasAVX2 ? X86::VPBROADCASTWYrm : 0, 1, 16, rebuildSplatCst},
+ {HasAVX2 ? X86::VPBROADCASTDYrm : X86::VBROADCASTSSYrm, 1, 32,
+ rebuildSplatCst},
+ {HasAVX2 ? X86::VPMOVSXBQYrm : 0, 4, 8, rebuildSExtCst},
+ {HasAVX2 ? X86::VPBROADCASTQYrm : X86::VBROADCASTSDYrm, 1, 64,
+ rebuildSplatCst},
+ {HasAVX2 ? X86::VPMOVSXBDYrm : 0, 8, 8, rebuildSExtCst},
+ {HasAVX2 ? X86::VPMOVSXWQYrm : 0, 4, 16, rebuildSExtCst},
+ {HasAVX2 ? X86::VBROADCASTI128rm : X86::VBROADCASTF128rm, 1, 128,
+ rebuildSplatCst},
+ {HasAVX2 ? X86::VPMOVSXBWYrm : 0, 16, 8, rebuildSExtCst},
+ {HasAVX2 ? X86::VPMOVSXWDYrm : 0, 8, 16, rebuildSExtCst},
+ {HasAVX2 ? X86::VPMOVSXDQYrm : 0, 4, 32, rebuildSExtCst}};
+ return FixupConstant(Fixups, 1);
+ }
case X86::VMOVDQA32Z128rm:
case X86::VMOVDQA64Z128rm:
case X86::VMOVDQU32Z128rm:
- case X86::VMOVDQU64Z128rm:
- return FixupConstant(0, 0, X86::VPBROADCASTQZ128rm, X86::VPBROADCASTDZ128rm,
- HasBWI ? X86::VPBROADCASTWZ128rm : 0,
- HasBWI ? X86::VPBROADCASTBZ128rm : 0,
- X86::VMOVQI2PQIZrm, X86::VMOVDI2PDIZrm, 1);
+ case X86::VMOVDQU64Z128rm: {
+ FixupEntry Fixups[] = {
+ {HasBWI ? X86::VPBROADCASTBZ128rm : 0, 1, 8, rebuildSplatCst},
+ {HasBWI ? X86::VPBROADCASTWZ128rm : 0, 1, 16, rebuildSplatCst},
+ {X86::VPMOVSXBQZ128rm, 2, 8, rebuildSExtCst},
+ {X86::VMOVDI2PDIZrm, 1, 32, rebuildZeroUpperCst},
+ {X86::VPBROADCASTDZ128rm, 1, 32, rebuildSplatCst},
+ {X86::VPMOVSXBDZ128rm, 4, 8, rebuildSExtCst},
+ {X86::VPMOVSXWQZ128rm, 2, 16, rebuildSExtCst},
+ {X86::VMOVQI2PQIZrm, 1, 64, rebuildZeroUpperCst},
+ {X86::VPBROADCASTQZ128rm, 1, 64, rebuildSplatCst},
+ {HasBWI ? X86::VPMOVSXBWZ128rm : 0, 8, 8, rebuildSExtCst},
+ {X86::VPMOVSXWDZ128rm, 4, 16, rebuildSExtCst},
+ {X86::VPMOVSXDQZ128rm, 2, 32, rebuildSExtCst}};
+ return FixupConstant(Fixups, 1);
+ }
case X86::VMOVDQA32Z256rm:
case X86::VMOVDQA64Z256rm:
case X86::VMOVDQU32Z256rm:
- case X86::VMOVDQU64Z256rm:
- return FixupConstant(0, X86::VBROADCASTI32X4Z256rm, X86::VPBROADCASTQZ256rm,
- X86::VPBROADCASTDZ256rm,
- HasBWI ? X86::VPBROADCASTWZ256rm : 0,
- HasBWI ? X86::VPBROADCASTBZ256rm : 0, 0, 0, 1);
+ case X86::VMOVDQU64Z256rm: {
+ FixupEntry Fixups[] = {
+ {HasBWI ? X86::VPBROADCASTBZ256rm : 0, 1, 8, rebuildSplatCst},
+ {HasBWI ? X86::VPBROADCASTWZ256rm : 0, 1, 16, rebuildSplatCst},
+ {X86::VPBROADCASTDZ256rm, 1, 32, rebuildSplatCst},
+ {X86::VPMOVSXBQZ256rm, 4, 8, rebuildSExtCst},
+ {X86::VPBROADCASTQZ256rm, 1, 64, rebuildSplatCst},
+ {X86::VPMOVSXBDZ256rm, 8, 8, rebuildSExtCst},
+ {X86::VPMOVSXWQZ256rm, 4, 16, rebuildSExtCst},
+ {X86::VBROADCASTI32X4Z256rm, 1, 128, rebuildSplatCst},
+ {HasBWI ? X86::VPMOVSXBWZ256rm : 0, 16, 8, rebuildSExtCst},
+ {X86::VPMOVSXWDZ256rm, 8, 16, rebuildSExtCst},
+ {X86::VPMOVSXDQZ256rm, 4, 32, rebuildSExtCst}};
+ return FixupConstant(Fixups, 1);
+ }
case X86::VMOVDQA32Zrm:
case X86::VMOVDQA64Zrm:
case X86::VMOVDQU32Zrm:
- case X86::VMOVDQU64Zrm:
- return FixupConstant(X86::VBROADCASTI64X4rm, X86::VBROADCASTI32X4rm,
- X86::VPBROADCASTQZrm, X86::VPBROADCASTDZrm,
- HasBWI ? X86::VPBROADCASTWZrm : 0,
- HasBWI ? X86::VPBROADCASTBZrm : 0, 0, 0, 1);
+ case X86::VMOVDQU64Zrm: {
+ FixupEntry Fixups[] = {
+ {HasBWI ? X86::VPBROADCASTBZrm : 0, 1, 8, rebuildSplatCst},
+ {HasBWI ? X86::VPBROADCASTWZrm : 0, 1, 16, rebuildSplatCst},
+ {X86::VPBROADCASTDZrm, 1, 32, rebuildSplatCst},
+ {X86::VPBROADCASTQZrm, 1, 64, rebuildSplatCst},
+ {X86::VPMOVSXBQZrm, 8, 8, rebuildSExtCst},
+ {X86::VBROADCASTI32X4rm, 1, 128, rebuildSplatCst},
+ {X86::VPMOVSXBDZrm, 16, 8, rebuildSExtCst},
+ {X86::VPMOVSXWQZrm, 8, 16, rebuildSExtCst},
+ {X86::VBROADCASTI64X4rm, 1, 256, rebuildSplatCst},
+ {HasBWI ? X86::VPMOVSXBWZrm : 0, 32, 8, rebuildSExtCst},
+ {X86::VPMOVSXWDZrm, 16, 16, rebuildSExtCst},
+ {X86::VPMOVSXDQZrm, 8, 32, rebuildSExtCst}};
+ return FixupConstant(Fixups, 1);
+ }
}
auto ConvertToBroadcastAVX512 = [&](unsigned OpSrc32, unsigned OpSrc64) {
@@ -423,7 +538,9 @@ bool X86FixupVectorConstantsPass::processInstruction(MachineFunction &MF,
if (OpBcst32 || OpBcst64) {
unsigned OpNo = OpBcst32 == 0 ? OpNoBcst64 : OpNoBcst32;
- return FixupConstant(0, 0, OpBcst64, OpBcst32, 0, 0, 0, 0, OpNo);
+ FixupEntry Fixups[] = {{(int)OpBcst32, 32, 32, rebuildSplatCst},
+ {(int)OpBcst64, 64, 64, rebuildSplatCst}};
+ return FixupConstant(Fixups, OpNo);
}
return false;
};
diff --git a/llvm/lib/Target/X86/X86MCInstLower.cpp b/llvm/lib/Target/X86/X86MCInstLower.cpp
index 58ebe023cd61eca..55d7868c530536f 100644
--- a/llvm/lib/Target/X86/X86MCInstLower.cpp
+++ b/llvm/lib/Target/X86/X86MCInstLower.cpp
@@ -1582,6 +1582,36 @@ static void printBroadcast(const MachineInstr *MI, MCStreamer &OutStreamer,
}
}
+static bool printSignExtend(const MachineInstr *MI, MCStreamer &OutStreamer,
+ int SrcEltBits, int DstEltBits) {
+ auto *C = X86::getConstantFromPool(*MI, 1);
+ if (C && C->getType()->getScalarSizeInBits() == SrcEltBits) {
+ if (auto *CDS = dyn_cast<ConstantDataSequential>(C)) {
+ int NumElts = CDS->getNumElements();
+ std::string Comment;
+ raw_string_ostream CS(Comment);
+
+ const MachineOperand &DstOp = MI->getOperand(0);
+ CS << X86ATTInstPrinter::getRegisterName(DstOp.getReg()) << " = ";
+ CS << "[";
+ for (int i = 0; i != NumElts; ++i) {
+ if (i != 0)
+ CS << ",";
+ if (CDS->getElementType()->isIntegerTy()) {
+ APInt Elt = CDS->getElementAsAPInt(i).sext(DstEltBits);
+ printConstant(Elt, CS);
+ } else
+ CS << "?";
+ }
+ CS << "]";
+ OutStreamer.AddComment(CS.str());
+ return true;
+ }
+ }
+
+ return false;
+}
+
void X86AsmPrinter::EmitSEHInstruction(const MachineInstr *MI) {
assert(MF->hasWinCFI() && "SEH_ instruction in function without WinCFI?");
assert((getSubtarget().isOSWindows() || TM.getTargetTriple().isUEFI()) &&
@@ -1844,7 +1874,7 @@ static void addConstantComments(const MachineInstr *MI,
case X86::VMOVQI2PQIrm:
case X86::VMOVQI2PQIZrm:
printZeroUpperMove(MI, OutStreamer, 64, 128, "mem[0],zero");
- break;
+ break;
case X86::MOVSSrm:
case X86::VMOVSSrm:
@@ -1979,6 +2009,36 @@ static void addConstantComments(const MachineInstr *MI,
case X86::VPBROADCASTBZrm:
printBroadcast(MI, OutStreamer, 64, 8);
break;
+
+#define MOVX_CASE(Prefix, Ext, Type, Suffix) \
+ case X86::Prefix##PMOV##Ext##Type##Suffix##rm:
+
+#define CASE_MOVX_RM(Ext, Type) \
+ MOVX_CASE(, Ex...
[truncated]
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
ca12714
to
528d9f0
Compare
@@ -750,7 +750,7 @@ define void @vec128_i16_widen_to_i32_factor2_broadcast_to_v4i32_factor4(ptr %in. | |||
; AVX512BW-SLOW-LABEL: vec128_i16_widen_to_i32_factor2_broadcast_to_v4i32_factor4: | |||
; AVX512BW-SLOW: # %bb.0: | |||
; AVX512BW-SLOW-NEXT: vmovdqa64 (%rdi), %zmm0 | |||
; AVX512BW-SLOW-NEXT: vmovdqa {{.*#+}} xmm1 = [0,9,0,11,0,13,0,15] | |||
; AVX512BW-SLOW-NEXT: vpmovsxbw {{.*#+}} xmm1 = [0,9,0,11,0,13,0,15] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can small memory size beat defects in micro arch, e.g.
- vpmovsxbw uses one more port
- vmovdqa can be eliminated by uarch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my reasoning for always preferring vzload/broadcast over vextload for the same constant load size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get the point. Doesn't vzload have the same TP/ports with vextload?
And if vzload is preferred, why not a vpmovzxbw
here?
My initial question was about "load size" vs. "inst cost", but we already using broadcast which has the cost too (e.g., vpbroadcastb/w). So it should not be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how I see it regarding preference between vzload/broadcast/vextload:
- vzload shouldn't ever need a shuffle port to zero the upper elements and the fp/int domain versions are equally available so we don't introduce a domain crossing penalty.
- broadcast sometimes need a shuffle port (especially for 8/16-bit variants), AVX1 only has fp domain broadcasts but AVX2+ have good fp/int domain equivalents
- vextload always needs a shuffle port and is only ever int domain
Maybe I should add this as a comment somewhere in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a comment is great.
vzload shouldn't ever need a shuffle port to zero the upper elements
But this doesn't match what I got from uops.info, where the vpmovzx
has the same shuffle port as vpmovsx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean vpmovzx (vzextload which I haven't added yet due to some shuffle comment problems) or vmovd/vmovq (vzload) which just zeros the upper elements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean vpmovzx. Sorry I mixed them up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, the current plan is to add vpmovzx handling in a future PR, there isn't any difference in perf between equivalent vpmovsx vs vpmovzx ops so I'm not sure which I'll actually prefer (most likely vpmovsx to reduce test churn).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a comment describing the preferences
f246840
to
ccc0450
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have checked the code and a dozen test files. No mistakes found so far, so LGTM.
It's better to pre-commit a NFC patch for refactor change related to rebuildZeroUpperCst and rebuildSplatCst to double check no test affect by potential typos.
ccc0450
to
108d32a
Compare
…take array of sorted candidates Pulled out of #79815 - refactors the internal FixupConstant logic to just accept an array of vzload/broadcast candidates that are pre-sorted in ascending constant pool size
0b9a3c4
to
d9b3354
Compare
…take array of sorted candidates Pulled out of llvm#79815 - refactors the internal FixupConstant logic to just accept an array of vzload/broadcast candidates that are pre-sorted in ascending constant pool size
…take array of sorted candidates Pulled out of llvm#79815 - refactors the internal FixupConstant logic to just accept an array of vzload/broadcast candidates that are pre-sorted in ascending constant pool size
…t can be stored in a truncated form Reduce the size of the vector constant by storing it in the constant pool in a truncated form, and sign-extend it as part of the load. I intend to add the matching load+zero-extend handling in a future patch, but that requires some alterations to the existing MC shuffle comments handling first. I've extended the existing FixupConstant functionality to support these constant rebuilds as well - we still select the smallest stored constant entry and prefer vzload/broadcast/vextload for same bitwidth to avoid domain flips. NOTE: Some of the FixupConstant tables are currently created on the fly as they are dependent on the supported ISAs (HasAVX2 etc.) - should we split these (to allow initializer lists instead) and have duplicate FixupConstant calls to avoid so much stack use?
d9b3354
to
57dc23d
Compare
…e stored in a truncated form Further develops the vsextload support added in llvm#79815 - reduces the size of the vector constant by storing it in the constant pool in a truncated form, and zero-extend it as part of the load.
…e stored in a truncated form Further develops the vsextload support added in llvm#79815 - reduces the size of the vector constant by storing it in the constant pool in a truncated form, and zero-extend it as part of the load.
…take array of sorted candidates Pulled out of llvm#79815 - refactors the internal FixupConstant logic to just accept an array of vzload/broadcast candidates that are pre-sorted in ascending constant pool size
…t can be stored in a truncated form (llvm#79815) Reduce the size of the vector constant by storing it in the constant pool in a truncated form, and sign-extend it as part of the load. I've extended the existing FixupConstant functionality to support these sext constant rebuilds - we still select the smallest stored constant entry and prefer vzload/broadcast/vextload for same bitwidth to avoid domain flips. I intend to add the matching load+zero-extend handling in a future PR, but that requires some alterations to the existing MC shuffle comments handling first.
…e stored in a truncated form (llvm#80428) Further develops the vsextload support added in llvm#79815 / b5d35fe - reduces the size of the vector constant by storing it in the constant pool in a truncated form, and zero-extend it as part of the load.
Reduce the size of the vector constant by storing it in the constant pool in a truncated form, and sign-extend it as part of the load.
I've extended the existing FixupConstant functionality to support these sext constant rebuilds - we still select the smallest stored constant entry and prefer vzload/broadcast/vextload for same bitwidth to avoid domain flips.
I intend to add the matching load+zero-extend handling in a future PR, but that requires some alterations to the existing MC shuffle comments handling first.
NOTE: Some of the FixupConstant tables are currently created on the fly as they are dependent on the supported ISAs (HasAVX2 etc.) - should we split these (to allow initializer lists instead) and have duplicate FixupConstant calls to avoid additional stack use?