-
Notifications
You must be signed in to change notification settings - Fork 12k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[InstCombine] Extend folding of aggregate construction to cases when source aggregates are partially available #100828
base: main
Are you sure you want to change the base?
Conversation
…source aggregates are partially available Function foldAggregateConstructionIntoAggregateReuse can fold insertvalue(phi(extractvalue(src1), extractvalue(src2))) into phi(src1, src2) when we can find source aggregates in all predecessors. This patch extends it to handle following case insertvalue(phi(extractvalue(src1), elm2)) into phi(src1, insertvalue(elm2)) with the condition that the predecessor without source aggregate has only one successor.
@llvm/pr-subscribers-llvm-transforms Author: None (weiguozhi) ChangesFunction foldAggregateConstructionIntoAggregateReuse can fold This patch extends it to handle following case Full diff: https://github.com/llvm/llvm-project/pull/100828.diff 4 Files Affected:
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp b/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
index 753ed55523c84..5377bf6f033d0 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineVectorOps.cpp
@@ -1106,6 +1106,7 @@ Instruction *InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse(
// from which all the elements were originally extracted from?
// Note that we want for the map to have stable iteration order!
SmallDenseMap<BasicBlock *, Value *, 4> SourceAggregates;
+ bool FoundSrcAgg = false;
for (BasicBlock *Pred : Preds) {
std::pair<decltype(SourceAggregates)::iterator, bool> IV =
SourceAggregates.insert({Pred, nullptr});
@@ -1117,9 +1118,35 @@ Instruction *InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse(
// aggregate produced by OrigIVI must have been originally extracted from
// the same aggregate. Is that so? Can we find said original aggregate?
SourceAggregate = FindCommonSourceAggregate(UseBB, Pred);
- if (Describe(SourceAggregate) != AggregateDescription::Found)
- return nullptr; // Give up.
- IV.first->second = *SourceAggregate;
+ if (Describe(SourceAggregate) == AggregateDescription::Found) {
+ FoundSrcAgg = true;
+ IV.first->second = *SourceAggregate;
+ } else {
+ // If UseBB is the single successor of Pred, we can add InsertValue to
+ // Pred.
+ if (succ_size(Pred) != 1)
+ return nullptr; // Give up.
+ }
+ }
+
+ if (!FoundSrcAgg)
+ return nullptr;
+
+ // For predecessors without appropriate source aggregate, create one in the
+ // predecessor.
+ for (auto &It : SourceAggregates) {
+ if (Describe(It.second) == AggregateDescription::Found)
+ continue;
+
+ BasicBlock *Pred = It.first;
+ Builder.SetInsertPoint(Pred, Pred->getTerminator()->getIterator());
+ Value *V = PoisonValue::get(AggTy);
+ for (auto I : enumerate(AggElts)) {
+ Value *Elt = (*I.value())->DoPHITranslation(UseBB, Pred);
+ V = Builder.CreateInsertValue(V, Elt, I.index());
+ }
+
+ It.second = V;
}
// All good! Now we just need to thread the source aggregates here.
diff --git a/llvm/test/Transforms/InstCombine/fold-aggregate-reconstruction.ll b/llvm/test/Transforms/InstCombine/fold-aggregate-reconstruction.ll
new file mode 100644
index 0000000000000..a75e9f1580ab9
--- /dev/null
+++ b/llvm/test/Transforms/InstCombine/fold-aggregate-reconstruction.ll
@@ -0,0 +1,215 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -S -passes=instcombine < %s | FileCheck %s
+
+declare {ptr, i64} @bar(i64)
+
+; Basic test.
+define {ptr, i64} @test1(i1 %cond1, ptr %p1, ptr %p2) {
+; CHECK-LABEL: define { ptr, i64 } @test1(
+; CHECK-SAME: i1 [[COND1:%.*]], ptr [[P1:%.*]], ptr [[P2:%.*]]) {
+; CHECK-NEXT: br i1 [[COND1]], label %[[BBB1:.*]], label %[[BBB2:.*]]
+; CHECK: [[BBB1]]:
+; CHECK-NEXT: [[CALL1:%.*]] = call { ptr, i64 } @bar(i64 0)
+; CHECK-NEXT: br label %[[EXIT:.*]]
+; CHECK: [[BBB2]]:
+; CHECK-NEXT: [[VAL21:%.*]] = load ptr, ptr [[P1]], align 8
+; CHECK-NEXT: [[VAL22:%.*]] = load i64, ptr [[P2]], align 4
+; CHECK-NEXT: [[TMP1:%.*]] = insertvalue { ptr, i64 } poison, ptr [[VAL21]], 0
+; CHECK-NEXT: [[TMP2:%.*]] = insertvalue { ptr, i64 } [[TMP1]], i64 [[VAL22]], 1
+; CHECK-NEXT: br label %[[EXIT]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: [[RES_MERGED:%.*]] = phi { ptr, i64 } [ [[CALL1]], %[[BBB1]] ], [ [[TMP2]], %[[BBB2]] ]
+; CHECK-NEXT: ret { ptr, i64 } [[RES_MERGED]]
+;
+ br i1 %cond1, label %bbb1, label %bbb2
+
+bbb1:
+ %call1 = call {ptr, i64} @bar(i64 0)
+ %val11 = extractvalue { ptr, i64 } %call1, 0
+ %val12 = extractvalue { ptr, i64 } %call1, 1
+ br label %exit
+
+bbb2:
+ %val21 = load ptr, ptr %p1
+ %val22 = load i64, ptr %p2
+ br label %exit
+
+exit:
+ %val1 = phi ptr [%val11, %bbb1], [%val21, %bbb2]
+ %val2 = phi i64 [%val12, %bbb1], [%val22, %bbb2]
+ %tmp = insertvalue { ptr, i64 } poison, ptr %val1, 0
+ %res = insertvalue { ptr, i64 } %tmp, i64 %val2, 1
+ ret {ptr, i64} %res
+}
+
+; Test with more predecessors.
+define {ptr, i64} @test2(i1 %cond1, i1 %cond2, ptr %p1, ptr %p2) {
+; CHECK-LABEL: define { ptr, i64 } @test2(
+; CHECK-SAME: i1 [[COND1:%.*]], i1 [[COND2:%.*]], ptr [[P1:%.*]], ptr [[P2:%.*]]) {
+; CHECK-NEXT: br i1 [[COND1]], label %[[BBB1:.*]], label %[[BBB4:.*]]
+; CHECK: [[BBB1]]:
+; CHECK-NEXT: br i1 [[COND2]], label %[[BBB2:.*]], label %[[BBB3:.*]]
+; CHECK: [[BBB2]]:
+; CHECK-NEXT: [[CALL1:%.*]] = call { ptr, i64 } @bar(i64 0)
+; CHECK-NEXT: br label %[[EXIT:.*]]
+; CHECK: [[BBB3]]:
+; CHECK-NEXT: [[CALL2:%.*]] = call { ptr, i64 } @bar(i64 1)
+; CHECK-NEXT: br label %[[EXIT]]
+; CHECK: [[BBB4]]:
+; CHECK-NEXT: [[VAL31:%.*]] = load ptr, ptr [[P1]], align 8
+; CHECK-NEXT: [[VAL32:%.*]] = load i64, ptr [[P2]], align 4
+; CHECK-NEXT: [[TMP1:%.*]] = insertvalue { ptr, i64 } poison, ptr [[VAL31]], 0
+; CHECK-NEXT: [[TMP2:%.*]] = insertvalue { ptr, i64 } [[TMP1]], i64 [[VAL32]], 1
+; CHECK-NEXT: br label %[[EXIT]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: [[RES_MERGED:%.*]] = phi { ptr, i64 } [ [[CALL1]], %[[BBB2]] ], [ [[CALL2]], %[[BBB3]] ], [ [[TMP2]], %[[BBB4]] ]
+; CHECK-NEXT: ret { ptr, i64 } [[RES_MERGED]]
+;
+ br i1 %cond1, label %bbb1, label %bbb4
+
+bbb1:
+ br i1 %cond2, label %bbb2, label %bbb3
+
+bbb2:
+ %call1 = call {ptr, i64} @bar(i64 0)
+ %val11 = extractvalue { ptr, i64 } %call1, 0
+ %val12 = extractvalue { ptr, i64 } %call1, 1
+ br label %exit
+
+bbb3:
+ %call2 = call {ptr, i64} @bar(i64 1)
+ %val21 = extractvalue { ptr, i64 } %call2, 0
+ %val22 = extractvalue { ptr, i64 } %call2, 1
+ br label %exit
+
+bbb4:
+ %val31 = load ptr, ptr %p1
+ %val32 = load i64, ptr %p2
+ br label %exit
+
+exit:
+ %val1 = phi ptr [%val11, %bbb2], [%val21, %bbb3], [%val31, %bbb4]
+ %val2 = phi i64 [%val12, %bbb2], [%val22, %bbb3], [%val32, %bbb4]
+ %tmp = insertvalue { ptr, i64 } poison, ptr %val1, 0
+ %res = insertvalue { ptr, i64 } %tmp, i64 %val2, 1
+ ret {ptr, i64} %res
+}
+
+; Test with multiple PHI instructions.
+define {ptr, i64} @test3(i1 %cond1, i1 %cond2, ptr %val31, i64 %val32) {
+; CHECK-LABEL: define { ptr, i64 } @test3(
+; CHECK-SAME: i1 [[COND1:%.*]], i1 [[COND2:%.*]], ptr [[VAL31:%.*]], i64 [[VAL32:%.*]]) {
+; CHECK-NEXT: br i1 [[COND1]], label %[[BBB1:.*]], label %[[BBB4:.*]]
+; CHECK: [[BBB1]]:
+; CHECK-NEXT: br i1 [[COND2]], label %[[BBB2:.*]], label %[[BBB3:.*]]
+; CHECK: [[BBB2]]:
+; CHECK-NEXT: [[CALL1:%.*]] = call { ptr, i64 } @bar(i64 0)
+; CHECK-NEXT: br label %[[EXIT:.*]]
+; CHECK: [[BBB3]]:
+; CHECK-NEXT: [[CALL2:%.*]] = call { ptr, i64 } @bar(i64 1)
+; CHECK-NEXT: br label %[[BBB5:.*]]
+; CHECK: [[BBB4]]:
+; CHECK-NEXT: [[CALL3:%.*]] = call { ptr, i64 } @bar(i64 2)
+; CHECK-NEXT: [[TMP1:%.*]] = insertvalue { ptr, i64 } poison, ptr [[VAL31]], 0
+; CHECK-NEXT: [[TMP2:%.*]] = insertvalue { ptr, i64 } [[TMP1]], i64 [[VAL32]], 1
+; CHECK-NEXT: br label %[[BBB5]]
+; CHECK: [[BBB5]]:
+; CHECK-NEXT: [[DOTMERGED:%.*]] = phi { ptr, i64 } [ [[CALL2]], %[[BBB3]] ], [ [[TMP2]], %[[BBB4]] ]
+; CHECK-NEXT: br label %[[EXIT]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: [[RES_MERGED:%.*]] = phi { ptr, i64 } [ [[CALL1]], %[[BBB2]] ], [ [[DOTMERGED]], %[[BBB5]] ]
+; CHECK-NEXT: ret { ptr, i64 } [[RES_MERGED]]
+;
+ br i1 %cond1, label %bbb1, label %bbb4
+
+bbb1:
+ br i1 %cond2, label %bbb2, label %bbb3
+
+bbb2:
+ %call1 = call {ptr, i64} @bar(i64 0)
+ %val11 = extractvalue { ptr, i64 } %call1, 0
+ %val12 = extractvalue { ptr, i64 } %call1, 1
+ br label %exit
+
+bbb3:
+ %call2 = call {ptr, i64} @bar(i64 1)
+ %val21 = extractvalue { ptr, i64 } %call2, 0
+ %val22 = extractvalue { ptr, i64 } %call2, 1
+ br label %bbb5
+
+bbb4:
+ %call3 = call {ptr, i64} @bar(i64 2)
+ br label %bbb5
+
+bbb5:
+ %val41 = phi ptr [%val21, %bbb3], [%val31, %bbb4]
+ %val42 = phi i64 [%val22, %bbb3], [%val32, %bbb4]
+ br label %exit
+
+exit:
+ %val1 = phi ptr [%val11, %bbb2], [%val41, %bbb5]
+ %val2 = phi i64 [%val12, %bbb2], [%val42, %bbb5]
+ %tmp = insertvalue { ptr, i64 } poison, ptr %val1, 0
+ %res = insertvalue { ptr, i64 } %tmp, i64 %val2, 1
+ ret {ptr, i64} %res
+}
+
+; Negative test, bbb4 has multiple successors, so we don't add insertvalue to it.
+define {ptr, i64} @test4(i1 %cond1, i1 %cond2, ptr %p1, ptr %p2) {
+; CHECK-LABEL: define { ptr, i64 } @test4(
+; CHECK-SAME: i1 [[COND1:%.*]], i1 [[COND2:%.*]], ptr [[P1:%.*]], ptr [[P2:%.*]]) {
+; CHECK-NEXT: br i1 [[COND1]], label %[[BBB1:.*]], label %[[BBB4:.*]]
+; CHECK: [[BBB1]]:
+; CHECK-NEXT: br i1 [[COND2]], label %[[BBB2:.*]], label %[[BBB3:.*]]
+; CHECK: [[BBB2]]:
+; CHECK-NEXT: [[CALL1:%.*]] = call { ptr, i64 } @bar(i64 0)
+; CHECK-NEXT: [[VAL11:%.*]] = extractvalue { ptr, i64 } [[CALL1]], 0
+; CHECK-NEXT: [[VAL12:%.*]] = extractvalue { ptr, i64 } [[CALL1]], 1
+; CHECK-NEXT: br label %[[EXIT:.*]]
+; CHECK: [[BBB3]]:
+; CHECK-NEXT: [[CALL2:%.*]] = call { ptr, i64 } @bar(i64 1)
+; CHECK-NEXT: [[VAL21:%.*]] = extractvalue { ptr, i64 } [[CALL2]], 0
+; CHECK-NEXT: [[VAL22:%.*]] = extractvalue { ptr, i64 } [[CALL2]], 1
+; CHECK-NEXT: br label %[[EXIT]]
+; CHECK: [[BBB4]]:
+; CHECK-NEXT: [[VAL31:%.*]] = load ptr, ptr [[P1]], align 8
+; CHECK-NEXT: [[VAL32:%.*]] = load i64, ptr [[P2]], align 4
+; CHECK-NEXT: [[COND3_NOT:%.*]] = icmp eq i64 [[VAL32]], 0
+; CHECK-NEXT: br i1 [[COND3_NOT]], label %[[EXIT]], label %[[BBB4]]
+; CHECK: [[EXIT]]:
+; CHECK-NEXT: [[VAL1:%.*]] = phi ptr [ [[VAL11]], %[[BBB2]] ], [ [[VAL21]], %[[BBB3]] ], [ [[VAL31]], %[[BBB4]] ]
+; CHECK-NEXT: [[VAL2:%.*]] = phi i64 [ [[VAL12]], %[[BBB2]] ], [ [[VAL22]], %[[BBB3]] ], [ [[VAL32]], %[[BBB4]] ]
+; CHECK-NEXT: [[TMP:%.*]] = insertvalue { ptr, i64 } poison, ptr [[VAL1]], 0
+; CHECK-NEXT: [[RES:%.*]] = insertvalue { ptr, i64 } [[TMP]], i64 [[VAL2]], 1
+; CHECK-NEXT: ret { ptr, i64 } [[RES]]
+;
+ br i1 %cond1, label %bbb1, label %bbb4
+
+bbb1:
+ br i1 %cond2, label %bbb2, label %bbb3
+
+bbb2:
+ %call1 = call {ptr, i64} @bar(i64 0)
+ %val11 = extractvalue { ptr, i64 } %call1, 0
+ %val12 = extractvalue { ptr, i64 } %call1, 1
+ br label %exit
+
+bbb3:
+ %call2 = call {ptr, i64} @bar(i64 1)
+ %val21 = extractvalue { ptr, i64 } %call2, 0
+ %val22 = extractvalue { ptr, i64 } %call2, 1
+ br label %exit
+
+bbb4:
+ %val31 = load ptr, ptr %p1
+ %val32 = load i64, ptr %p2
+ %cond3 = icmp ne i64 %val32, 0
+ br i1 %cond3, label %bbb4, label %exit
+
+exit:
+ %val1 = phi ptr [%val11, %bbb2], [%val21, %bbb3], [%val31, %bbb4]
+ %val2 = phi i64 [%val12, %bbb2], [%val22, %bbb3], [%val32, %bbb4]
+ %tmp = insertvalue { ptr, i64 } poison, ptr %val1, 0
+ %res = insertvalue { ptr, i64 } %tmp, i64 %val2, 1
+ ret {ptr, i64} %res
+}
diff --git a/llvm/test/Transforms/InstCombine/merging-multiple-stores-into-successor.ll b/llvm/test/Transforms/InstCombine/merging-multiple-stores-into-successor.ll
index fbf58d47a32d2..41e3197e5a2f0 100644
--- a/llvm/test/Transforms/InstCombine/merging-multiple-stores-into-successor.ll
+++ b/llvm/test/Transforms/InstCombine/merging-multiple-stores-into-successor.ll
@@ -166,14 +166,13 @@ define %struct.half @one_elem_struct_merge(i1 %cond, %struct.half %a, half %b) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br i1 [[COND:%.*]], label [[BB0:%.*]], label [[BB1:%.*]]
; CHECK: BB0:
-; CHECK-NEXT: [[TMP0:%.*]] = extractvalue [[STRUCT_HALF:%.*]] [[A:%.*]], 0
; CHECK-NEXT: br label [[SINK:%.*]]
; CHECK: BB1:
+; CHECK-NEXT: [[TMP0:%.*]] = insertvalue [[STRUCT_HALF:%.*]] poison, half [[B:%.*]], 0
; CHECK-NEXT: br label [[SINK]]
; CHECK: sink:
-; CHECK-NEXT: [[STOREMERGE:%.*]] = phi half [ [[TMP0]], [[BB0]] ], [ [[B:%.*]], [[BB1]] ]
-; CHECK-NEXT: [[VAL1:%.*]] = insertvalue [[STRUCT_HALF]] poison, half [[STOREMERGE]], 0
-; CHECK-NEXT: ret [[STRUCT_HALF]] [[VAL1]]
+; CHECK-NEXT: [[VAL1_MERGED:%.*]] = phi [[STRUCT_HALF]] [ [[A:%.*]], [[BB0]] ], [ [[TMP0]], [[BB1]] ]
+; CHECK-NEXT: ret [[STRUCT_HALF]] [[VAL1_MERGED]]
;
entry:
%alloca = alloca i64
diff --git a/llvm/test/Transforms/InstCombine/phi-aware-aggregate-reconstruction.ll b/llvm/test/Transforms/InstCombine/phi-aware-aggregate-reconstruction.ll
index 6b7ac445839d2..706ce4e02f231 100644
--- a/llvm/test/Transforms/InstCombine/phi-aware-aggregate-reconstruction.ll
+++ b/llvm/test/Transforms/InstCombine/phi-aware-aggregate-reconstruction.ll
@@ -97,28 +97,26 @@ end:
ret { i32, i32 } %i8
}
-; When coming from %left, elements are swapped
-define { i32, i32 } @negative_test2({ i32, i32 } %agg_left, { i32, i32 } %agg_right, i1 %c) {
-; CHECK-LABEL: @negative_test2(
+; When coming from %left, elements are swapped, and new insertvalue is created.
+; From right side the old aggregate can be directly reused.
+define { i32, i32 } @positive_test2({ i32, i32 } %agg_left, { i32, i32 } %agg_right, i1 %c) {
+; CHECK-LABEL: @positive_test2(
; CHECK-NEXT: entry:
; CHECK-NEXT: br i1 [[C:%.*]], label [[LEFT:%.*]], label [[RIGHT:%.*]]
; CHECK: left:
; CHECK-NEXT: [[I2:%.*]] = extractvalue { i32, i32 } [[AGG_LEFT:%.*]], 1
; CHECK-NEXT: [[I0:%.*]] = extractvalue { i32, i32 } [[AGG_LEFT]], 0
; CHECK-NEXT: call void @foo()
+; CHECK-NEXT: [[TMP0:%.*]] = insertvalue { i32, i32 } poison, i32 [[I2]], 0
+; CHECK-NEXT: [[TMP1:%.*]] = insertvalue { i32, i32 } [[TMP0]], i32 [[I0]], 1
; CHECK-NEXT: br label [[END:%.*]]
; CHECK: right:
-; CHECK-NEXT: [[I4:%.*]] = extractvalue { i32, i32 } [[AGG_RIGHT:%.*]], 1
-; CHECK-NEXT: [[I3:%.*]] = extractvalue { i32, i32 } [[AGG_RIGHT]], 0
; CHECK-NEXT: call void @bar()
; CHECK-NEXT: br label [[END]]
; CHECK: end:
-; CHECK-NEXT: [[I5:%.*]] = phi i32 [ [[I2]], [[LEFT]] ], [ [[I3]], [[RIGHT]] ]
-; CHECK-NEXT: [[I6:%.*]] = phi i32 [ [[I0]], [[LEFT]] ], [ [[I4]], [[RIGHT]] ]
+; CHECK-NEXT: [[I8_MERGED:%.*]] = phi { i32, i32 } [ [[TMP1]], [[LEFT]] ], [ [[AGG_RIGHT:%.*]], [[RIGHT]] ]
; CHECK-NEXT: call void @baz()
-; CHECK-NEXT: [[I7:%.*]] = insertvalue { i32, i32 } undef, i32 [[I5]], 0
-; CHECK-NEXT: [[I8:%.*]] = insertvalue { i32, i32 } [[I7]], i32 [[I6]], 1
-; CHECK-NEXT: ret { i32, i32 } [[I8]]
+; CHECK-NEXT: ret { i32, i32 } [[I8_MERGED]]
;
entry:
%i0 = extractvalue { i32, i32 } %agg_left, 0
@@ -417,14 +415,14 @@ define { i32, i32 } @test8({ i32, i32 } %agg_left, { i32, i32 } %agg_right, i1 %
; CHECK: left:
; CHECK-NEXT: call void @foo()
; CHECK-NEXT: switch i32 [[VAL_LEFT:%.*]], label [[IMPOSSIBLE:%.*]] [
-; CHECK-NEXT: i32 -42, label [[END:%.*]]
-; CHECK-NEXT: i32 42, label [[END]]
+; CHECK-NEXT: i32 -42, label [[END:%.*]]
+; CHECK-NEXT: i32 42, label [[END]]
; CHECK-NEXT: ]
; CHECK: right:
; CHECK-NEXT: call void @bar()
; CHECK-NEXT: switch i32 [[VAL_RIGHT:%.*]], label [[IMPOSSIBLE]] [
-; CHECK-NEXT: i32 42, label [[END]]
-; CHECK-NEXT: i32 -42, label [[END]]
+; CHECK-NEXT: i32 42, label [[END]]
+; CHECK-NEXT: i32 -42, label [[END]]
; CHECK-NEXT: ]
; CHECK: impossible:
; CHECK-NEXT: unreachable
|
add an insertvalue to PredBB.
llvm/test/Transforms/InstCombine/fold-aggregate-reconstruction.ll
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch causes InstCombine to hang when building https://github.com/jqlang/jq:
; timeout 5s bin/opt -passes=instcombine reduced.ll -S
define { i64, ptr } @minmax_by({ i64, ptr } %0, i1 %1, i1 %2, { i64, ptr } %3) {
%5 = extractvalue { i64, ptr } %0, 0
%6 = extractvalue { i64, ptr } %0, 1
br label %7
7: ; preds = %14, %4
%.sroa.01.0 = phi i64 [ %5, %4 ], [ %.sroa.01.1, %14 ]
%.sroa.3.0 = phi ptr [ %6, %4 ], [ %.sroa.3.1, %14 ]
br i1 %1, label %8, label %15
8: ; preds = %7
br i1 %2, label %9, label %12
9: ; preds = %8
%10 = extractvalue { i64, ptr } %3, 0
%11 = extractvalue { i64, ptr } %3, 1
br label %14
12: ; preds = %8
%13 = getelementptr { i64, ptr }, ptr null, i32 0, i32 0
br label %14
14: ; preds = %12, %9
%.sroa.01.1 = phi i64 [ %10, %9 ], [ %.sroa.01.0, %12 ]
%.sroa.3.1 = phi ptr [ %11, %9 ], [ %.sroa.3.0, %12 ]
br label %7
15: ; preds = %7
%.fca.0.insert = insertvalue { i64, ptr } zeroinitializer, i64 %.sroa.01.0, 0
%.fca.1.insert = insertvalue { i64, ptr } %.fca.0.insert, ptr %.sroa.3.0, 1
ret { i64, ptr } %.fca.1.insert
}
|
Thanks for the quick test. The new insertvalue is added to a predecessor with only one successor, so it intends to reduce the number of executed insertvalue instruction. But it doesn't behave like this when a loop is involved. The loop exiting edges and loop latches don't guarantee reduced dynamic number of insertvalue instructions. We should check these cases. |
ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another reproducer causing instcombine to hang:
bin/opt -passes=instcombine reduced.ll
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"
define { i64, ptr } @minmax_by({ i64, ptr } %0, i1 %1, i1 %2, { i64, ptr } %3) {
%5 = extractvalue { i64, ptr } %0, 0
%6 = extractvalue { i64, ptr } %0, 1
br label %7
7: ; preds = %14, %4
%.sroa.01.0 = phi i64 [ %5, %4 ], [ %.sroa.01.1, %14 ]
%.sroa.3.0 = phi ptr [ %6, %4 ], [ %.sroa.3.1, %14 ]
br i1 %1, label %8, label %15
8: ; preds = %7
br i1 %2, label %9, label %12
9: ; preds = %8
%10 = extractvalue { i64, ptr } %3, 0
%11 = extractvalue { i64, ptr } %3, 1
br label %14
12: ; preds = %8
%13 = getelementptr { i64, ptr }, ptr null, i32 0, i32 0
br label %14
14: ; preds = %12, %9
%.sroa.01.1 = phi i64 [ %10, %9 ], [ %.sroa.01.0, %12 ]
%.sroa.3.1 = phi ptr [ %11, %9 ], [ %.sroa.3.0, %12 ]
br label %7
15: ; preds = %7
%.fca.0.insert = insertvalue { i64, ptr } zeroinitializer, i64 %.sroa.01.0, 0
%.fca.1.insert = insertvalue { i64, ptr } %.fca.0.insert, ptr %.sroa.3.0, 1
ret { i64, ptr } %.fca.1.insert
}
BTW, I don't think this transform is beneficial since it blocks some select-related optimizations:
Before this patch:
define hidden { i64, i64 } @"_ZN4core6result19Result$LT$T$C$E$GT$7map_err17h84015741f1851455E.llvm.9397651224479784633"(i64 noundef %0, i64 %1) unnamed_addr #0 {
%3 = insertvalue { i64, i64 } poison, i64 %0, 0
%4 = insertvalue { i64, i64 } %3, i64 %1, 1
ret { i64, i64 } %4
}
After this patch:
define hidden { i64, i64 } @"_ZN4core6result19Result$LT$T$C$E$GT$7map_err17h84015741f1851455E.llvm.9397651224479784633"(i64 noundef %0, i64 %1) unnamed_addr #0 {
%3 = icmp eq i64 %0, -9223372036854775807
%4 = insertvalue { i64, i64 } poison, i64 %0, 0
%5 = insertvalue { i64, i64 } %4, i64 %1, 1
%.merged = select i1 %3, { i64, i64 } { i64 -9223372036854775807, i64 undef }, { i64, i64 } %5
ret { i64, i64 } %.merged
}
Can you provide some performance data to demonstrate the usefulness of this patch?
The test case is extracted from hot path of tcmalloc, so it has real world impact. This patch uses LoopInfo, so you need to run it with
I ran the patch on your smaller test case, and got
How did you run it? |
I run it with |
With -O3 I got
Thank you for the LoopInfo information. In order to make it more usable I should avoid the LoopInfo usage. I will think it more. |
Now if I need to add insertvalue into predecessor, I restrict it to 1, UseBB == OrigBB, then it can not cross a loop |
ping |
Some quick thoughts:
|
BFI in InstCombine has limited usage, it's usually not available, so remove it. Also avoid constructing constant aggregate.
The original code removes fully redundant insertvalue, this patch enhances it to remove partial redundant insertvalue, so it actually drop redundant insertvalue. Avoiding constant aggregate sounds reasonable, I added code to check it.
Deleted.
Thanks for the info. Currently I restrict it to UseBB == OrigBB && succ_size(Pred) == 1, it can already avoid cycles.
Changed it to checking for unconditional branch. |
@dtcxzyw could you help to explain these numbers? Execution time? File size? Number of llvm IR instructions?
I can guess out the numbers in other two parts, the llvm stats result and the changed lines in diff result. |
It is the number of instructions executed by |
So it's the number of instructions executed by opt, not the number of instructions executed by the generated code. Am I right? |
Yes |
Thanks for the confirmation. So it seems compile time impact is neutral. I can only see two files from dtcxzyw/llvm-opt-benchmark@78a3cec. Both files reduced the dynamic number of insertvalue instructions, so it has positive performance impact. |
ping |
// If UseBB is the single successor of Pred, we can add InsertValue to | ||
// Pred. | ||
auto *BI = dyn_cast<BranchInst>(Pred->getTerminator()); | ||
if (!(BI && BI->isUnconditional())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!(BI && BI->isUnconditional())) | |
if (!BI || !BI->isUnconditional()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ping |
I don't think we can benefit from introducing more aggregate phi nodes. If these phi nodes are converted into selects in subsequent passes (e.g., SimplifyCFG), it is hard for InstCombine to fold. See dtcxzyw/llvm-opt-benchmark#1351 (comment). If you think this patch will reduce total number of instructions, I don't agree with this argument. Can you provide some performance data on SPEC/llvm-test-suite? Compilation time impact looks ok. Don't worry about this :) I will accept this patch if:
|
extract (select (cond, insert(agg, elem), FV)) -> select (cond, elem, extract(FV))
Add new instcombine pattern to transform
so the regression of dtcxzyw/llvm-opt-benchmark#1351 (comment) can be fixed. |
I tested it on speccpu 2006 int, this patch is triggerred 19 times for 483.xalancbmk only, but the generated code is same. So no impact. The real world impact of this patch is to a hot path in tcmalloc. In function _ZN8tcmalloc17tcmalloc_internal14TCMallocPolicyINS0_12CppOomPolicyENS0_18DefaultAlignPolicyENS0_25AllocationAccessHotPolicyENS0_17InvokeHooksPolicyENS0_21IsSizeReturningPolicyENS0_24LocalNumaPartitionPolicyEE10to_pointerEPvmocPolicyINS0_12CppOomPolicyENS0_18DefaultAlignPolicyENS0_25AllocationAccessHotPolicyENS0_17InvokeHooksPolicyENS0_21IsSizeReturningPolicyENS0_24LocalNumaPartitionPolicyEE10to_pointerEPvm (will be inlined into tcmalloc_size_returning_operator_new) there is a code sequence similar to
Usually function foldAggregateConstructionIntoAggregateReuse can optimize away all of the insertvalue/extractvalue,
But in one of our applications built with thinlto, with profile information thinlto causes the function _ZN8tcmalloc17tcmalloc_internal14TCMallocPolicyINS0_12CppOomPolicyENS0_18DefaultAlignPolicyENS0_25AllocationAccessHotPolicyENS0_17InvokeHooksPolicyENS0_21IsSizeReturningPolicyENS0_24LocalNumaPartitionPolicyEE10to_pointerEPvmocPolicyINS0_12CppOomPolicyENS0_18DefaultAlignPolicyENS0_25AllocationAccessHotPolicyENS0_17InvokeHooksPolicyENS0_21IsSizeReturningPolicyENS0_24LocalNumaPartitionPolicyEE10to_pointerEPvm to be inlined, so I got
This instruction sequence can't be handled by the original foldAggregateConstructionIntoAggregateReuse, so worse tcmalloc_size_returning_operator_new was generated. |
%tmp = insertvalue { ptr, i64 } poison, ptr %val1, 0 | ||
%res = insertvalue { ptr, i64 } %tmp, i64 %val2, 1 | ||
ret {ptr, i64} %res | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this optimization doesn't improve the final codegen: https://godbolt.org/z/YrW8zvo3x
Does it improve tcmalloc's performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add "tail" to the call instruction, then you can see the function call is changed to a tailcall, and function prolog and epilog is simplified(removed). In the real code the %exit block has 4 or 5 predecessors ending with tail call.
In our application it does improve tcmalloc's performance because tcmalloc_size_returning_operator_new is a hot function.
IR diff looks good to me. I believe the regression can be addressed by using |
ping |
@weiguozhi Are you willing to try this? |
Yes, I tried it. As you have noticed it doesn't handle non-const aggregates. We can enhance FoldOpIntoSelect to do the non-const transform in aef5402
|
Can you file a separate PR? |
Function foldAggregateConstructionIntoAggregateReuse can fold
insertvalue(phi(extractvalue(src1), extractvalue(src2)))
into
phi(src1, src2)
when we can find source aggregates in all predecessors.
This patch extends it to handle following case
insertvalue(phi(extractvalue(src1), elm2))
into
phi(src1, insertvalue(elm2))
with the condition that the predecessor without source aggregate has only one successor.