-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582
Conversation
@llvm/pr-subscribers-backend-x86 Author: Quentin Dian (DianQK) ChangesFixes #78578. We should add a count check to the predecessors to avoid the code size explosion. I found a strange argument during my investigation. llvm-project/llvm/lib/CodeGen/TailDuplicator.cpp Lines 76 to 77 in 4b2381a
We didn't use Also, it may be that an issue with AsmPrinter is causing this use case to print two line breaks. This makes the test case fail. I haven't checked, but I don't think it at least affects the review. Patch is 29.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78582.diff 3 Files Affected:
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 5ed67bd0a121ed..e76d63d3c0d66f 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -76,6 +76,11 @@ static cl::opt<bool>
static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),
cl::Hidden);
+static cl::opt<unsigned> TailDupPredSizeLimit(
+ "tail-dup-pred-size-limit",
+ cl::desc("Maximum predecessors to consider tail duplicating."), cl::init(8),
+ cl::Hidden);
+
void TailDuplicator::initMF(MachineFunction &MFin, bool PreRegAlloc,
const MachineBranchProbabilityInfo *MBPIin,
MBFIWrapper *MBFIin,
@@ -565,6 +570,8 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
if (TailBB.isSuccessor(&TailBB))
return false;
+ if (TailDupPredSizeLimit < TailBB.pred_size())
+ return false;
// Set the limit on the cost to duplicate. When optimizing for size,
// duplicate only one, because one branch instruction can be eliminated to
// compensate for the duplication.
diff --git a/llvm/test/CodeGen/X86/mul-constant-result.ll b/llvm/test/CodeGen/X86/mul-constant-result.ll
index 1f9e7a93ad0b90..73c764a3f53da1 100644
--- a/llvm/test/CodeGen/X86/mul-constant-result.ll
+++ b/llvm/test/CodeGen/X86/mul-constant-result.ll
@@ -28,162 +28,132 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
; X86-NEXT: .LBB0_4:
; X86-NEXT: decl %ecx
; X86-NEXT: cmpl $31, %ecx
-; X86-NEXT: ja .LBB0_35
+; X86-NEXT: ja .LBB0_31
; X86-NEXT: # %bb.5:
; X86-NEXT: jmpl *.LJTI0_0(,%ecx,4)
; X86-NEXT: .LBB0_6:
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
+; X86-NEXT: jmp .LBB0_40
; X86-NEXT: .LBB0_7:
-; X86-NEXT: .cfi_def_cfa_offset 8
; X86-NEXT: leal (%eax,%eax,8), %ecx
; X86-NEXT: leal (%ecx,%ecx,2), %ecx
-; X86-NEXT: jmp .LBB0_9
+; X86-NEXT: addl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
; X86-NEXT: .LBB0_8:
; X86-NEXT: movl %eax, %ecx
; X86-NEXT: shll $4, %ecx
-; X86-NEXT: jmp .LBB0_9
-; X86-NEXT: .LBB0_10:
+; X86-NEXT: addl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_9:
; X86-NEXT: leal (%eax,%eax,4), %eax
-; X86-NEXT: jmp .LBB0_18
-; X86-NEXT: .LBB0_11:
+; X86-NEXT: jmp .LBB0_39
+; X86-NEXT: .LBB0_10:
; X86-NEXT: shll $2, %eax
-; X86-NEXT: jmp .LBB0_18
-; X86-NEXT: .LBB0_13:
+; X86-NEXT: jmp .LBB0_39
+; X86-NEXT: .LBB0_11:
+; X86-NEXT: leal (%eax,%eax,4), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_12:
; X86-NEXT: leal (%eax,%eax,2), %ecx
-; X86-NEXT: jmp .LBB0_14
-; X86-NEXT: .LBB0_15:
+; X86-NEXT: leal (%eax,%ecx,4), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_13:
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: jmp .LBB0_12
-; X86-NEXT: .LBB0_16:
+; X86-NEXT: leal (%eax,%eax,4), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_14:
; X86-NEXT: leal (%eax,%eax,4), %ecx
; X86-NEXT: leal (%ecx,%ecx,4), %ecx
-; X86-NEXT: jmp .LBB0_9
-; X86-NEXT: .LBB0_17:
+; X86-NEXT: addl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_15:
; X86-NEXT: leal (%eax,%eax,4), %eax
-; X86-NEXT: jmp .LBB0_12
-; X86-NEXT: .LBB0_19:
+; X86-NEXT: leal (%eax,%eax,4), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_17:
; X86-NEXT: shll $4, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_20:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_18:
; X86-NEXT: shll $2, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_21:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_19:
; X86-NEXT: shll $3, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_22:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_20:
; X86-NEXT: shll $5, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_23:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_21:
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: .LBB0_33:
; X86-NEXT: leal (%eax,%eax,8), %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_24:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_22:
; X86-NEXT: leal (%eax,%eax,4), %ecx
-; X86-NEXT: .LBB0_14:
; X86-NEXT: leal (%eax,%ecx,4), %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_25:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_23:
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: jmp .LBB0_18
-; X86-NEXT: .LBB0_26:
+; X86-NEXT: jmp .LBB0_39
+; X86-NEXT: .LBB0_24:
; X86-NEXT: leal (%eax,%eax,4), %ecx
; X86-NEXT: leal (%eax,%ecx,4), %ecx
-; X86-NEXT: jmp .LBB0_9
-; X86-NEXT: .LBB0_27:
+; X86-NEXT: addl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_25:
; X86-NEXT: leal (%eax,%eax), %ecx
; X86-NEXT: shll $4, %eax
-; X86-NEXT: jmp .LBB0_28
-; X86-NEXT: .LBB0_29:
+; X86-NEXT: subl %ecx, %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_26:
; X86-NEXT: leal (,%eax,8), %ecx
-; X86-NEXT: jmp .LBB0_38
-; X86-NEXT: .LBB0_30:
+; X86-NEXT: jmp .LBB0_33
+; X86-NEXT: .LBB0_27:
; X86-NEXT: leal (%eax,%eax,8), %ecx
-; X86-NEXT: jmp .LBB0_32
-; X86-NEXT: .LBB0_31:
+; X86-NEXT: leal (%eax,%ecx,2), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_28:
; X86-NEXT: leal (%eax,%eax,4), %ecx
-; X86-NEXT: .LBB0_32:
; X86-NEXT: leal (%eax,%ecx,2), %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_34:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_29:
+; X86-NEXT: leal (%eax,%eax,8), %eax
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_30:
; X86-NEXT: movl %eax, %ecx
; X86-NEXT: shll $5, %ecx
-; X86-NEXT: jmp .LBB0_38
-; X86-NEXT: .LBB0_35:
+; X86-NEXT: jmp .LBB0_33
+; X86-NEXT: .LBB0_31:
; X86-NEXT: xorl %eax, %eax
-; X86-NEXT: .LBB0_36:
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_37:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_32:
; X86-NEXT: leal (%eax,%eax,2), %ecx
; X86-NEXT: shll $3, %ecx
-; X86-NEXT: .LBB0_38:
+; X86-NEXT: .LBB0_33:
; X86-NEXT: subl %eax, %ecx
; X86-NEXT: movl %ecx, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_39:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_34:
; X86-NEXT: shll $2, %eax
-; X86-NEXT: .LBB0_12:
; X86-NEXT: leal (%eax,%eax,4), %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_40:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_35:
; X86-NEXT: shll $3, %eax
-; X86-NEXT: jmp .LBB0_18
-; X86-NEXT: .LBB0_41:
+; X86-NEXT: jmp .LBB0_39
+; X86-NEXT: .LBB0_36:
; X86-NEXT: leal (%eax,%eax,8), %ecx
; X86-NEXT: leal (%ecx,%ecx,2), %ecx
; X86-NEXT: addl %eax, %eax
-; X86-NEXT: .LBB0_9:
; X86-NEXT: addl %ecx, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_42:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_37:
; X86-NEXT: leal (%eax,%eax), %ecx
; X86-NEXT: shll $5, %eax
-; X86-NEXT: .LBB0_28:
; X86-NEXT: subl %ecx, %eax
-; X86-NEXT: popl %esi
-; X86-NEXT: .cfi_def_cfa_offset 4
-; X86-NEXT: retl
-; X86-NEXT: .LBB0_43:
-; X86-NEXT: .cfi_def_cfa_offset 8
+; X86-NEXT: jmp .LBB0_40
+; X86-NEXT: .LBB0_38:
; X86-NEXT: leal (%eax,%eax,8), %eax
-; X86-NEXT: .LBB0_18:
+; X86-NEXT: .LBB0_39:
; X86-NEXT: leal (%eax,%eax,2), %eax
+; X86-NEXT: .LBB0_40:
; X86-NEXT: popl %esi
; X86-NEXT: .cfi_def_cfa_offset 4
; X86-NEXT: retl
@@ -199,154 +169,131 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
; X64-HSW-NEXT: cmovel %ecx, %eax
; X64-HSW-NEXT: decl %edi
; X64-HSW-NEXT: cmpl $31, %edi
-; X64-HSW-NEXT: ja .LBB0_31
+; X64-HSW-NEXT: ja .LBB0_28
; X64-HSW-NEXT: # %bb.1:
; X64-HSW-NEXT: jmpq *.LJTI0_0(,%rdi,8)
; X64-HSW-NEXT: .LBB0_2:
; X64-HSW-NEXT: addl %eax, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
+; X64-HSW-NEXT: jmp .LBB0_37
; X64-HSW-NEXT: .LBB0_3:
; X64-HSW-NEXT: leal (%rax,%rax,8), %ecx
; X64-HSW-NEXT: leal (%rcx,%rcx,2), %ecx
-; X64-HSW-NEXT: jmp .LBB0_22
+; X64-HSW-NEXT: jmp .LBB0_21
; X64-HSW-NEXT: .LBB0_4:
; X64-HSW-NEXT: movl %eax, %ecx
; X64-HSW-NEXT: shll $4, %ecx
-; X64-HSW-NEXT: jmp .LBB0_22
+; X64-HSW-NEXT: jmp .LBB0_21
; X64-HSW-NEXT: .LBB0_5:
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT: .LBB0_13:
-; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
+; X64-HSW-NEXT: jmp .LBB0_36
; X64-HSW-NEXT: .LBB0_6:
; X64-HSW-NEXT: shll $2, %eax
-; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
+; X64-HSW-NEXT: jmp .LBB0_36
+; X64-HSW-NEXT: .LBB0_7:
+; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
+; X64-HSW-NEXT: jmp .LBB0_37
; X64-HSW-NEXT: .LBB0_8:
; X64-HSW-NEXT: leal (%rax,%rax,2), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_10:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_9:
; X64-HSW-NEXT: addl %eax, %eax
-; X64-HSW-NEXT: .LBB0_7:
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_11:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_10:
; X64-HSW-NEXT: leal (%rax,%rax,4), %ecx
; X64-HSW-NEXT: leal (%rcx,%rcx,4), %ecx
-; X64-HSW-NEXT: jmp .LBB0_22
-; X64-HSW-NEXT: .LBB0_12:
+; X64-HSW-NEXT: jmp .LBB0_21
+; X64-HSW-NEXT: .LBB0_11:
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_14:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_13:
; X64-HSW-NEXT: shll $4, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_15:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_14:
; X64-HSW-NEXT: shll $2, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_16:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_15:
; X64-HSW-NEXT: shll $3, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_17:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_16:
; X64-HSW-NEXT: shll $5, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_18:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_17:
; X64-HSW-NEXT: addl %eax, %eax
-; X64-HSW-NEXT: .LBB0_29:
; X64-HSW-NEXT: leal (%rax,%rax,8), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_19:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_18:
; X64-HSW-NEXT: leal (%rax,%rax,4), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_20:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_19:
; X64-HSW-NEXT: addl %eax, %eax
-; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_21:
+; X64-HSW-NEXT: jmp .LBB0_36
+; X64-HSW-NEXT: .LBB0_20:
; X64-HSW-NEXT: leal (%rax,%rax,4), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,4), %ecx
-; X64-HSW-NEXT: .LBB0_22:
+; X64-HSW-NEXT: .LBB0_21:
; X64-HSW-NEXT: addl %eax, %ecx
; X64-HSW-NEXT: movl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_23:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_22:
; X64-HSW-NEXT: leal (%rax,%rax), %ecx
; X64-HSW-NEXT: shll $4, %eax
; X64-HSW-NEXT: subl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_25:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_23:
; X64-HSW-NEXT: leal (,%rax,8), %ecx
-; X64-HSW-NEXT: jmp .LBB0_34
-; X64-HSW-NEXT: .LBB0_26:
+; X64-HSW-NEXT: jmp .LBB0_30
+; X64-HSW-NEXT: .LBB0_24:
; X64-HSW-NEXT: leal (%rax,%rax,8), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_27:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_25:
; X64-HSW-NEXT: leal (%rax,%rax,4), %ecx
; X64-HSW-NEXT: leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_30:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_26:
+; X64-HSW-NEXT: leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_27:
; X64-HSW-NEXT: movl %eax, %ecx
; X64-HSW-NEXT: shll $5, %ecx
-; X64-HSW-NEXT: jmp .LBB0_34
-; X64-HSW-NEXT: .LBB0_31:
+; X64-HSW-NEXT: jmp .LBB0_30
+; X64-HSW-NEXT: .LBB0_28:
; X64-HSW-NEXT: xorl %eax, %eax
-; X64-HSW-NEXT: .LBB0_32:
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_33:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_29:
; X64-HSW-NEXT: leal (%rax,%rax,2), %ecx
; X64-HSW-NEXT: shll $3, %ecx
-; X64-HSW-NEXT: .LBB0_34:
+; X64-HSW-NEXT: .LBB0_30:
; X64-HSW-NEXT: subl %eax, %ecx
; X64-HSW-NEXT: movl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_36:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_31:
; X64-HSW-NEXT: shll $2, %eax
; X64-HSW-NEXT: leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_37:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_32:
; X64-HSW-NEXT: shll $3, %eax
-; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_38:
+; X64-HSW-NEXT: jmp .LBB0_36
+; X64-HSW-NEXT: .LBB0_33:
; X64-HSW-NEXT: leal (%rax,%rax,8), %ecx
; X64-HSW-NEXT: leal (%rcx,%rcx,2), %ecx
; X64-HSW-NEXT: addl %eax, %eax
; X64-HSW-NEXT: addl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_39:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_34:
; X64-HSW-NEXT: leal (%rax,%rax), %ecx
; X64-HSW-NEXT: shll $5, %eax
; X64-HSW-NEXT: subl %ecx, %eax
-; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT: retq
-; X64-HSW-NEXT: .LBB0_40:
+; X64-HSW-NEXT: jmp .LBB0_37
+; X64-HSW-NEXT: .LBB0_35:
; X64-HSW-NEXT: leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT: .LBB0_36:
; X64-HSW-NEXT: leal (%rax,%rax,2), %eax
+; X64-HSW-NEXT: .LBB0_37:
; X64-HSW-NEXT: # kill: def $eax killed $eax killed $rax
; X64-HSW-NEXT: retq
%3 = icmp eq i32 %1, 0
diff --git a/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
new file mode 100644
index 00000000000000..47b9fcaa7d6c85
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
@@ -0,0 +1,242 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=3 < %s | FileCheck %s -check-prefix=LIMIT
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=4 < %s | FileCheck %s -check-prefix=NOLIMIT
+
+define i32 @foo(ptr %0, i32 %1) {
+ ; LIMIT-LABEL: name: foo
+ ; LIMIT: bb.0 (%ir-block.2):
+ ; LIMIT-NEXT: successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+ ; LIMIT-NEXT: liveins: $rdi, $esi
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[COPY:%[0-9]+]]:gr32 = COPY $esi
+ ; LIMIT-NEXT: [[COPY1:%[0-9]+]]:gr64 = COPY $rdi
+ ; LIMIT-NEXT: [[SHR32ri:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 1, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[AND32ri:%[0-9]+]]:gr32 = AND32ri [[SHR32ri]], 7, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SUBREG_TO_REG:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri]], %subreg.sub_32bit
+ ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.1 (%ir-block.5):
+ ; LIMIT-NEXT: successors: %bb.6(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: JMP_1 %bb.6
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.2 (%ir-block.7):
+ ; LIMIT-NEXT: successors: %bb.6(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm1:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: [[SHR32ri1:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm1]], 1, implicit-def dead $eflags
+ ; LIMIT-NEXT: JMP_1 %bb.6
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.3 (%ir-block.10):
+ ; LIMIT-NEXT: successors: %bb.6(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm2:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: [[SHR32ri2:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm2]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: JMP_1 %bb.6
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.4 (%ir-block.13):
+ ; LIMIT-NEXT: successors: %bb.6(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm3:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: [[SHR32ri3:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm3]], 3, implicit-def dead $eflags
+ ; LIMIT-NEXT: JMP_1 %bb.6
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.5.default.unreachable2:
+ ; LIMIT-NEXT: successors:
+ ; LIMIT: bb.6 (%ir-block.16):
+ ; LIMIT-NEXT: successors: %bb.7(0x20000000), %bb.8(0x20000000), %bb.9(0x20000000), %bb.10(0x20000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[PHI:%[0-9]+]]:gr32 = PHI [[SHR32ri3]], %bb.4, [[SHR32ri2]], %bb.3, [[SHR32ri1]], %bb.2, [[MOV32rm]], %bb.1
+ ; LIMIT-NEXT: [[SHR32ri4:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[AND32ri1:%[0-9]+]]:gr32 = AND32ri [[SHR32ri4]], 7, implicit-def dead $eflags
+ ; LIMIT-NEXT: [[SUBREG_TO_REG1:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri1]], %subreg.sub_32bit
+ ; LIMIT-NEXT: JMP64m $noreg, 8, [[SUBREG_TO_REG1]], %jump-table.1, $noreg :: (load (s64) from jump-table)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.7 (%ir-block.20):
+ ; LIMIT-NEXT: successors: %bb.11(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm4:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: JMP_1 %bb.11
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: bb.8 (%ir-block.22):
+ ; LIMIT-NEXT: successors: %bb.11(0x80000000)
+ ; LIMIT-NEXT: {{ $}}
+ ; LIMIT-NEXT: [[MOV32rm5:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+ ; LIMIT-NEXT: [[SHR32r...
[truncated]
|
I can verify that the initial OOM case is back to normal. Also, when the default branch is removed, the instructions have some improvements. With the default branch:
Without the default branch:
Text diff? diff --git a/output.s b/output.s
index 322d0d0..6ca97d0 100644
--- a/output.s
+++ b/output.s
@@ -1,5 +1,5 @@
.text
- .file "oom_manual.c"
+ .file "oom_manual2.c"
.globl f1 # -- Begin function f1
.p2align 4, 0x90
.type f1,@function
@@ -33,12805 +33,12788 @@ f1: # @f1
movl %eax, %ecx
shrl %ecx
andl $127, %ecx
- cmpl $126, %ecx
- ja .LBB0_15
-# %bb.1:
jmpq *.LJTI0_0(,%rcx,8) I can see many |
@aeubanks mind taking a look at this small patch? Is this limit reasonable/consistent with other similar limits, etc? Do we need more data to back up why this particular bound was chosen? |
the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR |
do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened? was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated? or is there an underlying algorithmic problem that we can fix? |
Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.
It looks like the default branch of switch creates an if statement (compare instruction)? This results in two successors.
I think so.
I'm not sure, but I can try if I need to. But I don't know much about MIR and performance improvements. |
What do you mean "poorly maintained"? Given that the option works on number of MIR instructions, we should keep the input MIR instruction count consistent.
if tail duplication is blowing up code size 4x, that definitely seems like a "we're doing too much tail duplication" issue. but hopefully somebody who actually understands the pass better can comment |
The cause is early-tailduplication transforms normal |
If you want to limit the number of tail duplication, I want to see a partial tail duplication base on profile information, similar to MachineBlockPlacement::maybeTailDuplicateBlock. |
8777d96
to
028c8c7
Compare
@aeubanks
@aeubanks
@bzEq
@weiguozhi |
Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP). |
But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered. |
I think the argument is: The currently proposed solution may harm performance in some cases - and that that loss can be mitigated at least in the presence of profile information. (& so with that mitigation, maybe the overall cost is low enough to be worth shipping) - also, even in the absence of real profile information I /think/ we have some codepaths that generate heuristic-based "profile" information (but I might be misremembering/misunderstanding) that might mean the mitigation fires even then. |
We should not implement profile-guided optimizations based on hypotheticals. Did you actually run benchmarks with this patch and saw regressions? (How large?) If not, we should do the straightforward thing until there is evidence that something more complex is justified. |
I'm not sure if I want to use PGO, since most applications won't be built with PGO. But I think the limiting conditions here can be adjusted. Perhaps consider the number of instructions duplicated. |
I've written down the compilation time in the issue. I simply tried the runtime benchmark. The
int src(void) {
return -1;
}
extern int f1(unsigned int *b);
int main(int argc, char **argv) {
int r = argc;
unsigned int b[] = { -1, -2, -3 };
for (int i = 0; i < 1000000; i++) {
r += f1(b);
}
return r;
}
clang -O1 oom_manual.c main.c -o oom_manual
clang -O1 oom_manual2.c main.c -o oom_manual2
ls -lh oom_manual oom_manual2 output:
hyperfine -i -N --runs 200 --warmup 50 ./oom_manual ./oom_manual2 output:
function run_perf() {
echo "perf stat $1"
perf stat -x \; \
-e instructions \
-e instructions:u \
-e cycles \
-e task-clock \
-e branches \
-e branch-misses \
$1
}
run_perf ./oom_manual
run_perf ./oom_manual2 output:
I am trying to change the code to see the results of different scenarios. |
028c8c7
to
b0e06a2
Compare
I made some progress. There are usually only two instructions that can be duplicated. Indirect branches increase this limit to 20. llvm-project/llvm/lib/CodeGen/TailDuplicator.cpp Lines 591 to 602 in 4b2381a
I understand from the comments that this is to improve the accuracy of branch prediction. I want to know that if it is appropriate with numerous of indirect branches. So I did some experimenting on https://github.com/DianQK/llvm-tail-dup-indirect-succ-size. I'm using the result of One of my results is as follows:
If this is the right route, I'll continue to figure out the other two problems.
|
@efriedma-quic I've put most of the analysis into this comment. In a nutshell, I speculate that duplicating critical BBs makes CFG exceptionally complex, especially within loops, which may be the primary reason for increased time consumption by other passes. A critical BB refers to one with multiple predecessors and multiple successors. |
Ping. If the new changes are suitable, I hope to catch up on the final release of 18.1.0. |
Ping. |
Ping. I assume this is an uncommon scenario, since there has never been feedback on similar compile-time issues before. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The new check looks sufficiently narrow (requires many predecessors and many successors) to just do this.
We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected? |
@DianQK ^^ It's not blocking us in any way, but it would be nice to ensure nothing wrong is happening here. |
I apologize for my late reply. (*^^*) |
I'll verify at the end of the week that increasing to 128 has no noticeable impact on compilation time. |
Could you try with |
Sure, I'll run the comparison before and after this commit with |
Benchmark compiled with clang after this commit with |
Actually, just |
Thanks, I will continue to investigate this. |
Hmm, I tried LLVM 18 and the main (c8864bc) branch on Raspberry Pi 4 (arm64), but I didn't find any performance issues:
|
…96089) This patch reverts #81585 as #78582 has been landed. Now clang works well with reproducer #79993 (comment).
…" (llvm#96089) This patch reverts llvm#81585 as llvm#78582 has been landed. Now clang works well with reproducer llvm#79993 (comment).
This adjusts the threshold logic added in llvm#78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors. In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions. This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs.
This adjusts the threshold logic added in llvm#78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors. In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions. This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs.
…onsider tail duplicating blocks (llvm#78582)" This reverts commit 86a7828. Now, we only consider computed GOTOs.
Fixes #78578.
Duplicating a BB which has both multiple predecessors and successors will result in a complex CFG and also may cause huge amount of PHI nodes. See #78578 (comment) for a detailed description of the limit.