[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582

DianQK · 2024-01-18T14:00:03Z

Duplicating a BB which has both multiple predecessors and successors will result in a complex CFG and also may cause huge amount of PHI nodes. See #78578 (comment) for a detailed description of the limit.

llvmbot · 2024-01-18T14:00:35Z

@llvm/pr-subscribers-backend-x86

Author: Quentin Dian (DianQK)

Changes

Fixes #78578.

We should add a count check to the predecessors to avoid the code size explosion.

I found a strange argument during my investigation.

llvm-project/llvm/lib/CodeGen/TailDuplicator.cpp

Lines 76 to 77 in 4b2381a

    
           static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U), 
        
                                                 cl::Hidden);

We didn't use -tail-dup-limit while setting a meaningless value.

Also, it may be that an issue with AsmPrinter is causing this use case to print two line breaks. This makes the test case fail. I haven't checked, but I don't think it at least affects the review.

Patch is 29.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78582.diff

3 Files Affected:

(modified) llvm/lib/CodeGen/TailDuplicator.cpp (+7)
(modified) llvm/test/CodeGen/X86/mul-constant-result.ll (+135-188)
(added) llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll (+242)

diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 5ed67bd0a121ed..e76d63d3c0d66f 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -76,6 +76,11 @@ static cl::opt<bool>
 static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),
                                       cl::Hidden);
 
+static cl::opt<unsigned> TailDupPredSizeLimit(
+    "tail-dup-pred-size-limit",
+    cl::desc("Maximum predecessors to consider tail duplicating."), cl::init(8),
+    cl::Hidden);
+
 void TailDuplicator::initMF(MachineFunction &MFin, bool PreRegAlloc,
                             const MachineBranchProbabilityInfo *MBPIin,
                             MBFIWrapper *MBFIin,
@@ -565,6 +570,8 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
   if (TailBB.isSuccessor(&TailBB))
     return false;
 
+  if (TailDupPredSizeLimit < TailBB.pred_size())
+    return false;
   // Set the limit on the cost to duplicate. When optimizing for size,
   // duplicate only one, because one branch instruction can be eliminated to
   // compensate for the duplication.
diff --git a/llvm/test/CodeGen/X86/mul-constant-result.ll b/llvm/test/CodeGen/X86/mul-constant-result.ll
index 1f9e7a93ad0b90..73c764a3f53da1 100644
--- a/llvm/test/CodeGen/X86/mul-constant-result.ll
+++ b/llvm/test/CodeGen/X86/mul-constant-result.ll
@@ -28,162 +28,132 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
 ; X86-NEXT:  .LBB0_4:
 ; X86-NEXT:    decl %ecx
 ; X86-NEXT:    cmpl $31, %ecx
-; X86-NEXT:    ja .LBB0_35
+; X86-NEXT:    ja .LBB0_31
 ; X86-NEXT:  # %bb.5:
 ; X86-NEXT:    jmpl *.LJTI0_0(,%ecx,4)
 ; X86-NEXT:  .LBB0_6:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
+; X86-NEXT:    jmp .LBB0_40
 ; X86-NEXT:  .LBB0_7:
-; X86-NEXT:    .cfi_def_cfa_offset 8
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,2), %ecx
-; X86-NEXT:    jmp .LBB0_9
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
 ; X86-NEXT:  .LBB0_8:
 ; X86-NEXT:    movl %eax, %ecx
 ; X86-NEXT:    shll $4, %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_10:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_9:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_11:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_10:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_13:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_11:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_12:
 ; X86-NEXT:    leal (%eax,%eax,2), %ecx
-; X86-NEXT:    jmp .LBB0_14
-; X86-NEXT:  .LBB0_15:
+; X86-NEXT:    leal (%eax,%ecx,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_13:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    jmp .LBB0_12
-; X86-NEXT:  .LBB0_16:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_14:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,4), %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_17:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_15:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    jmp .LBB0_12
-; X86-NEXT:  .LBB0_19:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_17:
 ; X86-NEXT:    shll $4, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_20:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_18:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_21:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_19:
 ; X86-NEXT:    shll $3, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_22:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_20:
 ; X86-NEXT:    shll $5, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_23:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_21:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:  .LBB0_33:
 ; X86-NEXT:    leal (%eax,%eax,8), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_24:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_22:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
-; X86-NEXT:  .LBB0_14:
 ; X86-NEXT:    leal (%eax,%ecx,4), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_25:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_23:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_26:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_24:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
 ; X86-NEXT:    leal (%eax,%ecx,4), %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_27:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_25:
 ; X86-NEXT:    leal (%eax,%eax), %ecx
 ; X86-NEXT:    shll $4, %eax
-; X86-NEXT:    jmp .LBB0_28
-; X86-NEXT:  .LBB0_29:
+; X86-NEXT:    subl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_26:
 ; X86-NEXT:    leal (,%eax,8), %ecx
-; X86-NEXT:    jmp .LBB0_38
-; X86-NEXT:  .LBB0_30:
+; X86-NEXT:    jmp .LBB0_33
+; X86-NEXT:  .LBB0_27:
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
-; X86-NEXT:    jmp .LBB0_32
-; X86-NEXT:  .LBB0_31:
+; X86-NEXT:    leal (%eax,%ecx,2), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_28:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
-; X86-NEXT:  .LBB0_32:
 ; X86-NEXT:    leal (%eax,%ecx,2), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_34:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_29:
+; X86-NEXT:    leal (%eax,%eax,8), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_30:
 ; X86-NEXT:    movl %eax, %ecx
 ; X86-NEXT:    shll $5, %ecx
-; X86-NEXT:    jmp .LBB0_38
-; X86-NEXT:  .LBB0_35:
+; X86-NEXT:    jmp .LBB0_33
+; X86-NEXT:  .LBB0_31:
 ; X86-NEXT:    xorl %eax, %eax
-; X86-NEXT:  .LBB0_36:
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_37:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_32:
 ; X86-NEXT:    leal (%eax,%eax,2), %ecx
 ; X86-NEXT:    shll $3, %ecx
-; X86-NEXT:  .LBB0_38:
+; X86-NEXT:  .LBB0_33:
 ; X86-NEXT:    subl %eax, %ecx
 ; X86-NEXT:    movl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_39:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_34:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:  .LBB0_12:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_40:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_35:
 ; X86-NEXT:    shll $3, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_41:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_36:
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,2), %ecx
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:  .LBB0_9:
 ; X86-NEXT:    addl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_42:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_37:
 ; X86-NEXT:    leal (%eax,%eax), %ecx
 ; X86-NEXT:    shll $5, %eax
-; X86-NEXT:  .LBB0_28:
 ; X86-NEXT:    subl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_43:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_38:
 ; X86-NEXT:    leal (%eax,%eax,8), %eax
-; X86-NEXT:  .LBB0_18:
+; X86-NEXT:  .LBB0_39:
 ; X86-NEXT:    leal (%eax,%eax,2), %eax
+; X86-NEXT:  .LBB0_40:
 ; X86-NEXT:    popl %esi
 ; X86-NEXT:    .cfi_def_cfa_offset 4
 ; X86-NEXT:    retl
@@ -199,154 +169,131 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
 ; X64-HSW-NEXT:    cmovel %ecx, %eax
 ; X64-HSW-NEXT:    decl %edi
 ; X64-HSW-NEXT:    cmpl $31, %edi
-; X64-HSW-NEXT:    ja .LBB0_31
+; X64-HSW-NEXT:    ja .LBB0_28
 ; X64-HSW-NEXT:  # %bb.1:
 ; X64-HSW-NEXT:    jmpq *.LJTI0_0(,%rdi,8)
 ; X64-HSW-NEXT:  .LBB0_2:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_37
 ; X64-HSW-NEXT:  .LBB0_3:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,2), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
+; X64-HSW-NEXT:    jmp .LBB0_21
 ; X64-HSW-NEXT:  .LBB0_4:
 ; X64-HSW-NEXT:    movl %eax, %ecx
 ; X64-HSW-NEXT:    shll $4, %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
+; X64-HSW-NEXT:    jmp .LBB0_21
 ; X64-HSW-NEXT:  .LBB0_5:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:  .LBB0_13:
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_36
 ; X64-HSW-NEXT:  .LBB0_6:
 ; X64-HSW-NEXT:    shll $2, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_7:
+; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
+; X64-HSW-NEXT:    jmp .LBB0_37
 ; X64-HSW-NEXT:  .LBB0_8:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_10:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_9:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_7:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_11:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_10:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,4), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
-; X64-HSW-NEXT:  .LBB0_12:
+; X64-HSW-NEXT:    jmp .LBB0_21
+; X64-HSW-NEXT:  .LBB0_11:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_14:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_13:
 ; X64-HSW-NEXT:    shll $4, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_15:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_14:
 ; X64-HSW-NEXT:    shll $2, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_16:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_15:
 ; X64-HSW-NEXT:    shll $3, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_17:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_16:
 ; X64-HSW-NEXT:    shll $5, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_18:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_17:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_29:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_19:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_18:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_20:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_19:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_21:
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_20:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %ecx
-; X64-HSW-NEXT:  .LBB0_22:
+; X64-HSW-NEXT:  .LBB0_21:
 ; X64-HSW-NEXT:    addl %eax, %ecx
 ; X64-HSW-NEXT:    movl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_23:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_22:
 ; X64-HSW-NEXT:    leal (%rax,%rax), %ecx
 ; X64-HSW-NEXT:    shll $4, %eax
 ; X64-HSW-NEXT:    subl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_25:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_23:
 ; X64-HSW-NEXT:    leal (,%rax,8), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_34
-; X64-HSW-NEXT:  .LBB0_26:
+; X64-HSW-NEXT:    jmp .LBB0_30
+; X64-HSW-NEXT:  .LBB0_24:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_27:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_25:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_30:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_26:
+; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_27:
 ; X64-HSW-NEXT:    movl %eax, %ecx
 ; X64-HSW-NEXT:    shll $5, %ecx
-; X64-HSW-NEXT:    jmp .LBB0_34
-; X64-HSW-NEXT:  .LBB0_31:
+; X64-HSW-NEXT:    jmp .LBB0_30
+; X64-HSW-NEXT:  .LBB0_28:
 ; X64-HSW-NEXT:    xorl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_32:
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_33:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_29:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %ecx
 ; X64-HSW-NEXT:    shll $3, %ecx
-; X64-HSW-NEXT:  .LBB0_34:
+; X64-HSW-NEXT:  .LBB0_30:
 ; X64-HSW-NEXT:    subl %eax, %ecx
 ; X64-HSW-NEXT:    movl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_36:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_31:
 ; X64-HSW-NEXT:    shll $2, %eax
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_37:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_32:
 ; X64-HSW-NEXT:    shll $3, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_38:
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_33:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,2), %ecx
 ; X64-HSW-NEXT:    addl %eax, %eax
 ; X64-HSW-NEXT:    addl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_39:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_34:
 ; X64-HSW-NEXT:    leal (%rax,%rax), %ecx
 ; X64-HSW-NEXT:    shll $5, %eax
 ; X64-HSW-NEXT:    subl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_40:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_35:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT:  .LBB0_36:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
+; X64-HSW-NEXT:  .LBB0_37:
 ; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
 ; X64-HSW-NEXT:    retq
   %3 = icmp eq i32 %1, 0
diff --git a/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
new file mode 100644
index 00000000000000..47b9fcaa7d6c85
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
@@ -0,0 +1,242 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=3 < %s | FileCheck %s -check-prefix=LIMIT
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=4 < %s | FileCheck %s -check-prefix=NOLIMIT
+
+define i32 @foo(ptr %0, i32 %1) {
+  ; LIMIT-LABEL: name: foo
+  ; LIMIT: bb.0 (%ir-block.2):
+  ; LIMIT-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; LIMIT-NEXT:   liveins: $rdi, $esi
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[COPY:%[0-9]+]]:gr32 = COPY $esi
+  ; LIMIT-NEXT:   [[COPY1:%[0-9]+]]:gr64 = COPY $rdi
+  ; LIMIT-NEXT:   [[SHR32ri:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 1, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[AND32ri:%[0-9]+]]:gr32 = AND32ri [[SHR32ri]], 7, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[SUBREG_TO_REG:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri]], %subreg.sub_32bit
+  ; LIMIT-NEXT:   JMP64m $noreg, 8, [[SUBREG_TO_REG]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.1 (%ir-block.5):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.2 (%ir-block.7):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm1:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri1:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm1]], 1, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.3 (%ir-block.10):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm2:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri2:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm2]], 2, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.4 (%ir-block.13):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm3:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri3:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm3]], 3, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.5.default.unreachable2:
+  ; LIMIT-NEXT:   successors:
+  ; LIMIT: bb.6 (%ir-block.16):
+  ; LIMIT-NEXT:   successors: %bb.7(0x20000000), %bb.8(0x20000000), %bb.9(0x20000000), %bb.10(0x20000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[PHI:%[0-9]+]]:gr32 = PHI [[SHR32ri3]], %bb.4, [[SHR32ri2]], %bb.3, [[SHR32ri1]], %bb.2, [[MOV32rm]], %bb.1
+  ; LIMIT-NEXT:   [[SHR32ri4:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[AND32ri1:%[0-9]+]]:gr32 = AND32ri [[SHR32ri4]], 7, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[SUBREG_TO_REG1:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri1]], %subreg.sub_32bit
+  ; LIMIT-NEXT:   JMP64m $noreg, 8, [[SUBREG_TO_REG1]], %jump-table.1, $noreg :: (load (s64) from jump-table)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.7 (%ir-block.20):
+  ; LIMIT-NEXT:   successors: %bb.11(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm4:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   JMP_1 %bb.11
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.8 (%ir-block.22):
+  ; LIMIT-NEXT:   successors: %bb.11(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm5:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32r...
[truncated]

DianQK · 2024-01-18T14:18:26Z

I can verify that the initial OOM case is back to normal.

Also, when the default branch is removed, the instructions have some improvements.

With the default branch:

Iterations:        100
Instructions:      710800
Total Cycles:      166608
Total uOps:        710800

Dispatch Width:    6
uOps Per Cycle:    4.27
IPC:               4.27
Block RThroughput: 1216.5

Without the default branch:

Iterations:        100
Instructions:      709000
Total Cycles:      158311
Total uOps:        709000

Dispatch Width:    6
uOps Per Cycle:    4.48
IPC:               4.48
Block RThroughput: 1205.0

Text diff?

diff --git a/output.s b/output.s
index 322d0d0..6ca97d0 100644
--- a/output.s
+++ b/output.s
@@ -1,5 +1,5 @@
 	.text
-	.file	"oom_manual.c"
+	.file	"oom_manual2.c"
 	.globl	f1                              # -- Begin function f1
 	.p2align	4, 0x90
 	.type	f1,@function
@@ -33,12805 +33,12788 @@ f1:                                     # @f1
 	movl	%eax, %ecx
 	shrl	%ecx
 	andl	$127, %ecx
-	cmpl	$126, %ecx
-	ja	.LBB0_15
-# %bb.1:
 	jmpq	*.LJTI0_0(,%rcx,8)

I can see many cmpl instructions disappearing.

dwblaikie · 2024-01-18T19:06:34Z

@aeubanks mind taking a look at this small patch? Is this limit reasonable/consistent with other similar limits, etc? Do we need more data to back up why this particular bound was chosen?

aeubanks · 2024-01-18T20:35:19Z

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

aeubanks · 2024-01-18T20:49:51Z

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened? was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated? or is there an underlying algorithmic problem that we can fix?

DianQK · 2024-01-19T14:32:48Z

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened?

It looks like the default branch of switch creates an if statement (compare instruction)? This results in two successors.
The previous patch did one thing, replace the default branch with an unreachable instruction. Then TailDuplicator can duplicate blocks into predecessors.
Since we revert the previous patch, we can't reproduce it directly in the main branch. But it can be reproduced by replacing it with an unreachable default branch in the C source code. e.g.

    case 127:  out1 = b[2] >> 31; break; \
    default:  __builtin_unreachable(); break; \

was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated?

I think so.
I checked the time consumption of lld with -time-trace (llc -O1 oom.bc -time-trace -time-trace-file=trace.json -filetype=null). It takes 4s to complete on my device. About 1.3s is Early Tail Duplication. And 1.3s is Eliminate PHI nodes for register allocation. (I tried perf, but maybe it's the environment of my machine that makes this unavailable.)
I had to use TimeProfiler. tailDuplicateAndUpdate and tailDuplicateAndUpdate occupy half the time, respectively.

or is there an underlying algorithmic problem that we can fix?

I'm not sure, but I can try if I need to. But I don't know much about MIR and performance improvements.
I'm also a little concerned about the code size, which seems to increase from about 50K to about 200K for the initial example. :3

aeubanks · 2024-01-19T18:28:30Z

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

What do you mean "poorly maintained"?

Given that the option works on number of MIR instructions, we should keep the input MIR instruction count consistent.

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened?

It looks like the default branch of switch creates an if statement (compare instruction)? This results in two successors. The previous patch did one thing, replace the default branch with an unreachable instruction. Then TailDuplicator can duplicate blocks into predecessors. Since we revert the previous patch, we can't reproduce it directly in the main branch. But it can be reproduced by replacing it with an unreachable default branch in the C source code. e.g.
    case 127:  out1 = b[2] >> 31; break; \
    default:  __builtin_unreachable(); break; \
was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated?

I think so. I checked the time consumption of lld with -time-trace (llc -O1 oom.bc -time-trace -time-trace-file=trace.json -filetype=null). It takes 4s to complete on my device. About 1.3s is Early Tail Duplication. And 1.3s is Eliminate PHI nodes for register allocation. (I tried perf, but maybe it's the environment of my machine that makes this unavailable.) I had to use TimeProfiler. tailDuplicateAndUpdate and tailDuplicateAndUpdate occupy half the time, respectively.

or is there an underlying algorithmic problem that we can fix?

I'm not sure, but I can try if I need to. But I don't know much about MIR and performance improvements. I'm also a little concerned about the code size, which seems to increase from about 50K to about 200K for the initial example. :3

if tail duplication is blowing up code size 4x, that definitely seems like a "we're doing too much tail duplication" issue. but hopefully somebody who actually understands the pass better can comment

bzEq · 2024-01-19T19:46:48Z

The cause is early-tailduplication transforms normal switch to computed-gotos, which means a lot of indirect branches are generated, adding quadratic number of edges inside these cases, that's to say at the end of each case, the control flow is able to jump to other cases directly. As a consequence, passes run after early-tailduplication get lag due to a more complex CFG.
You may be interested in https://reviews.llvm.org/D110613.

weiguozhi · 2024-01-19T21:21:53Z

If you want to limit the number of tail duplication, I want to see a partial tail duplication base on profile information, similar to MachineBlockPlacement::maybeTailDuplicateBlock.

DianQK · 2024-01-21T10:53:14Z

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

What do you mean "poorly maintained"?

Given that the option works on number of MIR instructions, we should keep the input MIR instruction count consistent.

@aeubanks
I have changed the test case. That was my mistake, I mistakenly thought the mir file would be very large.

if tail duplication is blowing up code size 4x, that definitely seems like a "we're doing too much tail duplication" issue. but hopefully somebody who actually understands the pass better can comment

@aeubanks
I've updated the issue. I added all the details I could.

The cause is early-tailduplication transforms normal switch to computed-gotos, which means a lot of indirect branches are generated, adding quadratic number of edges inside these cases, that's to say at the end of each case, the control flow is able to jump to other cases directly. As a consequence, passes run after early-tailduplication get lag due to a more complex CFG. You may be interested in https://reviews.llvm.org/D110613.

@bzEq
This looks like a very similar patch. But it looks like it's been revert for unknown reasons. What I've found so far is that the main time is early-tailduplication.

If you want to limit the number of tail duplication, I want to see a partial tail duplication base on profile information, similar to MachineBlockPlacement::maybeTailDuplicateBlock.

@weiguozhi
Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

weiguozhi · 2024-01-22T19:30:52Z

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

DianQK · 2024-01-23T04:01:56Z

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

dwblaikie · 2024-01-23T16:30:45Z

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

I think the argument is: The currently proposed solution may harm performance in some cases - and that that loss can be mitigated at least in the presence of profile information. (& so with that mitigation, maybe the overall cost is low enough to be worth shipping) - also, even in the absence of real profile information I /think/ we have some codepaths that generate heuristic-based "profile" information (but I might be misremembering/misunderstanding) that might mean the mitigation fires even then.

nikic · 2024-01-23T20:01:04Z

We should not implement profile-guided optimizations based on hypotheticals. Did you actually run benchmarks with this patch and saw regressions? (How large?) If not, we should do the straightforward thing until there is evidence that something more complex is justified.

DianQK · 2024-01-25T14:19:47Z

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

I think the argument is: The currently proposed solution may harm performance in some cases - and that that loss can be mitigated at least in the presence of profile information. (& so with that mitigation, maybe the overall cost is low enough to be worth shipping) - also, even in the absence of real profile information I /think/ we have some codepaths that generate heuristic-based "profile" information (but I might be misremembering/misunderstanding) that might mean the mitigation fires even then.

I'm not sure if I want to use PGO, since most applications won't be built with PGO. But I think the limiting conditions here can be adjusted. Perhaps consider the number of instructions duplicated.

DianQK · 2024-01-25T14:28:12Z

We should not implement profile-guided optimizations based on hypotheticals. Did you actually run benchmarks with this patch and saw regressions? (How large?) If not, we should do the straightforward thing until there is evidence that something more complex is justified.

I've written down the compilation time in the issue. I simply tried the runtime benchmark.

The oom_manual.c file is to replace the default branch by __builtin_unreachable().

clang -v:

clang version 17.0.6
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/j7mhyvnbbm4pk7vriilrc5wvh2kims5p-clang-17.0.6/bin

main.c:

int src(void) {
  return -1;
}

extern int f1(unsigned int *b);

int main(int argc, char **argv) {
  int r = argc;
  unsigned int b[] = { -1, -2, -3 };
  for (int i = 0; i < 1000000; i++) {
    r += f1(b);
  }
  return r;
}

build.sh:

clang -O1 oom_manual.c main.c -o oom_manual
clang -O1 oom_manual2.c main.c -o oom_manual2
ls -lh oom_manual oom_manual2

output:

-rwxr-xr-x 1 dianqk users  56K Jan 25 22:15 oom_manual
-rwxr-xr-x 1 dianqk users 192K Jan 25 22:15 oom_manual2

hyperfine.sh:

hyperfine -i -N --runs 200 --warmup 50 ./oom_manual ./oom_manual2

output:

Benchmark 1: ./oom_manual
  Time (mean ± σ):      12.4 ms ±   0.0 ms    [User: 12.2 ms, System: 0.2 ms]
  Range (min … max):    12.4 ms …  12.6 ms    200 runs

Benchmark 2: ./oom_manual2
  Time (mean ± σ):      14.9 ms ±   0.3 ms    [User: 14.7 ms, System: 0.2 ms]
  Range (min … max):    14.8 ms …  18.8 ms    200 runs

Summary
  ./oom_manual ran
    1.20 ± 0.02 times faster than ./oom_manual2

perf.sh:

function run_perf() {
  echo "perf stat $1"
  perf stat -x \; \
    -e instructions \
    -e instructions:u \
    -e cycles \
    -e task-clock \
    -e branches \
    -e branch-misses \
    $1
}

run_perf ./oom_manual
run_perf ./oom_manual2

output:

perf stat ./oom_manual
251193019;;instructions:u;12839070;100.00;3.63;insn per cycle
251193019;;instructions:u;12839070;100.00;3.63;insn per cycle
69202217;;cycles:u;12839070;100.00;5.390;GHz
12.84;msec;task-clock:u;12839070;100.00;0.981;CPUs utilized
56047969;;branches:u;12839070;100.00;4.365;G/sec
2876;;branch-misses:u;12839070;100.00;0.01;of all branches
perf stat ./oom_manual2
424193026;;instructions:u;15665683;100.00;5.12;insn per cycle
424193026;;instructions:u;15665683;100.00;5.12;insn per cycle
82892122;;cycles:u;15665683;100.00;5.291;GHz
15.67;msec;task-clock:u;15665683;100.00;0.981;CPUs utilized
32047975;;branches:u;15665683;100.00;2.046;G/sec
2827;;branch-misses:u;15665683;100.00;0.01;of all branches

I am trying to change the code to see the results of different scenarios.

DianQK · 2024-02-05T13:36:01Z

I made some progress.

There are usually only two instructions that can be duplicated. Indirect branches increase this limit to 20.

llvm-project/llvm/lib/CodeGen/TailDuplicator.cpp

Lines 591 to 602 in 4b2381a

    
           // If the target has hardware branch prediction that can handle indirect 
        
           // branches, duplicating them can often make them predictable when there 
        
           // are common paths through the code.  The limit needs to be high enough 
        
           // to allow undoing the effects of tail merging and other optimizations 
        
           // that rearrange the predecessors of the indirect branch. 
        
           bool HasIndirectbr = false; 
        
           if (!TailBB.empty()) 
        
             HasIndirectbr = TailBB.back().isIndirectBranch(); 
        
           if (HasIndirectbr && PreRegAlloc) 
        
             MaxDuplicateCount = TailDupIndirectBranchSize;

I understand from the comments that this is to improve the accuracy of branch prediction. I want to know that if it is appropriate with numerous of indirect branches. So I did some experimenting on https://github.com/DianQK/llvm-tail-dup-indirect-succ-size.

I'm using the result of instructions:u and branch-misses:u that don't call f1 as a baseline. The arguments of dup are -tail-dup-indirect-size=20, and the arguments of nodup are -tail-dup-indirect-size=2. I think the replication of this scenario may not make sense.

One of my results is as follows:

==================
Iterations: 1
dup: instructions: 418, branches: 32, branch-misses: 64
nodup: instructions: 280, branches: 53, branch-misses: 1
==================
Iterations: 2
dup: instructions: 840, branches: 63, branch-misses: 60
nodup: instructions: 565, branches: 106, branch-misses: 30
==================
Iterations: 3
dup: instructions: 1260, branches: 93, branch-misses: 46
nodup: instructions: 857, branches: 159, branch-misses: 62
==================
Iterations: 4
dup: instructions: 1681, branches: 123, branch-misses: 48
nodup: instructions: 1135, branches: 212, branch-misses: 74
==================
Iterations: 5
dup: instructions: 2102, branches: 152, branch-misses: 46
nodup: instructions: 1434, branches: 265, branch-misses: 67
==================
Iterations: 6
dup: instructions: 2536, branches: 183, branch-misses: 55
nodup: instructions: 1705, branches: 318, branch-misses: 82
==================
Iterations: 7
dup: instructions: 2944, branches: 212, branch-misses: 60
nodup: instructions: 1990, branches: 371, branch-misses: 77
==================
Iterations: 8
dup: instructions: 3371, branches: 243, branch-misses: 45
nodup: instructions: 2279, branches: 424, branch-misses: 57
==================
Iterations: 9
dup: instructions: 3793, branches: 273, branch-misses: 48
nodup: instructions: 2565, branches: 477, branch-misses: 56
==================
Iterations: 10
dup: instructions: 4232, branches: 309, branch-misses: 61
nodup: instructions: 2870, branches: 536, branch-misses: 49
==================
Iterations: 100
dup: instructions: 42138, branches: 3015, branch-misses: 54
nodup: instructions: 28540, branches: 5312, branch-misses: 55
==================
Iterations: 1000
dup: instructions: 421057, branches: 30020, branch-misses: 42
nodup: instructions: 285055, branches: 53018, branch-misses: 84
==================
Iterations: 10000
dup: instructions: 4210078, branches: 300027, branch-misses: 51
nodup: instructions: 2850075, branches: 530024, branch-misses: 15
==================
Iterations: 100000
dup: instructions: 42100100, branches: 3000035, branch-misses: 60
nodup: instructions: 28500098, branches: 5300032, branch-misses: 52
==================
Iterations: 1000000
dup: instructions: 421000146, branches: 30000068, branch-misses: 157
nodup: instructions: 285000149, branches: 53000061, branch-misses: 117
==================
Iterations: 10000000
dup: instructions: 4210000439, branches: 300000341, branch-misses: 749
nodup: instructions: 2850000384, branches: 530000291, branch-misses: 621
==================
Iterations: 100000000
dup: instructions: 42100003182, branches: 3000003075, branch-misses: 6277
nodup: instructions: 28500002671, branches: 5300002564, branch-misses: 5186

If this is the right route, I'll continue to figure out the other two problems.

Why does execute more instructions? (Related to increased PHI?)
Finding a suitable successor size?

DianQK · 2024-02-24T13:28:41Z

@efriedma-quic I've put most of the analysis into this comment.

In a nutshell, I speculate that duplicating critical BBs makes CFG exceptionally complex, especially within loops, which may be the primary reason for increased time consumption by other passes. A critical BB refers to one with multiple predecessors and multiple successors.

DianQK · 2024-02-27T01:53:26Z

Ping. If the new changes are suitable, I hope to catch up on the final release of 18.1.0.

dtcxzyw · 2024-03-22T14:08:41Z

Ping.

DianQK · 2024-04-15T13:38:28Z

Ping. I assume this is an uncommon scenario, since there has never been feedback on similar compile-time issues before.

nikic

LGTM. The new check looks sufficiently narrow (requires many predecessors and many successors) to just do this.

alexfh · 2024-04-22T11:28:40Z

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

alexfh · 2024-04-25T21:48:34Z

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

@DianQK ^^

It's not blocking us in any way, but it would be nice to ensure nothing wrong is happening here.

DianQK · 2024-04-25T22:17:42Z

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

@DianQK ^^

It's not blocking us in any way, but it would be nice to ensure nothing wrong is happening here.

I apologize for my late reply. (*^^*)
I think we can adjust the values of tail-dup-succ-size and tail-dup-pred-size, and based on my experiments, this compilation time is acceptable up to 100.

DianQK · 2024-04-25T22:22:06Z

I'll verify at the end of the week that increasing to 128 has no noticeable impact on compilation time.

DianQK · 2024-04-27T12:40:15Z

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

Could you try with -tail-dup-pred-size=64 -tail-dup-succ-size=64? It seems I can't reproduce it on x86-64.

alexfh · 2024-05-02T11:13:45Z

Sure, I'll run the comparison before and after this commit with -tail-dup-pred-size=64 -tail-dup-succ-size=64. And yes, this only reproduces on AArch64, but not on any flavor of x86-64 I have access to.

alexfh · 2024-05-02T14:21:32Z

Benchmark compiled with clang after this commit with -mllvm=-tail-dup-pred-size=64 -mllvm=-tail-dup-succ-size=64 has no regression compared to before this commit. I'm running with 32 just in case. Also note that this benchmark is in no way considered representative of any real workloads we care about.

alexfh · 2024-05-02T14:50:59Z

Actually, just -mllvm=-tail-dup-succ-size=32 (without touching -tail-dup-pred-size) is enough to fix the regression, while -mllvm=-tail-dup-succ-size=24 -mllvm=-tail-dup-succ-size=24 is not. Thus, the optimal value of -tail-dup-succ-size for this particular benchmark on aarch64 is somewhere between 24 and 32 (with the default value of -tail-dup-pred-size - 16).

DianQK · 2024-05-06T00:06:12Z

Thanks, I will continue to investigate this.

DianQK · 2024-05-12T10:09:34Z

Hmm, I tried LLVM 18 and the main (c8864bc) branch on Raspberry Pi 4 (arm64), but I didn't find any performance issues:


 Performance counter stats for './llvm18':

          1,805.64 msec task-clock:u                     #    0.994 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   25.476 /sec
     2,653,746,964      cycles:u                         #    1.470 GHz
     2,656,672,132      instructions:u                   #    1.00  insn per cycle
   <not supported>      branches:u
           247,193      branch-misses:u

       1.815950443 seconds time elapsed

       1.800769000 seconds user
       0.004009000 seconds sys


Sum: 3273600000

 Performance counter stats for './main':

          1,784.68 msec task-clock:u                     #    0.998 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   25.775 /sec
     2,653,681,171      cycles:u                         #    1.487 GHz
     2,656,672,132      instructions:u                   #    1.00  insn per cycle
   <not supported>      branches:u
           241,177      branch-misses:u

       1.788022549 seconds time elapsed

       1.780709000 seconds user
       0.003991000 seconds sys

…96089) This patch reverts #81585 as #78582 has been landed. Now clang works well with reproducer #79993 (comment).

…" (llvm#96089) This patch reverts llvm#81585 as llvm#78582 has been landed. Now clang works well with reproducer llvm#79993 (comment).

This adjusts the threshold logic added in llvm#78582 to only trigger for cases where there are actually phis to duplicate in either TailBB or in one of the successors. In cases there are no phis, we only have to pay the cost of extra edges, but have no explosion in PHI related instructions. This improves performance of Python on some inputs by 2-3% on Apple Silicon CPUs.

…onsider tail duplicating blocks (llvm#78582)" This reverts commit 86a7828. Now, we only consider computed GOTOs.

llvmbot added the backend:X86 label Jan 18, 2024

DianQK requested review from TNorthover, dwblaikie, bjope and mikaelholmen January 18, 2024 14:04

dwblaikie requested a review from aeubanks January 18, 2024 19:05

bzEq requested review from efriedma-quic, qcolombet, MatzeB and weiguozhi January 19, 2024 19:47

DianQK force-pushed the tail-dup-pred-size-limit branch from 8777d96 to 028c8c7 Compare January 21, 2024 10:43

DianQK mentioned this pull request Jan 31, 2024

[MIRPrinter] Don't print space when there is no successor #80143

Merged

DianQK force-pushed the tail-dup-pred-size-limit branch from 028c8c7 to b0e06a2 Compare February 5, 2024 13:17

DianQK changed the title ~~[TailDuplicator] Add a limit on the number of indirect branch successors~~ [TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks Feb 24, 2024

nikic approved these changes Apr 16, 2024

View reviewed changes

DianQK merged commit 86a7828 into llvm:main Apr 17, 2024
3 of 4 checks passed

DianQK deleted the tail-dup-pred-size-limit branch April 17, 2024 12:27

DianQK mentioned this pull request May 25, 2024

Remove the default branch of a switch #73446

Closed

dtcxzyw mentioned this pull request Jun 19, 2024

Reland "[CVP] Check whether the default case is reachable (#79993)" #96089

Merged

dtcxzyw added a commit that referenced this pull request Jun 20, 2024

Reland "[CVP] Check whether the default case is reachable (#79993)" (#…

394e321

…96089) This patch reverts #81585 as #78582 has been landed. Now clang works well with reproducer #79993 (comment).

nikic mentioned this pull request Aug 31, 2024

performance regression in clang-19 when using computed goto #106846

Open

DianQK mentioned this pull request Nov 5, 2024

[TailDuplicator] Only duplicate the blocks containing computed gotos #114990

Open

fhahn mentioned this pull request Nov 13, 2024

[TaiDup] Allow large number of predecessors/successors without phis. #116072

Open

DianQK added a commit to DianQK/llvm-project that referenced this pull request Nov 20, 2024

Revert "[TailDuplicator] Add maximum predecessors and successors to c…

5e4c0f4

…onsider tail duplicating blocks (llvm#78582)" This reverts commit 86a7828. Now, we only consider computed GOTOs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582

[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582

DianQK commented Jan 18, 2024 •

edited

Loading

llvmbot commented Jan 18, 2024

DianQK commented Jan 18, 2024

dwblaikie commented Jan 18, 2024

aeubanks commented Jan 18, 2024

aeubanks commented Jan 18, 2024

DianQK commented Jan 19, 2024

aeubanks commented Jan 19, 2024

bzEq commented Jan 19, 2024 •

edited

Loading

weiguozhi commented Jan 19, 2024

DianQK commented Jan 21, 2024

weiguozhi commented Jan 22, 2024

DianQK commented Jan 23, 2024

dwblaikie commented Jan 23, 2024

nikic commented Jan 23, 2024

DianQK commented Jan 25, 2024

DianQK commented Jan 25, 2024

DianQK commented Feb 5, 2024

DianQK commented Feb 24, 2024

DianQK commented Feb 27, 2024

dtcxzyw commented Mar 22, 2024

DianQK commented Apr 15, 2024

nikic left a comment

alexfh commented Apr 22, 2024

alexfh commented Apr 25, 2024

DianQK commented Apr 25, 2024 •

edited

Loading

DianQK commented Apr 25, 2024

DianQK commented Apr 27, 2024

alexfh commented May 2, 2024

alexfh commented May 2, 2024

alexfh commented May 2, 2024 •

edited

Loading

DianQK commented May 6, 2024

DianQK commented May 12, 2024

[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582

[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582

Conversation

DianQK commented Jan 18, 2024 • edited Loading

llvmbot commented Jan 18, 2024

DianQK commented Jan 18, 2024

dwblaikie commented Jan 18, 2024

aeubanks commented Jan 18, 2024

aeubanks commented Jan 18, 2024

DianQK commented Jan 19, 2024

aeubanks commented Jan 19, 2024

bzEq commented Jan 19, 2024 • edited Loading

weiguozhi commented Jan 19, 2024

DianQK commented Jan 21, 2024

weiguozhi commented Jan 22, 2024

DianQK commented Jan 23, 2024

dwblaikie commented Jan 23, 2024

nikic commented Jan 23, 2024

DianQK commented Jan 25, 2024

DianQK commented Jan 25, 2024

DianQK commented Feb 5, 2024

DianQK commented Feb 24, 2024

DianQK commented Feb 27, 2024

dtcxzyw commented Mar 22, 2024

DianQK commented Apr 15, 2024

nikic left a comment

Choose a reason for hiding this comment

alexfh commented Apr 22, 2024

alexfh commented Apr 25, 2024

DianQK commented Apr 25, 2024 • edited Loading

DianQK commented Apr 25, 2024

DianQK commented Apr 27, 2024

alexfh commented May 2, 2024

alexfh commented May 2, 2024

alexfh commented May 2, 2024 • edited Loading

DianQK commented May 6, 2024

DianQK commented May 12, 2024

DianQK commented Jan 18, 2024 •

edited

Loading

bzEq commented Jan 19, 2024 •

edited

Loading

DianQK commented Apr 25, 2024 •

edited

Loading

alexfh commented May 2, 2024 •

edited

Loading