Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LoopVectorize] Refine runtime memory check costs when there is an outer loop #76034

Merged
merged 6 commits into from
Jan 26, 2024

Conversation

david-arm
Copy link
Contributor

@david-arm david-arm commented Dec 20, 2023

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.

@llvmbot
Copy link
Collaborator

llvmbot commented Dec 20, 2023

@llvm/pr-subscribers-llvm-transforms

Author: David Sherwood (david-arm)

Changes

[LoopVectorize] Refine runtime memory check costs when there is an outer loop

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.


Full diff: https://github.com/llvm/llvm-project/pull/76034.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+33-4)
  • (added) llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll (+193)
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index f82e161fb846d1..8287090800dd37 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -2070,7 +2070,7 @@ class GeneratedRTChecks {
     }
   }
 
-  InstructionCost getCost() {
+  InstructionCost getCost(Loop *OuterLoop) {
     if (SCEVCheckBlock || MemCheckBlock)
       LLVM_DEBUG(dbgs() << "Calculating cost of runtime checks:\n");
 
@@ -2091,16 +2091,45 @@ class GeneratedRTChecks {
         LLVM_DEBUG(dbgs() << "  " << C << "  for " << I << "\n");
         RTCheckCost += C;
       }
-    if (MemCheckBlock)
+    if (MemCheckBlock) {
+      InstructionCost MemCheckCost = 0;
       for (Instruction &I : *MemCheckBlock) {
         if (MemCheckBlock->getTerminator() == &I)
           continue;
         InstructionCost C =
             TTI->getInstructionCost(&I, TTI::TCK_RecipThroughput);
         LLVM_DEBUG(dbgs() << "  " << C << "  for " << I << "\n");
-        RTCheckCost += C;
+        MemCheckCost += C;
+      }
+
+      // If the runtime memory checks are being created inside an outer loop
+      // we should find out if these checks are outer loop invariant. If so,
+      // the checks will be hoisted out and so the effective cost will reduce
+      // according to the outer loop trip count.
+      if (OuterLoop) {
+        ScalarEvolution *SE = MemCheckExp.getSE();
+        const SCEV *Cond = SE->getSCEV(MemRuntimeCheckCond);
+        if (SE->isLoopInvariant(Cond, OuterLoop)) {
+          if (std::optional<unsigned> OuterTC =
+                  getSmallBestKnownTC(*SE, OuterLoop))
+            MemCheckCost /= *OuterTC;
+          else {
+            // It seems reasonable to assume that we can reduce the effective
+            // cost of the checks even when we know nothing about the trip
+            // count. Here I've assumed that the outer loop executes at least
+            // twice.
+            MemCheckCost /= 2;
+          }
+
+          // Let's ensure the cost is always at least 1.
+          if (MemCheckCost == 0)
+            MemCheckCost = 1;
+        }
       }
 
+      RTCheckCost += MemCheckCost;
+    }
+
     if (SCEVCheckBlock || MemCheckBlock)
       LLVM_DEBUG(dbgs() << "Total cost of runtime checks: " << RTCheckCost
                         << "\n");
@@ -9754,7 +9783,7 @@ static bool areRuntimeChecksProfitable(GeneratedRTChecks &Checks,
                                        std::optional<unsigned> VScale, Loop *L,
                                        ScalarEvolution &SE,
                                        ScalarEpilogueLowering SEL) {
-  InstructionCost CheckCost = Checks.getCost();
+  InstructionCost CheckCost = Checks.getCost(L->getParentLoop());
   if (!CheckCost.isValid())
     return false;
 
diff --git a/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll
new file mode 100644
index 00000000000000..b740b055822991
--- /dev/null
+++ b/llvm/test/Transforms/LoopVectorize/AArch64/low_trip_memcheck_cost.ll
@@ -0,0 +1,193 @@
+; RUN: opt -p loop-vectorize -mattr=+sve -debug-only=loop-vectorize -S -disable-output < %s 2>&1 | FileCheck %s
+
+target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
+target triple = "aarch64-unknown-linux-gnu"
+
+
+define void @outer_no_tc(ptr nocapture noundef %a, ptr nocapture noundef readonly %b, i64 noundef %m, i64 noundef %n) vscale_range(1,16) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_no_tc'
+; CHECK:      Calculating cost of runtime checks:
+; CHECK:      Total cost of runtime checks: 3
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:32
+entry:
+  %cmp24 = icmp sgt i64 %m, 0
+  %cmp222 = icmp sgt i64 %n, 0
+  %or.cond = and i1 %cmp24, %cmp222
+  br i1 %or.cond, label %for.cond1.preheader.us, label %for.cond.cleanup
+
+for.cond1.preheader.us:
+  %i.025.us = phi i64 [ %inc12.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %entry ]
+  %mul.us = mul nsw i64 %i.025.us, %n
+  br label %for.body4.us
+
+for.body4.us:
+  %j.023.us = phi i64 [ 0, %for.cond1.preheader.us ], [ %inc.us, %for.body4.us ]
+  %add.us = add nuw nsw i64 %j.023.us, %mul.us
+  %arrayidx.us = getelementptr inbounds i8, ptr %b, i64 %add.us
+  %0 = load i8, ptr %arrayidx.us, align 1
+  %arrayidx7.us = getelementptr inbounds i8, ptr %a, i64 %add.us
+  %1 = load i8, ptr %arrayidx7.us, align 1
+  %add9.us = add i8 %1, %0
+  store i8 %add9.us, ptr %arrayidx7.us, align 1
+  %inc.us = add nuw nsw i64 %j.023.us, 1
+  %exitcond.not = icmp eq i64 %inc.us, %n
+  br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:
+  %inc12.us = add nuw nsw i64 %i.025.us, 1
+  %exitcond27.not = icmp eq i64 %inc12.us, %m
+  br i1 %exitcond27.not, label %for.cond.cleanup, label %for.cond1.preheader.us
+
+for.cond.cleanup:
+  ret void
+}
+
+
+define void @outer_known_tc3(ptr nocapture noundef %a, ptr nocapture noundef readonly %b, i64 noundef %n) vscale_range(1,16) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_known_tc3'
+; CHECK:      Calculating cost of runtime checks:
+; CHECK:      Total cost of runtime checks: 2
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:32
+entry:
+  %cmp222 = icmp sgt i64 %n, 0
+  br i1 %cmp222, label %for.cond1.preheader.us, label %for.cond.cleanup
+
+for.cond1.preheader.us:
+  %i.024.us = phi i64 [ %inc12.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %entry ]
+  %mul.us = mul nsw i64 %i.024.us, %n
+  br label %for.body4.us
+
+for.body4.us:
+  %j.023.us = phi i64 [ 0, %for.cond1.preheader.us ], [ %inc.us, %for.body4.us ]
+  %add.us = add nuw nsw i64 %j.023.us, %mul.us
+  %arrayidx.us = getelementptr inbounds i8, ptr %b, i64 %add.us
+  %0 = load i8, ptr %arrayidx.us, align 1
+  %arrayidx7.us = getelementptr inbounds i8, ptr %a, i64 %add.us
+  %1 = load i8, ptr %arrayidx7.us, align 1
+  %add9.us = add i8 %1, %0
+  store i8 %add9.us, ptr %arrayidx7.us, align 1
+  %inc.us = add nuw nsw i64 %j.023.us, 1
+  %exitcond.not = icmp eq i64 %inc.us, %n
+  br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:
+  %inc12.us = add nuw nsw i64 %i.024.us, 1
+  %exitcond26.not = icmp eq i64 %inc12.us, 3
+  br i1 %exitcond26.not, label %for.cond.cleanup, label %for.cond1.preheader.us
+
+for.cond.cleanup:
+  ret void
+}
+
+
+define void @outer_known_tc64(ptr nocapture noundef %a, ptr nocapture noundef readonly %b, i64 noundef %n) vscale_range(1,16) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_known_tc64'
+; CHECK:      Calculating cost of runtime checks:
+; CHECK:      Total cost of runtime checks: 1
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:32
+entry:
+  %cmp222 = icmp sgt i64 %n, 0
+  br i1 %cmp222, label %for.cond1.preheader.us, label %for.cond.cleanup
+
+for.cond1.preheader.us:
+  %i.024.us = phi i64 [ %inc12.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %entry ]
+  %mul.us = mul nsw i64 %i.024.us, %n
+  br label %for.body4.us
+
+for.body4.us:
+  %j.023.us = phi i64 [ 0, %for.cond1.preheader.us ], [ %inc.us, %for.body4.us ]
+  %add.us = add nuw nsw i64 %j.023.us, %mul.us
+  %arrayidx.us = getelementptr inbounds i8, ptr %b, i64 %add.us
+  %0 = load i8, ptr %arrayidx.us, align 1
+  %arrayidx7.us = getelementptr inbounds i8, ptr %a, i64 %add.us
+  %1 = load i8, ptr %arrayidx7.us, align 1
+  %add9.us = add i8 %1, %0
+  store i8 %add9.us, ptr %arrayidx7.us, align 1
+  %inc.us = add nuw nsw i64 %j.023.us, 1
+  %exitcond.not = icmp eq i64 %inc.us, %n
+  br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:
+  %inc12.us = add nuw nsw i64 %i.024.us, 1
+  %exitcond26.not = icmp eq i64 %inc12.us, 64
+  br i1 %exitcond26.not, label %for.cond.cleanup, label %for.cond1.preheader.us
+
+for.cond.cleanup:
+  ret void
+}
+
+
+define void @outer_pgo_3(ptr nocapture noundef %a, ptr nocapture noundef readonly %b, i64 noundef %m, i64 noundef %n) vscale_range(1,16) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_pgo_3'
+; CHECK:      Calculating cost of runtime checks:
+; CHECK:      Total cost of runtime checks: 2
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:32
+entry:
+  %cmp222 = icmp sgt i64 %n, 0
+  br i1 %cmp222, label %for.cond1.preheader.us, label %for.cond.cleanup
+
+for.cond1.preheader.us:
+  %i.024.us = phi i64 [ %inc12.us, %for.cond1.for.cond.cleanup3_crit_edge.us ], [ 0, %entry ]
+  %mul.us = mul nsw i64 %i.024.us, %n
+  br label %for.body4.us
+
+for.body4.us:
+  %j.023.us = phi i64 [ 0, %for.cond1.preheader.us ], [ %inc.us, %for.body4.us ]
+  %add.us = add nuw nsw i64 %j.023.us, %mul.us
+  %arrayidx.us = getelementptr inbounds i8, ptr %b, i64 %add.us
+  %0 = load i8, ptr %arrayidx.us, align 1
+  %arrayidx7.us = getelementptr inbounds i8, ptr %a, i64 %add.us
+  %1 = load i8, ptr %arrayidx7.us, align 1
+  %add9.us = add i8 %1, %0
+  store i8 %add9.us, ptr %arrayidx7.us, align 1
+  %inc.us = add nuw nsw i64 %j.023.us, 1
+  %exitcond.not = icmp eq i64 %inc.us, %n
+  br i1 %exitcond.not, label %for.cond1.for.cond.cleanup3_crit_edge.us, label %for.body4.us
+
+for.cond1.for.cond.cleanup3_crit_edge.us:
+  %inc12.us = add nuw nsw i64 %i.024.us, 1
+  %exitcond26.not = icmp eq i64 %inc12.us, %m
+  br i1 %exitcond26.not, label %for.cond.cleanup, label %for.cond1.preheader.us, !prof !0
+
+for.cond.cleanup:
+  ret void
+}
+
+
+define void @outer_known_tc3_full_range_checks(ptr nocapture noundef %dst, ptr nocapture noundef readonly %src, i64 noundef %n) {
+; CHECK-LABEL: LV: Checking a loop in 'outer_known_tc3_full_range_checks'
+; CHECK:      Calculating cost of runtime checks:
+; CHECK:      Total cost of runtime checks: 2
+; CHECK-NEXT: LV: Minimum required TC for runtime checks to be profitable:8
+entry:
+  br label %outer.loop
+
+outer.loop:
+  %outer.iv = phi i64 [ 0, %entry ], [ %outer.iv.next, %inner.exit ]
+  %0 = mul nsw i64 %outer.iv, %n
+  br label %inner.loop
+
+inner.loop:
+  %iv.inner = phi i64 [ 0, %outer.loop ], [ %iv.inner.next, %inner.loop ]
+  %1 = add nuw nsw i64 %iv.inner, %0
+  %arrayidx.us = getelementptr inbounds i32, ptr %src, i64 %1
+  %2 = load i32, ptr %arrayidx.us, align 4
+  %arrayidx8.us = getelementptr inbounds i32, ptr %dst, i64 %1
+  %3 = load i32, ptr %arrayidx8.us, align 4
+  %add9.us = add nsw i32 %3, %2
+  store i32 %add9.us, ptr %arrayidx8.us, align 4
+  %iv.inner.next = add nuw nsw i64 %iv.inner, 1
+  %inner.exit.cond = icmp eq i64 %iv.inner.next, %n
+  br i1 %inner.exit.cond, label %inner.exit, label %inner.loop
+
+inner.exit:
+  %outer.iv.next = add nuw nsw i64 %outer.iv, 1
+  %outer.exit.cond = icmp eq i64 %outer.iv.next, 3
+  br i1 %outer.exit.cond, label %outer.exit, label %outer.loop
+
+outer.exit:
+  ret void
+}
+
+
+!0 = !{!"branch_weights", i32 10, i32 20}

@david-arm david-arm changed the title Low trip memcheck [LoopVectorize] Refine runtime memory check costs when there is an outer loop Dec 20, 2023
Copy link
Collaborator

@davemgreen davemgreen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems OK from what I can tell. LGTM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
@@ -2070,7 +2070,7 @@ class GeneratedRTChecks {
}
}

InstructionCost getCost() {
InstructionCost getCost(Loop *OuterLoop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to store the inner or outer loop directly in the struct; the inner loop is passed in to Create and there are 2 places that check for an outer loop already (via if (auto *PL = LI->getLoopFor(LoopVectorPreHeader)))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the places that look up the outer loop to use the new variable as well?

@david-arm
Copy link
Contributor Author

Gentle ping!

Copy link
Contributor

@Rin18 Rin18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment, but otherwise LGTM! I'll leave someone else more familiar with the code to approve the change.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
…ter loop

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.
@david-arm
Copy link
Contributor Author

Trying a rebase to see if that fixes the failing RISCV test that appears unrelated to my change

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates!

Trying a rebase to see if that fixes the failing RISCV test that appears unrelated to my change

Alternatively you could use the Update Branch button to merge in the changes from current main, which I think is preferred to rebases, as the GH UI may lose comments after rebases.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
@@ -2070,7 +2070,7 @@ class GeneratedRTChecks {
}
}

InstructionCost getCost() {
InstructionCost getCost(Loop *OuterLoop) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the places that look up the outer loop to use the new variable as well?

@david-arm
Copy link
Contributor Author

Thanks for the updates!

Trying a rebase to see if that fixes the failing RISCV test that appears unrelated to my change

Alternatively you could use the Update Branch button to merge in the changes from current main, which I think is preferred to rebases, as the GH UI may lose comments after rebases.

Yeah I used to use the "Update Branch" button, but I find it makes life very difficult for reviewers (and authors) and I'd like to stop using it in future. The problem is that in Phabricator we could do stacked patches quite easily, i.e. NFC test patch followed by code change. The purpose of this is to allow people to see the impact of the code changes. In this new world we're supposed to always add a new commit every time we address comments, which completely defeats the whole purpose of splitting the test patch up from the code change. Quite often reviewers ask authors to rewrite tests or add new ones, which means that when you add a new commit you can no longer see the impact of your code change. It feels like we should either:

  1. Stop splitting commits up into NFC test + code change patch, because it's not that useful anymore. Or
  2. Force push the same two commits every time similar to how we did things on Phabricator. That way when we do change the tests it's still possible for the reviewer to see what effect the code change has.

To be honest either approach is unsatisfactory, but I feel option 2) gives the reviewer a more fighting chance of seeing cause and effect!

@fhahn
Copy link
Contributor

fhahn commented Jan 19, 2024

To be honest either approach is unsatisfactory, but I feel option 2) gives the reviewer a more fighting chance of seeing cause and effect!

IMO including a separate commit for precommitting tests in a feature PR isn't ideal, as the default view in GH shows the diff for all changes compared to the main branch and when merging the PR all commits get squashed into one.

It's now possible to get something close to Phabricator's stacked reviews discussed here https://discourse.llvm.org/t/update-on-github-pull-requests/71540/146.

@david-arm
Copy link
Contributor Author

To be honest either approach is unsatisfactory, but I feel option 2) gives the reviewer a more fighting chance of seeing cause and effect!

IMO including a separate commit for precommitting tests in a feature PR isn't ideal, as the default view in GH shows the diff for all changes compared to the main branch and when merging the PR all commits get squashed into one.

It's now possible to get something close to Phabricator's stacked reviews discussed here https://discourse.llvm.org/t/update-on-github-pull-requests/71540/146.

That's true, but to be honest the idea of creating loads of user branches on main fills me with dread. :) I've not tried it myself so I don't know how much extra overhead is involved. If you have examples where you have done this successfully/efficiently I'd be genuinely curious to see how it's done?

@david-arm
Copy link
Contributor Author

Could you update the places that look up the outer loop to use the new variable as well?

Hi @fhahn, I couldn't find any other instances in GeneratedRTChecks where we call getParentLoop(). I'm not sure where else I could use OuterLoop?

@fhahn
Copy link
Contributor

fhahn commented Jan 23, 2024

Could you update the places that look up the outer loop to use the new variable as well?

Hi @fhahn, I couldn't find any other instances in GeneratedRTChecks where we call getParentLoop(). I'm not sure where else I could use OuterLoop?

There are 2 places that do something like if (auto *PL = LI->getLoopFor(LoopVectorPreHeader)..., which effectively checks if there's an outer loop

@david-arm
Copy link
Contributor Author

Hi @fhahn, ok thanks. It wasn't obvious to me that those cases were checking for outer loops, so thanks for pointing them out! I've changed them now.

Copy link
Contributor

@fhahn fhahn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp Outdated Show resolved Hide resolved
…ter loop

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.
@david-arm david-arm merged commit 962fbaf into llvm:main Jan 26, 2024
3 of 4 checks passed
llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Jan 29, 2024
…ter loop (llvm#76034)

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.

(cherry picked from commit 962fbaf)
tstellar pushed a commit that referenced this pull request Jan 29, 2024
…ter loop (#76034)

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.

(cherry picked from commit 962fbaf)
tstellar pushed a commit to tstellar/llvm-project that referenced this pull request Feb 14, 2024
…ter loop (llvm#76034)

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.

(cherry picked from commit 962fbaf)
tstellar pushed a commit to tstellar/llvm-project that referenced this pull request Feb 14, 2024
…ter loop (llvm#76034)

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.

(cherry picked from commit 962fbaf)
tstellar pushed a commit to tstellar/llvm-project that referenced this pull request Feb 14, 2024
…ter loop (llvm#76034)

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.

(cherry picked from commit 962fbaf)
tstellar pushed a commit to tstellar/llvm-project that referenced this pull request Feb 14, 2024
…ter loop (llvm#76034)

When we generate runtime memory checks for an inner loop it's
possible that these checks are invariant in the outer loop and
so will get hoisted out. In such cases, the effective cost of
the checks should reduce to reflect the outer loop trip count.

This fixes a 25% performance regression introduced by commit

49b0e6d

when building the SPEC2017 x264 benchmark with PGO, where we
decided the inner loop trip count wasn't high enough to warrant
the (incorrect) high cost of the runtime checks. Also, when
runtime memory checks consist entirely of diff checks these are
likely to be outer loop invariant.

(cherry picked from commit 962fbaf)
@david-arm david-arm deleted the low_trip_memcheck branch May 1, 2024 11:55
@pointhex pointhex mentioned this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants