[MemProf] Support cloning for indirect calls with ThinLTO #110625

teresajohnson · 2024-10-01T02:46:56Z

This patch enables support for cloning in indirect callsites.

This is done by synthesizing callsite records for each virtual call
target from the profile metadata. In the thin link all the synthesized
records for a particular indirect callsite initially share the same
context node, but support is added to partition the callsites and
outgoing edges based on the callee function, creating a separate node
for each target.

In the LTO backend, when cloning is needed we first perform indirect
call promotion, then change the target of the new direct call to the
desired clone.

Note this is ThinLTO-specific, since for regular LTO indirect call
promotion should have already occurred.

This patch enables support for cloning in indirect callsites. This is done by synthesizing callsite records for each virtual call target from the profile metadata. In the thin link all the synthesized records for a particular indirect callsite initially share the same context node, but support is added to partition the callsites and outgoing edges based on the callee function, creating a separate node for each target. In the LTO backend, when cloning is needed we first perform indirect call promotion, then change the target of the new direct call to the desired clone. Note this is ThinLTO-specific, since for regular LTO indirect call promotion should have already occurred.

llvmbot · 2024-10-01T02:47:52Z

@llvm/pr-subscribers-lto
@llvm/pr-subscribers-llvm-ir

@llvm/pr-subscribers-llvm-transforms

Author: Teresa Johnson (teresajohnson)

Changes

This patch enables support for cloning in indirect callsites.

This is done by synthesizing callsite records for each virtual call
target from the profile metadata. In the thin link all the synthesized
records for a particular indirect callsite initially share the same
context node, but support is added to partition the callsites and
outgoing edges based on the callee function, creating a separate node
for each target.

In the LTO backend, when cloning is needed we first perform indirect
call promotion, then change the target of the new direct call to the
desired clone.

Note this is ThinLTO-specific, since for regular LTO indirect call
promotion should have already occurred.

Patch is 50.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110625.diff

7 Files Affected:

(modified) llvm/include/llvm/IR/ModuleSummaryIndex.h (+5)
(modified) llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h (+6-1)
(modified) llvm/lib/Analysis/ModuleSummaryAnalysis.cpp (+36-23)
(modified) llvm/lib/IR/AsmWriter.cpp (+2-2)
(modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+5-2)
(modified) llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp (+479-18)
(added) llvm/test/ThinLTO/X86/memprof-icp.ll (+309)

diff --git a/llvm/include/llvm/IR/ModuleSummaryIndex.h b/llvm/include/llvm/IR/ModuleSummaryIndex.h
index 288aba16702f56..1cfe7c15f97dbc 100644
--- a/llvm/include/llvm/IR/ModuleSummaryIndex.h
+++ b/llvm/include/llvm/IR/ModuleSummaryIndex.h
@@ -200,7 +200,12 @@ struct ValueInfo {
     return getRef()->second.SummaryList;
   }
 
+  // Even if the index is built with GVs available, we may not have one for
+  // summary entries synthesized for profiled indirect call targets.
+  bool hasName() const { return !haveGVs() || getValue(); }
+
   StringRef name() const {
+    assert(!haveGVs() || getRef()->second.U.GV);
     return haveGVs() ? getRef()->second.U.GV->getName()
                      : getRef()->second.U.Name;
   }
diff --git a/llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h b/llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h
index f5dce01c34e7aa..fa067ad62ed1b3 100644
--- a/llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h
+++ b/llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h
@@ -43,8 +43,13 @@ class MemProfContextDisambiguation
   // ThinLTO backend via opt (to simulate distributed ThinLTO).
   std::unique_ptr<ModuleSummaryIndex> ImportSummaryForTesting;
 
+  // Whether we are building with SamplePGO. This is needed for correctly
+  // updating profile metadata on speculatively promoted calls.
+  bool SamplePGO;
+
 public:
-  MemProfContextDisambiguation(const ModuleSummaryIndex *Summary = nullptr);
+  MemProfContextDisambiguation(const ModuleSummaryIndex *Summary = nullptr,
+                               bool SamplePGO = false);
 
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
 
diff --git a/llvm/lib/Analysis/ModuleSummaryAnalysis.cpp b/llvm/lib/Analysis/ModuleSummaryAnalysis.cpp
index 2d4961dade9db1..1bd9ee651d2b0b 100644
--- a/llvm/lib/Analysis/ModuleSummaryAnalysis.cpp
+++ b/llvm/lib/Analysis/ModuleSummaryAnalysis.cpp
@@ -81,6 +81,11 @@ static cl::opt<std::string> ModuleSummaryDotFile(
     "module-summary-dot-file", cl::Hidden, cl::value_desc("filename"),
     cl::desc("File to emit dot graph of new summary into"));
 
+static cl::opt<bool> EnableMemProfIndirectCallSupport(
+    "enable-memprof-indirect-call-support", cl::init(true), cl::Hidden,
+    cl::desc(
+        "Enable MemProf support for summarizing and cloning indirect calls"));
+
 extern cl::opt<bool> ScalePartialSampleProfileWorkingSetSize;
 
 extern cl::opt<unsigned> MaxNumVTableAnnotations;
@@ -404,6 +409,11 @@ static void computeFunctionSummary(
       if (HasLocalsInUsedOrAsm && CI && CI->isInlineAsm())
         HasInlineAsmMaybeReferencingInternal = true;
 
+      // Compute this once per indirect call.
+      uint32_t NumCandidates = 0;
+      uint64_t TotalCount = 0;
+      MutableArrayRef<InstrProfValueData> CandidateProfileData;
+
       auto *CalledValue = CB->getCalledOperand();
       auto *CalledFunction = CB->getCalledFunction();
       if (CalledValue && !CalledFunction) {
@@ -481,9 +491,7 @@ static void computeFunctionSummary(
           }
         }
 
-        uint32_t NumCandidates;
-        uint64_t TotalCount;
-        auto CandidateProfileData =
+        CandidateProfileData =
             ICallAnalysis.getPromotionCandidatesForInstruction(&I, TotalCount,
                                                                NumCandidates);
         for (const auto &Candidate : CandidateProfileData)
@@ -495,16 +503,6 @@ static void computeFunctionSummary(
       if (!IsThinLTO)
         continue;
 
-      // TODO: Skip indirect calls for now. Need to handle these better, likely
-      // by creating multiple Callsites, one per target, then speculatively
-      // devirtualize while applying clone info in the ThinLTO backends. This
-      // will also be important because we will have a different set of clone
-      // versions per target. This handling needs to match that in the ThinLTO
-      // backend so we handle things consistently for matching of callsite
-      // summaries to instructions.
-      if (!CalledFunction)
-        continue;
-
       // Ensure we keep this analysis in sync with the handling in the ThinLTO
       // backend (see MemProfContextDisambiguation::applyImport). Save this call
       // so that we can skip it in checking the reverse case later.
@@ -555,13 +553,24 @@ static void computeFunctionSummary(
         SmallVector<unsigned> StackIdIndices;
         for (auto StackId : InstCallsite)
           StackIdIndices.push_back(Index.addOrGetStackIdIndex(StackId));
-        // Use the original CalledValue, in case it was an alias. We want
-        // to record the call edge to the alias in that case. Eventually
-        // an alias summary will be created to associate the alias and
-        // aliasee.
-        auto CalleeValueInfo =
-            Index.getOrInsertValueInfo(cast<GlobalValue>(CalledValue));
-        Callsites.push_back({CalleeValueInfo, StackIdIndices});
+        if (CalledFunction) {
+          // Use the original CalledValue, in case it was an alias. We want
+          // to record the call edge to the alias in that case. Eventually
+          // an alias summary will be created to associate the alias and
+          // aliasee.
+          auto CalleeValueInfo =
+              Index.getOrInsertValueInfo(cast<GlobalValue>(CalledValue));
+          Callsites.push_back({CalleeValueInfo, StackIdIndices});
+        } else if (EnableMemProfIndirectCallSupport) {
+          // For indirect callsites, create multiple Callsites, one per target.
+          // This enables having a different set of clone versions per target,
+          // and we will apply the cloning decisions while speculatively
+          // devirtualizing in the ThinLTO backends.
+          for (const auto &Candidate : CandidateProfileData) {
+            auto CalleeValueInfo = Index.getOrInsertValueInfo(Candidate.Value);
+            Callsites.push_back({CalleeValueInfo, StackIdIndices});
+          }
+        }
       }
     }
   }
@@ -1214,9 +1223,13 @@ bool llvm::mayHaveMemprofSummary(const CallBase *CB) {
     if (CI && CalledFunction->isIntrinsic())
       return false;
   } else {
-    // TODO: For now skip indirect calls. See comments in
-    // computeFunctionSummary for what is needed to handle this.
-    return false;
+    // Skip inline assembly calls.
+    if (CI && CI->isInlineAsm())
+      return false;
+    // Skip direct calls via Constant.
+    if (!CalledValue || isa<Constant>(CalledValue))
+      return false;
+    return true;
   }
   return true;
 }
diff --git a/llvm/lib/IR/AsmWriter.cpp b/llvm/lib/IR/AsmWriter.cpp
index 70e3af941bf77b..f07cfcaa0124da 100644
--- a/llvm/lib/IR/AsmWriter.cpp
+++ b/llvm/lib/IR/AsmWriter.cpp
@@ -3613,7 +3613,7 @@ void AssemblyWriter::printSummary(const GlobalValueSummary &Summary) {
 
 void AssemblyWriter::printSummaryInfo(unsigned Slot, const ValueInfo &VI) {
   Out << "^" << Slot << " = gv: (";
-  if (!VI.name().empty())
+  if (VI.hasName() && !VI.name().empty())
     Out << "name: \"" << VI.name() << "\"";
   else
     Out << "guid: " << VI.getGUID();
@@ -3627,7 +3627,7 @@ void AssemblyWriter::printSummaryInfo(unsigned Slot, const ValueInfo &VI) {
     Out << ")";
   }
   Out << ")";
-  if (!VI.name().empty())
+  if (VI.hasName() && !VI.name().empty())
     Out << " ; guid = " << VI.getGUID();
   Out << "\n";
 }
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 8f151a99b11709..916967f8ee02b9 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -1714,7 +1714,8 @@ ModulePassManager PassBuilder::buildThinLTODefaultPipeline(
     // For ThinLTO we must apply the context disambiguation decisions early, to
     // ensure we can correctly match the callsites to summary data.
     if (EnableMemProfContextDisambiguation)
-      MPM.addPass(MemProfContextDisambiguation(ImportSummary));
+      MPM.addPass(MemProfContextDisambiguation(
+          ImportSummary, PGOOpt && PGOOpt->Action == PGOOptions::SampleUse));
 
     // These passes import type identifier resolutions for whole-program
     // devirtualization and CFI. They must run early because other passes may
@@ -1927,7 +1928,9 @@ PassBuilder::buildLTODefaultPipeline(OptimizationLevel Level,
   // amount of additional cloning required to distinguish the allocation
   // contexts.
   if (EnableMemProfContextDisambiguation)
-    MPM.addPass(MemProfContextDisambiguation());
+    MPM.addPass(MemProfContextDisambiguation(
+        /*Summary=*/nullptr,
+        PGOOpt && PGOOpt->Action == PGOOptions::SampleUse));
 
   // Optimize globals again after we ran the inliner.
   MPM.addPass(GlobalOptPass());
diff --git a/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp b/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
index 5e7d2c3c713d32..84e177355003d4 100644
--- a/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
+++ b/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
@@ -29,6 +29,7 @@
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
+#include "llvm/Analysis/IndirectCallPromotionAnalysis.h"
 #include "llvm/Analysis/MemoryProfileInfo.h"
 #include "llvm/Analysis/ModuleSummaryAnalysis.h"
 #include "llvm/Analysis/OptimizationRemarkEmitter.h"
@@ -43,6 +44,8 @@
 #include "llvm/Support/GraphWriter.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Transforms/IPO.h"
+#include "llvm/Transforms/Utils/Instrumentation.h"
+#include "llvm/Transforms/Utils/CallPromotionUtils.h"
 #include "llvm/Transforms/Utils/Cloning.h"
 #include <deque>
 #include <sstream>
@@ -449,9 +452,10 @@ class CallsiteContextGraph {
     }
   };
 
-  /// Helpers to remove callee edges that have allocation type None (due to not
+  /// Helpers to remove edges that have allocation type None (due to not
   /// carrying any context ids) after transformations.
   void removeNoneTypeCalleeEdges(ContextNode *Node);
+  void removeNoneTypeCallerEdges(ContextNode *Node);
   void
   recursivelyRemoveNoneTypeCalleeEdges(ContextNode *Node,
                                        DenseSet<const ContextNode *> &Visited);
@@ -483,6 +487,14 @@ class CallsiteContextGraph {
   /// multiple different callee target functions.
   void handleCallsitesWithMultipleTargets();
 
+  // Try to partition calls on the given node (already placed into the AllCalls
+  // array) by callee function, creating new copies of Node as needed to hold
+  // calls with different callees, and moving the callee edges appropriately.
+  // Returns true if partitioning was successful.
+  bool partitionCallsByCallee(
+      ContextNode *Node, ArrayRef<CallInfo> AllCalls,
+      std::vector<std::pair<CallInfo, ContextNode *>> &NewCallToNode);
+
   /// Save lists of calls with MemProf metadata in each function, for faster
   /// iteration.
   MapVector<FuncTy *, std::vector<CallInfo>> FuncToCallsWithMetadata;
@@ -563,6 +575,12 @@ class CallsiteContextGraph {
   calleesMatch(CallTy Call, EdgeIter &EI,
                MapVector<CallInfo, ContextNode *> &TailCallToContextNodeMap);
 
+  // Return the callee function of the given call, or nullptr if it can't be
+  // determined
+  const FuncTy *getCalleeFunc(CallTy Call) {
+    return static_cast<DerivedCCG *>(this)->getCalleeFunc(Call);
+  }
+
   /// Returns true if the given call targets the given function, or if we were
   /// able to identify the call chain through intermediate tail calls (in which
   /// case FoundCalleeChain will be populated).
@@ -670,6 +688,14 @@ class CallsiteContextGraph {
                                      bool NewClone = false,
                                      DenseSet<uint32_t> ContextIdsToMove = {});
 
+  /// Change the caller of the edge at the given callee edge iterator to be
+  /// NewCaller, performing the necessary context id and allocation type
+  /// updates. The iterator is updated as the edge is removed from the list of
+  /// callee edges in the original caller. This is similar to the above
+  /// moveEdgeToExistingCalleeClone, but a simplified version of it as we always
+  /// move the given edge and all of its context ids.
+  void moveCalleeEdgeToNewCaller(EdgeIter &CalleeEdgeI, ContextNode *NewCaller);
+
   /// Recursively perform cloning on the graph for the given Node and its
   /// callers, in order to uniquely identify the allocation behavior of an
   /// allocation given its context. The context ids of the allocation being
@@ -731,6 +757,7 @@ class ModuleCallsiteContextGraph
                               Instruction *>;
 
   uint64_t getStackId(uint64_t IdOrIndex) const;
+  const Function *getCalleeFunc(Instruction *Call);
   bool calleeMatchesFunc(
       Instruction *Call, const Function *Func, const Function *CallerFunc,
       std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain);
@@ -808,6 +835,7 @@ class IndexCallsiteContextGraph
                               IndexCall>;
 
   uint64_t getStackId(uint64_t IdOrIndex) const;
+  const FunctionSummary *getCalleeFunc(IndexCall &Call);
   bool calleeMatchesFunc(
       IndexCall &Call, const FunctionSummary *Func,
       const FunctionSummary *CallerFunc,
@@ -1003,6 +1031,20 @@ void CallsiteContextGraph<
   }
 }
 
+template <typename DerivedCCG, typename FuncTy, typename CallTy>
+void CallsiteContextGraph<
+    DerivedCCG, FuncTy, CallTy>::removeNoneTypeCallerEdges(ContextNode *Node) {
+  for (auto EI = Node->CallerEdges.begin(); EI != Node->CallerEdges.end();) {
+    auto Edge = *EI;
+    if (Edge->AllocTypes == (uint8_t)AllocationType::None) {
+      assert(Edge->ContextIds.empty());
+      Edge->Caller->eraseCalleeEdge(Edge.get());
+      EI = Node->CallerEdges.erase(EI);
+    } else
+      ++EI;
+  }
+}
+
 template <typename DerivedCCG, typename FuncTy, typename CallTy>
 typename CallsiteContextGraph<DerivedCCG, FuncTy, CallTy>::ContextEdge *
 CallsiteContextGraph<DerivedCCG, FuncTy, CallTy>::ContextNode::
@@ -2025,6 +2067,21 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
     AllCalls.push_back(Node->Call);
     AllCalls.insert(AllCalls.end(), Node->MatchingCalls.begin(),
                     Node->MatchingCalls.end());
+
+    // First see if we can partition the calls by callee function, creating new
+    // nodes to host each set of calls calling the same callees. This is
+    // necessary for support indirect calls with ThinLTO, for which we
+    // synthesized CallsiteInfo records for each target. They will all have the
+    // same callsite stack ids and would be sharing a context node at this
+    // point. We need to perform separate cloning for each, which will be
+    // applied along with speculative devirtualization in the ThinLTO backends
+    // as needed. Note this does not currently support looking through tail
+    // calls, it is unclear if we need that for indirect call targets.
+    // First partition calls by callee func. Map indexed by func, value is
+    // struct with list of matching calls, assigned node.
+    if (partitionCallsByCallee(Node, AllCalls, NewCallToNode))
+      continue;
+
     auto It = AllCalls.begin();
     // Iterate through the calls until we find the first that matches.
     for (; It != AllCalls.end(); ++It) {
@@ -2098,6 +2155,130 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
     NonAllocationCallToContextNodeMap[Call] = Node;
 }
 
+template <typename DerivedCCG, typename FuncTy, typename CallTy>
+bool CallsiteContextGraph<DerivedCCG, FuncTy, CallTy>::partitionCallsByCallee(
+    ContextNode *Node, ArrayRef<CallInfo> AllCalls,
+    std::vector<std::pair<CallInfo, ContextNode *>> &NewCallToNode) {
+  // Struct to keep track of all the calls having the same callee function,
+  // and the node we eventually assign to them. Eventually we will record the
+  // context node assigned to this group of calls.
+  struct CallsWithSameCallee {
+    std::vector<CallInfo> Calls;
+    ContextNode *Node = nullptr;
+  };
+
+  // First partition calls by callee function. Build map from each function
+  // to the list of matching calls.
+  DenseMap<const FuncTy *, CallsWithSameCallee> CalleeFuncToCallInfo;
+  for (auto ThisCall : AllCalls) {
+    auto *F = getCalleeFunc(ThisCall.call());
+    if (F)
+      CalleeFuncToCallInfo[F].Calls.push_back(ThisCall);
+  }
+
+  // Next, walk through all callee edges. For each callee node, get its
+  // containing function and see if it was recorded in the above map (meaning we
+  // have at least one matching call). Build another map from each callee node
+  // with a matching call to the structure instance created above containing all
+  // the calls.
+  DenseMap<ContextNode *, CallsWithSameCallee *> CalleeNodeToCallInfo;
+  for (const auto &Edge : Node->CalleeEdges) {
+    if (!Edge->Callee->hasCall())
+      continue;
+    const FuncTy *ProfiledCalleeFunc = NodeToCallingFunc[Edge->Callee];
+    if (CalleeFuncToCallInfo.contains(ProfiledCalleeFunc))
+      CalleeNodeToCallInfo[Edge->Callee] =
+          &CalleeFuncToCallInfo[ProfiledCalleeFunc];
+  }
+
+  // If there are entries in the second map, then there were no matching
+  // calls/callees, nothing to do here. Return so we can go to the handling that
+  // looks through tail calls.
+  if (CalleeNodeToCallInfo.empty())
+    return false;
+
+  // Walk through all callee edges again. Any and all callee edges that didn't
+  // match any calls (callee not in the CalleeNodeToCallInfo map) are moved to a
+  // new caller node (UnmatchedCalleesNode) which gets a null call so that it is
+  // ignored during cloning. If it is in the map, then we use the node recorded
+  // in that entry (creating it if needed), and move the callee edge to it.
+  // The first callee will use the original node instead of creating a new one.
+  // Note that any of the original calls on this node (in AllCalls) that didn't
+  // have a callee function automatically get dropped from the node as part of
+  // this process.
+  ContextNode *UnmatchedCalleesNode = nullptr;
+  // Track whether we already assigned original node to a callee.
+  bool UsedOrigNode = false;
+  assert(NodeToCallingFunc[Node]);
+  for (auto EI = Node->CalleeEdges.begin(); EI != Node->CalleeEdges.end();) {
+    auto Edge = *EI;
+    if (!Edge->Callee->hasCall()) {
+      ++EI;
+      continue;
+    }
+
+    // Will be updated below to point to whatever (caller) node this callee edge
+    // should be moved to.
+    ContextNode *CallerNodeToUse = nullptr;
+
+    // Handle the case where there were no matching calls first. Move this
+    // callee edge to the UnmatchedCalleesNode, creating it if needed.
+    if (!CalleeNodeToCallInfo.contains(Edge->Callee)) {
+      if (!UnmatchedCalleesNode)
+        UnmatchedCalleesNode =
+            createNewNode(/*IsAllocation=*/false, NodeToCallingFunc[Node]);
+      CallerNodeToUse = UnmatchedCalleesNode;
+    } else {
+      // Look up the information recorded for this callee node, and use the
+      // recorded caller node (creating it if needed).
+      auto *Info = CalleeNodeToCallInfo[Edge->Callee];
+      if (!Info->Node) {
+        // If we haven't assigned any callees to the original node use it.
+        if (!UsedOrigNode) {
+          Info->Node = Node;
+          // Clear the set of matching calls which will be updated below.
+          Node->MatchingCalls.clear();
+          UsedOrigNode = true;
+        } else
+          Info->Node =
+              createNewNode(/*IsAllocation=*/false, NodeToCallingFunc[Node]);
+        assert(!Info->Calls.empty());
+        // The first call becomes the primary call for this caller node, and the
+        // rest go in the matching calls list.
+        Info->Node->setCall(Info->Calls.front());
+        Info->Node->MatchingCalls.insert(Info->Node->MatchingCalls.end(),
+                                         Info->Calls.begin() + 1,
+                                         Info->Calls.end());
+        // Save the primary call to node correspondence so that we can update
+        // the NonAllocationCallToContextNodeMap, which is being iterated in the
+        // caller of this function.
+        NewCallToNode.push_back({Info->Node->Call, Info->Node});
+      }
+      CallerNodeToUse = Info->Node;
+    }
+
+    // Don't need to move edge if we are using the original node;
+    if (CallerNodeToUse == Node) {
+      ++EI;
+      continue;
+    }
+
+    moveCalleeEdgeToNewCaller(EI, CallerNodeToUse);
+  }
+  // Now that ...
[truncated]

llvmbot · 2024-10-01T02:47:52Z

@llvm/pr-subscribers-llvm-analysis

Author: Teresa Johnson (teresajohnson)

Changes

This patch enables support for cloning in indirect callsites.

This is done by synthesizing callsite records for each virtual call
target from the profile metadata. In the thin link all the synthesized
records for a particular indirect callsite initially share the same
context node, but support is added to partition the callsites and
outgoing edges based on the callee function, creating a separate node
for each target.

In the LTO backend, when cloning is needed we first perform indirect
call promotion, then change the target of the new direct call to the
desired clone.

Note this is ThinLTO-specific, since for regular LTO indirect call
promotion should have already occurred.

Patch is 50.42 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/110625.diff

7 Files Affected:

(modified) llvm/include/llvm/IR/ModuleSummaryIndex.h (+5)
(modified) llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h (+6-1)
(modified) llvm/lib/Analysis/ModuleSummaryAnalysis.cpp (+36-23)
(modified) llvm/lib/IR/AsmWriter.cpp (+2-2)
(modified) llvm/lib/Passes/PassBuilderPipelines.cpp (+5-2)
(modified) llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp (+479-18)
(added) llvm/test/ThinLTO/X86/memprof-icp.ll (+309)

diff --git a/llvm/include/llvm/IR/ModuleSummaryIndex.h b/llvm/include/llvm/IR/ModuleSummaryIndex.h
index 288aba16702f56..1cfe7c15f97dbc 100644
--- a/llvm/include/llvm/IR/ModuleSummaryIndex.h
+++ b/llvm/include/llvm/IR/ModuleSummaryIndex.h
@@ -200,7 +200,12 @@ struct ValueInfo {
     return getRef()->second.SummaryList;
   }
 
+  // Even if the index is built with GVs available, we may not have one for
+  // summary entries synthesized for profiled indirect call targets.
+  bool hasName() const { return !haveGVs() || getValue(); }
+
   StringRef name() const {
+    assert(!haveGVs() || getRef()->second.U.GV);
     return haveGVs() ? getRef()->second.U.GV->getName()
                      : getRef()->second.U.Name;
   }
diff --git a/llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h b/llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h
index f5dce01c34e7aa..fa067ad62ed1b3 100644
--- a/llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h
+++ b/llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h
@@ -43,8 +43,13 @@ class MemProfContextDisambiguation
   // ThinLTO backend via opt (to simulate distributed ThinLTO).
   std::unique_ptr<ModuleSummaryIndex> ImportSummaryForTesting;
 
+  // Whether we are building with SamplePGO. This is needed for correctly
+  // updating profile metadata on speculatively promoted calls.
+  bool SamplePGO;
+
 public:
-  MemProfContextDisambiguation(const ModuleSummaryIndex *Summary = nullptr);
+  MemProfContextDisambiguation(const ModuleSummaryIndex *Summary = nullptr,
+                               bool SamplePGO = false);
 
   PreservedAnalyses run(Module &M, ModuleAnalysisManager &AM);
 
diff --git a/llvm/lib/Analysis/ModuleSummaryAnalysis.cpp b/llvm/lib/Analysis/ModuleSummaryAnalysis.cpp
index 2d4961dade9db1..1bd9ee651d2b0b 100644
--- a/llvm/lib/Analysis/ModuleSummaryAnalysis.cpp
+++ b/llvm/lib/Analysis/ModuleSummaryAnalysis.cpp
@@ -81,6 +81,11 @@ static cl::opt<std::string> ModuleSummaryDotFile(
     "module-summary-dot-file", cl::Hidden, cl::value_desc("filename"),
     cl::desc("File to emit dot graph of new summary into"));
 
+static cl::opt<bool> EnableMemProfIndirectCallSupport(
+    "enable-memprof-indirect-call-support", cl::init(true), cl::Hidden,
+    cl::desc(
+        "Enable MemProf support for summarizing and cloning indirect calls"));
+
 extern cl::opt<bool> ScalePartialSampleProfileWorkingSetSize;
 
 extern cl::opt<unsigned> MaxNumVTableAnnotations;
@@ -404,6 +409,11 @@ static void computeFunctionSummary(
       if (HasLocalsInUsedOrAsm && CI && CI->isInlineAsm())
         HasInlineAsmMaybeReferencingInternal = true;
 
+      // Compute this once per indirect call.
+      uint32_t NumCandidates = 0;
+      uint64_t TotalCount = 0;
+      MutableArrayRef<InstrProfValueData> CandidateProfileData;
+
       auto *CalledValue = CB->getCalledOperand();
       auto *CalledFunction = CB->getCalledFunction();
       if (CalledValue && !CalledFunction) {
@@ -481,9 +491,7 @@ static void computeFunctionSummary(
           }
         }
 
-        uint32_t NumCandidates;
-        uint64_t TotalCount;
-        auto CandidateProfileData =
+        CandidateProfileData =
             ICallAnalysis.getPromotionCandidatesForInstruction(&I, TotalCount,
                                                                NumCandidates);
         for (const auto &Candidate : CandidateProfileData)
@@ -495,16 +503,6 @@ static void computeFunctionSummary(
       if (!IsThinLTO)
         continue;
 
-      // TODO: Skip indirect calls for now. Need to handle these better, likely
-      // by creating multiple Callsites, one per target, then speculatively
-      // devirtualize while applying clone info in the ThinLTO backends. This
-      // will also be important because we will have a different set of clone
-      // versions per target. This handling needs to match that in the ThinLTO
-      // backend so we handle things consistently for matching of callsite
-      // summaries to instructions.
-      if (!CalledFunction)
-        continue;
-
       // Ensure we keep this analysis in sync with the handling in the ThinLTO
       // backend (see MemProfContextDisambiguation::applyImport). Save this call
       // so that we can skip it in checking the reverse case later.
@@ -555,13 +553,24 @@ static void computeFunctionSummary(
         SmallVector<unsigned> StackIdIndices;
         for (auto StackId : InstCallsite)
           StackIdIndices.push_back(Index.addOrGetStackIdIndex(StackId));
-        // Use the original CalledValue, in case it was an alias. We want
-        // to record the call edge to the alias in that case. Eventually
-        // an alias summary will be created to associate the alias and
-        // aliasee.
-        auto CalleeValueInfo =
-            Index.getOrInsertValueInfo(cast<GlobalValue>(CalledValue));
-        Callsites.push_back({CalleeValueInfo, StackIdIndices});
+        if (CalledFunction) {
+          // Use the original CalledValue, in case it was an alias. We want
+          // to record the call edge to the alias in that case. Eventually
+          // an alias summary will be created to associate the alias and
+          // aliasee.
+          auto CalleeValueInfo =
+              Index.getOrInsertValueInfo(cast<GlobalValue>(CalledValue));
+          Callsites.push_back({CalleeValueInfo, StackIdIndices});
+        } else if (EnableMemProfIndirectCallSupport) {
+          // For indirect callsites, create multiple Callsites, one per target.
+          // This enables having a different set of clone versions per target,
+          // and we will apply the cloning decisions while speculatively
+          // devirtualizing in the ThinLTO backends.
+          for (const auto &Candidate : CandidateProfileData) {
+            auto CalleeValueInfo = Index.getOrInsertValueInfo(Candidate.Value);
+            Callsites.push_back({CalleeValueInfo, StackIdIndices});
+          }
+        }
       }
     }
   }
@@ -1214,9 +1223,13 @@ bool llvm::mayHaveMemprofSummary(const CallBase *CB) {
     if (CI && CalledFunction->isIntrinsic())
       return false;
   } else {
-    // TODO: For now skip indirect calls. See comments in
-    // computeFunctionSummary for what is needed to handle this.
-    return false;
+    // Skip inline assembly calls.
+    if (CI && CI->isInlineAsm())
+      return false;
+    // Skip direct calls via Constant.
+    if (!CalledValue || isa<Constant>(CalledValue))
+      return false;
+    return true;
   }
   return true;
 }
diff --git a/llvm/lib/IR/AsmWriter.cpp b/llvm/lib/IR/AsmWriter.cpp
index 70e3af941bf77b..f07cfcaa0124da 100644
--- a/llvm/lib/IR/AsmWriter.cpp
+++ b/llvm/lib/IR/AsmWriter.cpp
@@ -3613,7 +3613,7 @@ void AssemblyWriter::printSummary(const GlobalValueSummary &Summary) {
 
 void AssemblyWriter::printSummaryInfo(unsigned Slot, const ValueInfo &VI) {
   Out << "^" << Slot << " = gv: (";
-  if (!VI.name().empty())
+  if (VI.hasName() && !VI.name().empty())
     Out << "name: \"" << VI.name() << "\"";
   else
     Out << "guid: " << VI.getGUID();
@@ -3627,7 +3627,7 @@ void AssemblyWriter::printSummaryInfo(unsigned Slot, const ValueInfo &VI) {
     Out << ")";
   }
   Out << ")";
-  if (!VI.name().empty())
+  if (VI.hasName() && !VI.name().empty())
     Out << " ; guid = " << VI.getGUID();
   Out << "\n";
 }
diff --git a/llvm/lib/Passes/PassBuilderPipelines.cpp b/llvm/lib/Passes/PassBuilderPipelines.cpp
index 8f151a99b11709..916967f8ee02b9 100644
--- a/llvm/lib/Passes/PassBuilderPipelines.cpp
+++ b/llvm/lib/Passes/PassBuilderPipelines.cpp
@@ -1714,7 +1714,8 @@ ModulePassManager PassBuilder::buildThinLTODefaultPipeline(
     // For ThinLTO we must apply the context disambiguation decisions early, to
     // ensure we can correctly match the callsites to summary data.
     if (EnableMemProfContextDisambiguation)
-      MPM.addPass(MemProfContextDisambiguation(ImportSummary));
+      MPM.addPass(MemProfContextDisambiguation(
+          ImportSummary, PGOOpt && PGOOpt->Action == PGOOptions::SampleUse));
 
     // These passes import type identifier resolutions for whole-program
     // devirtualization and CFI. They must run early because other passes may
@@ -1927,7 +1928,9 @@ PassBuilder::buildLTODefaultPipeline(OptimizationLevel Level,
   // amount of additional cloning required to distinguish the allocation
   // contexts.
   if (EnableMemProfContextDisambiguation)
-    MPM.addPass(MemProfContextDisambiguation());
+    MPM.addPass(MemProfContextDisambiguation(
+        /*Summary=*/nullptr,
+        PGOOpt && PGOOpt->Action == PGOOptions::SampleUse));
 
   // Optimize globals again after we ran the inliner.
   MPM.addPass(GlobalOptPass());
diff --git a/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp b/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
index 5e7d2c3c713d32..84e177355003d4 100644
--- a/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
+++ b/llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp
@@ -29,6 +29,7 @@
 #include "llvm/ADT/SmallSet.h"
 #include "llvm/ADT/SmallVector.h"
 #include "llvm/ADT/Statistic.h"
+#include "llvm/Analysis/IndirectCallPromotionAnalysis.h"
 #include "llvm/Analysis/MemoryProfileInfo.h"
 #include "llvm/Analysis/ModuleSummaryAnalysis.h"
 #include "llvm/Analysis/OptimizationRemarkEmitter.h"
@@ -43,6 +44,8 @@
 #include "llvm/Support/GraphWriter.h"
 #include "llvm/Support/raw_ostream.h"
 #include "llvm/Transforms/IPO.h"
+#include "llvm/Transforms/Utils/Instrumentation.h"
+#include "llvm/Transforms/Utils/CallPromotionUtils.h"
 #include "llvm/Transforms/Utils/Cloning.h"
 #include <deque>
 #include <sstream>
@@ -449,9 +452,10 @@ class CallsiteContextGraph {
     }
   };
 
-  /// Helpers to remove callee edges that have allocation type None (due to not
+  /// Helpers to remove edges that have allocation type None (due to not
   /// carrying any context ids) after transformations.
   void removeNoneTypeCalleeEdges(ContextNode *Node);
+  void removeNoneTypeCallerEdges(ContextNode *Node);
   void
   recursivelyRemoveNoneTypeCalleeEdges(ContextNode *Node,
                                        DenseSet<const ContextNode *> &Visited);
@@ -483,6 +487,14 @@ class CallsiteContextGraph {
   /// multiple different callee target functions.
   void handleCallsitesWithMultipleTargets();
 
+  // Try to partition calls on the given node (already placed into the AllCalls
+  // array) by callee function, creating new copies of Node as needed to hold
+  // calls with different callees, and moving the callee edges appropriately.
+  // Returns true if partitioning was successful.
+  bool partitionCallsByCallee(
+      ContextNode *Node, ArrayRef<CallInfo> AllCalls,
+      std::vector<std::pair<CallInfo, ContextNode *>> &NewCallToNode);
+
   /// Save lists of calls with MemProf metadata in each function, for faster
   /// iteration.
   MapVector<FuncTy *, std::vector<CallInfo>> FuncToCallsWithMetadata;
@@ -563,6 +575,12 @@ class CallsiteContextGraph {
   calleesMatch(CallTy Call, EdgeIter &EI,
                MapVector<CallInfo, ContextNode *> &TailCallToContextNodeMap);
 
+  // Return the callee function of the given call, or nullptr if it can't be
+  // determined
+  const FuncTy *getCalleeFunc(CallTy Call) {
+    return static_cast<DerivedCCG *>(this)->getCalleeFunc(Call);
+  }
+
   /// Returns true if the given call targets the given function, or if we were
   /// able to identify the call chain through intermediate tail calls (in which
   /// case FoundCalleeChain will be populated).
@@ -670,6 +688,14 @@ class CallsiteContextGraph {
                                      bool NewClone = false,
                                      DenseSet<uint32_t> ContextIdsToMove = {});
 
+  /// Change the caller of the edge at the given callee edge iterator to be
+  /// NewCaller, performing the necessary context id and allocation type
+  /// updates. The iterator is updated as the edge is removed from the list of
+  /// callee edges in the original caller. This is similar to the above
+  /// moveEdgeToExistingCalleeClone, but a simplified version of it as we always
+  /// move the given edge and all of its context ids.
+  void moveCalleeEdgeToNewCaller(EdgeIter &CalleeEdgeI, ContextNode *NewCaller);
+
   /// Recursively perform cloning on the graph for the given Node and its
   /// callers, in order to uniquely identify the allocation behavior of an
   /// allocation given its context. The context ids of the allocation being
@@ -731,6 +757,7 @@ class ModuleCallsiteContextGraph
                               Instruction *>;
 
   uint64_t getStackId(uint64_t IdOrIndex) const;
+  const Function *getCalleeFunc(Instruction *Call);
   bool calleeMatchesFunc(
       Instruction *Call, const Function *Func, const Function *CallerFunc,
       std::vector<std::pair<Instruction *, Function *>> &FoundCalleeChain);
@@ -808,6 +835,7 @@ class IndexCallsiteContextGraph
                               IndexCall>;
 
   uint64_t getStackId(uint64_t IdOrIndex) const;
+  const FunctionSummary *getCalleeFunc(IndexCall &Call);
   bool calleeMatchesFunc(
       IndexCall &Call, const FunctionSummary *Func,
       const FunctionSummary *CallerFunc,
@@ -1003,6 +1031,20 @@ void CallsiteContextGraph<
   }
 }
 
+template <typename DerivedCCG, typename FuncTy, typename CallTy>
+void CallsiteContextGraph<
+    DerivedCCG, FuncTy, CallTy>::removeNoneTypeCallerEdges(ContextNode *Node) {
+  for (auto EI = Node->CallerEdges.begin(); EI != Node->CallerEdges.end();) {
+    auto Edge = *EI;
+    if (Edge->AllocTypes == (uint8_t)AllocationType::None) {
+      assert(Edge->ContextIds.empty());
+      Edge->Caller->eraseCalleeEdge(Edge.get());
+      EI = Node->CallerEdges.erase(EI);
+    } else
+      ++EI;
+  }
+}
+
 template <typename DerivedCCG, typename FuncTy, typename CallTy>
 typename CallsiteContextGraph<DerivedCCG, FuncTy, CallTy>::ContextEdge *
 CallsiteContextGraph<DerivedCCG, FuncTy, CallTy>::ContextNode::
@@ -2025,6 +2067,21 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
     AllCalls.push_back(Node->Call);
     AllCalls.insert(AllCalls.end(), Node->MatchingCalls.begin(),
                     Node->MatchingCalls.end());
+
+    // First see if we can partition the calls by callee function, creating new
+    // nodes to host each set of calls calling the same callees. This is
+    // necessary for support indirect calls with ThinLTO, for which we
+    // synthesized CallsiteInfo records for each target. They will all have the
+    // same callsite stack ids and would be sharing a context node at this
+    // point. We need to perform separate cloning for each, which will be
+    // applied along with speculative devirtualization in the ThinLTO backends
+    // as needed. Note this does not currently support looking through tail
+    // calls, it is unclear if we need that for indirect call targets.
+    // First partition calls by callee func. Map indexed by func, value is
+    // struct with list of matching calls, assigned node.
+    if (partitionCallsByCallee(Node, AllCalls, NewCallToNode))
+      continue;
+
     auto It = AllCalls.begin();
     // Iterate through the calls until we find the first that matches.
     for (; It != AllCalls.end(); ++It) {
@@ -2098,6 +2155,130 @@ void CallsiteContextGraph<DerivedCCG, FuncTy,
     NonAllocationCallToContextNodeMap[Call] = Node;
 }
 
+template <typename DerivedCCG, typename FuncTy, typename CallTy>
+bool CallsiteContextGraph<DerivedCCG, FuncTy, CallTy>::partitionCallsByCallee(
+    ContextNode *Node, ArrayRef<CallInfo> AllCalls,
+    std::vector<std::pair<CallInfo, ContextNode *>> &NewCallToNode) {
+  // Struct to keep track of all the calls having the same callee function,
+  // and the node we eventually assign to them. Eventually we will record the
+  // context node assigned to this group of calls.
+  struct CallsWithSameCallee {
+    std::vector<CallInfo> Calls;
+    ContextNode *Node = nullptr;
+  };
+
+  // First partition calls by callee function. Build map from each function
+  // to the list of matching calls.
+  DenseMap<const FuncTy *, CallsWithSameCallee> CalleeFuncToCallInfo;
+  for (auto ThisCall : AllCalls) {
+    auto *F = getCalleeFunc(ThisCall.call());
+    if (F)
+      CalleeFuncToCallInfo[F].Calls.push_back(ThisCall);
+  }
+
+  // Next, walk through all callee edges. For each callee node, get its
+  // containing function and see if it was recorded in the above map (meaning we
+  // have at least one matching call). Build another map from each callee node
+  // with a matching call to the structure instance created above containing all
+  // the calls.
+  DenseMap<ContextNode *, CallsWithSameCallee *> CalleeNodeToCallInfo;
+  for (const auto &Edge : Node->CalleeEdges) {
+    if (!Edge->Callee->hasCall())
+      continue;
+    const FuncTy *ProfiledCalleeFunc = NodeToCallingFunc[Edge->Callee];
+    if (CalleeFuncToCallInfo.contains(ProfiledCalleeFunc))
+      CalleeNodeToCallInfo[Edge->Callee] =
+          &CalleeFuncToCallInfo[ProfiledCalleeFunc];
+  }
+
+  // If there are entries in the second map, then there were no matching
+  // calls/callees, nothing to do here. Return so we can go to the handling that
+  // looks through tail calls.
+  if (CalleeNodeToCallInfo.empty())
+    return false;
+
+  // Walk through all callee edges again. Any and all callee edges that didn't
+  // match any calls (callee not in the CalleeNodeToCallInfo map) are moved to a
+  // new caller node (UnmatchedCalleesNode) which gets a null call so that it is
+  // ignored during cloning. If it is in the map, then we use the node recorded
+  // in that entry (creating it if needed), and move the callee edge to it.
+  // The first callee will use the original node instead of creating a new one.
+  // Note that any of the original calls on this node (in AllCalls) that didn't
+  // have a callee function automatically get dropped from the node as part of
+  // this process.
+  ContextNode *UnmatchedCalleesNode = nullptr;
+  // Track whether we already assigned original node to a callee.
+  bool UsedOrigNode = false;
+  assert(NodeToCallingFunc[Node]);
+  for (auto EI = Node->CalleeEdges.begin(); EI != Node->CalleeEdges.end();) {
+    auto Edge = *EI;
+    if (!Edge->Callee->hasCall()) {
+      ++EI;
+      continue;
+    }
+
+    // Will be updated below to point to whatever (caller) node this callee edge
+    // should be moved to.
+    ContextNode *CallerNodeToUse = nullptr;
+
+    // Handle the case where there were no matching calls first. Move this
+    // callee edge to the UnmatchedCalleesNode, creating it if needed.
+    if (!CalleeNodeToCallInfo.contains(Edge->Callee)) {
+      if (!UnmatchedCalleesNode)
+        UnmatchedCalleesNode =
+            createNewNode(/*IsAllocation=*/false, NodeToCallingFunc[Node]);
+      CallerNodeToUse = UnmatchedCalleesNode;
+    } else {
+      // Look up the information recorded for this callee node, and use the
+      // recorded caller node (creating it if needed).
+      auto *Info = CalleeNodeToCallInfo[Edge->Callee];
+      if (!Info->Node) {
+        // If we haven't assigned any callees to the original node use it.
+        if (!UsedOrigNode) {
+          Info->Node = Node;
+          // Clear the set of matching calls which will be updated below.
+          Node->MatchingCalls.clear();
+          UsedOrigNode = true;
+        } else
+          Info->Node =
+              createNewNode(/*IsAllocation=*/false, NodeToCallingFunc[Node]);
+        assert(!Info->Calls.empty());
+        // The first call becomes the primary call for this caller node, and the
+        // rest go in the matching calls list.
+        Info->Node->setCall(Info->Calls.front());
+        Info->Node->MatchingCalls.insert(Info->Node->MatchingCalls.end(),
+                                         Info->Calls.begin() + 1,
+                                         Info->Calls.end());
+        // Save the primary call to node correspondence so that we can update
+        // the NonAllocationCallToContextNodeMap, which is being iterated in the
+        // caller of this function.
+        NewCallToNode.push_back({Info->Node->Call, Info->Node});
+      }
+      CallerNodeToUse = Info->Node;
+    }
+
+    // Don't need to move edge if we are using the original node;
+    if (CallerNodeToUse == Node) {
+      ++EI;
+      continue;
+    }
+
+    moveCalleeEdgeToNewCaller(EI, CallerNodeToUse);
+  }
+  // Now that ...
[truncated]

github-actions · 2024-10-01T02:50:52Z

✅ With the latest revision this PR passed the C/C++ code formatter.

teresajohnson · 2024-10-01T03:03:02Z

I had forgotten to test the new internal option that can be used to disable the support, just added that.

cloning decisions.

snehasish · 2024-10-04T16:26:14Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+      uint64_t TotalCount;
+      unsigned CallsiteInfoStartIndex;
+    };
+    std::map<CallBase *, ICallAnalysisData> ICallAnalysisMap;


We iterate over this map on L4466 and use pointer keys so this could lead to non-determinism in the order things are processed. Since it shouldn't matter, lets change it to DenseMap?

Actually, it's probably best that we do the promotions in a consistent order, so this is a good point, and I realized we don't even need a map, so I have moved the CallBase into the ICallAnalysisData struct and changed this to a vector. Doing that made me realize that we were (harmlessly but unnecessarily) reinserting the same info into the map multiple times in the loop over CandidateProfileData, which would have been an issue for the vector, so I moved the insertion into the new vector outside the loop over the candidate data.

snehasish · 2024-10-04T16:28:16Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+      std::vector<InstrProfValueData> CandidateProfileData;
+      uint32_t NumCandidates;
+      uint64_t TotalCount;
+      unsigned CallsiteInfoStartIndex;


nit: A little strange to have an unsigned when it's peers use types. Should this be size_t though since it's an index?

The types of the others come from the ICP helper code, and seemed like overkill for the index. Good idea to use size_t, done

snehasish · 2024-10-04T16:29:33Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+    // Data structures for saving indirect call profile info for use in ICP with
+    // cloning.
+    struct ICallAnalysisData {
+      std::vector<InstrProfValueData> CandidateProfileData;


I guess InstrProfValueData doesn't matter whether its iFDO or SPGO to be used along with MemProf?

Correct, it is built from the VP metadata which exists for indirect calls for both

snehasish · 2024-10-04T16:40:44Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+                ICallAnalysis.getPromotionCandidatesForInstruction(
+                    CB, TotalCount, NumCandidates);
+            if (!CandidateProfileData.empty()) {
+              unsigned CallsiteInfoStartStartIndex =


Should this be size_t? Also maybe use std::distance and drop the static_cast?

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

snehasish · 2024-10-04T17:27:52Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+        CallBase *CBClone;
+        // Copy 0 is the original function.
+        if (!J)
+          CBClone = CB;
+        else
+          CBClone = cast<CallBase>((*VMaps[J - 1])[CB]);


Same simplification as the one suggested above?

did the same way as above

snehasish · 2024-10-04T17:31:50Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

-        } else if (CB->isTailCall()) {
+            CloneCallsite(StackNode, CB, CalledFunction);
+          }
+        } else if (CB->isTailCall() && CalledFunction) {


Is it necessary to check CalledFunction here? I believe a tail call should always be direct and thus we should always have CalledFunction.

Maybe an assert is more appropriate if you agree.

Yep it is necessary because in the IR indirect calls can in fact be marked as tail calls (there's lots of examples in the tests)

snehasish · 2024-10-04T17:36:14Z

llvm/test/ThinLTO/X86/memprof-icp.ll

+;; using llvm-reduce with the expected FileCheck input.
+
+;; -- virtfunc.h: --
+;; #include <unistd.h>


nit: This header is needed in virtfunc_main.cc not here.

llvm/test/ThinLTO/X86/memprof-icp.ll

snehasish · 2024-10-04T17:45:53Z

llvm/test/ThinLTO/X86/memprof-icp.ll

+
+!0 = !{i32 1, !"ProfileSummary", !1}
+!1 = !{!2, !3, !4, !5, !6, !7, !8, !9, !10, !11}
+!2 = !{!"ProfileFormat", !"InstrProf"}


Is it worth adding a test for SampleProf?

Maybe add TODO somewhere in the code to add a test once we have a text format to prevent any changes from breaking our usage.

They have the same VP metadata for indirect calls so I don't know how valuable that is. But I added a TODO at the top

teresajohnson

I've addressed most of the comments except of the refactoring suggestions, which I will do in a follow on update most likely tomorrow.

teresajohnson · 2024-10-09T23:31:19Z

llvm/include/llvm/Transforms/IPO/MemProfContextDisambiguation.h

@@ -43,8 +43,13 @@ class MemProfContextDisambiguation
  // ThinLTO backend via opt (to simulate distributed ThinLTO).
  std::unique_ptr<ModuleSummaryIndex> ImportSummaryForTesting;

+  // Whether we are building with SamplePGO. This is needed for correctly
+  // updating profile metadata on speculatively promoted calls.
+  bool SamplePGO;


teresajohnson · 2024-10-10T00:11:39Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+    // Data structures for saving indirect call profile info for use in ICP with
+    // cloning.
+    struct ICallAnalysisData {
+      std::vector<InstrProfValueData> CandidateProfileData;


Correct, it is built from the VP metadata which exists for indirect calls for both

teresajohnson · 2024-10-10T00:22:32Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+                ICallAnalysis.getPromotionCandidatesForInstruction(
+                    CB, TotalCount, NumCandidates);
+            if (!CandidateProfileData.empty()) {
+              unsigned CallsiteInfoStartStartIndex =


teresajohnson · 2024-10-10T00:23:04Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+      std::vector<InstrProfValueData> CandidateProfileData;
+      uint32_t NumCandidates;
+      uint64_t TotalCount;
+      unsigned CallsiteInfoStartIndex;


The types of the others come from the ICP helper code, and seemed like overkill for the index. Good idea to use size_t, done

teresajohnson · 2024-10-10T00:32:50Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

-        } else if (CB->isTailCall()) {
+            CloneCallsite(StackNode, CB, CalledFunction);
+          }
+        } else if (CB->isTailCall() && CalledFunction) {


Yep it is necessary because in the IR indirect calls can in fact be marked as tail calls (there's lots of examples in the tests)

llvm/test/ThinLTO/X86/memprof-icp.ll

teresajohnson · 2024-10-10T00:46:56Z

llvm/test/ThinLTO/X86/memprof-icp.ll

+
+!0 = !{i32 1, !"ProfileSummary", !1}
+!1 = !{!2, !3, !4, !5, !6, !7, !8, !9, !10, !11}
+!2 = !{!"ProfileFormat", !"InstrProf"}


They have the same VP metadata for indirect calls so I don't know how valuable that is. But I added a TODO at the top

teresajohnson · 2024-10-10T00:47:44Z

llvm/test/ThinLTO/X86/memprof-icp.ll

+;; using llvm-reduce with the expected FileCheck input.
+
+;; -- virtfunc.h: --
+;; #include <unistd.h>


teresajohnson · 2024-10-10T01:17:22Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+      auto *CB = ICallInfo.first;
+      auto &Info = ICallInfo.second;


teresajohnson · 2024-10-10T02:27:32Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+      uint64_t TotalCount;
+      unsigned CallsiteInfoStartIndex;
+    };
+    std::map<CallBase *, ICallAnalysisData> ICallAnalysisMap;


Actually, it's probably best that we do the promotions in a consistent order, so this is a good point, and I realized we don't even need a map, so I have moved the CallBase into the ICallAnalysisData struct and changed this to a vector. Doing that made me realize that we were (harmlessly but unnecessarily) reinserting the same info into the map multiple times in the loop over CandidateProfileData, which would have been an issue for the vector, so I moved the insertion into the new vector outside the loop over the candidate data.

was no longer applicable with the change from a map to a vector.

teresajohnson · 2024-10-11T03:16:18Z

I've addressed most of the comments except of the refactoring suggestions, which I will do in a follow on update most likely tomorrow.

I've finished addressing all the comments

snehasish

lgtm

snehasish · 2024-10-11T19:48:13Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+    SmallVectorImpl<std::unique_ptr<ValueToValueMapTy>> &VMaps,
+    SmallVector<ICallAnalysisData> &ICallAnalysisInfo,


Can these be ArrayRefs?

snehasish · 2024-10-11T19:55:44Z

llvm/lib/Transforms/IPO/MemProfContextDisambiguation.cpp

+      const char *Reason = nullptr;
+      if (!isLegalToPromote(*CB, TargetFunction, &Reason)) {
+        ORE.emit([&]() {
+          return OptimizationRemarkMissed(DEBUG_TYPE, "UnableToPromote", CB)


Should we customize this for memprof like we do for the one on L4566?

llvm-ci · 2024-10-11T21:19:37Z

LLVM Buildbot has detected a new failure on builder sanitizer-aarch64-linux-fuzzer running on sanitizer-buildbot11 while building llvm at step 2 "annotate".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/159/builds/8032

Here is the relevant piece of the build log for the reference

Step 2 (annotate) failure: 'python ../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py' (failure) (timed out)
...
[5/7] Linking CXX shared library /home/b/sanitizer-aarch64-linux-fuzzer/build/llvm_build0/lib/clang/20/lib/aarch64-unknown-linux-gnu/libclang_rt.hwasan.so
[6/7] Linking CXX shared library /home/b/sanitizer-aarch64-linux-fuzzer/build/llvm_build0/lib/clang/20/lib/aarch64-unknown-linux-gnu/libclang_rt.ubsan_standalone.so
[7/7] Linking CXX shared library /home/b/sanitizer-aarch64-linux-fuzzer/build/llvm_build0/lib/clang/20/lib/aarch64-unknown-linux-gnu/libclang_rt.asan.so
[241/243] No install step for 'runtimes'
[243/243] Completed 'runtimes'
a708aa98043551eaf675db1947467dfb  llvm_build0/bin/clang
@@@BUILD_STEP get fuzzer-test-suite @@@
Already up to date.
@@@BUILD_STEP test libxml2-v2.9.2 fuzzer@@@
Cloning into 'SRC'...
command timed out: 1200 seconds without output running [b'python', b'../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1343.468447
Step 9 (test libxml2-v2.9.2 fuzzer) failure: test libxml2-v2.9.2 fuzzer (failure)
@@@BUILD_STEP test libxml2-v2.9.2 fuzzer@@@
Cloning into 'SRC'...

command timed out: 1200 seconds without output running [b'python', b'../sanitizer_buildbot/sanitizers/zorg/buildbot/builders/sanitizers/buildbot_selector.py'], attempting to kill
process killed by signal 9
program finished with exit code -1
elapsedTime=1343.468447

This patch enables support for cloning in indirect callsites. This is done by synthesizing callsite records for each virtual call target from the profile metadata. In the thin link all the synthesized records for a particular indirect callsite initially share the same context node, but support is added to partition the callsites and outgoing edges based on the callee function, creating a separate node for each target. In the LTO backend, when cloning is needed we first perform indirect call promotion, then change the target of the new direct call to the desired clone. Note this is ThinLTO-specific, since for regular LTO indirect call promotion should have already occurred.

teresajohnson requested a review from snehasish October 1, 2024 02:47

llvmbot added LTO Link time optimization (regular/full LTO or ThinLTO) llvm:ir llvm:analysis llvm:transforms labels Oct 1, 2024

Test the -enable-memprof-indirect-call-support flag

6c6121a

Add missing check for null function (indirect call) when applying

16e66f6

cloning decisions.

snehasish reviewed Oct 4, 2024

View reviewed changes

teresajohnson commented Oct 10, 2024

View reviewed changes

teresajohnson added 3 commits October 9, 2024 20:35

Address most of the review comments, remaining are TODO

f86d45b

Remove a couple of changes only needed with the structured binding that

f5b8df3

was no longer applicable with the change from a map to a vector.

Address remaining comments

fe04fd2

snehasish approved these changes Oct 11, 2024

View reviewed changes

Address review comments

22bb406

teresajohnson merged commit 1de7165 into llvm:main Oct 11, 2024
5 of 7 checks passed

		SmallVectorImpl<std::unique_ptr<ValueToValueMapTy>> &VMaps,
		SmallVector<ICallAnalysisData> &ICallAnalysisInfo,

[MemProf] Support cloning for indirect calls with ThinLTO #110625

[MemProf] Support cloning for indirect calls with ThinLTO #110625

Conversation

teresajohnson commented Oct 1, 2024

llvmbot commented Oct 1, 2024 • edited Loading

llvmbot commented Oct 1, 2024

github-actions bot commented Oct 1, 2024 • edited Loading

teresajohnson commented Oct 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teresajohnson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

teresajohnson commented Oct 11, 2024

snehasish left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llvm-ci commented Oct 11, 2024

llvmbot commented Oct 1, 2024 •

edited

Loading

github-actions bot commented Oct 1, 2024 •

edited

Loading