Add metrics to track local and global memory arbitrations separately (f…

…acebookincubator#9224) Summary: We trigger memory arbitrations for two different reasons: (1) a query exceeds its own memory limit; (2) the memory arbitrator doesn't have free space to grow a query memory arbitration request. The latter indicates we are over-provision the worker memory or it happens that they are all run at the peak (at least we don't expect the memory arbitration to handle the sustained high memory usage. The memory arbitration should help to handle the transient peak memory usage, otherwise, the whole worker performance will be severely degraded). The case (1) can run in parallel and shouldn't affect the other running queries or block their memory arbitration if the system has free capacities. We might consider the followup optimization for case (1). For now, add metrics to monitor the two arbitration events separately in this PR Pull Request resolved: facebookincubator#9224 Reviewed By: bikramSingh91, oerling Differential Revision: D55261366 Pulled By: xiaoxmeng fbshipit-source-id: 3258b6cef04c7afde4cce0c0d5cdaa19bbc919e8
Joe-Abraham · Mar 23, 2024 · 458339f · 458339f
1 parent dcc3c88
commit 458339f
Show file tree

Hide file tree

Showing 4 changed files with 37 additions and 0 deletions.
diff --git a/velox/common/base/Counters.cpp b/velox/common/base/Counters.cpp
@@ -86,6 +86,23 @@ void registerVeloxMetrics() {
   DEFINE_METRIC(
       kMetricArbitratorRequestsCount, facebook::velox::StatType::COUNT);
 
+  // The number of arbitration that reclaims the used memory from the query
+  // which initiates the memory arbitration request itself. It ensures the
+  // memory arbitration request won't exceed its per-query memory capacity
+  // limit.
+  DEFINE_METRIC(
+      kMetricArbitratorLocalArbitrationCount, facebook::velox::StatType::COUNT);
+
+  // The number of arbitration which ensures the total allocated query capacity
+  // won't exceed the arbitrator capacity limit. It may or may not reclaim
+  // memory from the query which initiate the memory arbitration request. This
+  // indicates the velox runtime doesn't have enough memory to run all the
+  // queries at their peak memory usage. We have to trigger spilling to let them
+  // run through completion.
+  DEFINE_METRIC(
+      kMetricArbitratorGlobalArbitrationCount,
+      facebook::velox::StatType::COUNT);
+
   // The number of times a query level memory pool is aborted as a result of a
   // memory arbitration process. The memory pool aborted will eventually result
   // in a cancelling the original query.

diff --git a/velox/common/base/Counters.h b/velox/common/base/Counters.h
@@ -64,6 +64,12 @@ constexpr folly::StringPiece kMetricMemoryPoolReservationLeakBytes{
 constexpr folly::StringPiece kMetricArbitratorRequestsCount{
     "velox.arbitrator_requests_count"};
 
+constexpr folly::StringPiece kMetricArbitratorLocalArbitrationCount{
+    "velox.arbitrator_local_arbitration_count"};
+
+constexpr folly::StringPiece kMetricArbitratorGlobalArbitrationCount{
+    "velox.arbitrator_global_arbitration_count"};
+
 constexpr folly::StringPiece kMetricArbitratorAbortedCount{
     "velox.arbitrator_aborted_count"};
 

diff --git a/velox/common/memory/SharedArbitrator.cpp b/velox/common/memory/SharedArbitrator.cpp
@@ -432,6 +432,7 @@ bool SharedArbitrator::arbitrateMemory(
   }
 
   VELOX_CHECK_LT(freedBytes, growTarget);
+  RECORD_METRIC_VALUE(kMetricArbitratorGlobalArbitrationCount);
   freedBytes += reclaimUsedMemoryFromCandidatesBySpill(
       requestor, candidates, growTarget - freedBytes);
   if (requestor->aborted()) {
@@ -547,6 +548,7 @@ uint64_t SharedArbitrator::reclaim(
     try {
       freedBytes = pool->shrink(targetBytes);
       if (freedBytes < targetBytes) {
+        RECORD_METRIC_VALUE(kMetricArbitratorLocalArbitrationCount);
         pool->reclaim(
             targetBytes - freedBytes, memoryReclaimWaitMs_, reclaimerStats);
       }

diff --git a/velox/docs/monitoring/metrics.rst b/velox/docs/monitoring/metrics.rst
@@ -117,6 +117,18 @@ Memory Management
      - Count
      - The number of times a memory arbitration request was initiated by a
        memory pool attempting to grow its capacity.
+   * - arbitrator_local_arbitration_count
+     - Count
+     - The number of arbitration that reclaims the used memory from the query which initiates
+       the memory arbitration request itself. It ensures the memory arbitration request won't
+       exceed its per-query memory capacity limit.
+   * - arbitrator_global_arbitration_count
+     - Count
+     - The number of arbitration which ensures the total allocated query capacity won't exceed
+       the arbitrator capacity limit. It may or may not reclaim memory from the query which
+       initiate the memory arbitration request. This indicates the velox runtime doesn't have
+       enough memory to run all the queries at their peak memory usage. We have to trigger
+       spilling to let them run through completion.
    * - arbitrator_aborted_count
      - Count
      - The number of times a query level memory pool is aborted as a result of