[Concurrent Segment Search] Explore different metrics/stats which will be useful with concurrent segment search #7359

sohami · 2023-05-02T03:59:43Z

Placeholder tasks to explore and add different metrics which will be useful for concurrent segment search execution model. These metrics can: i) provide insights into the performance of shard level requests (min/max/avg latencies across request at index/node level), ii) how many requests used concurrent search path vs sequential path iii) concurrency used across the requests at index/node level, etc

jed326 · 2023-08-23T22:12:29Z

Existing metrics:

https://opensearch.org/docs/latest/api-reference/nodes-apis/nodes-stats/
https://opensearch.org/docs/latest/api-reference/index-apis/index/
- There should be an <index>/_stats API as well but looks like we're missing documentation for it.

jed326 · 2023-08-28T21:22:22Z

New Metrics:

Metric	Description
search.concurrent_query_total	The total number of query operations using concurrent segment search.
search.concurrent_query_time_in_millis	The total time for all query operations using concurrent segment search, in milliseconds.
search.concurrent_query_current	The number of query operations using concurrent segment search that are currently running.
search.query_avg_concurrency	The average concurrency of query operations using concurrent segment serch.

Sample requests for reference (without new metrics):

curl -X GET "localhost:9200/my-index-000001/_stats/search?pretty"
{
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_all" : {
    "primaries" : {
      "search" : {
        "open_contexts" : 0,
        "query_total" : 0,
        "query_time_in_millis" : 0,
        "query_current" : 0,
        "fetch_total" : 0,
        "fetch_time_in_millis" : 0,
        "fetch_current" : 0,
        "scroll_total" : 0,
        "scroll_time_in_millis" : 0,
        "scroll_current" : 0,
        "point_in_time_total" : 0,
        "point_in_time_time_in_millis" : 0,
        "point_in_time_current" : 0,
        "suggest_total" : 0,
        "suggest_time_in_millis" : 0,
        "suggest_current" : 0
      }
    },
    "total" : {
      "search" : {
        "open_contexts" : 0,
        "query_total" : 0,
        "query_time_in_millis" : 0,
        "query_current" : 0,
        "fetch_total" : 0,
        "fetch_time_in_millis" : 0,
        "fetch_current" : 0,
        "scroll_total" : 0,
        "scroll_time_in_millis" : 0,
        "scroll_current" : 0,
        "point_in_time_total" : 0,
        "point_in_time_time_in_millis" : 0,
        "point_in_time_current" : 0,
        "suggest_total" : 0,
        "suggest_time_in_millis" : 0,
        "suggest_current" : 0
      }
    }
  },
  "indices" : {
    "my-index-000001" : {
      "uuid" : "FQKBJoW9T-KdlI8KHLCThA",
      "primaries" : {
        "search" : {
          "open_contexts" : 0,
          "query_total" : 0,
          "query_time_in_millis" : 0,
          "query_current" : 0,
          "fetch_total" : 0,
          "fetch_time_in_millis" : 0,
          "fetch_current" : 0,
          "scroll_total" : 0,
          "scroll_time_in_millis" : 0,
          "scroll_current" : 0,
          "point_in_time_total" : 0,
          "point_in_time_time_in_millis" : 0,
          "point_in_time_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0
        }
      },
      "total" : {
        "search" : {
          "open_contexts" : 0,
          "query_total" : 0,
          "query_time_in_millis" : 0,
          "query_current" : 0,
          "fetch_total" : 0,
          "fetch_time_in_millis" : 0,
          "fetch_current" : 0,
          "scroll_total" : 0,
          "scroll_time_in_millis" : 0,
          "scroll_current" : 0,
          "point_in_time_total" : 0,
          "point_in_time_time_in_millis" : 0,
          "point_in_time_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0
        }
      }
    }
  }
}

curl "localhost:9200/_nodes/stats/indices/search?pretty"
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "runTask",
  "nodes" : {
    "5zw1q4MxTFyoobBDhUESEQ" : {
      "timestamp" : 1693249140022,
      "name" : "runTask-0",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "roles" : [
        "cluster_manager",
        "data",
        "ingest",
        "remote_cluster_client"
      ],
      "attributes" : {
        "testattr" : "test",
        "shard_indexing_pressure_enabled" : "true"
      },
      "indices" : {
        "search" : {
          "open_contexts" : 0,
          "query_total" : 0,
          "query_time_in_millis" : 0,
          "query_current" : 0,
          "fetch_total" : 0,
          "fetch_time_in_millis" : 0,
          "fetch_current" : 0,
          "scroll_total" : 0,
          "scroll_time_in_millis" : 0,
          "scroll_current" : 0,
          "point_in_time_total" : 0,
          "point_in_time_time_in_millis" : 0,
          "point_in_time_current" : 0,
          "suggest_total" : 0,
          "suggest_time_in_millis" : 0,
          "suggest_current" : 0
        }

      }

    }

  }

}

Reference PRs for PIT changes:

reta · 2023-08-29T14:19:36Z

Thanks @sohami , two more to suggest (the naming could be better expressed):

Metric	Description
search.concurrent_pool_queue_size	The queue size of the index searcher pool
search.concurrent_pool_wait_time	The amount of time the index searcher tasks spend in queue vs being scheduled right away

These metrics should help with proper index searcher thread pool sizing I think.

jed326 · 2023-08-29T14:27:00Z

@reta thanks for the suggestion! It seems like these metrics should go under thread_pool metrics instead of under search metrics. I do agree that they would both be useful though, what do you think?

sohami · 2023-08-29T15:38:56Z

It seems like these metrics should go under thread_pool metrics instead of under search metrics

Threadpool queue size stats is available for all threadpool via _cat/thread_pool api. For pool_wait_time, I like the idea to add it in thread_pool metrics so it will be available for all the pools and not specifically for search.

jed326 · 2023-08-30T02:10:09Z

search.query_avg_concurrency and thread_pool.pool_wait_time metrics are both not that straightforward to implement.

`search.query_avg_concurrency`

The MeanMetric class gives us an easy way to compute the mean within a shard, but the stats get summed up across shards in the total result:

OpenSearch/server/src/main/java/org/opensearch/action/admin/indices/stats/IndexShardStats.java

Lines 84 to 94 in e354201

    
           public CommonStats getTotal() { 
        
               if (total != null) { 
        
                   return total; 
        
               } 
        
               CommonStats stats = new CommonStats(); 
        
               for (ShardStats shard : shards) { 
        
                   stats.add(shard.getStats()); 
        
               } 
        
               total = stats; 
        
               return stats; 
        
           }

This makes it difficult to compute the average concurrency across all of the shards in 2 ways. First, we only want to consider shards that have a value > 0 for average concurrency because average concurrency only considers requests that use concurrent search. Second, we need some way to track the number of shards with value > 0 that are in the overall response and take that into consideration.

`thread_pool.pool_wait_time`

The existing thread_pool metrics come from the ThreadPoolExecutor class in java.util.concurrent. See https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/ThreadPoolExecutor.html for more details.

OpenSearch/server/src/main/java/org/opensearch/threadpool/ThreadPool.java

Lines 384 to 396 in aca2e9d

    
           if (holder.executor() instanceof ThreadPoolExecutor) { 
        
               ThreadPoolExecutor threadPoolExecutor = (ThreadPoolExecutor) holder.executor(); 
        
               threads = threadPoolExecutor.getPoolSize(); 
        
               queue = threadPoolExecutor.getQueue().size(); 
        
               active = threadPoolExecutor.getActiveCount(); 
        
               largest = threadPoolExecutor.getLargestPoolSize(); 
        
               completed = threadPoolExecutor.getCompletedTaskCount(); 
        
               RejectedExecutionHandler rejectedExecutionHandler = threadPoolExecutor.getRejectedExecutionHandler(); 
        
               if (rejectedExecutionHandler instanceof XRejectedExecutionHandler) { 
        
                   rejected = ((XRejectedExecutionHandler) rejectedExecutionHandler).rejected(); 
        
               } 
        
           } 
        
           stats.add(new ThreadPoolStats.Stats(name, threads, queue, active, rejected, largest, completed));

Since wait time is not provided by the executor class, we would need to provide our own wait time calculation. Since this is a pretty involved change and affects all threadpools I will create a separate issue to track this since I do believe wait time is a valuable metric to have.

jed326 · 2023-08-30T18:59:24Z

Tracking the remaining metrics in separate issues:

sohami added enhancement Enhancement or improvement to existing feature or request untriaged labels May 2, 2023

mch2 added the Search Search query, autocomplete ...etc label May 9, 2023

dbwiddis removed the untriaged label May 12, 2023

macohen removed the Search Search query, autocomplete ...etc label May 15, 2023

sohami assigned jed326 Aug 21, 2023

jed326 mentioned this issue Aug 29, 2023

Adding concurrent search versions of query count/time metrics #9622

Merged

6 tasks

This was referenced Aug 30, 2023

[Concurrent Segment Search] Implement search.query_avg_concurrency metric #9644

Closed

[Concurrent Segment Search] Implement thread_pool.pool_wait_time #9645

Closed

sohami closed this as completed in #9622 Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Concurrent Segment Search] Explore different metrics/stats which will be useful with concurrent segment search #7359

[Concurrent Segment Search] Explore different metrics/stats which will be useful with concurrent segment search #7359

sohami commented May 2, 2023

jed326 commented Aug 23, 2023

jed326 commented Aug 28, 2023 •

edited

Loading

reta commented Aug 29, 2023

jed326 commented Aug 29, 2023

sohami commented Aug 29, 2023

jed326 commented Aug 30, 2023

jed326 commented Aug 30, 2023

[Concurrent Segment Search] Explore different metrics/stats which will be useful with concurrent segment search #7359

[Concurrent Segment Search] Explore different metrics/stats which will be useful with concurrent segment search #7359

Comments

sohami commented May 2, 2023

jed326 commented Aug 23, 2023

jed326 commented Aug 28, 2023 • edited Loading

reta commented Aug 29, 2023

jed326 commented Aug 29, 2023

sohami commented Aug 29, 2023

jed326 commented Aug 30, 2023

search.query_avg_concurrency

thread_pool.pool_wait_time

jed326 commented Aug 30, 2023

jed326 commented Aug 28, 2023 •

edited

Loading

`search.query_avg_concurrency`

`thread_pool.pool_wait_time`