From 59b985a63233b45fde6d1d399358fbe828799687 Mon Sep 17 00:00:00 2001
From: Stef Nestor <26751266+stefnestor@users.noreply.github.com>
Date: Thu, 12 Sep 2024 07:46:40 -0600
Subject: [PATCH] (Docs+) Flush out Resource+Task troubleshooting (#111773)

* (Docs+) Flush out Resource+Task troubleshooting

---------

Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
Co-authored-by: David Turner <david.turner@elastic.co>
---
 .../modules/indices/circuit_breaker.asciidoc  |  3 +-
 docs/reference/tab-widgets/cpu-usage.asciidoc | 34 ++++-----
 .../common-issues/high-cpu-usage.asciidoc     | 23 +++++-
 .../common-issues/rejected-requests.asciidoc  | 52 ++++++++++++--
 .../common-issues/task-queue-backlog.asciidoc | 72 ++++++++++++++-----
 5 files changed, 135 insertions(+), 49 deletions(-)

diff --git a/docs/reference/modules/indices/circuit_breaker.asciidoc b/docs/reference/modules/indices/circuit_breaker.asciidoc
index a5a787e23d170..452d4e99704ce 100644
--- a/docs/reference/modules/indices/circuit_breaker.asciidoc
+++ b/docs/reference/modules/indices/circuit_breaker.asciidoc
@@ -175,7 +175,8 @@ an `OutOfMemory` exception which would bring down the node.
 To prevent this from happening, a special <<circuit-breaker, circuit breaker>> is used,
 which limits the memory allocation during the execution of a <<eql-sequences, sequence>>
 query. When the breaker is triggered, an `org.elasticsearch.common.breaker.CircuitBreakingException`
-is thrown and a descriptive error message is returned to the user.
+is thrown and a descriptive error message including `circuit_breaking_exception`
+is returned to the user.
 
 This <<circuit-breaker, circuit breaker>> can be configured using the following settings:
 
diff --git a/docs/reference/tab-widgets/cpu-usage.asciidoc b/docs/reference/tab-widgets/cpu-usage.asciidoc
index 575cf459ee5be..c6272228965eb 100644
--- a/docs/reference/tab-widgets/cpu-usage.asciidoc
+++ b/docs/reference/tab-widgets/cpu-usage.asciidoc
@@ -1,30 +1,20 @@
 // tag::cloud[]
-From your deployment menu, click **Performance**. The page's **CPU Usage** chart
-shows your deployment's CPU usage as a percentage.
+* (Recommended) Enable {cloud}/ec-monitoring-setup.html[logs and metrics]. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page. 
++
+You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email.
 
-High CPU usage can also deplete your CPU credits. CPU credits let {ess} provide
-smaller clusters with a performance boost when needed. The **CPU credits**
-chart shows your remaining CPU credits, measured in seconds of CPU time.
+* From your deployment menu, view the {cloud}/ec-saas-metrics-accessing.html[**Performance**] page. On this page, you can view two key metrics:
+** **CPU usage**: Your deployment's CPU usage, represented as a percentage.
+** **CPU credits**: Your remaining CPU credits, measured in seconds of CPU time.
 
-You can also use the <<cat-nodes,cat nodes API>> to get the current CPU usage
-for each node.
-
-// tag::cpu-usage-cat-nodes[]
-[source,console]
-----
-GET _cat/nodes?v=true&s=cpu:desc
-----
-
-The response's `cpu` column contains the current CPU usage as a percentage. The
-`name` column contains the node's name.
-// end::cpu-usage-cat-nodes[]
+{ess} grants {cloud}/ec-vcpu-boost-instance.html[CPU credits] per deployment
+to provide smaller clusters with performance boosts when needed. High CPU
+usage can deplete these credits, which might lead to {cloud}/ec-scenario_why_is_performance_degrading_over_time.html[performance degradation] and {cloud}/ec-scenario_why_are_my_cluster_response_times_suddenly_so_much_worse.html[increased cluster response times].
 
 // end::cloud[]
 
 // tag::self-managed[]
-
-Use the <<cat-nodes,cat nodes API>> to get the current CPU usage for each node.
-
-include::cpu-usage.asciidoc[tag=cpu-usage-cat-nodes]
-
+* Enable <<monitoring-overview,{es} monitoring>>. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page.
++
+You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email.
 // end::self-managed[]
diff --git a/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc b/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc
index 858683ef97a6d..96a9a8f1e32b7 100644
--- a/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc
+++ b/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc
@@ -9,12 +9,29 @@ If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
 related to the thread pool. For example, if the `search` thread pool is
 depleted, {es} will reject search requests until more threads are available.
 
+You might experience high CPU usage if a <<data-tiers,data tier>>, and therefore the nodes assigned to that tier, is experiencing more traffic than other tiers. This imbalance in resource utilization is also known as <<hotspotting,hot spotting>>.
+
 [discrete]
 [[diagnose-high-cpu-usage]]
 ==== Diagnose high CPU usage
 
 **Check CPU usage**
 
+You can check the CPU usage per node using the <<cat-nodes,cat nodes API>>:
+
+// tag::cpu-usage-cat-nodes[]
+[source,console]
+----
+GET _cat/nodes?v=true&s=cpu:desc
+----
+
+The response's `cpu` column contains the current CPU usage as a percentage.
+The `name` column contains the node's name. Elevated but transient CPU usage is
+normal. However, if CPU usage is elevated for an extended duration, it should be
+investigated.
+
+To track CPU usage over time, we recommend enabling monitoring:
+
 include::{es-ref-dir}/tab-widgets/cpu-usage-widget.asciidoc[]
 
 **Check hot threads**
@@ -24,11 +41,13 @@ threads API>> to check for resource-intensive threads running on the node.
 
 [source,console]
 ----
-GET _nodes/my-node,my-other-node/hot_threads
+GET _nodes/hot_threads
 ----
 // TEST[s/\/my-node,my-other-node//]
 
-This API returns a breakdown of any hot threads in plain text.
+This API returns a breakdown of any hot threads in plain text. High CPU usage
+frequently correlates to <<task-queue-backlog,a long-running task, or a
+backlog of tasks>>.
 
 [discrete]
 [[reduce-cpu-usage]]
diff --git a/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc b/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc
index 497bddc562c69..c863709775fcd 100644
--- a/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc
+++ b/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc
@@ -23,9 +23,52 @@ To check the number of rejected tasks for each thread pool, use the
 
 [source,console]
 ----
-GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
+GET /_cat/thread_pool?v=true&h=id,name,queue,active,rejected,completed
 ----
 
+`write` thread pool rejections frequently appear in the erring API and
+correlating log as `EsRejectedExecutionException` with either
+`QueueResizingEsThreadPoolExecutor` or `queue capacity`.
+
+These errors are often related to <<task-queue-backlog,backlogged tasks>>.
+
+[discrete]
+[[check-circuit-breakers]]
+==== Check circuit breakers
+
+To check the number of tripped <<circuit-breaker,circuit breakers>>, use the
+<<cluster-nodes-stats,node stats API>>.
+
+[source,console]
+----
+GET /_nodes/stats/breaker
+----
+
+These statistics are cumulative from node startup. For more information, see
+<<circuit-breaker,circuit breaker errors>>.
+
+[discrete]
+[[check-indexing-pressure]]
+==== Check indexing pressure
+
+To check the number of <<index-modules-indexing-pressure,indexing pressure>>
+rejections, use the <<cluster-nodes-stats,node stats API>>.
+
+[source,console]
+----
+GET _nodes/stats?human&filter_path=nodes.*.indexing_pressure
+----
+
+These stats are cumulative from node startup. 
+
+Indexing pressure rejections appear as an
+`EsRejectedExecutionException`, and indicate that they were rejected due
+to `coordinating_and_primary_bytes`, `coordinating`, `primary`, or `replica`.
+
+These errors are often related to <<task-queue-backlog,backlogged tasks>>,
+<<docs-bulk,bulk index>> sizing, or the ingest target's
+<<index-modules,`refresh_interval` setting>>.
+
 [discrete]
 [[prevent-rejected-requests]]
 ==== Prevent rejected requests
@@ -34,9 +77,4 @@ GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
 
 If {es} regularly rejects requests and other tasks, your cluster likely has high
 CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
-<<high-jvm-memory-pressure>>.
-
-**Prevent circuit breaker errors**
-
-If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
-for tips on diagnosing and preventing them.
\ No newline at end of file
+<<high-jvm-memory-pressure>>.
\ No newline at end of file
diff --git a/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc b/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc
index 1ff5bf2e5c311..5aa6a0129c2d4 100644
--- a/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc
+++ b/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc
@@ -1,10 +1,10 @@
 [[task-queue-backlog]]
 === Task queue backlog
 
-A backlogged task queue can prevent tasks from completing and 
-put the cluster into an unhealthy state. 
-Resource constraints, a large number of tasks being triggered at once,
-and long running tasks can all contribute to a backlogged task queue.
+A backlogged task queue can prevent tasks from completing and put the cluster
+into an unhealthy state. Resource constraints, a large number of tasks being
+triggered at once, and long running tasks can all contribute to a backlogged
+task queue.
 
 [discrete]
 [[diagnose-task-queue-backlog]]
@@ -12,39 +12,77 @@ and long running tasks can all contribute to a backlogged task queue.
 
 **Check the thread pool status**
 
-A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>. 
+A <<high-cpu-usage,depleted thread pool>> can result in
+<<rejected-requests,rejected requests>>. 
 
-You can use the <<cat-thread-pool,cat thread pool API>> to 
-see the number of active threads in each thread pool and
-how many tasks are queued, how many have been rejected, and how many have completed. 
+Thread pool depletion might be restricted to a specific <<data-tiers,data tier>>. If <<hotspotting,hot spotting>> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog.
+
+You can use the <<cat-thread-pool,cat thread pool API>> to see the number of
+active threads in each thread pool and how many tasks are queued, how many
+have been rejected, and how many have completed.
 
 [source,console]
 ----
 GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
 ----
 
+The `active` and `queue` statistics are instantaneous while the `rejected` and
+`completed` statistics are cumulative from node startup.
+
 **Inspect the hot threads on each node**
 
-If a particular thread pool queue is backed up, 
-you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API 
-to determine if the thread has sufficient 
-resources to progress and gauge how quickly it is progressing.
+If a particular thread pool queue is backed up, you can periodically poll the
+<<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread
+has sufficient resources to progress and gauge how quickly it is progressing.
 
 [source,console]
 ----
 GET /_nodes/hot_threads
 ----
 
-**Look for long running tasks**
+**Look for long running node tasks**
+
+Long-running tasks can also cause a backlog. You can use the <<tasks,task
+management>> API to get information about the node tasks that are running.
+Check the `running_time_in_nanos` to identify tasks that are taking an
+excessive amount of time to complete.
+
+[source,console]
+----
+GET /_tasks?pretty=true&human=true&detailed=true
+----
 
-Long-running tasks can also cause a backlog. 
-You can use the <<tasks,task management>> API to get information about the tasks that are running. 
-Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. 
+If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <<docs-bulk,bulk index>>- or search-related.
 
+* Filter for <<docs-bulk,bulk index>> actions:
++
 [source,console]
 ----
-GET /_tasks?filter_path=nodes.*.tasks
+GET /_tasks?human&detailed&actions=indices:data/write/bulk
+----
+
+* Filter for search actions:
++
+[source,console]
 ----
+GET /_tasks?human&detailed&actions=indices:data/write/search
+----
+
+The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis.
+
+**Look for long running cluster tasks**
+
+A task backlog might also appear as a delay in synchronizing the cluster state. You
+can use the <<cluster-pending,cluster pending tasks API>> to get information
+about the pending cluster state sync tasks that are running. 
+
+[source,console]
+----
+GET /_cluster/pending_tasks
+----
+
+Check the `timeInQueue` to identify tasks that are taking an excessive amount 
+of time to complete.
 
 [discrete]
 [[resolve-task-queue-backlog]]