From 59b985a63233b45fde6d1d399358fbe828799687 Mon Sep 17 00:00:00 2001 From: Stef Nestor <26751266+stefnestor@users.noreply.github.com> Date: Thu, 12 Sep 2024 07:46:40 -0600 Subject: [PATCH] (Docs+) Flush out Resource+Task troubleshooting (#111773) * (Docs+) Flush out Resource+Task troubleshooting --------- Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com> Co-authored-by: David Turner --- .../modules/indices/circuit_breaker.asciidoc | 3 +- docs/reference/tab-widgets/cpu-usage.asciidoc | 34 ++++----- .../common-issues/high-cpu-usage.asciidoc | 23 +++++- .../common-issues/rejected-requests.asciidoc | 52 ++++++++++++-- .../common-issues/task-queue-backlog.asciidoc | 72 ++++++++++++++----- 5 files changed, 135 insertions(+), 49 deletions(-) diff --git a/docs/reference/modules/indices/circuit_breaker.asciidoc b/docs/reference/modules/indices/circuit_breaker.asciidoc index a5a787e23d170..452d4e99704ce 100644 --- a/docs/reference/modules/indices/circuit_breaker.asciidoc +++ b/docs/reference/modules/indices/circuit_breaker.asciidoc @@ -175,7 +175,8 @@ an `OutOfMemory` exception which would bring down the node. To prevent this from happening, a special <> is used, which limits the memory allocation during the execution of a <> query. When the breaker is triggered, an `org.elasticsearch.common.breaker.CircuitBreakingException` -is thrown and a descriptive error message is returned to the user. +is thrown and a descriptive error message including `circuit_breaking_exception` +is returned to the user. This <> can be configured using the following settings: diff --git a/docs/reference/tab-widgets/cpu-usage.asciidoc b/docs/reference/tab-widgets/cpu-usage.asciidoc index 575cf459ee5be..c6272228965eb 100644 --- a/docs/reference/tab-widgets/cpu-usage.asciidoc +++ b/docs/reference/tab-widgets/cpu-usage.asciidoc @@ -1,30 +1,20 @@ // tag::cloud[] -From your deployment menu, click **Performance**. The page's **CPU Usage** chart -shows your deployment's CPU usage as a percentage. +* (Recommended) Enable {cloud}/ec-monitoring-setup.html[logs and metrics]. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page. ++ +You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email. -High CPU usage can also deplete your CPU credits. CPU credits let {ess} provide -smaller clusters with a performance boost when needed. The **CPU credits** -chart shows your remaining CPU credits, measured in seconds of CPU time. +* From your deployment menu, view the {cloud}/ec-saas-metrics-accessing.html[**Performance**] page. On this page, you can view two key metrics: +** **CPU usage**: Your deployment's CPU usage, represented as a percentage. +** **CPU credits**: Your remaining CPU credits, measured in seconds of CPU time. -You can also use the <> to get the current CPU usage -for each node. - -// tag::cpu-usage-cat-nodes[] -[source,console] ----- -GET _cat/nodes?v=true&s=cpu:desc ----- - -The response's `cpu` column contains the current CPU usage as a percentage. The -`name` column contains the node's name. -// end::cpu-usage-cat-nodes[] +{ess} grants {cloud}/ec-vcpu-boost-instance.html[CPU credits] per deployment +to provide smaller clusters with performance boosts when needed. High CPU +usage can deplete these credits, which might lead to {cloud}/ec-scenario_why_is_performance_degrading_over_time.html[performance degradation] and {cloud}/ec-scenario_why_are_my_cluster_response_times_suddenly_so_much_worse.html[increased cluster response times]. // end::cloud[] // tag::self-managed[] - -Use the <> to get the current CPU usage for each node. - -include::cpu-usage.asciidoc[tag=cpu-usage-cat-nodes] - +* Enable <>. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page. ++ +You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email. // end::self-managed[] diff --git a/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc b/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc index 858683ef97a6d..96a9a8f1e32b7 100644 --- a/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc +++ b/docs/reference/troubleshooting/common-issues/high-cpu-usage.asciidoc @@ -9,12 +9,29 @@ If a thread pool is depleted, {es} will <> related to the thread pool. For example, if the `search` thread pool is depleted, {es} will reject search requests until more threads are available. +You might experience high CPU usage if a <>, and therefore the nodes assigned to that tier, is experiencing more traffic than other tiers. This imbalance in resource utilization is also known as <>. + [discrete] [[diagnose-high-cpu-usage]] ==== Diagnose high CPU usage **Check CPU usage** +You can check the CPU usage per node using the <>: + +// tag::cpu-usage-cat-nodes[] +[source,console] +---- +GET _cat/nodes?v=true&s=cpu:desc +---- + +The response's `cpu` column contains the current CPU usage as a percentage. +The `name` column contains the node's name. Elevated but transient CPU usage is +normal. However, if CPU usage is elevated for an extended duration, it should be +investigated. + +To track CPU usage over time, we recommend enabling monitoring: + include::{es-ref-dir}/tab-widgets/cpu-usage-widget.asciidoc[] **Check hot threads** @@ -24,11 +41,13 @@ threads API>> to check for resource-intensive threads running on the node. [source,console] ---- -GET _nodes/my-node,my-other-node/hot_threads +GET _nodes/hot_threads ---- // TEST[s/\/my-node,my-other-node//] -This API returns a breakdown of any hot threads in plain text. +This API returns a breakdown of any hot threads in plain text. High CPU usage +frequently correlates to <>. [discrete] [[reduce-cpu-usage]] diff --git a/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc b/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc index 497bddc562c69..c863709775fcd 100644 --- a/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc +++ b/docs/reference/troubleshooting/common-issues/rejected-requests.asciidoc @@ -23,9 +23,52 @@ To check the number of rejected tasks for each thread pool, use the [source,console] ---- -GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed +GET /_cat/thread_pool?v=true&h=id,name,queue,active,rejected,completed ---- +`write` thread pool rejections frequently appear in the erring API and +correlating log as `EsRejectedExecutionException` with either +`QueueResizingEsThreadPoolExecutor` or `queue capacity`. + +These errors are often related to <>. + +[discrete] +[[check-circuit-breakers]] +==== Check circuit breakers + +To check the number of tripped <>, use the +<>. + +[source,console] +---- +GET /_nodes/stats/breaker +---- + +These statistics are cumulative from node startup. For more information, see +<>. + +[discrete] +[[check-indexing-pressure]] +==== Check indexing pressure + +To check the number of <> +rejections, use the <>. + +[source,console] +---- +GET _nodes/stats?human&filter_path=nodes.*.indexing_pressure +---- + +These stats are cumulative from node startup. + +Indexing pressure rejections appear as an +`EsRejectedExecutionException`, and indicate that they were rejected due +to `coordinating_and_primary_bytes`, `coordinating`, `primary`, or `replica`. + +These errors are often related to <>, +<> sizing, or the ingest target's +<>. + [discrete] [[prevent-rejected-requests]] ==== Prevent rejected requests @@ -34,9 +77,4 @@ GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed If {es} regularly rejects requests and other tasks, your cluster likely has high CPU usage or high JVM memory pressure. For tips, see <> and -<>. - -**Prevent circuit breaker errors** - -If you regularly trigger circuit breaker errors, see <> -for tips on diagnosing and preventing them. \ No newline at end of file +<>. \ No newline at end of file diff --git a/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc b/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc index 1ff5bf2e5c311..5aa6a0129c2d4 100644 --- a/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc +++ b/docs/reference/troubleshooting/common-issues/task-queue-backlog.asciidoc @@ -1,10 +1,10 @@ [[task-queue-backlog]] === Task queue backlog -A backlogged task queue can prevent tasks from completing and -put the cluster into an unhealthy state. -Resource constraints, a large number of tasks being triggered at once, -and long running tasks can all contribute to a backlogged task queue. +A backlogged task queue can prevent tasks from completing and put the cluster +into an unhealthy state. Resource constraints, a large number of tasks being +triggered at once, and long running tasks can all contribute to a backlogged +task queue. [discrete] [[diagnose-task-queue-backlog]] @@ -12,39 +12,77 @@ and long running tasks can all contribute to a backlogged task queue. **Check the thread pool status** -A <> can result in <>. +A <> can result in +<>. -You can use the <> to -see the number of active threads in each thread pool and -how many tasks are queued, how many have been rejected, and how many have completed. +Thread pool depletion might be restricted to a specific <>. If <> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog. + +You can use the <> to see the number of +active threads in each thread pool and how many tasks are queued, how many +have been rejected, and how many have completed. [source,console] ---- GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed ---- +The `active` and `queue` statistics are instantaneous while the `rejected` and +`completed` statistics are cumulative from node startup. + **Inspect the hot threads on each node** -If a particular thread pool queue is backed up, -you can periodically poll the <> API -to determine if the thread has sufficient -resources to progress and gauge how quickly it is progressing. +If a particular thread pool queue is backed up, you can periodically poll the +<> API to determine if the thread +has sufficient resources to progress and gauge how quickly it is progressing. [source,console] ---- GET /_nodes/hot_threads ---- -**Look for long running tasks** +**Look for long running node tasks** + +Long-running tasks can also cause a backlog. You can use the <> API to get information about the node tasks that are running. +Check the `running_time_in_nanos` to identify tasks that are taking an +excessive amount of time to complete. + +[source,console] +---- +GET /_tasks?pretty=true&human=true&detailed=true +---- -Long-running tasks can also cause a backlog. -You can use the <> API to get information about the tasks that are running. -Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete. +If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <>- or search-related. +* Filter for <> actions: ++ [source,console] ---- -GET /_tasks?filter_path=nodes.*.tasks +GET /_tasks?human&detailed&actions=indices:data/write/bulk +---- + +* Filter for search actions: ++ +[source,console] ---- +GET /_tasks?human&detailed&actions=indices:data/write/search +---- + +The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis. + +**Look for long running cluster tasks** + +A task backlog might also appear as a delay in synchronizing the cluster state. You +can use the <> to get information +about the pending cluster state sync tasks that are running. + +[source,console] +---- +GET /_cluster/pending_tasks +---- + +Check the `timeInQueue` to identify tasks that are taking an excessive amount +of time to complete. [discrete] [[resolve-task-queue-backlog]]