Skip to content

Commit

Permalink
(Docs+) Flush out Resource+Task troubleshooting (elastic#111773)
Browse files Browse the repository at this point in the history
* (Docs+) Flush out Resource+Task troubleshooting

---------

Co-authored-by: shainaraskas <58563081+shainaraskas@users.noreply.github.com>
Co-authored-by: David Turner <david.turner@elastic.co>
  • Loading branch information
3 people committed Sep 12, 2024
1 parent 5b2d861 commit 42bd3ea
Show file tree
Hide file tree
Showing 5 changed files with 135 additions and 49 deletions.
3 changes: 2 additions & 1 deletion docs/reference/modules/indices/circuit_breaker.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,8 @@ an `OutOfMemory` exception which would bring down the node.
To prevent this from happening, a special <<circuit-breaker, circuit breaker>> is used,
which limits the memory allocation during the execution of a <<eql-sequences, sequence>>
query. When the breaker is triggered, an `org.elasticsearch.common.breaker.CircuitBreakingException`
is thrown and a descriptive error message is returned to the user.
is thrown and a descriptive error message including `circuit_breaking_exception`
is returned to the user.

This <<circuit-breaker, circuit breaker>> can be configured using the following settings:

Expand Down
34 changes: 12 additions & 22 deletions docs/reference/tab-widgets/cpu-usage.asciidoc
Original file line number Diff line number Diff line change
@@ -1,30 +1,20 @@
// tag::cloud[]
From your deployment menu, click **Performance**. The page's **CPU Usage** chart
shows your deployment's CPU usage as a percentage.
* (Recommended) Enable {cloud}/ec-monitoring-setup.html[logs and metrics]. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page.
+
You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email.
High CPU usage can also deplete your CPU credits. CPU credits let {ess} provide
smaller clusters with a performance boost when needed. The **CPU credits**
chart shows your remaining CPU credits, measured in seconds of CPU time.
* From your deployment menu, view the {cloud}/ec-saas-metrics-accessing.html[**Performance**] page. On this page, you can view two key metrics:
** **CPU usage**: Your deployment's CPU usage, represented as a percentage.
** **CPU credits**: Your remaining CPU credits, measured in seconds of CPU time.
You can also use the <<cat-nodes,cat nodes API>> to get the current CPU usage
for each node.

// tag::cpu-usage-cat-nodes[]
[source,console]
----
GET _cat/nodes?v=true&s=cpu:desc
----

The response's `cpu` column contains the current CPU usage as a percentage. The
`name` column contains the node's name.
// end::cpu-usage-cat-nodes[]
{ess} grants {cloud}/ec-vcpu-boost-instance.html[CPU credits] per deployment
to provide smaller clusters with performance boosts when needed. High CPU
usage can deplete these credits, which might lead to {cloud}/ec-scenario_why_is_performance_degrading_over_time.html[performance degradation] and {cloud}/ec-scenario_why_are_my_cluster_response_times_suddenly_so_much_worse.html[increased cluster response times].
// end::cloud[]
// tag::self-managed[]

Use the <<cat-nodes,cat nodes API>> to get the current CPU usage for each node.

include::cpu-usage.asciidoc[tag=cpu-usage-cat-nodes]

* Enable <<monitoring-overview,{es} monitoring>>. When logs and metrics are enabled, monitoring information is visible on {kib}'s {kibana-ref}/xpack-monitoring.html[Stack Monitoring] page.
+
You can also enable the {kibana-ref}/kibana-alerts.html[CPU usage threshold alert] to be notified about potential issues through email.
// end::self-managed[]
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,29 @@ If a thread pool is depleted, {es} will <<rejected-requests,reject requests>>
related to the thread pool. For example, if the `search` thread pool is
depleted, {es} will reject search requests until more threads are available.

You might experience high CPU usage if a <<data-tiers,data tier>>, and therefore the nodes assigned to that tier, is experiencing more traffic than other tiers. This imbalance in resource utilization is also known as <<hotspotting,hot spotting>>.

[discrete]
[[diagnose-high-cpu-usage]]
==== Diagnose high CPU usage

**Check CPU usage**

You can check the CPU usage per node using the <<cat-nodes,cat nodes API>>:

// tag::cpu-usage-cat-nodes[]
[source,console]
----
GET _cat/nodes?v=true&s=cpu:desc
----

The response's `cpu` column contains the current CPU usage as a percentage.
The `name` column contains the node's name. Elevated but transient CPU usage is
normal. However, if CPU usage is elevated for an extended duration, it should be
investigated.

To track CPU usage over time, we recommend enabling monitoring:

include::{es-ref-dir}/tab-widgets/cpu-usage-widget.asciidoc[]

**Check hot threads**
Expand All @@ -24,11 +41,13 @@ threads API>> to check for resource-intensive threads running on the node.

[source,console]
----
GET _nodes/my-node,my-other-node/hot_threads
GET _nodes/hot_threads
----
// TEST[s/\/my-node,my-other-node//]

This API returns a breakdown of any hot threads in plain text.
This API returns a breakdown of any hot threads in plain text. High CPU usage
frequently correlates to <<task-queue-backlog,a long-running task, or a
backlog of tasks>>.

[discrete]
[[reduce-cpu-usage]]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,52 @@ To check the number of rejected tasks for each thread pool, use the

[source,console]
----
GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed
GET /_cat/thread_pool?v=true&h=id,name,queue,active,rejected,completed
----

`write` thread pool rejections frequently appear in the erring API and
correlating log as `EsRejectedExecutionException` with either
`QueueResizingEsThreadPoolExecutor` or `queue capacity`.

These errors are often related to <<task-queue-backlog,backlogged tasks>>.

[discrete]
[[check-circuit-breakers]]
==== Check circuit breakers

To check the number of tripped <<circuit-breaker,circuit breakers>>, use the
<<cluster-nodes-stats,node stats API>>.

[source,console]
----
GET /_nodes/stats/breaker
----

These statistics are cumulative from node startup. For more information, see
<<circuit-breaker,circuit breaker errors>>.

[discrete]
[[check-indexing-pressure]]
==== Check indexing pressure

To check the number of <<index-modules-indexing-pressure,indexing pressure>>
rejections, use the <<cluster-nodes-stats,node stats API>>.

[source,console]
----
GET _nodes/stats?human&filter_path=nodes.*.indexing_pressure
----

These stats are cumulative from node startup.

Indexing pressure rejections appear as an
`EsRejectedExecutionException`, and indicate that they were rejected due
to `coordinating_and_primary_bytes`, `coordinating`, `primary`, or `replica`.

These errors are often related to <<task-queue-backlog,backlogged tasks>>,
<<docs-bulk,bulk index>> sizing, or the ingest target's
<<index-modules,`refresh_interval` setting>>.

[discrete]
[[prevent-rejected-requests]]
==== Prevent rejected requests
Expand All @@ -34,9 +77,4 @@ GET /_cat/thread_pool?v=true&h=id,name,active,rejected,completed

If {es} regularly rejects requests and other tasks, your cluster likely has high
CPU usage or high JVM memory pressure. For tips, see <<high-cpu-usage>> and
<<high-jvm-memory-pressure>>.

**Prevent circuit breaker errors**

If you regularly trigger circuit breaker errors, see <<circuit-breaker-errors>>
for tips on diagnosing and preventing them.
<<high-jvm-memory-pressure>>.
Original file line number Diff line number Diff line change
@@ -1,50 +1,88 @@
[[task-queue-backlog]]
=== Task queue backlog

A backlogged task queue can prevent tasks from completing and
put the cluster into an unhealthy state.
Resource constraints, a large number of tasks being triggered at once,
and long running tasks can all contribute to a backlogged task queue.
A backlogged task queue can prevent tasks from completing and put the cluster
into an unhealthy state. Resource constraints, a large number of tasks being
triggered at once, and long running tasks can all contribute to a backlogged
task queue.

[discrete]
[[diagnose-task-queue-backlog]]
==== Diagnose a task queue backlog

**Check the thread pool status**

A <<high-cpu-usage,depleted thread pool>> can result in <<rejected-requests,rejected requests>>.
A <<high-cpu-usage,depleted thread pool>> can result in
<<rejected-requests,rejected requests>>.

You can use the <<cat-thread-pool,cat thread pool API>> to
see the number of active threads in each thread pool and
how many tasks are queued, how many have been rejected, and how many have completed.
Thread pool depletion might be restricted to a specific <<data-tiers,data tier>>. If <<hotspotting,hot spotting>> is occuring, one node might experience depletion faster than other nodes, leading to performance issues and a growing task backlog.

You can use the <<cat-thread-pool,cat thread pool API>> to see the number of
active threads in each thread pool and how many tasks are queued, how many
have been rejected, and how many have completed.

[source,console]
----
GET /_cat/thread_pool?v&s=t,n&h=type,name,node_name,active,queue,rejected,completed
----

The `active` and `queue` statistics are instantaneous while the `rejected` and
`completed` statistics are cumulative from node startup.

**Inspect the hot threads on each node**

If a particular thread pool queue is backed up,
you can periodically poll the <<cluster-nodes-hot-threads,Nodes hot threads>> API
to determine if the thread has sufficient
resources to progress and gauge how quickly it is progressing.
If a particular thread pool queue is backed up, you can periodically poll the
<<cluster-nodes-hot-threads,Nodes hot threads>> API to determine if the thread
has sufficient resources to progress and gauge how quickly it is progressing.

[source,console]
----
GET /_nodes/hot_threads
----

**Look for long running tasks**
**Look for long running node tasks**

Long-running tasks can also cause a backlog. You can use the <<tasks,task
management>> API to get information about the node tasks that are running.
Check the `running_time_in_nanos` to identify tasks that are taking an
excessive amount of time to complete.

[source,console]
----
GET /_tasks?pretty=true&human=true&detailed=true
----

Long-running tasks can also cause a backlog.
You can use the <<tasks,task management>> API to get information about the tasks that are running.
Check the `running_time_in_nanos` to identify tasks that are taking an excessive amount of time to complete.
If a particular `action` is suspected, you can filter the tasks further. The most common long-running tasks are <<docs-bulk,bulk index>>- or search-related.

* Filter for <<docs-bulk,bulk index>> actions:
+
[source,console]
----
GET /_tasks?filter_path=nodes.*.tasks
GET /_tasks?human&detailed&actions=indices:data/write/bulk
----

* Filter for search actions:
+
[source,console]
----
GET /_tasks?human&detailed&actions=indices:data/write/search
----

The API response may contain additional tasks columns, including `description` and `header`, which provides the task parameters, target, and requestor. You can use this information to perform further diagnosis.

**Look for long running cluster tasks**

A task backlog might also appear as a delay in synchronizing the cluster state. You
can use the <<cluster-pending,cluster pending tasks API>> to get information
about the pending cluster state sync tasks that are running.

[source,console]
----
GET /_cluster/pending_tasks
----

Check the `timeInQueue` to identify tasks that are taking an excessive amount
of time to complete.

[discrete]
[[resolve-task-queue-backlog]]
Expand Down

0 comments on commit 42bd3ea

Please sign in to comment.