Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: health report first pass #16513

Draft
wants to merge 3 commits into
base: 8.x
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
156 changes: 154 additions & 2 deletions docs/static/monitoring/monitoring-apis.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,13 @@
[[monitoring]]
== APIs for monitoring {ls}

{ls} provides monitoring APIs for retrieving runtime metrics
about {ls}:
{ls} provides monitoring APIs for retrieving runtime information about {ls}:

* <<node-info-api>>
* <<plugins-api>>
* <<node-stats-api>>
* <<hot-threads-api>>
* <<logstash-health-report-api>>


You can use the root resource to retrieve general information about the Logstash instance, including
Expand Down Expand Up @@ -1184,3 +1184,155 @@ Example of a human-readable response:
org.jruby.internal.runtime.NativeThread.join(NativeThread.java:75)

--------------------------------------------------


[[logstash-health-report-api]]
=== Health report API

An API that reports the health status of Logstash.

[source,js]
--------------------------------------------------
curl -XGET 'localhost:9600/_health_report?pretty'
--------------------------------------------------

==== Description

The health API returns a report with the health status of Logstash and the pipelines that are running inside of it.
The report contains a list of indicators that compose Logstash functionality.

Each indicator has a health status of: `green`, `unknown`, `yellow`, or `red`.
The indicator will provide an explanation and metadata describing the reason for its current health status.

The top-level status is controlled by the worst indicator status.

In the event that an indicator's status is non-green, a list of impacts may be present in the indicator result which detail the functionalities that are negatively affected by the health issue.
Each impact carries with it a severity level, an area of the system that is affected, and a simple description of the impact on the system.

Some health indicators can determine the root cause of a health problem and prescribe a set of steps that can be performed in order to improve the health of the system.
The root cause and remediation steps are encapsulated in a `diagnosis`.
A diagnosis contains a cause detailing a root cause analysis, an action containing a brief description of the steps to take to fix the problem, and the URL for detailed troubleshooting help.

NOTE: The health indicators perform root cause analysis of non-green health statuses.
This can be computationally expensive when called frequently.

==== Response body

`status`::
(Optional, string) Health status of {ls}, based on the aggregated status of all indicators. Statuses are:

`green`:::
{ls} is healthy.

`unknown`:::
The health of {ls} could not be determined.

`yellow`:::
The functionality of {ls} is in a degraded state and may need remediation to avoid the health becoming `red`.

`red`:::
{ls} is experiencing an outage or certain features are unavailable for use.

`indicators`::
(object) Information about the health of the {ls} indicators.

+
.Properties of `indicators`
[%collapsible%open]
====
`<indicator>`::
(object) Contains health results for an indicator.
+
.Properties of `<indicator>`
[%collapsible%open]
=======
`status`::
(string) Health status of the indicator. Statuses are:

`green`:::
The indicator is healthy.

`unknown`:::
The health of the indicator could not be determined.

`yellow`:::
The functionality of an indicator is in a degraded state and may need remediation to avoid the health becoming `red`.

`red`:::
The indicator is experiencing an outage or certain features are unavailable for use.

`symptom`::
(string) A message providing information about the current health status.

`details`::
(Optional, object) An object that contains additional information about the indicator that has lead to the current health status result.
Each indicator has <<logstash-health-api-response-details, a unique set of details>>.

`impacts`::
(Optional, array) If a non-healthy status is returned, indicators may include a list of impacts that this health status will have on {ls}.
+
.Properties of `impacts`
[%collapsible%open]
========
`severity`::
(integer) How important this impact is to the functionality of {ls}.
A value of 1 is the highest severity, with larger values indicating lower severity.

`description`::
(string) A description of the impact on {ls}.

`impact_areas`::
(array of strings) The areas {ls} functionality that this impact affects.
Possible values are:
+
--
* `pipeline_execution`
--

========

`diagnosis`::
(Optional, array) If a non-healthy status is returned, indicators may include a list of diagnosis that encapsulate the cause of the health issue and an action to take in order to remediate the problem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆 (as a non-native-speaker this is my most-time mistake diagnosis vs diagnoses 🙈)

Suggested change
(Optional, array) If a non-healthy status is returned, indicators may include a list of diagnosis that encapsulate the cause of the health issue and an action to take in order to remediate the problem.
(Optional, array) If a non-healthy status is returned, indicators may include a list of diagnoses that encapsulate the cause of the health issue and an action to take in order to remediate the problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a verbatim from the Elasticsearch docs 😩

+
.Properties of `diagnosis`
[%collapsible%open]
========
`cause`::
(string) A description of a root cause of this health problem.

`action`::
(string) A brief description the steps that should be taken to remediate the problem.
A more detailed step-by-step guide to remediate the problem is provided by the `help_url` field.

`help_url`::
(string) A link to the troubleshooting guide that'll fix the health problem.
========
=======
====

[role="child_attributes"]
[[logstash-health-api-response-details]]
==== Indicator Details

Each health indicator in the health API returns a set of details that further explains the state of the system.
The details have contents and a structure that is unique to each indicator.

[[logstash-health-api-response-details-pipeline]]
===== Pipeline Indicator Details

`+pipelines/indicators/<pipeline_id>/details+`::
(object) Information about the specified pipeline.
+
.Properties of `+pipelines/indicators/<pipeline_id>/details+`
yaauie marked this conversation as resolved.
Show resolved Hide resolved
[%collapsible%open]
====
`status`::
(object) Details related to the pipeline's current status and run-state.
+
.Properties of `status`
[%collapsible%open]
========
`state`::
(string) The current state of the pipeline, including whether it is `loading`, `running`, `finished`, or `terminated`.
========
====
37 changes: 37 additions & 0 deletions docs/static/troubleshoot/health-pipeline-status.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
[[health-report-pipeline-status]]
=== Health Report Pipeline Status

The Pipeline indicator has a `status` probe that is capable of producing one of several diagnoses about the pipeline's lifecycle, indicating whether the pipeline is currently running.

[[health-report-pipeline-status-diagnosis-loading]]
==== [[loading]]Loading Pipeline

A pipeline that is loading is not yet processing data, and is considered a temporarily-degraded pipeline state.
Some plugins perform actions or pre-validation that can delay the starting of the pipeline, such as when a plugin pre-establishes a connection to an external service before allowing the pipeline to start.
When these plugins take significant time to start up, the whole pipeline can remain in a loading state for an extended time.

If your pipeline does not come up in a reasonable amount of time, consider checking the Logstash logs to see if the plugin shows evidence of being caught in a retry loop.

[[health-report-pipeline-status-diagnosis-finished]]
==== [[finished]]Finished Pipeline

A logstash pipeline whose input plugins have all completed will be shut down once events have finished processing.

Many plugins can be configured to run indefinitely, either by listening for new inbound events or by polling for events on a schedule.
A finished pipeline will not produce or process any more events until it is restarted, which will occur if the pipeline's definition is changed and pipeline reloads are enabled.
If you wish to keep your pipeline runing, consider configuring its input to run on a schedule or otherwise listen for new events.

[[health-report-pipeline-status-diagnosis-terminated]]
==== [[terminated]]Terminated Pipeline

When a Logstash pipeline's filter or output plugins crash, the entire pipeline is terminated and intervention is required.

A terminated pipeline will not produce or process any more events until it is restarted, which will occur if the pipeline's definition is changed and pipeline reloads are enabled.
Check the logs to determine the cause of the crash, and report the issue to the plugin maintainers.

[[health-report-pipeline-status-diagnosis-unknown]]
==== [[unknown]]Unknown Pipeline

When a Logstash pipeline either cannot be created or has recently been deleted the health report doesn't know enough to produce a meaningful status.

Check the logs to determine if the pipeline crashed during creation, and report the issue to the plugin maintainers.
1 change: 1 addition & 0 deletions docs/static/troubleshoot/troubleshooting.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,4 @@ include::ts-logstash.asciidoc[]
include::ts-plugins-general.asciidoc[]
include::ts-plugins.asciidoc[]
include::ts-other-issues.asciidoc[]
include::health-pipeline-status.asciidoc[]
Loading