[Alerting][Docs] Adds Alerting & Task Manager Scalability Guidance & …

…Health Monitoring (#91171) Documentation for scaling Kibana alerting, what configurations can change, what impacts they have, etc. Scaling Alerting relies heavily on scaling Task Manager, so these docs also document Task manager Health Monitoring and scaling.
elastic · Mar 4, 2021 · 79134b3 · 79134b3
1 parent 462fe08
commit 79134b3
Show file tree

Hide file tree

Showing 18 changed files with 1,232 additions and 287 deletions.
diff --git a/docs/api/task-manager/health.asciidoc b/docs/api/task-manager/health.asciidoc
@@ -0,0 +1,133 @@
+[[task-manager-api-health]]
+=== Get Task Manager health API
+++++
+<titleabbrev>Get Task Manager health</titleabbrev>
+++++
+
+Retrieve the health status of the {kib} Task Manager.
+
+[[task-manager-api-health-request]]
+==== Request
+
+`GET <kibana host>:<port>/api/task_manager/_health`
+
+[[task-manager-api-health-codes]]
+==== Response code
+
+`200`::
+    Indicates a successful call.
+
+[[task-manager-api-health-example]]
+==== Example
+
+Retrieve the health status of the {kib} Task Manager:
+
+[source,sh]
+--------------------------------------------------
+$ curl -X GET api/task_manager/_health
+--------------------------------------------------
+// KIBANA
+
+The API returns the following:
+
+[source,sh]
+--------------------------------------------------
+{
+  "id": "15415ecf-cdb0-4fef-950a-f824bd277fe4",
+  "timestamp": "2021-02-16T11:38:10.077Z",
+  "status": "OK",
+  "last_update": "2021-02-16T11:38:09.934Z",
+  "stats": {
+    "configuration": {
+      "timestamp": "2021-02-16T11:29:05.055Z",
+      "value": {
+        "request_capacity": 1000,
+        "max_poll_inactivity_cycles": 10,
+        "monitored_aggregated_stats_refresh_rate": 60000,
+        "monitored_stats_running_average_window": 50,
+        "monitored_task_execution_thresholds": {
+          "default": {
+            "error_threshold": 90,
+            "warn_threshold": 80
+          },
+          "custom": {}
+        },
+        "poll_interval": 3000,
+        "max_workers": 10
+      },
+      "status": "OK"
+    },
+    "runtime": {
+      "timestamp": "2021-02-16T11:38:09.934Z",
+      "value": {
+        "polling": {
+          "last_successful_poll": "2021-02-16T11:38:09.934Z",
+          "last_polling_delay": "2021-02-16T11:29:05.053Z",
+          "duration": {
+            "p50": 0,
+            "p90": 0,
+            "p95": 0,
+            "p99": 0
+          },
+          "claim_conflicts": {
+            "p50": 0,
+            "p90": 0,
+            "p95": 0,
+            "p99": 0
+          },
+          "claim_mismatches": {
+            "p50": 0,
+            "p90": 0,
+            "p95": 0,
+            "p99": 0
+          },
+          "result_frequency_percent_as_number": {
+            "Failed": 0,
+            "NoAvailableWorkers": 0,
+            "NoTasksClaimed": 0,
+            "RanOutOfCapacity": 0,
+            "RunningAtCapacity": 0,
+            "PoolFilled": 0
+          }
+        },
+        "drift": {
+          "p50": 0,
+          "p90": 0,
+          "p95": 0,
+          "p99": 0
+        },
+        "load": {
+          "p50": 0,
+          "p90": 0,
+          "p95": 0,
+          "p99": 0
+        },
+        "execution": {
+          "duration": {},
+          "result_frequency_percent_as_number": {}
+        }
+      },
+      "status": "OK"
+    },
+    "workload": {
+      "timestamp": "2021-02-16T11:38:05.826Z",
+      "value": {
+        "count": 26,
+        "task_types": {},
+        "schedule": [],
+        "overdue": 0,
+        "estimated_schedule_density": []
+      },
+      "status": "OK"
+    }
+  }
+}
+--------------------------------------------------
+
+The health API response is described in <<making-sense-of-task-manager-health-stats>>.
+
+The health monitoring API exposes three sections:
+
+* `configuration` is described in detail under <<task-manager-health-evaluate-the-configuration>>
+* `workload` is described in detail under <<task-manager-health-evaluate-the-workload>>
+* `runtime` is described in detail under <<task-manager-health-evaluate-the-runtime>>
diff --git a/docs/developer/plugin-list.asciidoc b/docs/developer/plugin-list.asciidoc
@@ -527,6 +527,7 @@ routes, etc.
 
 |{kib-repo}blob/{branch}/x-pack/plugins/task_manager/README.md[taskManager]
 |The task manager is a generic system for running background tasks.
+Documentation: https://www.elastic.co/guide/en/kibana/master/task-manager-production-considerations.html
 
 
 |{kib-repo}blob/{branch}/x-pack/plugins/telemetry_collection_xpack/README.md[telemetryCollectionXpack]

diff --git a/docs/settings/task-manager-settings.asciidoc b/docs/settings/task-manager-settings.asciidoc
@@ -28,5 +28,18 @@ Task Manager runs background tasks by polling for work on an interval.  You can
   | `xpack.task_manager.max_workers`
   | The maximum number of tasks that this Kibana instance will run simultaneously.  Defaults to 10.
     Starting in 8.0, it will not be possible to set the value greater than 100.
+|===
+
+[float]
+[[task-manager-health-settings]]
+==== Task Manager Health settings 
+
+Settings that configure the <<task-manager-health-monitoring>> endpoint.
+
+[cols="2*<"]
+|===
+| `xpack.task_manager.`
+`monitored_task_execution_thresholds`
+  | Configures the threshold of failed task executions at which point the `warn` or `error` health status is set under each task type execution status (under `stats.runtime.value.excution.result_frequency_percent_as_number[${task type}].status`). This setting allows configuration of both the default level and a custom task type specific level. By default, this setting is configured to mark the health of every task type as `warning` when it exceeds 80% failed executions, and as `error` at 90%. Custom configurations allow you to reduce this threshold to catch failures sooner for task types that you might consider critical, such as alerting tasks. This value can be set to any number between 0 to 100, and a threshold is hit when the value *exceeds* this number. This means that you can avoid setting the status to `error` by setting the threshold at 100, or hit `error` the moment any task fails by setting the threshold to 0 (as it will exceed 0 once a single failure occurs).
 
 |===
diff --git a/docs/setup/settings.asciidoc b/docs/setup/settings.asciidoc
@@ -683,5 +683,5 @@ include::{kib-repo-dir}/settings/reporting-settings.asciidoc[]
 include::secure-settings.asciidoc[]
 include::{kib-repo-dir}/settings/security-settings.asciidoc[]
 include::{kib-repo-dir}/settings/spaces-settings.asciidoc[]
-include::{kib-repo-dir}/settings/telemetry-settings.asciidoc[]
 include::{kib-repo-dir}/settings/task-manager-settings.asciidoc[]
+include::{kib-repo-dir}/settings/telemetry-settings.asciidoc[]
diff --git a/docs/user/alerting/alerting-getting-started.asciidoc b/docs/user/alerting/alerting-getting-started.asciidoc
@@ -164,6 +164,14 @@ If you are using an *on-premises* Elastic Stack deployment with <<using-kibana-w
 
 * You must enable Transport Layer Security (TLS) for communication <<configuring-tls-kib-es, between {es} and {kib}>>. {kib} alerting uses <<api-keys, API keys>> to secure background alert checks and actions, and API keys require {ref}/configuring-tls.html#tls-http[TLS on the HTTP interface]. A proxy will not suffice.
 
+[float]
+[[alerting-setup-production]]
+== Production considerations and scaling guidance
+
+When relying on alerts and actions as mission critical services, make sure you follow the <<alerting-production-considerations,Alerting production considerations>>.
+
+See <<alerting-scaling-guidance>> for more information on the scalability of {kib} alerting.
+
 [float]
 [[alerting-security]]
 == Security

diff --git a/docs/user/alerting/alerting-production-considerations.asciidoc b/docs/user/alerting/alerting-production-considerations.asciidoc
diff --git a/docs/user/alerting/alerting-troubleshooting.asciidoc b/docs/user/alerting/alerting-troubleshooting.asciidoc
@@ -0,0 +1,55 @@
+[role="xpack"]
+[[alerting-troubleshooting]]
+== Alerting Troubleshooting
+
+This page describes how to resolve common problems you might encounter with Alerting.
+If your problem isn’t described here, please review open issues in the following GitHub repositories:
+
+* https://github.com/elastic/kibana/issues[kibana] (https://github.com/elastic/kibana/issues?q=is%3Aopen+is%3Aissue+label%3AFeature%3AAlerting[Alerting issues])
+
+Have a question? Contact us in the https://discuss.elastic.co/[discuss forum].
+
+[float]
+[[alerts-small-check-interval-run-late]]
+=== Alerts with small check intervals run late
+
+*Problem*:
+
+Alerts with a small check interval, such as every two seconds, run later than scheduled.
+
+*Resolution*:
+
+Alerts run as background tasks at a cadence defined by their *check interval*.
+When an Alert *check interval* is smaller than the Task Manager <<task-manager-settings,`poll_interval`>> the alert will run late.
+
+Either tweak the <<task-manager-settings,{kib} Task Manager settings>> or increase the *check interval* of the alerts in question.
+
+For more details, see <<task-manager-health-scheduled-tasks-small-schedule-interval-run-late>>.
+
+
+[float]
+[[scheduled-alerts-run-late]]
+=== Alerts run late
+
+*Problem*:
+
+Scheduled alerts run at an inconsistent cadence, often running late.
+
+Actions run long after the status of an alert changes, sending a notification of the change too late.
+
+*Solution*:
+
+Alerts and actions run as background tasks by each {kib} instance at a default rate of ten tasks every three seconds.
+
+If many alerts or actions are scheduled to run at the same time, pending tasks will queue in {es}. Each {kib} instance then polls for pending tasks at a rate of up to ten tasks at a time, at three second intervals. Because alerts and actions are backed by tasks, it is possible for pending tasks in the queue to exceed this capacity and run late.
+
+For details on diagnosing the underlying causes of such delays, see <<task-manager-health-tasks-run-late>>.
+
+Alerting and action tasks are identified by their type.
+
+* Alert tasks always begin with `alerting:`. For example, the `alerting:.index-threshold` tasks back the <<alert-type-index-threshold, index threshold stack alert>>.
+* Action tasks always begin with `actions:`. For example, the `actions:.index` tasks back the <<index-action-type, index action>>.
+
+When diagnosing issues related to Alerting, focus on the thats that begin with `alerting:` and `actions:`.
+
+For more details on monitoring and diagnosing task execution in Task Manager, see <<task-manager-health-monitoring>>.
diff --git a/docs/user/alerting/index.asciidoc b/docs/user/alerting/index.asciidoc
@@ -2,4 +2,4 @@ include::alerting-getting-started.asciidoc[]
 include::defining-alerts.asciidoc[]
 include::action-types.asciidoc[]
 include::alert-types.asciidoc[]
-include::alerting-production-considerations.asciidoc[]
+include::alerting-troubleshooting.asciidoc[]
diff --git a/docs/user/index.asciidoc b/docs/user/index.asciidoc
@@ -13,6 +13,8 @@ include::monitoring/monitoring-kibana.asciidoc[leveloffset=+2]
 
 include::security/securing-kibana.asciidoc[]
 
+include::production-considerations/index.asciidoc[]
+
 include::discover.asciidoc[]
 
 include::dashboard/dashboard.asciidoc[]

diff --git a/docs/user/production-considerations/alerting-production-considerations.asciidoc b/docs/user/production-considerations/alerting-production-considerations.asciidoc
@@ -0,0 +1,51 @@
+[role="xpack"]
+[[alerting-production-considerations]]
+== Alerting production considerations
+
+++++
+<titleabbrev>Alerting</titleabbrev>
+++++
+
+Alerting runs both alert checks and actions as persistent background tasks managed by the Task Manager.
+
+When relying on alerts and actions as mission critical services, make sure you follow the <<task-manager-production-considerations, production considerations>> for Task Manager.
+
+[float]
+[[alerting-background-tasks]]
+=== Running background alert checks and actions
+
+{kib} uses background tasks to run alerts and actions, distributed across all {kib} instances in the cluster.
+
+By default, each {kib} instance polls for work at three second intervals, and can run a maximum of ten concurrent tasks.
+These tasks are then run on the {kib} server.
+
+Alerts are recurring background tasks which are rescheduled according to the <<defining-alerts-general-details, check interval>> on completion.
+Actions are non-recurring background tasks which are deleted on completion.
+
+For more details on Task Manager, see <<task-manager-background-tasks>>.
+
+[IMPORTANT]
+==============================================
+Alert and action tasks can run late or at an inconsistent schedule.
+This is typically a symptom of the specific usage of the cluster in question.
+
+You can address such issues by tweaking the <<task-manager-settings,Task Manager settings>> or scaling the deployment to better suit your use case.
+
+For detailed guidance, see <<alerting-troubleshooting, Alerting Troubleshooting>>.
+==============================================
+
+[float]
+[[alerting-scaling-guidance]]
+=== Scaling Guidance
+
+As alerts and actions leverage background tasks to perform the majority of work, scaling Alerting is possible by following the <<task-manager-scaling-guidance,Task Manager Scaling Guidance>>.
+
+When estimating the required task throughput, keep the following in mind:
+
+* Each alert uses a single recurring task that is scheduled to run at the cadence defined by its <<defining-alerts-general-details,check interval>>.
+* Each action uses a single task. However, because <<alerting-concepts-suppressing-duplicate-notifications,actions are taken per instance>>, alerts can generate a large number of non-recurring tasks.
+
+It is difficult to predict how much throughput is needed to ensure all alerts and actions are executed at consistent schedules.
+By counting alerts as recurring tasks and actions as non-recurring tasks, a rough throughput <<task-manager-rough-throughput-estimation,can be estimated>> as a _tasks per minute_ measurement.
+
+Predicting the buffer required to account for actions depends heavily on the alert types you use, the amount of alert Instances they might detect, and the number of actions you might choose to assign to action groups. With that in mind, regularly <<task-manager-health-monitoring,monitor the health>> of your Task Manager instances.
diff --git a/docs/user/production-considerations/index.asciidoc b/docs/user/production-considerations/index.asciidoc
@@ -0,0 +1,5 @@
+include::production.asciidoc[]
+include::alerting-production-considerations.asciidoc[]
+include::task-manager-production-considerations.asciidoc[]
+include::task-manager-health-monitoring.asciidoc[]
+include::task-manager-troubleshooting.asciidoc[]
diff --git a/docs/setup/production.asciidoc → ...uction-considerations/production.asciidoc b/docs/setup/production.asciidoc → ...uction-considerations/production.asciidoc
@@ -1,5 +1,9 @@
 [[production]]
-== Use {kib} in a production environment
+= Use {kib} in a production environment
+
+++++
+<titleabbrev>Production considerations</titleabbrev>
+++++
 
 * <<configuring-kibana-shield>>
 * <<csp-strict-mode>>