Skip to content

Commit

Permalink
(Doc+) Capture Elasticsearch diagnostic
Browse files Browse the repository at this point in the history
  • Loading branch information
stefnestor committed May 3, 2024
1 parent a561958 commit 3f83dd0
Showing 1 changed file with 115 additions and 0 deletions.
115 changes: 115 additions & 0 deletions docs/reference/troubleshooting/diagnostic.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
[[diagnostic]]
=== Diagnostic
++++
<titleabbrev>Capturing Diagnostic</titleabbrev>
++++
:keywords: Elasticsearch diagnostic, diagnostics

An https://github.com/elastic/support-diagnostics[{es} diagnostic] allows
you to capture a point-in-time snapshot of cluster statistics and most settings.
It works against all {es} versions and requires JRE/JDK ≥v1.8. It is
useful when escalting to https://support.elastic.co[Elastic Support] or
https://discuss.elastic.co[Elastic Discuss] to minimize turnaround time.
It's point-in-time view is also useful when troubleshooting, see
https://www.elastic.co/blog/why-does-elastic-support-keep-asking-for-diagnostic-files[this
for examples].

[TIP]
====
The {es} diagnostic is included as a sub-library within Elastic's platforms:
* {ece} which you can pull under {ece} > Deployment > Operations >
Prepare Bundle > {es}.
* {eck}'s https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-take-eck-dump.html[diagnostic]
pulls this by default.
====

[discrete]
[[diagnostic-capture]]
==== Capture

To capture an {es} diagnostic:

. Download latest `diagnostics-X.X.X-dist.zip` (_not_ the "source code") file
from https://github.com/elastic/support-diagnostics/releases/latest[its
latest releases]. We will reference the unzipped execution file below as
`./diagnostics.sh` below which is for Unix-based systems though Windows will
replace this for `.\diagnostics.bat`.

. There's https://github.com/elastic/support-diagnostics#diagnostic-types[three
available `type`'s'] to capture your {es} diagnostic.

** `local` (default, **recommended**): polls the <<rest-apis,{es} API>>,
gathers Operating System info, and captures cluster and GC logs.
Alternatively, you can use `remote` which will establish an ssh session
to the applicable target server to pull the same info.

** `api` polls the <<rest-apis,{es} API>> but all other data must be
collected manually.

. Verify network and user permissions are sufficient to connect to your {es}
cluster by checking its <<cluster-health,Cluster Health>>. For example,
for `host:localhost`, `port:9200`, and `username:elastic` this would curl as:
+
[source,sh]
---
curl -X GET -k -u elastic -p https://localhost:9200/_cluster/health
---

. You're expecting an HTTP 200 `OK` response that reports the cluster's
`status`. If you can't successfully curl your {es} host, please
pause and review the resulting error as the diagnostic will potentially
not have the expected results. Outlining common errors and their next steps:

** HTTP 401 `UNAUTHENTICATED`: the error will usually tell you either
that your `username:password` pair is invalid or that your `.security`
index is unavailable and you'll need to setup a temporary
<<file-realm,file-based realm>> user with `role:superuser` to authenticate.

** HTTP 403 `UNAUTHORIZED`: your attempted `username` is recognized but
has insufficient permissions to run the diagnostic. Either use a different
username or elevate this user's privileges.

** HTTP 429 `TOO_MANY_REQUESTS` (for example `circuit_breaking_exception`):
your username authenticated and authorized but the cluster is under
sufficiently high strain that it's not responding to API calls. These
responses are usually hit and miss, so potentially indicate that you can
proceed with running the diagnostic (which will pull what it can).

** HTTP 504 `BAD_GATEWAY`: your network is experiencing issues reaching
the cluster (for example because of proxy or firewall). You might
change where you attempt from, confirm your port, or attempt targeting
the host's IP instead of its URL domain.

** HTTP 503 `SERVICE_UNAVAILABLE` (for example `master_not_discovered_exception`):
your cluster does not currently have an elected master node (which is
required for it to be API-responsive). This may be temporary while master
node rotates. Otherwise, do not run Step#5 but pivot towards investigating
and first resolve <<cluster-fault-detection,cluster fault detection>>
before proceeding.

. Once you have a working curl request, use those same parameters to fill-in
the https://github.com/elastic/support-diagnostics#standard-options[diagnostic
parameters]. From our example, most common results will appear:
+
[source,sh]
---
sudo ./diagnostics.sh --type local --host localhost --port 9200 -u elastic -p --bypassDiagVerify --ssl --noVerify
---

. Once this script has completed, verify no errors emitted in the
`diagnostic.log`. Common errors to resolve:

** `Error: Could not find or load main class com.elastic.support.diagnostics.DiagnosticApp`
indicates that you accidentally downloaded the "source code" file
instead of the diagnostic in Step#1 above.

** `Could not retrieve the {es} version due to a system or network error - unable to continue.`
indicates an issue for the diagnostic to curl the cluster. You should
expect either Step#3 failed or there's a parameter disconnect between
Step#3 and Step#5 above.

** `security_exception` with `is unauthorized for user` suggests
insufficient admin permissions to run the diagnostic tool and another
user should be used or current user granted `role:superuser` privileges
to run diagnostic.

0 comments on commit 3f83dd0

Please sign in to comment.