(Doc+) Capture Elasticsearch diagnostic

stefnestor · May 3, 2024 · 3f83dd0 · 3f83dd0
1 parent a561958
commit 3f83dd0
Showing 1 changed file with 115 additions and 0 deletions.
diff --git a/docs/reference/troubleshooting/diagnostic.asciidoc b/docs/reference/troubleshooting/diagnostic.asciidoc
@@ -0,0 +1,115 @@
+[[diagnostic]]
+=== Diagnostic
+++++
+<titleabbrev>Capturing Diagnostic</titleabbrev>
+++++
+:keywords: Elasticsearch diagnostic, diagnostics
+
+An https://github.com/elastic/support-diagnostics[{es} diagnostic] allows 
+you to capture a point-in-time snapshot of cluster statistics and most settings. 
+It works against all {es} versions and requires JRE/JDK ≥v1.8. It is 
+useful when escalting to https://support.elastic.co[Elastic Support] or 
+https://discuss.elastic.co[Elastic Discuss] to minimize turnaround time. 
+It's point-in-time view is also useful when troubleshooting, see 
+https://www.elastic.co/blog/why-does-elastic-support-keep-asking-for-diagnostic-files[this 
+for examples].
+
+[TIP]
+====
+The {es} diagnostic is included as a sub-library within Elastic's platforms: 
+
+* {ece} which you can pull under {ece} > Deployment > Operations > 
+Prepare Bundle > {es}. 
+* {eck}'s https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-take-eck-dump.html[diagnostic] 
+pulls this by default. 
+====
+
+[discrete]
+[[diagnostic-capture]]
+==== Capture
+
+To capture an {es} diagnostic: 
+
+. Download latest `diagnostics-X.X.X-dist.zip` (_not_ the "source code") file 
+from https://github.com/elastic/support-diagnostics/releases/latest[its 
+latest releases]. We will reference the unzipped execution file below as 
+`./diagnostics.sh` below which is for Unix-based systems though Windows will 
+replace this for `.\diagnostics.bat`. 
+
+. There's https://github.com/elastic/support-diagnostics#diagnostic-types[three 
+available `type`'s'] to capture your {es} diagnostic. 
+
+** `local` (default, **recommended**): polls the <<rest-apis,{es} API>>, 
+gathers Operating System info, and captures cluster and GC logs. 
+Alternatively, you can use `remote` which will establish an ssh session 
+to the applicable target server to pull the same info.
+
+** `api` polls the <<rest-apis,{es} API>> but all other data must be 
+collected manually.
+
+. Verify network and user permissions are sufficient to connect to your {es} 
+cluster by checking its <<cluster-health,Cluster Health>>. For example, 
+for `host:localhost`, `port:9200`, and `username:elastic` this would curl as: 
++ 
+[source,sh]
+---
+curl -X GET -k -u elastic -p https://localhost:9200/_cluster/health
+---
+
+. You're expecting an HTTP 200 `OK` response that reports the cluster's 
+`status`. If you can't successfully curl your {es} host, please 
+pause and review the resulting error as the diagnostic will potentially 
+not have the expected results. Outlining common errors and their next steps:
+
+** HTTP 401 `UNAUTHENTICATED`: the error will usually tell you either 
+that your `username:password` pair is invalid or that your `.security` 
+index is unavailable and you'll need to setup a temporary 
+<<file-realm,file-based realm>> user with `role:superuser` to authenticate.
+
+** HTTP 403 `UNAUTHORIZED`: your attempted `username` is recognized but 
+has insufficient permissions to run the diagnostic. Either use a different 
+username or elevate this user's privileges.
+
+** HTTP 429 `TOO_MANY_REQUESTS` (for example `circuit_breaking_exception`): 
+your username authenticated and authorized but the cluster is under 
+sufficiently high strain that it's not responding to API calls. These 
+responses are usually hit and miss, so potentially indicate that you can 
+proceed with running the diagnostic (which will pull what it can). 
+
+** HTTP 504 `BAD_GATEWAY`: your network is experiencing issues reaching 
+the cluster (for example because of proxy or firewall). You might 
+change where you attempt from, confirm your port, or attempt targeting 
+the host's IP instead of its URL domain. 
+
+** HTTP 503 `SERVICE_UNAVAILABLE` (for example `master_not_discovered_exception`): 
+your cluster does not currently have an elected master node (which is 
+required for it to be API-responsive). This may be temporary while master 
+node rotates. Otherwise, do not run Step#5 but pivot towards investigating 
+and first resolve  <<cluster-fault-detection,cluster fault detection>> 
+before proceeding. 
+
+. Once you have a working curl request, use those same parameters to fill-in 
+the https://github.com/elastic/support-diagnostics#standard-options[diagnostic 
+parameters]. From our example, most common results will appear:
++ 
+[source,sh]
+---
+sudo ./diagnostics.sh --type local --host localhost --port 9200 -u elastic -p --bypassDiagVerify --ssl --noVerify
+---
+
+. Once this script has completed, verify no errors emitted in the 
+`diagnostic.log`. Common errors to resolve: 
+
+** `Error: Could not find or load main class com.elastic.support.diagnostics.DiagnosticApp` 
+indicates that you accidentally downloaded the "source code" file 
+instead of the diagnostic in Step#1 above.
+
+** `Could not retrieve the {es} version due to a system or network error - unable to continue.` 
+indicates an issue for the diagnostic to curl the cluster. You should 
+expect either Step#3 failed or there's a parameter disconnect between 
+Step#3 and Step#5 above. 
+
+** `security_exception` with `is unauthorized for user` suggests 
+insufficient admin permissions to run the diagnostic tool and another 
+user should be used or current user granted `role:superuser` privileges 
+to run diagnostic.