dcos-diagnostics is a monitoring agent which exposes a HTTP API for querying from the /system/health/v1
DC/OS API.
dcos-diagnostics puller collects the data from agents and represents individual node health for things like
system resources as well as DC/OS-specific services.
dcos-diagnostics generate historical mesos-states bundles. For more context see: dcos/dcos#5907.
Originally dcos-diagnostics was designed in Master/Agent model. It's running on every DC/OS node.
- Master
Master runs on DC/OS Masters. There is the point of entry to dcos-diagnostics from remote systems (e.g., UI). Master is able to query other nodes for health status. Master is responsible for generating cluster diagnostics bundle.
- Public Agent and Agent
Agent runs on every non Master node (excluding bootstrap node). The main responsibility of Agent is providing JSON report of DC/OS Systemd components health. Agent also provides logs that should appear in cluster bundle.
Diagnostics bundle is just a ZIP file with all files useful when debugging problems. It can be treated as flight recorder (blackbox) but for clusters. List of interesting files, commands and endpoints, that should be fetch in bundle is configurable and deployed with dcos-diagnostics binary. Diagnostic bundle generation process fetches all configured files and stores them in single ZIP. ZIP contains directories named after nodes' IP and role (see: api/rest/coordinator.go).
The contents of the generated bundle are not stable over time and any internal or third party bundle analysis tooling should be programmed very defensively in this regard. See: dcos-docs-site#2253
API documentation could be find in docs directory. It's using OpenAPI v3.0 You can see rendered version here. There are two versions of bundle API.
- Old serial API – single master calls every node for data. This API is deprecated and should be removed in DC/OS 2.2
- New parallel API – single master schedules local bundle creation for every node in a cluster. Then master wait until nodes finish bundles. Master downloads finished bundles and merges them into a single cluster bundle zip.
Old API is faster for smaller clusters but it's slow for large clusters, so we recommend to only use the new API that's available since DC/OS 2.0.
To get more information read the design doc
In the past dcos-diagnostics was bundled with:
– see: #35
In that time dcos-diagnostics was called 3dt
(DC/OS Distributed Diagnostics Tool).
It was deprecated in Jun, 2017 but some references might still exist.
go get github.com/dcos/dcos-diagnostics
cd $GOPATH/src/github.com/dcos/dcos-diagnostics
make
build/dcos-diagnostics --version
Run dcos-diagnostics once, on a DC/OS host to check systemd units:
dcos-diagnostics --diag
Get verbose log output:
dcos-diagnostics --diag --verbose
Run the dcos-diagnostics aggregation service to query all cluster hosts for health state:
dcos-diagnostics daemon --pull
Start the dcos-diagnostics health API endpoint:
dcos-diagnostics daemon
Flag | Type | Description |
---|---|---|
agent-port | int | Use TCP port to connect to agents. (default 1050) |
ca-cert | string | Use certificate authority. |
command-exec-timeout | int | Set command executing timeout (default 50) |
debug | bool | Enable pprof debugging endpoints. |
diagnostics-bundle-dir | string | Set a path to store diagnostic bundles (default "/var/run/dcos/dcos-diagnostics/diagnostic_bundles") |
diagnostics-job-timeout | int | Set a global diagnostics job timeout (default 720) |
diagnostics-units-since | string | Collect systemd units logs since (default "24h") |
diagnostics-url-timeout | int | Set a local timeout for every single GET request to a log endpoint (default 1) |
endpoint-config | strings | Use endpoints_config.json (default [/opt/mesosphere/etc/endpoints_config.json]) |
exhibitor-url | string | Use Exhibitor URL to discover master nodes. (default "http://127.0.0.1:8181/exhibitor/v1/cluster/status") |
fetchers-count | int | Set a number of concurrent fetchers gathering nodes logs (default 1) |
force-tls | bool | Use HTTPS to do all requests. |
health-update-interval | int | Set update health interval in seconds. (default 60) |
hostname | string | A host name (by default it uses system hostname) (default "orion") |
iam-config | string | A path to identity and access management config |
ip-discovery-command-location | string | A command used to get local IP address |
master-port | int | Use TCP port to connect to masters. (default 1050) |
no-unix-socket | bool | Disable use unix socket provided by systemd activation. |
port | int | Web server TCP port. (default 1050) |
pull | bool | Try to pull runner from DC/OS hosts. |
pull-interval | int | Set pull interval in seconds. (default 60) |
pull-timeout | int | Set pull timeout. (default 3) |
make test
Starting with DC/OS 2.0 we deprecated "old" bundle API and proposed new parallel API. The deprecation process should be finished with DC/OS 2.3 and all code responsible for old API can be deleted. In order to do this we need to change all scripts in other DC/OS components to use new DC/OS Diagnostics CLI.
New Diagnostics Bundle API gives us opportunity to create diagnostics bundle on a single node even if DC/OS Cluster is down. Next step should be making dcos-diagnostics independent from DC/OS. Currently, Cluster bundle will not be generated if Mesos, Admin Router or DNS is down. To do it we should move from single service to binary deployed on cluster. This idea is described in design doc
We keep user stories in this doc Tasks are gathered under DCOS-57837.