A cross-platform tool to monitor and remediate unhealthy Docker containers
Written in Rust and designed to be OS agnostic, flexible, and performant in large environments via concurrency and multi-threading
The docker-autoheal
binary may be executed in a native OS or from a Docker container
- See https://docs.docker.com/engine/reference/builder/#healthcheck for details
Variable | Default | Description |
---|---|---|
AUTOHEAL_CONNECTION_TYPE | local | This determines how docker-autoheal connects to Docker (One of: local, socket, http, ssl |
AUTOHEAL_STOP_TIMEOUT | 10 | Docker waits n seconds for a container to stop before killing it during restarts (override via label; see below) |
AUTOHEAL_INTERVAL | 5 | Check container health every n seconds |
AUTOHEAL_START_DELAY | 0 | Wait n seconds before first health check |
AUTOHEAL_POST_ACTION | The absolute path of an executable to be run after restart attempts; container name , id and stop-timeout are passed as arguments in that order |
|
AUTOHEAL_MONITOR_ALL | FALSE | Set to TRUE to simply monitor all containers on the host or leave as FALSE and control via autoheal.monitor.enable |
AUTOHEAL_LOG_ALL | FALSE | Allow (TRUE /FALSE ) logging (and webhook/apprise if set) for containers with autostart.restart.enable=FALSE |
AUTOHEAL_LOG_PERSIST | FALSE | Allow (TRUE /FALSE ) external persistent logging and reporting of historical data |
AUTOHEAL_TCP_HOST | localhost | Address of Docker host |
AUTOHEAL_TCP_PORT | 2375 (ssl: 2376) | Port on which to connect to the Docker host |
AUTOHEAL_TCP_TIMEOUT | 10 | Time in n seconds before failing connection attempt |
AUTOHEAL_PEM_PATH | /opt/docker-autoheal/tls | Absolute path to requisite ssl certificate files (key.pem, cert.pem, ca.pem) when AUTOHEAL_CONNECTION_TYPE=ssl |
AUTOHEAL_APPRISE_URL | URL to post messages to the apprise following actions on unhealthy container | |
AUTOHEAL_WEBHOOK_KEY | KEY to post messages to the webhook following actions on unhealthy container | |
AUTOHEAL_WEBHOOK_URL | URL to post messages to the webhook following actions on unhealthy container |
Label | Default | Description |
---|---|---|
autoheal.stop.timeout | Per container override (in seconds) of AUTOHEAL_STOP_TIMEOUT during restart (e.g. Some container routinely takes longer to cleanly exit) |
|
autoheal.monitor.enable | FALSE | Per container override (true/false) to control if should be monitored (e.g. If you have a large number of containers that you wish to monitor and restart, apply this label as FALSE to the few that you do not wish to monitor and set AUTOHEAL_MONITOR_ALL to TRUE ) |
autoheal.restart.enable | TRUE | Per container override (true/false) to control if should restart on unhealthy (e.g. If you have a large number of containers that you wish to monitor and restart, apply this label as FALSE to the few that you do not wish to restart and set AUTOHEAL_MONITOR_ALL to TRUE ) |
Used when executed in native OS (NOTE: The environment variables are also accepted)
Options:
-a, --apprise-url <APPRISE_URL>
The apprise url
-c, --connection-type <CONNECTION_TYPE>
One of local, socket, http, or ssl
-d, --start-delay <START_DELAY>
Time in seconds to wait for first check
-h, --help Print help
-i, --interval <INTERVAL>
Time in seconds to check health
-j, --webhook-key <WEBHOOK_KEY>
The webhook json key string
-k, --key-path <KEY_PATH>
The absolute path to requisite ssl PEM files
-l, --log-all Enable logging of unhealthy containers where restart
is disabled (WARNING, this could be chatty)
-m, --monitor-all Enable monitoring off all containers that have a
healthcheck
-n, --tcp-host <TCP_HOST>
The hostname or IP address of the Docker host (when -c
http or ssl)
-p, --tcp-port <TCP_PORT>
The tcp port number of the Docker host (when -c http
or ssl)
-s, --stop-timeout <STOP_TIMEOUT>
Time in seconds to wait for action to complete
-t, --tcp-timeout <TCP_TIMEOUT>
Time in seconds to wait for connection to complete
-w, --webhook-url <WEBHOOK_URL>
The webhook url
-L, --log-persist Enable external persistent logging and reporting of historical
data
-P, --post-action <SCRIPT_PATH>
The absolute path to a script that should be executed
after container restart
-V, --version Print version information
/usr/local/bin/docker-autoheal --monitor-all --log_persist > /var/log/docker-autoheal.log &
Will connect to the local Docker host, monitor all containers, and generate a persistent log at /opt/docker-autoheal/log.json
docker run -d --read-only \
--user=[uid]:[gid]
--name docker-autoheal \
--network=none \
--restart=always \
--env="AUTOHEAL_CONNECTION_TYPE=socket" \
--env="AUTOHEAL_MONITOR_ALL=true" \
--env="AUTOHEAL_LOG_PERSIST=true" \
--volume=/var/run/docker.sock:/var/run/docker.sock:ro \
--volume=/opt/docker-autoheal/log.json:/opt/docker-autoheal/log.json:rw \
tmknight88/docker-autoheal:latest
Will connect to the Docker host via unix socket location /var/run/docker.sock or Windows named pipe location //./pipe/docker_engine, monitor all containers, and write persistent log data to /opt/docker-autoheal/log.json
as the user with the specified uid:gid
docker run -d --read-only \
--user=[uid]:[gid]
--name docker-autoheal \
--restart=always \
--env="AUTOHEAL_CONNECTION_TYPE=http" \
--env="AUTOHEAL_TCP_HOST=MYHOST" \
--env="AUTOHEAL_TCP_PORT=2375" \
--env="AUTOHEAL_LOG_PERSIST=true" \
--volume=/opt/docker-autoheal/log.json:/opt/docker-autoheal/log.json:rw \
tmknight88/docker-autoheal:latest
Will connect to the Docker host via hostname or IP and the specified port, monitor only containers with a label autoheal.monitor.enable=true
, and write persistent log data to /opt/docker-autoheal/log.json
as the user with the specified uid:gid
2024-01-23 03:03:23-0500 [WARNING] [nordvpn] Container (886d37fd9f5c) is unhealthy with 3 failures
2024-01-23 03:03:23-0500 [WARNING] [nordvpn] Container (886d37fd9f5c) last output: [4] Status: Unstable
2024-01-23 03:03:23-0500 [WARNING] [nordvpn] Restarting container (886d37fd9f5c) with 10s timeout
2024-01-23 03:03:34-0500 [ INFO] [nordvpn] Restart of container (886d37fd9f5c) was successful
2024-01-23 03:03:34-0500 [ INFO] [nordvpn] Container (886d37fd9f5c) has been unhealthy 1 time
2024-01-23 03:04:48-0500 [WARNING] [privoxy] Container (74f74eb7b2d0) is unhealthy with 3 failures
2024-01-23 03:04:48-0500 [WARNING] [privoxy] Container (74f74eb7b2d0) last output: [-1] Health check exceeded timeout (3s)
2024-01-23 03:04:48-0500 [WARNING] [privoxy] Restarting container (74f74eb7b2d0) with 10s timeout
2024-01-23 03:04:59-0500 [ INFO] [privoxy] Restart of container (74f74eb7b2d0) was successful
2024-01-23 03:04:59-0500 [ INFO] [privoxy] Container (74f74eb7b2d0) has been unhealthy 1 time
Example output when docker-autoheal is in action
Examples of working with log.json:
jq -s 'group_by(.name) | map({name: .[0].name, data: (group_by(.id) | map({id: .[0].id, data: .}))})' /opt/docker-autoheal/log.json
Group all entries by name and then group by container id
jq -s 'map(select(.name=="privoxy"))' /opt/docker-autoheal/log.json
Find all occurrences of 'privoxy'
jq -s 'map(select(.name=="privoxy")) | group_by(.name) | map({name: .[0].name, data: (group_by(.id) | map({id: .[0].id, data: .}))})' /opt/docker-autoheal/log.json
Find all occurrences of 'privoxy' and group by container id
a) Apply the label autoheal.monitor.enable=true
to your container to have it watched
OR
b) Set ENV AUTOHEAL_MONITOR_ALL=true
(or apply --monitor-all
to the binary) to watch all running containers
See https://docs.docker.com/engine/security/https/ for how to configure TCP with mTLS
The certificates and keys need these names:
- ca.pem
- cert.pem
- key.pem
Additional security can be obtained by:
- Use a unique user for monitoring and remediating
- Create a new user
- Add that user to the
docker
group - Execute the binary or docker container with that uid:gid
- Run docker in rootless mode
If you need the docker-autoheal
container timezone to match the local machine, you can map /etc/localtime
docker run ... -v /etc/localtime:/etc/localtime:ro
- The payload includes the following separated by
|
: Docker system hostname, the last health output, and the result of restart action
- Excluding a container from restarts and enabling logging for excluded containers will generate numerous log messages whenever that container becomes unhealthy
- Additionally, if a webhook or apprise is also configured, they will be executed at each monitoring interval for those containers