Skip to content

Commit

Permalink
FIXME: data/bootstrap/files/usr/local/bin/installer-gather: Look for …
Browse files Browse the repository at this point in the history
…unit restarts

From [1]:

> Note that service restart is subject to unit start rate limiting
> configured with StartLimitIntervalSec= and StartLimitBurst=, see
> systemd.unit(5) for details. A restarted service enters the failed
> state only after the start limits are reached.

And [2]:

> Configure unit start rate limiting. Units which are started more
> than burst times within an interval time interval are not permitted
> to start any more

We don't set those StartLimit* properties on our units, so they are
endlessly restarted without ever entering the 'failed' state and being
collected by failed-units.txt [3]:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224/artifacts/e2e-aws/ipi-install-install/log-bundle-20201006155840.tar >log-bundle.tar.gz
  $ tar xOz log-bundle-20201006155840/bootstrap/journals/bootkube.log <log-bundle.tar.gz | tail
  Oct 06 15:58:33 ip-10-0-1-187 bootkube.sh[15702]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found
  Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a
  Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'.
  Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
  Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 273.
  Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Stopped Bootstrap a Kubernetes cluster.
  Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Started Bootstrap a Kubernetes cluster.
  Oct 06 15:58:38 ip-10-0-1-187 bootkube.sh[15762]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found
  Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a
  Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'.
  $ tar xOz log-bundle-20201006155840/failed-units.txt <log-bundle.tar.gz
  0 loaded units listed. Pass --all to see loaded but inactive units, too.
  To show all installed unit files use 'systemctl list-unit-files'.

With this commit, we look for log entries with automatic-restart
events [4], and use those to identify units which may be having
trouble.

[1]: https://www.freedesktop.org/software/systemd/man/systemd.service.html
[2]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224
[4]: https://github.com/systemd/systemd/blob/4b28e50f9ef7655542a5ce5bc05857508ddf1495/catalog/systemd.catalog.in#L341-L342
  • Loading branch information
wking committed Nov 4, 2020
1 parent 0c03f4d commit 61bf20c
Showing 1 changed file with 1 addition and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ mkdir -p "${ARTIFACTS}"

echo "Gathering bootstrap systemd summary ..."
LANG=POSIX systemctl list-units --state=failed >& "${ARTIFACTS}/failed-units.txt"
LANG=POSIX journalctl -o json-pretty MESSAGE_ID=5eb03494b6584870a536b337290809b3 > "${ARTIFACTS}/fixme-restarts.json"

This comment has been minimized.

Copy link
@cgwalters

cgwalters Nov 4, 2020

Member

Looks like a useful start, though with just a slightly bit more effort I think we can extract the failing unit name and get its logs as text.

This comment has been minimized.

Copy link
@cgwalters

cgwalters Nov 4, 2020

Member

Also specifying _PID=1 is a really important general best practice, because any userspace process can provide an arbitrary message ID.

This comment has been minimized.

Copy link
@cgwalters

cgwalters Nov 4, 2020

Member

Here's a crude version:

failing_units=$(journalctl -o json-pretty _PID=1 MESSAGE_ID=5eb03494b6584870a536b337290809b3 | jq -r .UNIT | sort -u)
for unit in ${failing_units}; do
  journalctl --lines=50 -u ${unit}
done

echo "Gathering bootstrap failed systemd unit status ..."
mkdir -p "${ARTIFACTS}/unit-status"
Expand Down

0 comments on commit 61bf20c

Please sign in to comment.