cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

…starts From [1]: > Note that service restart is subject to unit start rate limiting > configured with StartLimitIntervalSec= and StartLimitBurst=, see > systemd.unit(5) for details. A restarted service enters the failed > state only after the start limits are reached. And [2]: > Configure unit start rate limiting. Units which are started more > than burst times within an interval time interval are not permitted > to start any more We don't set those StartLimit* properties on our units, so they are endlessly restarted without ever entering the 'failed' state and being collected by failed-units.txt [3]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224/artifacts/e2e-aws/ipi-install-install/log-bundle-20201006155840.tar >log-bundle.tar.gz $ tar xOz log-bundle-20201006155840/bootstrap/journals/bootkube.log <log-bundle.tar.gz | tail Oct 06 15:58:33 ip-10-0-1-187 bootkube.sh[15702]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 273. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Stopped Bootstrap a Kubernetes cluster. Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Started Bootstrap a Kubernetes cluster. Oct 06 15:58:38 ip-10-0-1-187 bootkube.sh[15762]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'. $ tar xOz log-bundle-20201006155840/failed-units.txt <log-bundle.tar.gz 0 loaded units listed. Pass --all to see loaded but inactive units, too. To show all installed unit files use 'systemctl list-unit-files'. With this commit, we look for log entries with automatic-restart events [4], and use those to identify units which may be having trouble, even if they aren't formally 'failed'. Recording json-pretty output in ${UNIT}.log.json will be a bit chatty, but JSON will allow us to to consumer-side processing to pull out the restart count (N_RESTARTS), logs from the final failed invocation (INVOCATION_ID) and similar useful subsets. json-pretty will take more transport space than '-o json', but it will also be easier to process with grep and other tools that are not JSON-aware, for folks who are just hunting for matching strings in the tarball and similar. [1]: https://www.freedesktop.org/software/systemd/man/systemd.service.html [2]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224 [4]: https://github.com/systemd/systemd/blob/4b28e50f9ef7655542a5ce5bc05857508ddf1495/catalog/systemd.catalog.in#L341-L342

Instead of just dropping them into the users lap "here's a big tarball, have fun", look through them for obvious things that we can summarize. With: func runGatherBootstrapCmd(directory string) error { + return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz") to feed [1] into the analysis logic, the output looks like: WARNING control-plane/10.0.134.229 had failing systemd units: crio.service WARNING control-plane/10.0.134.229: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:31 UTC; 320ms ago Docs: https://github.com/cri-o/cri-o Process: 8491 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8491 (code=exited, status=1/FAILURE) CPU: 144ms Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:31 ip-10-0-134-229 crio[8491]: time="2019-10-24 11:11:31.895986612Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Consumed 144ms CPU time WARNING control-plane/10.0.134.243 had failing systemd units: crio.service WARNING control-plane/10.0.134.243: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:35 UTC; 8s ago Docs: https://github.com/cri-o/cri-o Process: 8439 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8439 (code=exited, status=1/FAILURE) CPU: 151ms Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:35 ip-10-0-134-243 crio[8439]: time="2019-10-24 11:11:35.238163016Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Consumed 151ms CPU time WARNING control-plane/10.0.157.61 had failing systemd units: crio.service WARNING control-plane/10.0.157.61: crio.service: ● crio.service - Open Container Initiative Daemon Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/crio.service.d └─10-default-env.conf Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:36 UTC; 1s ago Docs: https://github.com/cri-o/cri-o Process: 8379 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE) Main PID: 8379 (code=exited, status=1/FAILURE) CPU: 158ms Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Starting Open Container Initiative Daemon... Oct 24 11:11:36 ip-10-0-157-61 crio[8379]: time="2019-10-24 11:11:36.807284677Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory" Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Failed with result 'exit-code'. Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Failed to start Open Container Initiative Daemon. Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Consumed 158ms CPU time That's maybe a bit noisy, but mostly because all three control-plane machines failed the same way. Journal field docs are linked from the Go docstrings, but if these evolve later, the ones I'm using are from [2], except for UNIT, where I have opened an upstream documentation request [3]. It might be worth exposing this as: $ openshift-install analyze bootstrap PATH so folks could look at bootstrap logs which had been gathered by third-parties, but I'm punting on that for now. [1]: https://storage.googleapis.com/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.3/2455/artifacts/e2e-aws/installer/log-bundle-20191024111122.tar [2]: https://github.com/systemd/systemd/blob/v246/man/systemd.journal-fields.xml [3]: systemd/systemd#17538

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

Commits on Nov 6, 2020