Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

Closed
wants to merge 2 commits into from

Commits on Nov 6, 2020

  1. data/bootstrap/files/usr/local/bin/installer-gather: Look for unit re…

    …starts
    
    From [1]:
    
    > Note that service restart is subject to unit start rate limiting
    > configured with StartLimitIntervalSec= and StartLimitBurst=, see
    > systemd.unit(5) for details. A restarted service enters the failed
    > state only after the start limits are reached.
    
    And [2]:
    
    > Configure unit start rate limiting. Units which are started more
    > than burst times within an interval time interval are not permitted
    > to start any more
    
    We don't set those StartLimit* properties on our units, so they are
    endlessly restarted without ever entering the 'failed' state and being
    collected by failed-units.txt [3]:
    
      $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224/artifacts/e2e-aws/ipi-install-install/log-bundle-20201006155840.tar >log-bundle.tar.gz
      $ tar xOz log-bundle-20201006155840/bootstrap/journals/bootkube.log <log-bundle.tar.gz | tail
      Oct 06 15:58:33 ip-10-0-1-187 bootkube.sh[15702]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found
      Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a
      Oct 06 15:58:33 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'.
      Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Service RestartSec=5s expired, scheduling restart.
      Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Scheduled restart job, restart counter is at 273.
      Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Stopped Bootstrap a Kubernetes cluster.
      Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: Started Bootstrap a Kubernetes cluster.
      Oct 06 15:58:38 ip-10-0-1-187 bootkube.sh[15762]: /usr/local/bin/bootkube.sh: line 6: i-am-a-command-that-does-not-exist: command not found
      Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Main process exited, code=exited, status=127/n/a
      Oct 06 15:58:38 ip-10-0-1-187 systemd[1]: bootkube.service: Failed with result 'exit-code'.
      $ tar xOz log-bundle-20201006155840/failed-units.txt <log-bundle.tar.gz
      0 loaded units listed. Pass --all to see loaded but inactive units, too.
      To show all installed unit files use 'systemctl list-unit-files'.
    
    With this commit, we look for log entries with automatic-restart
    events [4], and use those to identify units which may be having
    trouble, even if they aren't formally 'failed'.
    
    Recording json-pretty output in ${UNIT}.log.json will be a bit chatty,
    but JSON will allow us to to consumer-side processing to pull out the
    restart count (N_RESTARTS), logs from the final failed invocation
    (INVOCATION_ID) and similar useful subsets.  json-pretty will take
    more transport space than '-o json', but it will also be easier to
    process with grep and other tools that are not JSON-aware, for folks
    who are just hunting for matching strings in the tarball and similar.
    
    [1]: https://www.freedesktop.org/software/systemd/man/systemd.service.html
    [2]: https://www.freedesktop.org/software/systemd/man/systemd.unit.html
    [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/2567/pull-ci-openshift-installer-master-e2e-aws/1313493438984884224
    [4]: https://github.com/systemd/systemd/blob/4b28e50f9ef7655542a5ce5bc05857508ddf1495/catalog/systemd.catalog.in#L341-L342
    wking committed Nov 6, 2020
    Configuration menu
    Copy the full SHA
    26cf75e View commit details
    Browse the repository at this point in the history
  2. cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs

    Instead of just dropping them into the users lap "here's a big
    tarball, have fun", look through them for obvious things that we can
    summarize.  With:
    
       func runGatherBootstrapCmd(directory string) error {
      +       return analyzeGatheredBootstrap("/tmp/log-bundle.tar.gz")
    
    to feed [1] into the analysis logic, the output looks like:
    
      WARNING control-plane/10.0.134.229 had failing systemd units: crio.service
      WARNING control-plane/10.0.134.229: crio.service:
      ● crio.service - Open Container Initiative Daemon
         Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/crio.service.d
                 └─10-default-env.conf
         Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:31 UTC; 320ms ago
           Docs: https://github.com/cri-o/cri-o
        Process: 8491 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
       Main PID: 8491 (code=exited, status=1/FAILURE)
            CPU: 144ms
    
      Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Starting Open Container Initiative Daemon...
      Oct 24 11:11:31 ip-10-0-134-229 crio[8491]: time="2019-10-24 11:11:31.895986612Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
      Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
      Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Failed with result 'exit-code'.
      Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: Failed to start Open Container Initiative Daemon.
      Oct 24 11:11:31 ip-10-0-134-229 systemd[1]: crio.service: Consumed 144ms CPU time
      WARNING control-plane/10.0.134.243 had failing systemd units: crio.service
      WARNING control-plane/10.0.134.243: crio.service:
      ● crio.service - Open Container Initiative Daemon
         Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/crio.service.d
                 └─10-default-env.conf
         Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:35 UTC; 8s ago
           Docs: https://github.com/cri-o/cri-o
        Process: 8439 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
       Main PID: 8439 (code=exited, status=1/FAILURE)
            CPU: 151ms
    
      Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Starting Open Container Initiative Daemon...
      Oct 24 11:11:35 ip-10-0-134-243 crio[8439]: time="2019-10-24 11:11:35.238163016Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
      Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
      Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Failed with result 'exit-code'.
      Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: Failed to start Open Container Initiative Daemon.
      Oct 24 11:11:35 ip-10-0-134-243 systemd[1]: crio.service: Consumed 151ms CPU time
      WARNING control-plane/10.0.157.61 had failing systemd units: crio.service
      WARNING control-plane/10.0.157.61: crio.service:
      ● crio.service - Open Container Initiative Daemon
         Loaded: loaded (/usr/lib/systemd/system/crio.service; disabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/crio.service.d
                 └─10-default-env.conf
         Active: failed (Result: exit-code) since Thu 2019-10-24 11:11:36 UTC; 1s ago
           Docs: https://github.com/cri-o/cri-o
        Process: 8379 ExecStart=/usr/bin/crio $CRIO_STORAGE_OPTIONS $CRIO_NETWORK_OPTIONS $CRIO_METRICS_OPTIONS (code=exited, status=1/FAILURE)
       Main PID: 8379 (code=exited, status=1/FAILURE)
            CPU: 158ms
    
      Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Starting Open Container Initiative Daemon...
      Oct 24 11:11:36 ip-10-0-157-61 crio[8379]: time="2019-10-24 11:11:36.807284677Z" level=fatal msg="opening seccomp profile (/etc/crio/seccomp.json) failed: open /etc/crio/seccomp.json: no such file or directory"
      Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Main process exited, code=exited, status=1/FAILURE
      Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Failed with result 'exit-code'.
      Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: Failed to start Open Container Initiative Daemon.
      Oct 24 11:11:36 ip-10-0-157-61 systemd[1]: crio.service: Consumed 158ms CPU time
    
    That's maybe a bit noisy, but mostly because all three control-plane
    machines failed the same way.
    
    Journal field docs are linked from the Go docstrings, but if these
    evolve later, the ones I'm using are from [2], except for UNIT, where I have opened an upstream documentation request [3].
    
    It might be worth exposing this as:
    
      $ openshift-install analyze bootstrap PATH
    
    so folks could look at bootstrap logs which had been gathered by
    third-parties, but I'm punting on that for now.
    
    [1]: https://storage.googleapis.com/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.3/2455/artifacts/e2e-aws/installer/log-bundle-20191024111122.tar
    [2]: https://github.com/systemd/systemd/blob/v246/man/systemd.journal-fields.xml
    [3]: systemd/systemd#17538
    wking committed Nov 6, 2020
    Configuration menu
    Copy the full SHA
    fe43889 View commit details
    Browse the repository at this point in the history