Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd: diagnose problems downloading release image #4751

Merged

Conversation

staebler
Copy link
Contributor

When an installation fails during bootstrapping, analyze the bootstrap gather bundle for problems with downloading the release image. This analysis uses data from the new service entries files.

https://issues.redhat.com/browse/CORS-1533

Sample output when pull secret is not valid.

$ ./bin/openshift-install gather bootstrap
INFO Pulling debug logs from the bootstrap machine
INFO Bootstrap gather logs captured here "log-bundle-20210313180900.tar.gz"
ERROR The bootstrap machine failed to download the release image
INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required
INFO Pull failed. Retrying registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603...
INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required

Also, add an analyze command to analyze an existing bootstrap gather bundle.

@staebler
Copy link
Contributor Author

/hold as this builds on #4742.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2021
@staebler staebler force-pushed the release_image_failures branch 4 times, most recently from 0dbbd6e to 2b20472 Compare March 15, 2021 02:11
@staebler
Copy link
Contributor Author

#4742 has merged
/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2021
@staebler
Copy link
Contributor Author

/retest

pkg/gather/service/analyze.go Outdated Show resolved Hide resolved
pkg/gather/service/analyze.go Outdated Show resolved Hide resolved
pkg/gather/service/analyze.go Outdated Show resolved Hide resolved
@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2021
When an installation fails during bootstrapping, analyze the bootstrap
gather bundle for problems with downloading the release image. This
analysis uses data from the new service entries files.

https://issues.redhat.com/browse/CORS-1533

Sample output when pull secret is not valid.
```
$ ./bin/openshift-install gather bootstrap
INFO Pulling debug logs from the bootstrap machine
INFO Bootstrap gather logs captured here "log-bundle-20210313180900.tar.gz"
ERROR The bootstrap machine failed to download the release image
INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required
INFO Pull failed. Retrying registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603...
INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required
```
The unit tests for analyzing bootstrap gather bundles using
the github.com/sirupsen/logrus/hooks/test package. The
github.com/sirupsen/logrus module is already included, so these
changes just add the hooks/test pacakge from that module to
the vendor directory.
Add an analyze command to analyze an existing bootstrap gather
bundle.
Lay the framework for checking the analysis of other bootstrap
services to present the user with more helpful diagnostics.
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2021
logrus.Infof("Could not analyze the release-image.service: %v", err)
}
break
serviceName := serviceEntriesFileSubmatch[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using the second match (or submatch--whatever the right term is)?

Perhaps it is a safe assumption that there are at least two matches to this regex pattern but I think it would be a good idea to check the length as well when checking for nil above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The length of the submatches is determined by the regex. If the length is less than 2, that is a coding error not a runtime error. Throwing a panic is the right thing to do in the face of a coding error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the second submatch is the match of ([^.]+) in the overall ^[^\/]+\/bootstrap\/services\/([^.]+)\.json$ expression.

a := serviceAnalyses[check.name]
if a.starts == 0 {
logrus.Errorf("The bootstrap machine did not execute the %s.service systemd unit", check.name)
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we add more checks then we want to remove this break, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. If the higher-priority service check determines that the service did not start, then we want to stop looking for errors in lower-priority service checks. For example, if we determine that the release-image service did not start, then we don't want to print errors that we find in the bootkube service, because any errors in the bootkube service can be traced back to the release-image service failing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that the check priority was determined by the order in the slice. This makes sense now.

break
}
if !check.check(a) {
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This break too. IIUC these seem like optimizations based on the limited number of checks, which is fine for the time being.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response as above, where this break is what we want long-term.

@patrickdillon
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2021
@staebler
Copy link
Contributor Author

staebler commented Apr 6, 2021

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2021
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 6, 2021

@staebler: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 779c6ee link /test e2e-libvirt
ci/prow/e2e-aws-workers-rhel7 779c6ee link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit f14d9ea into openshift:master Apr 6, 2021
wking added a commit to wking/openshift-installer that referenced this pull request Jan 26, 2022
… logs above gather logs

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

I wanted to move the gather attempt and the gathere-analysis attempt
closer together, so where to move the installer-client's bootstrap
error logging?  Options:

a. Right after we decide that bootstrap failed.
b. After we log any surprising ClusterOperator conditions, but before
   the gather stuff.
c. After the gather stuff, right before the fatal exit.

I've gone with (b) here, because logClusterOperatorConditions can be
very chatty, and I don't want feedback like [1]:

  level=error msg=Bootstrap failed to complete: timed out waiting for the condition
  level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.

to get lost in the scrollback.  And the feedback from the gather
handling is likely to be very directed, so I've kept that towards the
end.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A51
wking added a commit to wking/openshift-installer that referenced this pull request Jan 26, 2022
…gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
wking added a commit to wking/openshift-installer that referenced this pull request Feb 3, 2022
… logs above gather logs

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

I wanted to move the gather attempt and the gathere-analysis attempt
closer together, so where to move the installer-client's bootstrap
error logging?  Options:

a. Right after we decide that bootstrap failed.
b. After we log any surprising ClusterOperator conditions, but before
   the gather stuff.
c. After the gather stuff, right before the fatal exit.

I've gone with (b) here, because logClusterOperatorConditions can be
very chatty, and I don't want feedback like [1]:

  level=error msg=Bootstrap failed to complete: timed out waiting for the condition
  level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane.

to get lost in the scrollback.  And the feedback from the gather
handling is likely to be very directed, so I've kept that towards the
end.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A51
wking added a commit to wking/openshift-installer that referenced this pull request Feb 3, 2022
…gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
wking added a commit to wking/openshift-installer that referenced this pull request Feb 3, 2022
…gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
wking added a commit to wking/openshift-installer that referenced this pull request Feb 3, 2022
…gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
wking added a commit to wking/openshift-installer that referenced this pull request Feb 3, 2022
… gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
wking added a commit to wking/openshift-installer that referenced this pull request Feb 4, 2022
… gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

Matthew suggested [2]:

1. Perform bootstrap wait and store error.
2. Perform bootstrap gather, and print any error.
3. Print failing console operators.
4. Print error from bootstrap wait.
5. If no error from bootstrap gather, perform analysis of gather bundle.
6. If no error from bootstrap gather, print out location of gather bundle.
7. Print generic message that bootstrapping failed.

so that's what I've gone with here.

In order to get step 6 that late, I had to pull it out of
logGatherBootstrap (and now that that function no longer logs on
success, I renamed it to gatherBootstrap).  Now the
newGatherBootstrapCmd function includes its own explicit logging of
the successful gather path, and we can place the create-cluster
logging down at 6 where Matthew wants it.

There's also no need to pick a unique name for is block-scoped
variable.  We can use the usual 'err' without clobbering the local
'err' that's scoped to the waitForBootstrapComplete block.

We only need a specific name for gatherError because we need it to
feed the deferred conditionals for 5 and 6.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
[2]: openshift#5582 (comment)
wking added a commit to wking/openshift-installer that referenced this pull request Feb 4, 2022
… gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

== Order of operations

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

Matthew suggested [2]:

1. Perform bootstrap wait and store error.
2. Perform bootstrap gather, and print any error.
3. Print failing console operators.
4. Print error from bootstrap wait.
5. If no error from bootstrap gather, perform analysis of gather bundle.
6. If no error from bootstrap gather, print out location of gather bundle.
7. Print generic message that bootstrapping failed.

so that's what I've gone with here.

In order to get step 6 that late, I had to pull it out of
logGatherBootstrap (and now that that function no longer logs on
success, I renamed it to gatherBootstrap).  Now the
newGatherBootstrapCmd function includes its own explicit logging of
the successful gather path, and we can place the create-cluster
logging down at 6 where Matthew wants it.

== Error scoping and naming

There's also no need to pick a unique name for is block-scoped
variable.  We can use the usual 'err' without clobbering the local
'err' that's scoped to the waitForBootstrapComplete block.

We only need a specific name for gatherError because we need it to
feed the deferred conditionals for 5 and 6.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
[2]: openshift#5582 (comment)
wking added a commit to wking/openshift-installer that referenced this pull request Feb 4, 2022
… gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

== Order of operations

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

Matthew suggested [2]:

1. Perform bootstrap wait and store error.
2. Perform bootstrap gather, and print any error.
3. Print failing console operators.
4. Print error from bootstrap wait.
5. If no error from bootstrap gather, perform analysis of gather bundle.
6. If no error from bootstrap gather, print out location of gather bundle.
7. Print generic message that bootstrapping failed.

so that's what I've gone with here.

In order to get step 6 that late, I had to pull it out of
logGatherBootstrap (and now that that function no longer logs on
success, I renamed it to gatherBootstrap).  Now the
newGatherBootstrapCmd function includes its own explicit logging of
the successful gather path, and we can place the create-cluster
logging down at 6 where Matthew wants it.

== Error scoping and naming

There's also no need to pick a unique name for is block-scoped
variable.  We can use the usual 'err' without clobbering the local
'err' that's scoped to the waitForBootstrapComplete block.

We only need a specific name for gatherError because we need it to
feed the deferred conditionals for 5 and 6.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
[2]: openshift#5582 (comment)
miyamotoh pushed a commit to miyamotoh/installer that referenced this pull request Feb 16, 2022
… gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

== Order of operations

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

Matthew suggested [2]:

1. Perform bootstrap wait and store error.
2. Perform bootstrap gather, and print any error.
3. Print failing console operators.
4. Print error from bootstrap wait.
5. If no error from bootstrap gather, perform analysis of gather bundle.
6. If no error from bootstrap gather, print out location of gather bundle.
7. Print generic message that bootstrapping failed.

so that's what I've gone with here.

In order to get step 6 that late, I had to pull it out of
logGatherBootstrap (and now that that function no longer logs on
success, I renamed it to gatherBootstrap).  Now the
newGatherBootstrapCmd function includes its own explicit logging of
the successful gather path, and we can place the create-cluster
logging down at 6 where Matthew wants it.

== Error scoping and naming

There's also no need to pick a unique name for is block-scoped
variable.  We can use the usual 'err' without clobbering the local
'err' that's scoped to the waitForBootstrapComplete block.

We only need a specific name for gatherError because we need it to
feed the deferred conditionals for 5 and 6.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
[2]: openshift#5582 (comment)
AnnaZivkovic pushed a commit to AnnaZivkovic/installer that referenced this pull request Apr 1, 2022
… gather logs

Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose
problems downloading release image, 2021-03-13, openshift#4751).  This will
avoid distraction like [1]:

  level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided
  ...
  level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory

where the first line usefully explains that we failed to gather the
log bundle, while the second line uselessly adds that without a log
bundle, there can be no log bundle analysis.

== Order of operations

When they landed in cad45bd (installer-create: Provide user
friendly error messages during failures, 2021-03-29, openshift#4800), the
installer-client's bootstrap error logs were after the gather attempt,
and right before the fatal "Bootstrap failed to complete" bail-out.

fdb04a7 (cmd: diagnose problems downloading release image,
2021-03-13, openshift#4751) landed later, adding an attempt to analyze the
gathered logs.  At this point, the installer-client's bootstrap error
logging sat in between the gather attempt and the gather-analysis
attempt, which seems like uneccessary context switching.

Matthew suggested [2]:

1. Perform bootstrap wait and store error.
2. Perform bootstrap gather, and print any error.
3. Print failing console operators.
4. Print error from bootstrap wait.
5. If no error from bootstrap gather, perform analysis of gather bundle.
6. If no error from bootstrap gather, print out location of gather bundle.
7. Print generic message that bootstrapping failed.

so that's what I've gone with here.

In order to get step 6 that late, I had to pull it out of
logGatherBootstrap (and now that that function no longer logs on
success, I renamed it to gatherBootstrap).  Now the
newGatherBootstrapCmd function includes its own explicit logging of
the successful gather path, and we can place the create-cluster
logging down at 6 where Matthew wants it.

== Error scoping and naming

There's also no need to pick a unique name for is block-scoped
variable.  We can use the usual 'err' without clobbering the local
'err' that's scoped to the waitForBootstrapComplete block.

We only need a specific name for gatherError because we need it to
feed the deferred conditionals for 5 and 6.

[1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
[2]: openshift#5582 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants