cmd: diagnose problems downloading release image #4751

staebler · 2021-03-14T00:02:00Z

When an installation fails during bootstrapping, analyze the bootstrap gather bundle for problems with downloading the release image. This analysis uses data from the new service entries files.

https://issues.redhat.com/browse/CORS-1533

Sample output when pull secret is not valid.

$ ./bin/openshift-install gather bootstrap
INFO Pulling debug logs from the bootstrap machine
INFO Bootstrap gather logs captured here "log-bundle-20210313180900.tar.gz"
ERROR The bootstrap machine failed to download the release image
INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required
INFO Pull failed. Retrying registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603...
INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required

Also, add an analyze command to analyze an existing bootstrap gather bundle.

staebler · 2021-03-14T00:02:36Z

/hold as this builds on #4742.

staebler · 2021-03-21T02:16:24Z

#4742 has merged
/hold cancel

cmd/openshift-install/analyze.go

staebler · 2021-03-24T20:30:58Z

/retest

pkg/gather/service/analyze.go

When an installation fails during bootstrapping, analyze the bootstrap gather bundle for problems with downloading the release image. This analysis uses data from the new service entries files. https://issues.redhat.com/browse/CORS-1533 Sample output when pull secret is not valid. ``` $ ./bin/openshift-install gather bootstrap INFO Pulling debug logs from the bootstrap machine INFO Bootstrap gather logs captured here "log-bundle-20210313180900.tar.gz" ERROR The bootstrap machine failed to download the release image INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required INFO Pull failed. Retrying registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603... INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required ```

The unit tests for analyzing bootstrap gather bundles using the github.com/sirupsen/logrus/hooks/test package. The github.com/sirupsen/logrus module is already included, so these changes just add the hooks/test pacakge from that module to the vendor directory.

Add an analyze command to analyze an existing bootstrap gather bundle.

Lay the framework for checking the analysis of other bootstrap services to present the user with more helpful diagnostics.

patrickdillon · 2021-04-05T20:27:19Z

pkg/gather/service/analyze.go

-				logrus.Infof("Could not analyze the release-image.service: %v", err)
-			}
-			break
+		serviceName := serviceEntriesFileSubmatch[1]


Why are we using the second match (or submatch--whatever the right term is)?

Perhaps it is a safe assumption that there are at least two matches to this regex pattern but I think it would be a good idea to check the length as well when checking for nil above.

The length of the submatches is determined by the regex. If the length is less than 2, that is a coding error not a runtime error. Throwing a panic is the right thing to do in the face of a coding error.

And the second submatch is the match of ([^.]+) in the overall ^[^\/]+\/bootstrap\/services\/([^.]+)\.json$ expression.

patrickdillon · 2021-04-05T20:43:27Z

pkg/gather/service/analyze.go

+		a := serviceAnalyses[check.name]
+		if a.starts == 0 {
+			logrus.Errorf("The bootstrap machine did not execute the %s.service systemd unit", check.name)
+			break


once we add more checks then we want to remove this break, right?

No. If the higher-priority service check determines that the service did not start, then we want to stop looking for errors in lower-priority service checks. For example, if we determine that the release-image service did not start, then we don't want to print errors that we find in the bootkube service, because any errors in the bootkube service can be traced back to the release-image service failing.

I missed that the check priority was determined by the order in the slice. This makes sense now.

patrickdillon · 2021-04-05T20:52:45Z

pkg/gather/service/analyze.go

+			break
+		}
+		if !check.check(a) {
+			break


This break too. IIUC these seem like optimizations based on the limited number of checks, which is fine for the time being.

Same response as above, where this break is what we want long-term.

patrickdillon · 2021-04-06T13:07:41Z

/lgtm

staebler · 2021-04-06T15:57:29Z

/approve

openshift-ci-robot · 2021-04-06T15:57:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: staebler

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [staebler]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-04-06T16:17:48Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-04-06T20:24:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-04-06T21:16:10Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2021-04-06T21:42:10Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci · 2021-04-06T21:54:32Z

@staebler: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`779c6ee`	link	`/test e2e-libvirt`
ci/prow/e2e-aws-workers-rhel7	`779c6ee`	link	`/test e2e-aws-workers-rhel7`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

… logs above gather logs When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. I wanted to move the gather attempt and the gathere-analysis attempt closer together, so where to move the installer-client's bootstrap error logging? Options: a. Right after we decide that bootstrap failed. b. After we log any surprising ClusterOperator conditions, but before the gather stuff. c. After the gather stuff, right before the fatal exit. I've gone with (b) here, because logClusterOperatorConditions can be very chatty, and I don't want feedback like [1]: level=error msg=Bootstrap failed to complete: timed out waiting for the condition level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. to get lost in the scrollback. And the feedback from the gather handling is likely to be very directed, so I've kept that towards the end. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A51

…gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50

… logs above gather logs When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. I wanted to move the gather attempt and the gathere-analysis attempt closer together, so where to move the installer-client's bootstrap error logging? Options: a. Right after we decide that bootstrap failed. b. After we log any surprising ClusterOperator conditions, but before the gather stuff. c. After the gather stuff, right before the fatal exit. I've gone with (b) here, because logClusterOperatorConditions can be very chatty, and I don't want feedback like [1]: level=error msg=Bootstrap failed to complete: timed out waiting for the condition level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. to get lost in the scrollback. And the feedback from the gather handling is likely to be very directed, so I've kept that towards the end. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A51

…gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50

… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50

… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)

… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. == Order of operations When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. == Error scoping and naming There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 14, 2021

openshift-ci-robot requested review from jhixson74 and patrickdillon March 14, 2021 00:03

staebler force-pushed the release_image_failures branch 4 times, most recently from 0dbbd6e to 2b20472 Compare March 15, 2021 02:11

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 21, 2021

staebler force-pushed the release_image_failures branch from 2b20472 to cb71976 Compare March 21, 2021 02:19

wking reviewed Mar 21, 2021

View reviewed changes

cmd/openshift-install/analyze.go Outdated Show resolved Hide resolved

staebler force-pushed the release_image_failures branch from cb71976 to d2ea19b Compare March 23, 2021 16:36

patrickdillon reviewed Mar 31, 2021

View reviewed changes

pkg/gather/service/analyze.go Outdated Show resolved Hide resolved

pkg/gather/service/analyze.go Outdated Show resolved Hide resolved

pkg/gather/service/analyze.go Outdated Show resolved Hide resolved

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2021

staebler added 4 commits April 1, 2021 18:40

cmd: add analyze command

3441e2a

Add an analyze command to analyze an existing bootstrap gather bundle.

gather: framework for analyzing other bootstrap services

779c6ee

Lay the framework for checking the analysis of other bootstrap services to present the user with more helpful diagnostics.

staebler force-pushed the release_image_failures branch from 4f52b6e to 779c6ee Compare April 1, 2021 22:41

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 1, 2021

patrickdillon reviewed Apr 5, 2021

View reviewed changes

openshift-ci-robot assigned patrickdillon Apr 6, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 6, 2021

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 6, 2021

openshift-merge-robot merged commit f14d9ea into openshift:master Apr 6, 2021

wking mentioned this pull request Jun 14, 2021

cmd/openshift-install/analyze: Attempt to analyze bootstrap tarballs #2567

Closed

wking mentioned this pull request Jan 26, 2022

cmd/openshift-install/create: Do not attempt analysis when we fail to gather logs #5582

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd: diagnose problems downloading release image #4751

cmd: diagnose problems downloading release image #4751

staebler commented Mar 14, 2021

staebler commented Mar 14, 2021

staebler commented Mar 21, 2021

staebler commented Mar 24, 2021

patrickdillon Apr 5, 2021

staebler Apr 5, 2021

staebler Apr 5, 2021

patrickdillon Apr 5, 2021

staebler Apr 5, 2021

patrickdillon Apr 6, 2021

patrickdillon Apr 5, 2021

staebler Apr 5, 2021

patrickdillon commented Apr 6, 2021

staebler commented Apr 6, 2021

openshift-ci-robot commented Apr 6, 2021

openshift-bot commented Apr 6, 2021

openshift-bot commented Apr 6, 2021

openshift-bot commented Apr 6, 2021

openshift-bot commented Apr 6, 2021

openshift-ci bot commented Apr 6, 2021 •

edited

Loading

cmd: diagnose problems downloading release image #4751

cmd: diagnose problems downloading release image #4751

Conversation

staebler commented Mar 14, 2021

staebler commented Mar 14, 2021

staebler commented Mar 21, 2021

staebler commented Mar 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

patrickdillon commented Apr 6, 2021

staebler commented Apr 6, 2021

openshift-ci-robot commented Apr 6, 2021

openshift-bot commented Apr 6, 2021

openshift-bot commented Apr 6, 2021

openshift-bot commented Apr 6, 2021

openshift-bot commented Apr 6, 2021

openshift-ci bot commented Apr 6, 2021 • edited Loading

openshift-ci bot commented Apr 6, 2021 •

edited

Loading