-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd: diagnose problems downloading release image #4751
cmd: diagnose problems downloading release image #4751
Conversation
/hold as this builds on #4742. |
0dbbd6e
to
2b20472
Compare
#4742 has merged |
2b20472
to
cb71976
Compare
cb71976
to
d2ea19b
Compare
/retest |
When an installation fails during bootstrapping, analyze the bootstrap gather bundle for problems with downloading the release image. This analysis uses data from the new service entries files. https://issues.redhat.com/browse/CORS-1533 Sample output when pull secret is not valid. ``` $ ./bin/openshift-install gather bootstrap INFO Pulling debug logs from the bootstrap machine INFO Bootstrap gather logs captured here "log-bundle-20210313180900.tar.gz" ERROR The bootstrap machine failed to download the release image INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required INFO Pull failed. Retrying registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603... INFO Error: unable to pull registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error initializing source docker://registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-03-12-142603: Error reading manifest 4.8.0-0.nightly-2021-03-12-142603 in registry.ci.openshift.org/ocp/release: unauthorized: authentication required ```
The unit tests for analyzing bootstrap gather bundles using the github.com/sirupsen/logrus/hooks/test package. The github.com/sirupsen/logrus module is already included, so these changes just add the hooks/test pacakge from that module to the vendor directory.
Add an analyze command to analyze an existing bootstrap gather bundle.
Lay the framework for checking the analysis of other bootstrap services to present the user with more helpful diagnostics.
4f52b6e
to
779c6ee
Compare
logrus.Infof("Could not analyze the release-image.service: %v", err) | ||
} | ||
break | ||
serviceName := serviceEntriesFileSubmatch[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we using the second match (or submatch--whatever the right term is)?
Perhaps it is a safe assumption that there are at least two matches to this regex pattern but I think it would be a good idea to check the length as well when checking for nil above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The length of the submatches is determined by the regex. If the length is less than 2, that is a coding error not a runtime error. Throwing a panic is the right thing to do in the face of a coding error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the second submatch is the match of ([^.]+)
in the overall ^[^\/]+\/bootstrap\/services\/([^.]+)\.json$
expression.
a := serviceAnalyses[check.name] | ||
if a.starts == 0 { | ||
logrus.Errorf("The bootstrap machine did not execute the %s.service systemd unit", check.name) | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we add more checks then we want to remove this break
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. If the higher-priority service check determines that the service did not start, then we want to stop looking for errors in lower-priority service checks. For example, if we determine that the release-image service did not start, then we don't want to print errors that we find in the bootkube service, because any errors in the bootkube service can be traced back to the release-image service failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that the check priority was determined by the order in the slice. This makes sense now.
break | ||
} | ||
if !check.check(a) { | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This break
too. IIUC these seem like optimizations based on the limited number of checks, which is fine for the time being.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same response as above, where this break is what we want long-term.
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: staebler The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
3 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
@staebler: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
… logs above gather logs When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. I wanted to move the gather attempt and the gathere-analysis attempt closer together, so where to move the installer-client's bootstrap error logging? Options: a. Right after we decide that bootstrap failed. b. After we log any surprising ClusterOperator conditions, but before the gather stuff. c. After the gather stuff, right before the fatal exit. I've gone with (b) here, because logClusterOperatorConditions can be very chatty, and I don't want feedback like [1]: level=error msg=Bootstrap failed to complete: timed out waiting for the condition level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. to get lost in the scrollback. And the feedback from the gather handling is likely to be very directed, so I've kept that towards the end. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A51
…gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
… logs above gather logs When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. I wanted to move the gather attempt and the gathere-analysis attempt closer together, so where to move the installer-client's bootstrap error logging? Options: a. Right after we decide that bootstrap failed. b. After we log any surprising ClusterOperator conditions, but before the gather stuff. c. After the gather stuff, right before the fatal exit. I've gone with (b) here, because logClusterOperatorConditions can be very chatty, and I don't want feedback like [1]: level=error msg=Bootstrap failed to complete: timed out waiting for the condition level=error msg=Failed to wait for bootstrapping to complete. This error usually happens when there is a problem with control plane hosts that prevents the control plane operators from creating the control plane. to get lost in the scrollback. And the feedback from the gather handling is likely to be very directed, so I've kept that towards the end. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A51
…gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
…gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
…gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50
… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)
… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. == Order of operations When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. == Error scoping and naming There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)
… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. == Order of operations When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. == Error scoping and naming There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)
… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. == Order of operations When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. == Error scoping and naming There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)
… gather logs Adding a conditional that we overlooked in fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751). This will avoid distraction like [1]: level=error msg=Attempted to gather debug logs after installation failure: bootstrap host address and at least one control plane host address must be provided ... level=error msg=Attempted to analyze the debug logs after installation failure: could not open the gather bundle: open : no such file or directory where the first line usefully explains that we failed to gather the log bundle, while the second line uselessly adds that without a log bundle, there can be no log bundle analysis. == Order of operations When they landed in cad45bd (installer-create: Provide user friendly error messages during failures, 2021-03-29, openshift#4800), the installer-client's bootstrap error logs were after the gather attempt, and right before the fatal "Bootstrap failed to complete" bail-out. fdb04a7 (cmd: diagnose problems downloading release image, 2021-03-13, openshift#4751) landed later, adding an attempt to analyze the gathered logs. At this point, the installer-client's bootstrap error logging sat in between the gather attempt and the gather-analysis attempt, which seems like uneccessary context switching. Matthew suggested [2]: 1. Perform bootstrap wait and store error. 2. Perform bootstrap gather, and print any error. 3. Print failing console operators. 4. Print error from bootstrap wait. 5. If no error from bootstrap gather, perform analysis of gather bundle. 6. If no error from bootstrap gather, print out location of gather bundle. 7. Print generic message that bootstrapping failed. so that's what I've gone with here. In order to get step 6 that late, I had to pull it out of logGatherBootstrap (and now that that function no longer logs on success, I renamed it to gatherBootstrap). Now the newGatherBootstrapCmd function includes its own explicit logging of the successful gather path, and we can place the create-cluster logging down at 6 where Matthew wants it. == Error scoping and naming There's also no need to pick a unique name for is block-scoped variable. We can use the usual 'err' without clobbering the local 'err' that's scoped to the waitForBootstrapComplete block. We only need a specific name for gatherError because we need it to feed the deferred conditionals for 5 and 6. [1]: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.10-e2e-ovirt-ovn/1486127081728249856#1:build-log.txt%3A50 [2]: openshift#5582 (comment)
When an installation fails during bootstrapping, analyze the bootstrap gather bundle for problems with downloading the release image. This analysis uses data from the new service entries files.
https://issues.redhat.com/browse/CORS-1533
Sample output when pull secret is not valid.
Also, add an
analyze
command to analyze an existing bootstrap gather bundle.