-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/openshift-install/create: Fatal Kube API wait timeouts #786
cmd/openshift-install/create: Fatal Kube API wait timeouts #786
Conversation
I've also pushed 0a875c192 to ensure we log fatal errors to CC @crawford. |
13b7608
to
0a875c1
Compare
all errors from should all other cli errors like |
So we can say "just send us your
I'm fine leaving those out, since they're generally easier to resolve than "my Terraform/watcher/destroy invocation broke". But by the time we get down into code that has info/debug-level stuff written to |
Since it landed in 127219f (cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete, 2018-10-30, openshift#579), the Kube-API wait.Until loop has lacked "did we timeout?" error checking. That means that a hung bootstrap node could lead to logs like: ... DEBUG Still waiting for the Kubernetes API: Get https://wking-api.installer.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused INFO Waiting 30m0s for the bootstrap-complete event... WARNING Failed to connect events watcher: Get https://wking-api.installer.testing:6443/api/v1/namespaces/kube-system/events?watch=true: dial tcp 192.168.126.10:6443: connect: connection refused ... where the Kube API never comes up and we waste 30 minutes of failed event-watcher connection attempts anyway. With this commit, we'll fail if apiContext expires without an API coming up. From [1]: If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error. [1]: https://golang.org/pkg/context/#Context
Before this commit, we had a number of callers with the following pattern: func doSomething(args) error { cleanup, err := setupFileHook(rootOpts.dir) if err != nil { return errors.Wrap(err, "failed to setup logging hook") } defer cleanup() err := someHelper() if err != nil { return err } return nil } The problem with that approach is than any errors returned by doSomething will not be logged to .openshift_install.log, and we definitely want those errors logged to the file. With this commit, I've shuffled things around to ensure we call logrus.Fatal(err) before the deferred cleanup() fires.
0a875c1
to
6bd59df
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
/retest Please review the full test history for this PR and help us cut down flakes. |
Since it landed in 127219f (#579), the Kube-API
wait.Until
loop has lacked "did we timeout?" error checking. That means that a hung bootstrap node could lead to logs like:where the Kube API never comes up and we waste 30 minutes of failed event-watcher connection attempts anyway. With this commit, we'll fail if apiContext expires without an API coming up. From the
Context.Err
docs: