cmd/openshift-install/create: Fatal Kube API wait timeouts #786

wking · 2018-12-05T04:40:54Z

Since it landed in 127219f (#579), the Kube-API wait.Until loop has lacked "did we timeout?" error checking. That means that a hung bootstrap node could lead to logs like:

...
DEBUG Still waiting for the Kubernetes API: Get https://wking-api.installer.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused
INFO Waiting 30m0s for the bootstrap-complete event...
WARNING Failed to connect events watcher: Get https://wking-api.installer.testing:6443/api/v1/namespaces/kube-system/events?watch=true: dial tcp 192.168.126.10:6443: connect: connection refused
...

where the Kube API never comes up and we waste 30 minutes of failed event-watcher connection attempts anyway. With this commit, we'll fail if apiContext expires without an API coming up. From the Context.Err docs:

If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error.

wking · 2018-12-05T05:08:15Z

I've also pushed 0a875c192 to ensure we log fatal errors to .openshift_install.log. Details in the commit message.

CC @crawford.

abhinavdahiya · 2018-12-05T17:26:58Z

The problem with that approach is than any errors returned by
doSomething will not be logged to .openshift_install.log, and we
definitely want those errors logged to the file. With this commit,
I've shuffled things around to ensure we call logrus.Fatal(err) before
the deferred cleanup() fires.

all errors from doSomething are Fatal using the cobra clis's RunE. they are already visible to user.. why put them in .openshift_install.log

should all other cli errors like invalid flag provided / invalid subcommand end up in openshift_install.log ?

wking · 2018-12-05T17:43:58Z

why put them in .openshift_install.log

So we can say "just send us your .openshift_install.log" when debugging user-reported issues and still see the final error.

should all other cli errors like invalid flag provided / invalid subcommand end up in .openshift_install.log ?

I'm fine leaving those out, since they're generally easier to resolve than "my Terraform/watcher/destroy invocation broke". But by the time we get down into code that has info/debug-level stuff written to .openshift_install.log, I want to have fatal/error-level output from that code written to the log too ;).

Since it landed in 127219f (cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete, 2018-10-30, openshift#579), the Kube-API wait.Until loop has lacked "did we timeout?" error checking. That means that a hung bootstrap node could lead to logs like: ... DEBUG Still waiting for the Kubernetes API: Get https://wking-api.installer.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused INFO Waiting 30m0s for the bootstrap-complete event... WARNING Failed to connect events watcher: Get https://wking-api.installer.testing:6443/api/v1/namespaces/kube-system/events?watch=true: dial tcp 192.168.126.10:6443: connect: connection refused ... where the Kube API never comes up and we waste 30 minutes of failed event-watcher connection attempts anyway. With this commit, we'll fail if apiContext expires without an API coming up. From [1]: If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error. [1]: https://golang.org/pkg/context/#Context

Before this commit, we had a number of callers with the following pattern: func doSomething(args) error { cleanup, err := setupFileHook(rootOpts.dir) if err != nil { return errors.Wrap(err, "failed to setup logging hook") } defer cleanup() err := someHelper() if err != nil { return err } return nil } The problem with that approach is than any errors returned by doSomething will not be logged to .openshift_install.log, and we definitely want those errors logged to the file. With this commit, I've shuffled things around to ensure we call logrus.Fatal(err) before the deferred cleanup() fires.

wking · 2018-12-11T07:17:14Z

Rebased onto master around #806 with 0a875c1 -> 6bd59df.

abhinavdahiya · 2018-12-11T19:50:51Z

/lgtm

openshift-ci-robot · 2018-12-11T19:51:09Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2018-12-12T11:55:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2018-12-12T17:59:00Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Dec 5, 2018

openshift-ci-robot requested review from rajatchopra and steveej December 5, 2018 04:41

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Dec 5, 2018

wking force-pushed the error-out-if-create-kube-api-wait-fails branch from 13b7608 to 0a875c1 Compare December 5, 2018 05:29

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 8, 2018

wking added 2 commits December 10, 2018 23:13

wking force-pushed the error-out-if-create-kube-api-wait-fails branch from 0a875c1 to 6bd59df Compare December 11, 2018 07:16

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 11, 2018

openshift-ci-robot assigned abhinavdahiya Dec 11, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 11, 2018

openshift-merge-robot merged commit 5748299 into openshift:master Dec 12, 2018

wking deleted the error-out-if-create-kube-api-wait-fails branch December 12, 2018 20:51

wking mentioned this pull request Apr 16, 2019

New gather subcommand to assist debugging bootstrap failures. #1627

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/openshift-install/create: Fatal Kube API wait timeouts #786

cmd/openshift-install/create: Fatal Kube API wait timeouts #786

wking commented Dec 5, 2018

wking commented Dec 5, 2018 •

edited

Loading

abhinavdahiya commented Dec 5, 2018

wking commented Dec 5, 2018

wking commented Dec 11, 2018

abhinavdahiya commented Dec 11, 2018

openshift-ci-robot commented Dec 11, 2018

openshift-bot commented Dec 12, 2018

openshift-bot commented Dec 12, 2018

cmd/openshift-install/create: Fatal Kube API wait timeouts #786

cmd/openshift-install/create: Fatal Kube API wait timeouts #786

Conversation

wking commented Dec 5, 2018

wking commented Dec 5, 2018 • edited Loading

abhinavdahiya commented Dec 5, 2018

wking commented Dec 5, 2018

wking commented Dec 11, 2018

abhinavdahiya commented Dec 11, 2018

openshift-ci-robot commented Dec 11, 2018

openshift-bot commented Dec 12, 2018

openshift-bot commented Dec 12, 2018

wking commented Dec 5, 2018 •

edited

Loading