cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete #579

wking · 2018-10-31T00:01:51Z

Once the cluster is up, the bootstrap stuff is just a waste of resources. With this commit we make a reasonable effort to wait for the bootstrap-complete event, and, if we see it, we tear down the bootstrap resources.

The Kubernetes client initialization is based on this example and the local wait.go is based on kubernetes/kubernetes#50102. Both of those are, like this project, under the Apache 2.0 license. Once kubernetes/kubernetes#50102 lands, we can drop the local wait.go and use the upstream implementation.

Still a few FIXME nits to deal with, but I thought I'd put it up for initial review.

abhinavdahiya · 2018-10-31T00:23:32Z

@wking can we close #574 ? The consumer for that code is this PR, So i rather evaluate the bump in this PR.

wking · 2018-10-31T04:18:06Z

I've pushed ce15364 -> 223f200 fixing the "wait for the API" timeout and some gofmt issues. I'm fine with this PR landing as it stands now; I can handle the in-memory kubeconfig/metadata transfer here or in follow-up work.

wking · 2018-10-31T08:26:22Z

e2e-aws:

2018/10/31 04:21:43 error: unable to signal to artifacts container to terminate in pod release-latest, triggering deletion: could not run remote command: unable to upgrade connection: container not found ("artifacts")

/test e2e-aws

There are still some gofmt issues as well. I'll get to those in the morning, but wanted to see if I could get a clean e2e-aws through during the quiet night so I can see what kind of logs we'll generate.

sjenning · 2018-10-31T13:35:35Z

All I ask is that if the install fails, the bootstrap node is not deleted so we can debug. Seems like in the past the bootstrap is under the impression it completed when, in fact, it did not complete successfully.

abhinavdahiya · 2018-10-31T14:24:10Z

a7b6b9a we might not be able to update to these versions because of all the cluster-api stuff in #573

cmd/openshift-install/create.go

abhinavdahiya · 2018-10-31T14:34:12Z

@wking watches were not resilient when we are watching the API we are bringing up.

wking · 2018-10-31T16:28:11Z

a7b6b9a we might not be able to update to these versions because of all the cluster-api stuff in #573

How can we decouple #573 from the apimachinery version? Ideally we'd just be pulling in stable types there, and I'd rather not pin mutable client-go logic to an older version because of some static types. Maybe we need to pull the validation stuff out into a separate package so we can drop this?

abhinavdahiya · 2018-10-31T16:32:02Z

I'd rather not pin mutable client-go logic to an older version because of some static types.

apimachinery is tied to api (static types) ??

the same case can be made for why bump apimachinery for client-go bump.

wking · 2018-10-31T17:05:54Z

I'd rather not pin mutable client-go logic to an older version because of some static types.

apimachinery is tied to api (static types) ??

But don't those static types persist into future versions of apimachinery? I'm not clear on how Kubernetes deprecates old types; I'd just assumed that it kept them around for backwards-compat.

the same case can be made for why bump apimachinery for client-go bump.

Are they that decoupled? If I can do that, I'm happy to bump client-go here and leave apimachinery alone.

abhinavdahiya · 2018-10-31T21:35:12Z

can we try with just client-go bump or make client-go a override than a constraint.

abhinavdahiya · 2018-10-31T21:37:04Z

Also do you think we can squash all the vendor bumps together?

wking · 2018-10-31T22:03:50Z

Hmm, this approach is going to break down if the caller hasn't set up DNS. For example, see #93 where @cgwalters had not set up DNS. Are we ok hanging for 30 minutes waiting for something that will never resolve? Or do I need to catch that error and bail early if I see it?

abhinavdahiya · 2018-10-31T22:13:26Z

Hmm, this approach is going to break down if the caller hasn't set up DNS. For example, see #93 where @cgwalters had not set up DNS. Are we ok hanging for 30 minutes waiting for something that will never resolve? Or do I need to catch that error and bail early if I see it?

can't resolve dns now and will never resolve dns now is a problem that is true for both aws and libvirt. AWS the dns resolution might take pretty long to propagate based on your domains SOA. So i think it is fine to wait for 30 mins.

the user can decide to cancel the wait and exit...

wking · 2018-10-31T22:42:39Z

can we try with just client-go bump or make client-go a override than a constraint.

Looks like a no-go:

$ dep ensure
Solving failure: package k8s.io/apimachinery/pkg/util/naming does not exist within project k8s.io/apimachinery

naming is new in 1.12 (kubernetes/apimachinery@3c4948ecf2) and we're pulling it with:

$ git grep util/naming vendor
vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:	"k8s.io/apimachinery/pkg/util/naming"
vendor/k8s.io/client-go/tools/cache/reflector.go:	"k8s.io/apimachinery/pkg/util/naming"
$ git grep tools/cache vendor/k8s.io/client-go/tools/watch
vendor/k8s.io/client-go/tools/watch/informerwatcher.go:	"k8s.io/client-go/tools/cache"
vendor/k8s.io/client-go/tools/watch/until.go:	"k8s.io/client-go/tools/cache"

So what do I need to do to catch cluster-API with 1.12?

abhinavdahiya · 2018-10-31T22:50:19Z

lets try this: rebase on #573 with overrides for your changes and see if that works?

wking · 2018-10-31T22:55:54Z

Rebasing onto #573:

$ dep ensure
Solving failure: package k8s.io/api/autoscaling/v2beta2 does not exist within project k8s.io/api

I've pushed 2fddb04 squashing down to a single vendor commit anyway.

wking · 2018-11-02T20:04:58Z

I've pushed another little tweak to the logging with 00fa119 -> faee023 (now the bootstrap-complete event gets logged by the same Debugf as the rest of the add events).

crawford · 2018-11-02T21:04:21Z

/lgtm

Once the cluster is up, the bootstrap stuff is just a waste of resources. With this commit we make a reasonable effort to wait for the bootstrap-complete event, and, if we see it, we tear down the bootstrap resources. The Kubernetes client initialization is based on [1] and the local wait.go is based on [2]. Both of those are, like this project, under the Apache 2.0 license. Once kubernetes/kubernetes#50102 lands, we can drop the local wait.go and use the upstream implementation. I've dropped the errors.Cause call from main.go, because it was dropping the wrapped context, which is useful context for debugging. That Cause call is from dc118f2 (cmd: switch to already vendored cobra, 2018-10-08, openshift#429), but that commit doesn't explicitly motivate the call. [1]: https://github.com/kubernetes/client-go/blob/v9.0.0/examples/out-of-cluster-client-configuration/main.go [2]: https://github.com/kubernetes/kubernetes/pull/50102/files

8.0.0 is the version that's compatible with Kubernetes 1.11 [1], and we've been using Kubernetes 1.11.3 for apimachinery since 2a606ef (vendor: bump apimachinery to 1.11.3, 2018-10-01, openshift#382). client-go 9.0.0 is nominally compatible with Kubernetes 1.12 [2], and I want this for UntilWithoutRetry [3] which is new in 9.0.0 [4]. I edited Gopkg.toml by hand and regenerated everything else with: $ dep ensure using: $ dep version dep: version : v0.5.0 build date : git hash : 22125cf go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false I need to bump apimachinery into the 1.12 channel as well to avoid (and now I match client-go's declared dependency [2]): $ dep ensure Solving failure: package k8s.io/apimachinery/pkg/util/naming does not exist within project k8s.io/apimachinery naming is new in 1.12 [5] and we're pulling it with: $ git grep util/naming vendor vendor/k8s.io/apimachinery/pkg/runtime/scheme.go: "k8s.io/apimachinery/pkg/util/naming" vendor/k8s.io/client-go/tools/cache/reflector.go: "k8s.io/apimachinery/pkg/util/naming" $ git grep tools/cache vendor/k8s.io/client-go/tools/watch vendor/k8s.io/client-go/tools/watch/informerwatcher.go: "k8s.io/client-go/tools/cache" vendor/k8s.io/client-go/tools/watch/until.go: "k8s.io/client-go/tools/cache" Then I needed to change client-go to an override to avoid: $ dep ensure Solving failure: package k8s.io/api/autoscaling/v2beta2 does not exist within project k8s.io/api Besides the version bump, dep also pulled in libraries for waiting on Kubernetes events. [1]: kubernetes/client-go@7d04d0e [2]: kubernetes/client-go@1638f89 [3]: https://github.com/kubernetes/client-go/blob/v9.0.0/tools/watch/until.go#L47-L58 [4]: kubernetes/client-go@cbdb98d [5]: kubernetes/apimachinery@3c4948e

wking · 2018-11-02T21:09:41Z

Oops, rebased onto master with faee023 -> d9ae9b3, now that #573 has landed. @crawford, can you re-/lgtm?

crawford · 2018-11-02T21:11:35Z

/lgtm

openshift-ci-robot · 2018-11-02T21:11:44Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Don't run the tests before 'create cluster' exits successfully (if it fails, the /tmp/shared/exit check will kill the 'test' container). With this, we no longer have to wait on the shared kubeconfig, because we know it was created before setup-success. And once 'create cluster' successfully blocks on the bootstrap-complete event [1], this commit will ensure that we run our tests on the production control plane, and while the bootstrap control plane is still involved. [1]: openshift/installer#579 Although we've been seeing some issues with the re-watch logic from that pull request. Once that logic solidifies, we can drop the API_UP check as well.

Since it landed in 127219f (cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete, 2018-10-30, openshift#579), the Kube-API wait.Until loop has lacked "did we timeout?" error checking. That means that a hung bootstrap node could lead to logs like: ... DEBUG Still waiting for the Kubernetes API: Get https://wking-api.installer.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused INFO Waiting 30m0s for the bootstrap-complete event... WARNING Failed to connect events watcher: Get https://wking-api.installer.testing:6443/api/v1/namespaces/kube-system/events?watch=true: dial tcp 192.168.126.10:6443: connect: connection refused ... where the Kube API never comes up and we waste 30 minutes of failed event-watcher connection attempts anyway. With this commit, we'll fail if apiContext expires without an API coming up. From [1]: If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error. [1]: https://golang.org/pkg/context/#Context

openshift-ci-robot requested review from crawford and smarterclayton October 31, 2018 00:01

wking mentioned this pull request Oct 31, 2018

vendor: Bump k8s.io/client-go to 9.0.0 and apimachinery to 1.12.2 #574

Closed

wking force-pushed the bootstrap-teardown branch from ce15364 to 223f200 Compare October 31, 2018 04:16

wking changed the title ~~WIP: cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete~~ cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete Oct 31, 2018

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 31, 2018

abhinavdahiya reviewed Oct 31, 2018

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

abhinavdahiya reviewed Oct 31, 2018

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

wking force-pushed the bootstrap-teardown branch 2 times, most recently from c898075 to ca9ab23 Compare October 31, 2018 16:14

wking force-pushed the bootstrap-teardown branch from ca9ab23 to 2fddb04 Compare October 31, 2018 22:50

openshift-ci-robot assigned crawford Nov 2, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2018

crawford approved these changes Nov 2, 2018

View reviewed changes

wking added 2 commits November 2, 2018 14:08

wking force-pushed the bootstrap-teardown branch from faee023 to d9ae9b3 Compare November 2, 2018 21:08

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2018

openshift-merge-robot merged commit f7e6e08 into openshift:master Nov 2, 2018

wking deleted the bootstrap-teardown branch November 2, 2018 21:59

This was referenced Nov 2, 2018

CHANGELOG: Document changes since v0.3.0 #595

Merged

cmd/destroy: delete cluster asset after destroying cluster #547

Merged

wking mentioned this pull request Nov 3, 2018

cluster-launch-installer-e2e: Block tests on 'create cluster' openshift/release#2062

Merged

This was referenced Nov 6, 2018

Cannot build from master !! ?? #619

Closed

add go version check to build script. #621

Merged

wking mentioned this pull request Nov 7, 2018

bootkube service not restarting on failure #540

Closed

wking mentioned this pull request Dec 5, 2018

cmd/openshift-install/create: Fatal Kube API wait timeouts #786

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete #579

cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete #579

wking commented Oct 31, 2018 •

edited by abhinavdahiya

Loading

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

wking commented Oct 31, 2018 •

edited

Loading

sjenning commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018 •

edited

Loading

wking commented Nov 2, 2018

crawford commented Nov 2, 2018

wking commented Nov 2, 2018

crawford commented Nov 2, 2018

openshift-ci-robot commented Nov 2, 2018

cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete #579

cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete #579

Conversation

wking commented Oct 31, 2018 • edited by abhinavdahiya Loading

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

wking commented Oct 31, 2018 • edited Loading

sjenning commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018

abhinavdahiya commented Oct 31, 2018

wking commented Oct 31, 2018 • edited Loading

wking commented Nov 2, 2018

crawford commented Nov 2, 2018

wking commented Nov 2, 2018

crawford commented Nov 2, 2018

openshift-ci-robot commented Nov 2, 2018

wking commented Oct 31, 2018 •

edited by abhinavdahiya

Loading

wking commented Oct 31, 2018 •

edited

Loading

wking commented Oct 31, 2018 •

edited

Loading