Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete #579

Merged
merged 2 commits into from
Nov 2, 2018

Conversation

wking
Copy link
Member

@wking wking commented Oct 31, 2018

Once the cluster is up, the bootstrap stuff is just a waste of resources. With this commit we make a reasonable effort to wait for the bootstrap-complete event, and, if we see it, we tear down the bootstrap resources.

The Kubernetes client initialization is based on this example and the local wait.go is based on kubernetes/kubernetes#50102. Both of those are, like this project, under the Apache 2.0 license. Once kubernetes/kubernetes#50102 lands, we can drop the local wait.go and use the upstream implementation.

Still a few FIXME nits to deal with, but I thought I'd put it up for initial review.

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Oct 31, 2018
@abhinavdahiya
Copy link
Contributor

@wking can we close #574 ? The consumer for that code is this PR, So i rather evaluate the bump in this PR.

@wking wking force-pushed the bootstrap-teardown branch from ce15364 to 223f200 Compare October 31, 2018 04:16
@wking wking changed the title WIP: cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete Oct 31, 2018
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 31, 2018
@wking
Copy link
Member Author

wking commented Oct 31, 2018

I've pushed ce15364 -> 223f200 fixing the "wait for the API" timeout and some gofmt issues. I'm fine with this PR landing as it stands now; I can handle the in-memory kubeconfig/metadata transfer here or in follow-up work.

@wking
Copy link
Member Author

wking commented Oct 31, 2018

e2e-aws:

2018/10/31 04:21:43 error: unable to signal to artifacts container to terminate in pod release-latest, triggering deletion: could not run remote command: unable to upgrade connection: container not found ("artifacts")

/test e2e-aws

There are still some gofmt issues as well. I'll get to those in the morning, but wanted to see if I could get a clean e2e-aws through during the quiet night so I can see what kind of logs we'll generate.

@sjenning
Copy link
Contributor

All I ask is that if the install fails, the bootstrap node is not deleted so we can debug. Seems like in the past the bootstrap is under the impression it completed when, in fact, it did not complete successfully.

@abhinavdahiya
Copy link
Contributor

a7b6b9a we might not be able to update to these versions because of all the cluster-api stuff in #573

@abhinavdahiya
Copy link
Contributor

@wking watches were not resilient when we are watching the API we are bringing up.

@wking wking force-pushed the bootstrap-teardown branch 2 times, most recently from c898075 to ca9ab23 Compare October 31, 2018 16:14
@wking
Copy link
Member Author

wking commented Oct 31, 2018

a7b6b9a we might not be able to update to these versions because of all the cluster-api stuff in #573

How can we decouple #573 from the apimachinery version? Ideally we'd just be pulling in stable types there, and I'd rather not pin mutable client-go logic to an older version because of some static types. Maybe we need to pull the validation stuff out into a separate package so we can drop this?

@abhinavdahiya
Copy link
Contributor

I'd rather not pin mutable client-go logic to an older version because of some static types.

apimachinery is tied to api (static types) ??

the same case can be made for why bump apimachinery for client-go bump.

@wking
Copy link
Member Author

wking commented Oct 31, 2018

I'd rather not pin mutable client-go logic to an older version because of some static types.

apimachinery is tied to api (static types) ??

But don't those static types persist into future versions of apimachinery? I'm not clear on how Kubernetes deprecates old types; I'd just assumed that it kept them around for backwards-compat.

the same case can be made for why bump apimachinery for client-go bump.

Are they that decoupled? If I can do that, I'm happy to bump client-go here and leave apimachinery alone.

@abhinavdahiya
Copy link
Contributor

can we try with just client-go bump or make client-go a override than a constraint.

@abhinavdahiya
Copy link
Contributor

Also do you think we can squash all the vendor bumps together?

@wking
Copy link
Member Author

wking commented Oct 31, 2018

Hmm, this approach is going to break down if the caller hasn't set up DNS. For example, see #93 where @cgwalters had not set up DNS. Are we ok hanging for 30 minutes waiting for something that will never resolve? Or do I need to catch that error and bail early if I see it?

@abhinavdahiya
Copy link
Contributor

Hmm, this approach is going to break down if the caller hasn't set up DNS. For example, see #93 where @cgwalters had not set up DNS. Are we ok hanging for 30 minutes waiting for something that will never resolve? Or do I need to catch that error and bail early if I see it?

can't resolve dns now and will never resolve dns now is a problem that is true for both aws and libvirt. AWS the dns resolution might take pretty long to propagate based on your domains SOA. So i think it is fine to wait for 30 mins.

the user can decide to cancel the wait and exit...

@wking
Copy link
Member Author

wking commented Oct 31, 2018

can we try with just client-go bump or make client-go a override than a constraint.

Looks like a no-go:

$ dep ensure
Solving failure: package k8s.io/apimachinery/pkg/util/naming does not exist within project k8s.io/apimachinery

naming is new in 1.12 (kubernetes/apimachinery@3c4948ecf2) and we're pulling it with:

$ git grep util/naming vendor
vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:	"k8s.io/apimachinery/pkg/util/naming"
vendor/k8s.io/client-go/tools/cache/reflector.go:	"k8s.io/apimachinery/pkg/util/naming"
$ git grep tools/cache vendor/k8s.io/client-go/tools/watch
vendor/k8s.io/client-go/tools/watch/informerwatcher.go:	"k8s.io/client-go/tools/cache"
vendor/k8s.io/client-go/tools/watch/until.go:	"k8s.io/client-go/tools/cache"

So what do I need to do to catch cluster-API with 1.12?

@abhinavdahiya
Copy link
Contributor

lets try this: rebase on #573 with overrides for your changes and see if that works?

@wking wking force-pushed the bootstrap-teardown branch from ca9ab23 to 2fddb04 Compare October 31, 2018 22:50
@wking
Copy link
Member Author

wking commented Oct 31, 2018

Rebasing onto #573:

$ dep ensure
Solving failure: package k8s.io/api/autoscaling/v2beta2 does not exist within project k8s.io/api

I've pushed 2fddb04 squashing down to a single vendor commit anyway.

@wking
Copy link
Member Author

wking commented Nov 2, 2018

I've pushed another little tweak to the logging with 00fa119 -> faee023 (now the bootstrap-complete event gets logged by the same Debugf as the rest of the add events).

@crawford
Copy link
Contributor

crawford commented Nov 2, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2018
wking added 2 commits November 2, 2018 14:08
Once the cluster is up, the bootstrap stuff is just a waste of
resources.  With this commit we make a reasonable effort to wait for
the bootstrap-complete event, and, if we see it, we tear down the
bootstrap resources.

The Kubernetes client initialization is based on [1] and the local
wait.go is based on [2].  Both of those are, like this project, under
the Apache 2.0 license.  Once kubernetes/kubernetes#50102 lands, we
can drop the local wait.go and use the upstream implementation.

I've dropped the errors.Cause call from main.go, because it was
dropping the wrapped context, which is useful context for debugging.
That Cause call is from dc118f2 (cmd: switch to already vendored
cobra, 2018-10-08, openshift#429), but that commit doesn't explicitly motivate
the call.

[1]: https://github.com/kubernetes/client-go/blob/v9.0.0/examples/out-of-cluster-client-configuration/main.go
[2]: https://github.com/kubernetes/kubernetes/pull/50102/files
8.0.0 is the version that's compatible with Kubernetes 1.11 [1], and
we've been using Kubernetes 1.11.3 for apimachinery since 2a606ef
(vendor: bump apimachinery to 1.11.3, 2018-10-01, openshift#382).  client-go
9.0.0 is nominally compatible with Kubernetes 1.12 [2], and I want
this for UntilWithoutRetry [3] which is new in 9.0.0 [4].

I edited Gopkg.toml by hand and regenerated everything else with:

  $ dep ensure

using:

  $ dep version
  dep:
   version     : v0.5.0
   build date  :
   git hash    : 22125cf
   go version  : go1.10.3
   go compiler : gc
   platform    : linux/amd64
   features    : ImportDuringSolve=false

I need to bump apimachinery into the 1.12 channel as well to avoid
(and now I match client-go's declared dependency [2]):

  $ dep ensure
  Solving failure: package k8s.io/apimachinery/pkg/util/naming does not exist within project k8s.io/apimachinery

naming is new in 1.12 [5] and we're pulling it with:

  $ git grep util/naming vendor
  vendor/k8s.io/apimachinery/pkg/runtime/scheme.go:				"k8s.io/apimachinery/pkg/util/naming"
  vendor/k8s.io/client-go/tools/cache/reflector.go:				"k8s.io/apimachinery/pkg/util/naming"
  $ git grep tools/cache vendor/k8s.io/client-go/tools/watch
  vendor/k8s.io/client-go/tools/watch/informerwatcher.go:	"k8s.io/client-go/tools/cache"
  vendor/k8s.io/client-go/tools/watch/until.go:						"k8s.io/client-go/tools/cache"

Then I needed to change client-go to an override to avoid:

  $ dep ensure
  Solving failure: package k8s.io/api/autoscaling/v2beta2 does not exist within project k8s.io/api

Besides the version bump, dep also pulled in libraries for waiting on
Kubernetes events.

[1]: kubernetes/client-go@7d04d0e
[2]: kubernetes/client-go@1638f89
[3]: https://github.com/kubernetes/client-go/blob/v9.0.0/tools/watch/until.go#L47-L58
[4]: kubernetes/client-go@cbdb98d
[5]: kubernetes/apimachinery@3c4948e
@wking wking force-pushed the bootstrap-teardown branch from faee023 to d9ae9b3 Compare November 2, 2018 21:08
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2018
@wking
Copy link
Member Author

wking commented Nov 2, 2018

Oops, rebased onto master with faee023 -> d9ae9b3, now that #573 has landed. @crawford, can you re-/lgtm?

@crawford
Copy link
Contributor

crawford commented Nov 2, 2018

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 2, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit f7e6e08 into openshift:master Nov 2, 2018
@wking wking deleted the bootstrap-teardown branch November 2, 2018 21:59
wking added a commit to wking/openshift-release that referenced this pull request Nov 3, 2018
Don't run the tests before 'create cluster' exits successfully (if it
fails, the /tmp/shared/exit check will kill the 'test' container).
With this, we no longer have to wait on the shared kubeconfig, because
we know it was created before setup-success.  And once 'create
cluster' successfully blocks on the bootstrap-complete event [1], this
commit will ensure that we run our tests on the production control
plane, and while the bootstrap control plane is still involved.

[1]: openshift/installer#579
     Although we've been seeing some issues with the re-watch logic
     from that pull request.  Once that logic solidifies, we can drop
     the API_UP check as well.
wking added a commit to wking/openshift-installer that referenced this pull request Dec 11, 2018
Since it landed in 127219f (cmd/openshift-install/create: Destroy
bootstrap on bootstrap-complete, 2018-10-30, openshift#579), the Kube-API
wait.Until loop has lacked "did we timeout?" error checking.  That
means that a hung bootstrap node could lead to logs like:

  ...
  DEBUG Still waiting for the Kubernetes API: Get https://wking-api.installer.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused
  INFO Waiting 30m0s for the bootstrap-complete event...
  WARNING Failed to connect events watcher: Get https://wking-api.installer.testing:6443/api/v1/namespaces/kube-system/events?watch=true: dial tcp 192.168.126.10:6443: connect: connection refused
  ...

where the Kube API never comes up and we waste 30 minutes of failed
event-watcher connection attempts anyway.  With this commit, we'll
fail if apiContext expires without an API coming up.  From [1]:

  If Done is not yet closed, Err returns nil.  If Done is closed, Err
  returns a non-nil error explaining why: Canceled if the context was
  canceled or DeadlineExceeded if the context's deadline passed.
  After Err returns a non-nil error, successive calls to Err return
  the same error.

[1]: https://golang.org/pkg/context/#Context
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants