-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete #579
cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete #579
Conversation
ce15364
to
223f200
Compare
I've pushed ce15364 -> 223f200 fixing the "wait for the API" timeout and some |
/test e2e-aws There are still some |
All I ask is that if the install fails, the bootstrap node is not deleted so we can debug. Seems like in the past the bootstrap is under the impression it completed when, in fact, it did not complete successfully. |
@wking watches were not resilient when we are watching the API we are bringing up. |
c898075
to
ca9ab23
Compare
How can we decouple #573 from the apimachinery version? Ideally we'd just be pulling in stable types there, and I'd rather not pin mutable client-go logic to an older version because of some static types. Maybe we need to pull the validation stuff out into a separate package so we can drop this? |
apimachinery is tied to api (static types) ?? the same case can be made for why bump apimachinery for client-go bump. |
But don't those static types persist into future versions of apimachinery? I'm not clear on how Kubernetes deprecates old types; I'd just assumed that it kept them around for backwards-compat.
Are they that decoupled? If I can do that, I'm happy to bump client-go here and leave apimachinery alone. |
can we try with just client-go bump or make client-go a override than a constraint. |
Also do you think we can squash all the vendor bumps together? |
Hmm, this approach is going to break down if the caller hasn't set up DNS. For example, see #93 where @cgwalters had not set up DNS. Are we ok hanging for 30 minutes waiting for something that will never resolve? Or do I need to catch that error and bail early if I see it? |
can't resolve dns now and will never resolve dns now is a problem that is true for both aws and libvirt. AWS the dns resolution might take pretty long to propagate based on your domains SOA. So i think it is fine to wait for 30 mins. the user can decide to cancel the wait and exit... |
Looks like a no-go: $ dep ensure
Solving failure: package k8s.io/apimachinery/pkg/util/naming does not exist within project k8s.io/apimachinery
$ git grep util/naming vendor
vendor/k8s.io/apimachinery/pkg/runtime/scheme.go: "k8s.io/apimachinery/pkg/util/naming"
vendor/k8s.io/client-go/tools/cache/reflector.go: "k8s.io/apimachinery/pkg/util/naming"
$ git grep tools/cache vendor/k8s.io/client-go/tools/watch
vendor/k8s.io/client-go/tools/watch/informerwatcher.go: "k8s.io/client-go/tools/cache"
vendor/k8s.io/client-go/tools/watch/until.go: "k8s.io/client-go/tools/cache" So what do I need to do to catch cluster-API with 1.12? |
lets try this: rebase on #573 with overrides for your changes and see if that works? |
ca9ab23
to
2fddb04
Compare
Rebasing onto #573: $ dep ensure
Solving failure: package k8s.io/api/autoscaling/v2beta2 does not exist within project k8s.io/api I've pushed 2fddb04 squashing down to a single vendor commit anyway. |
I've pushed another little tweak to the logging with 00fa119 -> faee023 (now the |
/lgtm |
Once the cluster is up, the bootstrap stuff is just a waste of resources. With this commit we make a reasonable effort to wait for the bootstrap-complete event, and, if we see it, we tear down the bootstrap resources. The Kubernetes client initialization is based on [1] and the local wait.go is based on [2]. Both of those are, like this project, under the Apache 2.0 license. Once kubernetes/kubernetes#50102 lands, we can drop the local wait.go and use the upstream implementation. I've dropped the errors.Cause call from main.go, because it was dropping the wrapped context, which is useful context for debugging. That Cause call is from dc118f2 (cmd: switch to already vendored cobra, 2018-10-08, openshift#429), but that commit doesn't explicitly motivate the call. [1]: https://github.com/kubernetes/client-go/blob/v9.0.0/examples/out-of-cluster-client-configuration/main.go [2]: https://github.com/kubernetes/kubernetes/pull/50102/files
8.0.0 is the version that's compatible with Kubernetes 1.11 [1], and we've been using Kubernetes 1.11.3 for apimachinery since 2a606ef (vendor: bump apimachinery to 1.11.3, 2018-10-01, openshift#382). client-go 9.0.0 is nominally compatible with Kubernetes 1.12 [2], and I want this for UntilWithoutRetry [3] which is new in 9.0.0 [4]. I edited Gopkg.toml by hand and regenerated everything else with: $ dep ensure using: $ dep version dep: version : v0.5.0 build date : git hash : 22125cf go version : go1.10.3 go compiler : gc platform : linux/amd64 features : ImportDuringSolve=false I need to bump apimachinery into the 1.12 channel as well to avoid (and now I match client-go's declared dependency [2]): $ dep ensure Solving failure: package k8s.io/apimachinery/pkg/util/naming does not exist within project k8s.io/apimachinery naming is new in 1.12 [5] and we're pulling it with: $ git grep util/naming vendor vendor/k8s.io/apimachinery/pkg/runtime/scheme.go: "k8s.io/apimachinery/pkg/util/naming" vendor/k8s.io/client-go/tools/cache/reflector.go: "k8s.io/apimachinery/pkg/util/naming" $ git grep tools/cache vendor/k8s.io/client-go/tools/watch vendor/k8s.io/client-go/tools/watch/informerwatcher.go: "k8s.io/client-go/tools/cache" vendor/k8s.io/client-go/tools/watch/until.go: "k8s.io/client-go/tools/cache" Then I needed to change client-go to an override to avoid: $ dep ensure Solving failure: package k8s.io/api/autoscaling/v2beta2 does not exist within project k8s.io/api Besides the version bump, dep also pulled in libraries for waiting on Kubernetes events. [1]: kubernetes/client-go@7d04d0e [2]: kubernetes/client-go@1638f89 [3]: https://github.com/kubernetes/client-go/blob/v9.0.0/tools/watch/until.go#L47-L58 [4]: kubernetes/client-go@cbdb98d [5]: kubernetes/apimachinery@3c4948e
faee023
to
d9ae9b3
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: crawford, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Don't run the tests before 'create cluster' exits successfully (if it fails, the /tmp/shared/exit check will kill the 'test' container). With this, we no longer have to wait on the shared kubeconfig, because we know it was created before setup-success. And once 'create cluster' successfully blocks on the bootstrap-complete event [1], this commit will ensure that we run our tests on the production control plane, and while the bootstrap control plane is still involved. [1]: openshift/installer#579 Although we've been seeing some issues with the re-watch logic from that pull request. Once that logic solidifies, we can drop the API_UP check as well.
Since it landed in 127219f (cmd/openshift-install/create: Destroy bootstrap on bootstrap-complete, 2018-10-30, openshift#579), the Kube-API wait.Until loop has lacked "did we timeout?" error checking. That means that a hung bootstrap node could lead to logs like: ... DEBUG Still waiting for the Kubernetes API: Get https://wking-api.installer.testing:6443/version?timeout=32s: dial tcp 192.168.126.10:6443: connect: connection refused INFO Waiting 30m0s for the bootstrap-complete event... WARNING Failed to connect events watcher: Get https://wking-api.installer.testing:6443/api/v1/namespaces/kube-system/events?watch=true: dial tcp 192.168.126.10:6443: connect: connection refused ... where the Kube API never comes up and we waste 30 minutes of failed event-watcher connection attempts anyway. With this commit, we'll fail if apiContext expires without an API coming up. From [1]: If Done is not yet closed, Err returns nil. If Done is closed, Err returns a non-nil error explaining why: Canceled if the context was canceled or DeadlineExceeded if the context's deadline passed. After Err returns a non-nil error, successive calls to Err return the same error. [1]: https://golang.org/pkg/context/#Context
Once the cluster is up, the bootstrap stuff is just a waste of resources. With this commit we make a reasonable effort to wait for the
bootstrap-complete
event, and, if we see it, we tear down the bootstrap resources.The Kubernetes client initialization is based on this example and the local
wait.go
is based on kubernetes/kubernetes#50102. Both of those are, like this project, under the Apache 2.0 license. Once kubernetes/kubernetes#50102 lands, we can drop the localwait.go
and use the upstream implementation.Still a few FIXME nits to deal with, but I thought I'd put it up for initial review.