Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

installer/*: Add --continue-on-error to destroy workflow #252

Merged

Conversation

eparis
Copy link
Member

@eparis eparis commented Sep 14, 2018

This would let us skip destroying things like worker machines. Useful if the
API never came up. As those resources can't exist yet.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 14, 2018
@eparis
Copy link
Member Author

eparis commented Sep 14, 2018

@bison
Easy to see the value, start an install, before the API is running call tectonic destroy.

@eparis
Copy link
Member Author

eparis commented Sep 14, 2018

@bison
Copy link
Contributor

bison commented Sep 14, 2018

Makes sense to me. I'm not an owner here though. You'll need someone else to approve.

@wking
Copy link
Member

wking commented Sep 14, 2018

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/252/pull-ci-openshift-installer-e2e-aws/274 doesn't look like my fault

Pasting from here, so I can find this later if we see it again:

Failure [592.691 seconds]
[BeforeSuite] BeforeSuite 
/tmp/openshift/build-rpms/rpm/BUILD/origin-4.0.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/e2e.go:140

  Node 1 disappeared before completing BeforeSuite

  /tmp/openshift/build-rpms/rpm/BUILD/origin-4.0.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/e2e.go:140

Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 14, 2018
@wking
Copy link
Member

wking commented Sep 14, 2018

Same Node 1 disappeared before completing BeforeSuite this time. I'll see if I can reproduce in a locally-launched cluster on master.

@wking
Copy link
Member

wking commented Sep 14, 2018

Probably won't address the "Node 1 disappeared..." because it only updated the smoke config, but I'm going to kick these again now that openshift/release/pull/1517 has landed:

/test e2e-aws
/test e2e-aws-smoke

@wking
Copy link
Member

wking commented Sep 14, 2018

Smoke is worse:

2018/09/14 22:18:01 Ran for 6m13s
error: could not run steps: template e2e-aws-smoke has required parameter LOCAL_IMAGE_INSTALLER_SMOKE which is not defined

@wking
Copy link
Member

wking commented Sep 14, 2018

We're back to an earlier release config with openshift/release#1518. Kicking again:

/retest

@wking
Copy link
Member

wking commented Sep 14, 2018

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2018
@wking
Copy link
Member

wking commented Sep 14, 2018

@crawford pointed out that this approach doesn't work well for CI. We want something like --keep-going to not die on in-cluster errors. Then CI can always set that flag, get full cleanup when the cluster is working, and get out-of-cluster cleanup when the cluster is broken.

@wking
Copy link
Member

wking commented Sep 14, 2018

And e2e-aws is still dying with "Node 1 disappeared before completing BeforeSuite" :/.

@wking
Copy link
Member

wking commented Sep 14, 2018

/lgtm cancel

@openshift-ci-robot openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 14, 2018
@eparis
Copy link
Member Author

eparis commented Sep 14, 2018

As I think about it, we should probably make it something like "don't fail" instead of skip. So we can use it in ci? Maybe

This would let us skip destroying things like worker machines. Useful
if the API never came up. As those resources can't exist yet.

I'm currently not passing the continue-on-error down into the individual
steps, but we could do that later if we need more granularity.
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 16, 2018
@enxebre
Copy link
Member

enxebre commented Sep 17, 2018

For the uninstall story, how about we ping the api health-check before trying to scale down, if API is not responding then just forget about machineset and go ahead with tf destroy
cc @eparis @bison

@wking
Copy link
Member

wking commented Sep 17, 2018

For the uninstall story, how about we ping the api health-check before trying to scale down...

This is racy. What if the API dies right after you check it? I'd rather just try and scale down, but keep going if it fails (wking/openshift-installer@945e0a91c0b0fd)

@eparis eparis changed the title installer/*: allow skipping destruction of resources inside the API installer/*: Add --continue-on-error to destroy workflow Sep 17, 2018
clusterDestroyDirFlag = clusterDestroyCommand.Flag("dir", "Cluster directory").Default(".").ExistingDir()
clusterDestroyCommand = kingpin.Command("destroy", "Destroy an existing Tectonic cluster")
clusterDestroyDirFlag = clusterDestroyCommand.Flag("dir", "Cluster directory").Default(".").ExistingDir()
clusterDestroyContOnErr = clusterDestroyCommand.Flag("continue-on-error", "Log errors, but attempt to continue cleaning up the cluster. THIS MAY LEAK RESOURCES, because you may not have enough state left after a partial removal to be able to perform a second destroy.").Default("false").Bool()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is there a precedent for --continue-on-error? --keep-going is in Make. Consistency with existing tools isn't critical, but it's nice to reuse existing wording where we can.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of spf13/pflag which is, admittedly, obscure.

@wking
Copy link
Member

wking commented Sep 17, 2018

/hold cancel

I'm fine with 23915f2 as it stands, or shifted back to --keep-going, either way. Does anyone else want to take a look before I /lgtm?

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 17, 2018
@crawford
Copy link
Contributor

@wking Looking

@crawford
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 17, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, eparis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2018
@openshift-merge-robot openshift-merge-robot merged commit 4acb6c5 into openshift:master Sep 17, 2018
wking added a commit to wking/openshift-release that referenced this pull request Sep 18, 2018
…ontinue-on-error

Taking advantage of openshift/installer@23915f2e (installer/*: Add
--continue-on-error to destroy workflow, 2018-09-16,
openshift/installer#252) so we can reap at least most of our resources
even if the cluster doesn't come up enough for the machine API
operator to be able to destroy workers.  With various stages of
cluster health:

1. Cluster never comes up at all.
2. Cluster healthy enough to create workers.
3. Cluster healthy enough to destroy workers.

we're only worried about leakage in the space between 2 and 3.
Hopefully there isn't any space there, but without this commit we're
currently leaking resources from 1 as well.

The two-part destroy attempts are originally from 51df634 (Support an
aws installer CI job, 2018-06-07, openshift#928), although there's not much to
motivate them there.  With --continue-on-error destruction, we're
already trying pretty hard to clean everything up.  So excepting brief
network hiccups and such, I think a single pass is sufficient.  And
we'll want a better backstop to catch any resources that leak through
(e.g. orphaned workers), so I'm dropping the retry here.
wking added a commit to wking/openshift-installer that referenced this pull request Sep 19, 2018
Taking advantage of 23915f2 (installer/*: Add --continue-on-error to
destroy workflow, 2018-09-16, openshift#252) so we can reap at least most of
our resources even if the cluster doesn't come up enough for the
machine API operator to be able to destroy workers.  With various
stages of cluster health:

1. Cluster never comes up at all.
2. Cluster healthy enough to create workers.
3. Cluster healthy enough to destroy workers.

we're only worried about leakage in the space between 2 and 3.
Hopefully there isn't any space there, but without this commit we're
currently leaking resources from 1 as well.
@eparis eparis deleted the skip-in-cluster-destroy branch February 18, 2019 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants