installer/*: Add --continue-on-error to destroy workflow #252

eparis · 2018-09-14T20:22:55Z

This would let us skip destroying things like worker machines. Useful if the
API never came up. As those resources can't exist yet.

eparis · 2018-09-14T20:23:29Z

@bison
Easy to see the value, start an install, before the API is running call tectonic destroy.

eparis · 2018-09-14T21:10:57Z

/retest
https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/252/pull-ci-openshift-installer-e2e-aws/274 doesn't look like my fault

bison · 2018-09-14T21:16:28Z

Makes sense to me. I'm not an owner here though. You'll need someone else to approve.

wking · 2018-09-14T21:44:17Z

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/252/pull-ci-openshift-installer-e2e-aws/274 doesn't look like my fault

Pasting from here, so I can find this later if we see it again:

Failure [592.691 seconds]
[BeforeSuite] BeforeSuite 
/tmp/openshift/build-rpms/rpm/BUILD/origin-4.0.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/e2e.go:140

  Node 1 disappeared before completing BeforeSuite

  /tmp/openshift/build-rpms/rpm/BUILD/origin-4.0.0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/e2e.go:140

wking

/lgtm

wking · 2018-09-14T21:55:14Z

Same Node 1 disappeared before completing BeforeSuite this time. I'll see if I can reproduce in a locally-launched cluster on master.

wking · 2018-09-14T22:11:36Z

Probably won't address the "Node 1 disappeared..." because it only updated the smoke config, but I'm going to kick these again now that openshift/release/pull/1517 has landed:

/test e2e-aws
/test e2e-aws-smoke

wking · 2018-09-14T22:25:10Z

Smoke is worse:

2018/09/14 22:18:01 Ran for 6m13s
error: could not run steps: template e2e-aws-smoke has required parameter LOCAL_IMAGE_INSTALLER_SMOKE which is not defined

wking · 2018-09-14T22:54:48Z

We're back to an earlier release config with openshift/release#1518. Kicking again:

/retest

wking · 2018-09-14T23:34:19Z

/hold

wking · 2018-09-14T23:36:47Z

@crawford pointed out that this approach doesn't work well for CI. We want something like --keep-going to not die on in-cluster errors. Then CI can always set that flag, get full cleanup when the cluster is working, and get out-of-cluster cleanup when the cluster is broken.

wking · 2018-09-14T23:39:27Z

And e2e-aws is still dying with "Node 1 disappeared before completing BeforeSuite" :/.

wking · 2018-09-14T23:39:48Z

/lgtm cancel

eparis · 2018-09-14T23:47:45Z

As I think about it, we should probably make it something like "don't fail" instead of skip. So we can use it in ci? Maybe

This would let us skip destroying things like worker machines. Useful if the API never came up. As those resources can't exist yet. I'm currently not passing the continue-on-error down into the individual steps, but we could do that later if we need more granularity.

enxebre · 2018-09-17T08:45:44Z

For the uninstall story, how about we ping the api health-check before trying to scale down, if API is not responding then just forget about machineset and go ahead with tf destroy
cc @eparis @bison

wking · 2018-09-17T09:02:00Z

For the uninstall story, how about we ping the api health-check before trying to scale down...

This is racy. What if the API dies right after you check it? I'd rather just try and scale down, but keep going if it fails (wking/openshift-installer@945e0a91c0b0fd)

wking · 2018-09-17T17:10:25Z

installer/cmd/tectonic/main.go

-	clusterDestroyDirFlag = clusterDestroyCommand.Flag("dir", "Cluster directory").Default(".").ExistingDir()
+	clusterDestroyCommand   = kingpin.Command("destroy", "Destroy an existing Tectonic cluster")
+	clusterDestroyDirFlag   = clusterDestroyCommand.Flag("dir", "Cluster directory").Default(".").ExistingDir()
+	clusterDestroyContOnErr = clusterDestroyCommand.Flag("continue-on-error", "Log errors, but attempt to continue cleaning up the cluster.  THIS MAY LEAK RESOURCES, because you may not have enough state left after a partial removal to be able to perform a second destroy.").Default("false").Bool()


nit: Is there a precedent for --continue-on-error? --keep-going is in Make. Consistency with existing tools isn't critical, but it's nice to reuse existing wording where we can.

I was thinking of spf13/pflag which is, admittedly, obscure.

wking · 2018-09-17T17:33:45Z

/hold cancel

I'm fine with 23915f2 as it stands, or shifted back to --keep-going, either way. Does anyone else want to take a look before I /lgtm?

crawford · 2018-09-17T17:40:25Z

@wking Looking

crawford · 2018-09-17T17:41:39Z

/lgtm

openshift-ci-robot · 2018-09-17T17:41:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crawford, eparis

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [crawford]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ontinue-on-error Taking advantage of openshift/installer@23915f2e (installer/*: Add --continue-on-error to destroy workflow, 2018-09-16, openshift/installer#252) so we can reap at least most of our resources even if the cluster doesn't come up enough for the machine API operator to be able to destroy workers. With various stages of cluster health: 1. Cluster never comes up at all. 2. Cluster healthy enough to create workers. 3. Cluster healthy enough to destroy workers. we're only worried about leakage in the space between 2 and 3. Hopefully there isn't any space there, but without this commit we're currently leaking resources from 1 as well. The two-part destroy attempts are originally from 51df634 (Support an aws installer CI job, 2018-06-07, openshift#928), although there's not much to motivate them there. With --continue-on-error destruction, we're already trying pretty hard to clean everything up. So excepting brief network hiccups and such, I think a single pass is sufficient. And we'll want a better backstop to catch any resources that leak through (e.g. orphaned workers), so I'm dropping the retry here.

Taking advantage of 23915f2 (installer/*: Add --continue-on-error to destroy workflow, 2018-09-16, openshift#252) so we can reap at least most of our resources even if the cluster doesn't come up enough for the machine API operator to be able to destroy workers. With various stages of cluster health: 1. Cluster never comes up at all. 2. Cluster healthy enough to create workers. 3. Cluster healthy enough to destroy workers. we're only worried about leakage in the space between 2 and 3. Hopefully there isn't any space there, but without this commit we're currently leaking resources from 1 as well.

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 14, 2018

openshift-ci-robot requested review from abhinavdahiya and wking September 14, 2018 20:23

wking approved these changes Sep 14, 2018

View reviewed changes

openshift-ci-robot assigned wking Sep 14, 2018

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 14, 2018

wking mentioned this pull request Sep 14, 2018

modules/bootkube/resources/bootkube: Restore --tmpfs #253

Merged

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2018

openshift-ci-robot removed lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 14, 2018

eparis force-pushed the skip-in-cluster-destroy branch from 2e8094f to 23915f2 Compare September 16, 2018 18:08

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 16, 2018

eparis changed the title ~~installer/*: allow skipping destruction of resources inside the API~~ installer/*: Add --continue-on-error to destroy workflow Sep 17, 2018

wking reviewed Sep 17, 2018

View reviewed changes

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 17, 2018

openshift-ci-robot assigned crawford Sep 17, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 17, 2018

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2018

openshift-merge-robot merged commit 4acb6c5 into openshift:master Sep 17, 2018

wking mentioned this pull request Sep 18, 2018

ci-operator/templates/cluster-launch-installer-e2e*: Destroy with --continue-on-error openshift/release#1542

Merged

wking mentioned this pull request Sep 19, 2018

tests/run: Destroy with --continue-on-error #283

Merged

eparis deleted the skip-in-cluster-destroy branch February 18, 2019 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

installer/*: Add --continue-on-error to destroy workflow #252

installer/*: Add --continue-on-error to destroy workflow #252

eparis commented Sep 14, 2018

eparis commented Sep 14, 2018

eparis commented Sep 14, 2018

bison commented Sep 14, 2018

wking commented Sep 14, 2018

wking left a comment

wking commented Sep 14, 2018

wking commented Sep 14, 2018

wking commented Sep 14, 2018 •

edited

Loading

wking commented Sep 14, 2018

wking commented Sep 14, 2018

wking commented Sep 14, 2018

wking commented Sep 14, 2018

wking commented Sep 14, 2018

eparis commented Sep 14, 2018

enxebre commented Sep 17, 2018

wking commented Sep 17, 2018

wking Sep 17, 2018

eparis Sep 17, 2018

wking commented Sep 17, 2018 •

edited

Loading

crawford commented Sep 17, 2018

crawford commented Sep 17, 2018

openshift-ci-robot commented Sep 17, 2018

installer/*: Add --continue-on-error to destroy workflow #252

installer/*: Add --continue-on-error to destroy workflow #252

Conversation

eparis commented Sep 14, 2018

eparis commented Sep 14, 2018

eparis commented Sep 14, 2018

bison commented Sep 14, 2018

wking commented Sep 14, 2018

wking left a comment

Choose a reason for hiding this comment

wking commented Sep 14, 2018

wking commented Sep 14, 2018

wking commented Sep 14, 2018 • edited Loading

wking commented Sep 14, 2018

wking commented Sep 14, 2018

wking commented Sep 14, 2018

wking commented Sep 14, 2018

wking commented Sep 14, 2018

eparis commented Sep 14, 2018

enxebre commented Sep 17, 2018

wking commented Sep 17, 2018

wking Sep 17, 2018

Choose a reason for hiding this comment

eparis Sep 17, 2018

Choose a reason for hiding this comment

wking commented Sep 17, 2018 • edited Loading

crawford commented Sep 17, 2018

crawford commented Sep 17, 2018

openshift-ci-robot commented Sep 17, 2018

wking commented Sep 14, 2018 •

edited

Loading

wking commented Sep 17, 2018 •

edited

Loading