-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
installer/*: Add --continue-on-error to destroy workflow #252
installer/*: Add --continue-on-error to destroy workflow #252
Conversation
@bison |
/retest |
Makes sense to me. I'm not an owner here though. You'll need someone else to approve. |
Pasting from here, so I can find this later if we see it again:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
Same Node 1 disappeared before completing BeforeSuite this time. I'll see if I can reproduce in a locally-launched cluster on master. |
Probably won't address the "Node 1 disappeared..." because it only updated the smoke config, but I'm going to kick these again now that openshift/release/pull/1517 has landed: /test e2e-aws |
|
We're back to an earlier release config with openshift/release#1518. Kicking again: /retest |
/hold |
@crawford pointed out that this approach doesn't work well for CI. We want something like |
And e2e-aws is still dying with "Node 1 disappeared before completing BeforeSuite" :/. |
/lgtm cancel |
As I think about it, we should probably make it something like "don't fail" instead of skip. So we can use it in ci? Maybe |
This would let us skip destroying things like worker machines. Useful if the API never came up. As those resources can't exist yet. I'm currently not passing the continue-on-error down into the individual steps, but we could do that later if we need more granularity.
2e8094f
to
23915f2
Compare
This is racy. What if the API dies right after you check it? I'd rather just try and scale down, but keep going if it fails (wking/openshift-installer@945e0a91c0b0fd) |
clusterDestroyDirFlag = clusterDestroyCommand.Flag("dir", "Cluster directory").Default(".").ExistingDir() | ||
clusterDestroyCommand = kingpin.Command("destroy", "Destroy an existing Tectonic cluster") | ||
clusterDestroyDirFlag = clusterDestroyCommand.Flag("dir", "Cluster directory").Default(".").ExistingDir() | ||
clusterDestroyContOnErr = clusterDestroyCommand.Flag("continue-on-error", "Log errors, but attempt to continue cleaning up the cluster. THIS MAY LEAK RESOURCES, because you may not have enough state left after a partial removal to be able to perform a second destroy.").Default("false").Bool() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Is there a precedent for --continue-on-error
? --keep-going
is in Make. Consistency with existing tools isn't critical, but it's nice to reuse existing wording where we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of spf13/pflag which is, admittedly, obscure.
/hold cancel I'm fine with 23915f2 as it stands, or shifted back to |
@wking Looking |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: crawford, eparis The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ontinue-on-error Taking advantage of openshift/installer@23915f2e (installer/*: Add --continue-on-error to destroy workflow, 2018-09-16, openshift/installer#252) so we can reap at least most of our resources even if the cluster doesn't come up enough for the machine API operator to be able to destroy workers. With various stages of cluster health: 1. Cluster never comes up at all. 2. Cluster healthy enough to create workers. 3. Cluster healthy enough to destroy workers. we're only worried about leakage in the space between 2 and 3. Hopefully there isn't any space there, but without this commit we're currently leaking resources from 1 as well. The two-part destroy attempts are originally from 51df634 (Support an aws installer CI job, 2018-06-07, openshift#928), although there's not much to motivate them there. With --continue-on-error destruction, we're already trying pretty hard to clean everything up. So excepting brief network hiccups and such, I think a single pass is sufficient. And we'll want a better backstop to catch any resources that leak through (e.g. orphaned workers), so I'm dropping the retry here.
Taking advantage of 23915f2 (installer/*: Add --continue-on-error to destroy workflow, 2018-09-16, openshift#252) so we can reap at least most of our resources even if the cluster doesn't come up enough for the machine API operator to be able to destroy workers. With various stages of cluster health: 1. Cluster never comes up at all. 2. Cluster healthy enough to create workers. 3. Cluster healthy enough to destroy workers. we're only worried about leakage in the space between 2 and 3. Hopefully there isn't any space there, but without this commit we're currently leaking resources from 1 as well.
This would let us skip destroying things like worker machines. Useful if the
API never came up. As those resources can't exist yet.