Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal #2979

stbenjam · 2020-01-24T16:02:33Z

Baremetal servers differ significantly from other platforms, especially
due to the length of time it can take to boot real hardware: servers go
through POST (power-on self checks) and hardware initializations that
can take many minutes to complete. When deploying the control plane as
well as workers in the installer, it reliably takes more than 30
minutes.

Recently, we added code to the installer and machine-api-operator that
now allows the installer to deploy workers on day 1. This change even on
virtualized baremetal is running up against the 30-minute time limit. On
real baremetal servers, it's guaranteed the install process is closer to
45 minutes.

stbenjam · 2020-01-24T16:04:45Z

Sorry to open this can of worms again.

I know this has been a recurring discussion but it's definitely a real problem for baremetal now when we try to deploy workers as day 1, as we need the control plane to come up, and then we deploy a worker after which takes > 30 minutes but < 60.

You can see our repeated CI failures due to timeouts when doing workers as day 1, and success when we did a second wait-for install-complete
openshift-metal3/dev-scripts#897

sdodson · 2020-01-24T20:05:09Z

In UPI workflows where the admin may have to go reconfigure something like the registry to use alternative storage we absolutely expect these commands to be run multiple times and unless we devise better mechanisms to differentiate between wait longer and terminal failures I'd be really hesitant to double that timeout. Even in IPI I think we should be leaning into the situation and make it clearer that the installer has done its best but the admin needs to take a look at the cluster to determine what's preventing it from completing in under 30 minutes. I think the perceived finality of wait-for install-complete is leading people to throw their hands up and walk away from salvageable clusters. We should be improving the messaging and troubleshooting documentation.

I've done the same in our e2e-metal jobs FWIW, i try twice for both bootstrap-complete and install-complete. I don't see this as a bug though.

stbenjam · 2020-01-24T20:37:15Z

Something close to 100% of real baremetal IPI clusters are going to take more than 45 minutes to install if they have worker replicas > 0.

The way the installer exits is pretty awful and makes it look like a catastrophic failure, I don't think that's really a good user experience on baremetal. There's just not much we can do about how long it takes real servers to boot, and the defaults we wait should cover most cases IMHO.

stbenjam · 2020-01-24T20:43:29Z

What would it take to take a different approach where we say after 30 minutes a certain list of operators should be available, and then permit another 30 minutes to allow things like machine-api to deploy workers? That would let us identify clusters that are probably broken and not have users waiting an extra 30 minutes?

sdodson · 2020-01-25T14:37:25Z

/retitle Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal
(not an endorsement that we should merge this, just linking things since there's a bug too)

openshift-ci-robot · 2020-01-25T14:37:28Z

@stbenjam: This pull request references Bugzilla bug 1794755, which is invalid:

expected the bug to target the "4.4.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdodson · 2020-01-25T14:39:36Z

/bugzilla refresh

openshift-ci-robot · 2020-01-25T14:39:42Z

@sdodson: This pull request references Bugzilla bug 1794755, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2020-01-29T18:07:33Z

In UPI workflows where the admin may have to go reconfigure something like the registry to use alternative storage we absolutely expect these commands to be run multiple times and unless we devise better mechanisms to differentiate between wait longer and terminal failures I'd be really hesitant to double that timeout. Even in IPI I think we should be leaning into the situation and make it clearer that the installer has done its best but the admin needs to take a look at the cluster to determine what's preventing it from completing in under 30 minutes. I think the perceived finality of wait-for install-complete is leading people to throw their hands up and walk away from salvageable clusters. We should be improving the messaging and troubleshooting documentation.

I've done the same in our e2e-metal jobs FWIW, i try twice for both bootstrap-complete and install-complete. I don't see this as a bug though.

How do we expect Hive to figure out if a cluster is up all the way? Should it be running the command multiple times, too, and then maybe trying to access the cluster's API?

Does everything that wraps the installer have to do that for itself?

sdodson · 2020-01-29T19:39:09Z

Since this is limited in scope to baremetal platform we'll accept this.
/approve
/lgtm

openshift-ci-robot · 2020-01-29T19:40:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sdodson]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-01-29T19:45:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

stbenjam · 2020-01-29T19:54:22Z

Since this is limited in scope to baremetal platform we'll accept this.

Thanks. At some point, I think it'd be worth having a discussion about how to bubble up more information from the CVO to the installer: having someone sit around for an hour with a cluster that we might've been able to determine at minute 15 was never going to succeed is obviously less than ideal. Although I know that's easier said than done

sdodson · 2020-01-29T20:02:22Z

Does everything that wraps the installer have to do that for itself?

I think that is highly dependent on the expectations of whatever is calling the installer. Really all the installer is capable of reporting is that it waited $x minutes and $y operators did not achieve their goals. The installer doesn't know if they will eventually achieve their goals or even why they weren't able to so it shouldn't be seen as a terminal failure in its current form.

Thanks. At some point, I think it'd be worth having a discussion about how to bubble up more information from the CVO to the installer: having someone sit around for an hour with a cluster that we might've been able to determine at minute 15 was never going to succeed is obviously less than ideal.

Yeah, I think this is becoming necessary.

openshift-bot · 2020-01-29T20:24:18Z

/retest

Please review the full test history for this PR and help us cut down flakes.

cmd/openshift-install/create.go

openshift-bot · 2020-01-29T21:16:41Z

/retest

Please review the full test history for this PR and help us cut down flakes.

sdodson · 2020-01-29T21:19:30Z

/lgtm cancel

Baremetal servers differ significantly from other platforms, especially due to the length of time it can take to boot real hardware: servers go through POST (power-on self checks) and hardware initializations that can take many minutes to complete. When deploying the control plane as well as workers in the installer, it reliably takes more than 30 minutes. Recently, we added code to the installer and machine-api-operator that now allows the installer to deploy workers on day 1. This change even on virtualized baremetal is running up against the 30-minute time limit. On real baremetal servers, it's guaranteed the install process is closer to 45 minutes.

stbenjam · 2020-01-29T21:29:51Z

Import ordering is fixed.

stbenjam · 2020-01-31T17:29:45Z

@abhinavdahiya PTAL, I've addressed your concern regarding imports

sdodson · 2020-02-03T20:24:46Z

/lgtm

openshift-bot · 2020-02-03T20:26:46Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-03T20:39:37Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-03T21:05:20Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-03T21:44:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-03T22:30:25Z

/retest

Please review the full test history for this PR and help us cut down flakes.

abhinavdahiya · 2020-02-03T23:13:44Z

/skip

openshift-bot · 2020-02-03T23:15:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-02-04T00:00:12Z

@stbenjam: All pull requests linked via external trackers have merged. Bugzilla bug 1794755 has been moved to the MODIFIED state.

In response to this:

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-02-04T00:30:31Z

@stbenjam: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-libvirt	`3d87766`	link	`/test e2e-libvirt`
ci/prow/e2e-aws-scaleup-rhel7	`3d87766`	link	`/test e2e-aws-scaleup-rhel7`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

e-minguez · 2020-02-17T13:35:34Z

Can we have this backported to 4.3? Thanks!

stbenjam · 2020-02-17T13:42:12Z

/cherry-pick release-4.3

openshift-cherrypick-robot · 2020-02-17T13:42:33Z

@stbenjam: new pull request created: #3116

In response to this:

/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sdodson · 2020-02-17T13:44:40Z

Can we have this backported to 4.3? Thanks!

Just FYI, this only applies to baremetal platform which is only used for baremetal IPI, this is not used by baremetal UPI which uses a platform of none.

e-minguez · 2020-02-17T13:49:09Z

Can we have this backported to 4.3? Thanks!

Just FYI, this only applies to baremetal platform which is only used for baremetal IPI, this is not used by baremetal UPI which uses a platform of none.

Thanks! Baremetal IPI is what we use here https://github.com/openshift-kni/baremetal-deploy and even if the focus is 4.4 we are still deploying 4.3 those days.

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 24, 2020

openshift-ci-robot requested review from jstuever and mtnbikenc January 24, 2020 16:03

openshift-ci-robot changed the title ~~cmd/openshift-install/create: wait 60 minutes for baremetal~~ Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal Jan 25, 2020

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 25, 2020

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jan 25, 2020

stbenjam mentioned this pull request Jan 29, 2020

baremetal: increasing failures due to timeouts #2741

Closed

openshift-ci-robot assigned sdodson Jan 29, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2020

abhinavdahiya reviewed Jan 29, 2020

View reviewed changes

cmd/openshift-install/create.go Outdated Show resolved Hide resolved

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2020

stbenjam force-pushed the baremetal-timeout branch from acde421 to eb11ffe Compare January 29, 2020 21:28

stbenjam force-pushed the baremetal-timeout branch from eb11ffe to 3d87766 Compare January 29, 2020 21:29

stbenjam requested a review from abhinavdahiya January 31, 2020 17:29

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 3, 2020

openshift-merge-robot merged commit 406e907 into openshift:master Feb 4, 2020

openshift-cherrypick-robot mentioned this pull request Feb 17, 2020

[release-4.3] Bug 1803805: cmd/openshift-install/create: wait 60 minutes for baremetal #3116

Merged

hardys mentioned this pull request Jun 8, 2022

Bug 2090816: Make bootstrap timeout configurable #5979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal #2979

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal #2979

stbenjam commented Jan 24, 2020

stbenjam commented Jan 24, 2020 •

edited

Loading

sdodson commented Jan 24, 2020

stbenjam commented Jan 24, 2020 •

edited

Loading

stbenjam commented Jan 24, 2020

sdodson commented Jan 25, 2020 •

edited

Loading

openshift-ci-robot commented Jan 25, 2020

sdodson commented Jan 25, 2020

openshift-ci-robot commented Jan 25, 2020

dhellmann commented Jan 29, 2020

sdodson commented Jan 29, 2020

openshift-ci-robot commented Jan 29, 2020

openshift-bot commented Jan 29, 2020

stbenjam commented Jan 29, 2020 •

edited

Loading

sdodson commented Jan 29, 2020

openshift-bot commented Jan 29, 2020

openshift-bot commented Jan 29, 2020

sdodson commented Jan 29, 2020

stbenjam commented Jan 29, 2020

stbenjam commented Jan 31, 2020

sdodson commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

abhinavdahiya commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-ci-robot commented Feb 4, 2020

openshift-ci-robot commented Feb 4, 2020

e-minguez commented Feb 17, 2020

stbenjam commented Feb 17, 2020

openshift-cherrypick-robot commented Feb 17, 2020

sdodson commented Feb 17, 2020

e-minguez commented Feb 17, 2020

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal #2979

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal #2979

Conversation

stbenjam commented Jan 24, 2020

stbenjam commented Jan 24, 2020 • edited Loading

sdodson commented Jan 24, 2020

stbenjam commented Jan 24, 2020 • edited Loading

stbenjam commented Jan 24, 2020

sdodson commented Jan 25, 2020 • edited Loading

openshift-ci-robot commented Jan 25, 2020

sdodson commented Jan 25, 2020

openshift-ci-robot commented Jan 25, 2020

dhellmann commented Jan 29, 2020

sdodson commented Jan 29, 2020

openshift-ci-robot commented Jan 29, 2020

openshift-bot commented Jan 29, 2020

stbenjam commented Jan 29, 2020 • edited Loading

sdodson commented Jan 29, 2020

openshift-bot commented Jan 29, 2020

openshift-bot commented Jan 29, 2020

sdodson commented Jan 29, 2020

stbenjam commented Jan 29, 2020

stbenjam commented Jan 31, 2020

sdodson commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

abhinavdahiya commented Feb 3, 2020

openshift-bot commented Feb 3, 2020

openshift-ci-robot commented Feb 4, 2020

openshift-ci-robot commented Feb 4, 2020

e-minguez commented Feb 17, 2020

stbenjam commented Feb 17, 2020

openshift-cherrypick-robot commented Feb 17, 2020

sdodson commented Feb 17, 2020

e-minguez commented Feb 17, 2020

stbenjam commented Jan 24, 2020 •

edited

Loading

stbenjam commented Jan 24, 2020 •

edited

Loading

sdodson commented Jan 25, 2020 •

edited

Loading

stbenjam commented Jan 29, 2020 •

edited

Loading