Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal #2979

Merged
merged 1 commit into from
Feb 4, 2020

Conversation

stbenjam
Copy link
Member

Baremetal servers differ significantly from other platforms, especially
due to the length of time it can take to boot real hardware: servers go
through POST (power-on self checks) and hardware initializations that
can take many minutes to complete. When deploying the control plane as
well as workers in the installer, it reliably takes more than 30
minutes.

Recently, we added code to the installer and machine-api-operator that
now allows the installer to deploy workers on day 1. This change even on
virtualized baremetal is running up against the 30-minute time limit. On
real baremetal servers, it's guaranteed the install process is closer to
45 minutes.

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 24, 2020
@stbenjam
Copy link
Member Author

stbenjam commented Jan 24, 2020

Sorry to open this can of worms again.

I know this has been a recurring discussion but it's definitely a real problem for baremetal now when we try to deploy workers as day 1, as we need the control plane to come up, and then we deploy a worker after which takes > 30 minutes but < 60.

You can see our repeated CI failures due to timeouts when doing workers as day 1, and success when we did a second wait-for install-complete
openshift-metal3/dev-scripts#897

@sdodson
Copy link
Member

sdodson commented Jan 24, 2020

In UPI workflows where the admin may have to go reconfigure something like the registry to use alternative storage we absolutely expect these commands to be run multiple times and unless we devise better mechanisms to differentiate between wait longer and terminal failures I'd be really hesitant to double that timeout. Even in IPI I think we should be leaning into the situation and make it clearer that the installer has done its best but the admin needs to take a look at the cluster to determine what's preventing it from completing in under 30 minutes. I think the perceived finality of wait-for install-complete is leading people to throw their hands up and walk away from salvageable clusters. We should be improving the messaging and troubleshooting documentation.

I've done the same in our e2e-metal jobs FWIW, i try twice for both bootstrap-complete and install-complete. I don't see this as a bug though.

@stbenjam
Copy link
Member Author

stbenjam commented Jan 24, 2020

Something close to 100% of real baremetal IPI clusters are going to take more than 45 minutes to install if they have worker replicas > 0.

The way the installer exits is pretty awful and makes it look like a catastrophic failure, I don't think that's really a good user experience on baremetal. There's just not much we can do about how long it takes real servers to boot, and the defaults we wait should cover most cases IMHO.

@stbenjam
Copy link
Member Author

What would it take to take a different approach where we say after 30 minutes a certain list of operators should be available, and then permit another 30 minutes to allow things like machine-api to deploy workers? That would let us identify clusters that are probably broken and not have users waiting an extra 30 minutes?

@sdodson
Copy link
Member

sdodson commented Jan 25, 2020

/retitle Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal
(not an endorsement that we should merge this, just linking things since there's a bug too)

@openshift-ci-robot openshift-ci-robot changed the title cmd/openshift-install/create: wait 60 minutes for baremetal Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal Jan 25, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 25, 2020
@openshift-ci-robot
Copy link
Contributor

@stbenjam: This pull request references Bugzilla bug 1794755, which is invalid:

  • expected the bug to target the "4.4.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson
Copy link
Member

sdodson commented Jan 25, 2020

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jan 25, 2020
@openshift-ci-robot
Copy link
Contributor

@sdodson: This pull request references Bugzilla bug 1794755, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dhellmann
Copy link
Contributor

In UPI workflows where the admin may have to go reconfigure something like the registry to use alternative storage we absolutely expect these commands to be run multiple times and unless we devise better mechanisms to differentiate between wait longer and terminal failures I'd be really hesitant to double that timeout. Even in IPI I think we should be leaning into the situation and make it clearer that the installer has done its best but the admin needs to take a look at the cluster to determine what's preventing it from completing in under 30 minutes. I think the perceived finality of wait-for install-complete is leading people to throw their hands up and walk away from salvageable clusters. We should be improving the messaging and troubleshooting documentation.

I've done the same in our e2e-metal jobs FWIW, i try twice for both bootstrap-complete and install-complete. I don't see this as a bug though.

How do we expect Hive to figure out if a cluster is up all the way? Should it be running the command multiple times, too, and then maybe trying to access the cluster's API?

Does everything that wraps the installer have to do that for itself?

@sdodson
Copy link
Member

sdodson commented Jan 29, 2020

Since this is limited in scope to baremetal platform we'll accept this.
/approve
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sdodson

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 29, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@stbenjam
Copy link
Member Author

stbenjam commented Jan 29, 2020

Since this is limited in scope to baremetal platform we'll accept this.

Thanks. At some point, I think it'd be worth having a discussion about how to bubble up more information from the CVO to the installer: having someone sit around for an hour with a cluster that we might've been able to determine at minute 15 was never going to succeed is obviously less than ideal. Although I know that's easier said than done

@sdodson
Copy link
Member

sdodson commented Jan 29, 2020

Does everything that wraps the installer have to do that for itself?

I think that is highly dependent on the expectations of whatever is calling the installer. Really all the installer is capable of reporting is that it waited $x minutes and $y operators did not achieve their goals. The installer doesn't know if they will eventually achieve their goals or even why they weren't able to so it shouldn't be seen as a terminal failure in its current form.

Thanks. At some point, I think it'd be worth having a discussion about how to bubble up more information from the CVO to the installer: having someone sit around for an hour with a cluster that we might've been able to determine at minute 15 was never going to succeed is obviously less than ideal.

Yeah, I think this is becoming necessary.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@sdodson
Copy link
Member

sdodson commented Jan 29, 2020

/lgtm cancel

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jan 29, 2020
Baremetal servers differ significantly from other platforms, especially
due to the length of time it can take to boot real hardware: servers go
through POST (power-on self checks) and hardware initializations that
can take many minutes to complete. When deploying the control plane as
well as workers in the installer, it reliably takes more than 30
minutes.

Recently, we added code to the installer and machine-api-operator that
now allows the installer to deploy workers on day 1. This change even on
virtualized baremetal is running up against the 30-minute time limit. On
real baremetal servers, it's guaranteed the install process is closer to
45 minutes.
@stbenjam
Copy link
Member Author

Import ordering is fixed.

@stbenjam
Copy link
Member Author

@abhinavdahiya PTAL, I've addressed your concern regarding imports

@sdodson
Copy link
Member

sdodson commented Feb 3, 2020

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 3, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@abhinavdahiya
Copy link
Contributor

/skip

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 406e907 into openshift:master Feb 4, 2020
@openshift-ci-robot
Copy link
Contributor

@stbenjam: All pull requests linked via external trackers have merged. Bugzilla bug 1794755 has been moved to the MODIFIED state.

In response to this:

Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@stbenjam: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-libvirt 3d87766 link /test e2e-libvirt
ci/prow/e2e-aws-scaleup-rhel7 3d87766 link /test e2e-aws-scaleup-rhel7

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@e-minguez
Copy link
Contributor

Can we have this backported to 4.3? Thanks!

@stbenjam
Copy link
Member Author

/cherry-pick release-4.3

@openshift-cherrypick-robot

@stbenjam: new pull request created: #3116

In response to this:

/cherry-pick release-4.3

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sdodson
Copy link
Member

sdodson commented Feb 17, 2020

Can we have this backported to 4.3? Thanks!

Just FYI, this only applies to baremetal platform which is only used for baremetal IPI, this is not used by baremetal UPI which uses a platform of none.

@e-minguez
Copy link
Contributor

Can we have this backported to 4.3? Thanks!

Just FYI, this only applies to baremetal platform which is only used for baremetal IPI, this is not used by baremetal UPI which uses a platform of none.

Thanks! Baremetal IPI is what we use here https://github.com/openshift-kni/baremetal-deploy and even if the focus is 4.4 we are still deploying 4.3 those days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants