-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal #2979
Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal #2979
Conversation
Sorry to open this can of worms again. I know this has been a recurring discussion but it's definitely a real problem for baremetal now when we try to deploy workers as day 1, as we need the control plane to come up, and then we deploy a worker after which takes > 30 minutes but < 60. You can see our repeated CI failures due to timeouts when doing workers as day 1, and success when we did a second |
In UPI workflows where the admin may have to go reconfigure something like the registry to use alternative storage we absolutely expect these commands to be run multiple times and unless we devise better mechanisms to differentiate between wait longer and terminal failures I'd be really hesitant to double that timeout. Even in IPI I think we should be leaning into the situation and make it clearer that the installer has done its best but the admin needs to take a look at the cluster to determine what's preventing it from completing in under 30 minutes. I think the perceived finality of I've done the same in our e2e-metal jobs FWIW, i try twice for both |
Something close to 100% of real baremetal IPI clusters are going to take more than 45 minutes to install if they have worker replicas > 0. The way the installer exits is pretty awful and makes it look like a catastrophic failure, I don't think that's really a good user experience on baremetal. There's just not much we can do about how long it takes real servers to boot, and the defaults we wait should cover most cases IMHO. |
What would it take to take a different approach where we say after 30 minutes a certain list of operators should be available, and then permit another 30 minutes to allow things like machine-api to deploy workers? That would let us identify clusters that are probably broken and not have users waiting an extra 30 minutes? |
/retitle Bug 1794755: cmd/openshift-install/create: wait 60 minutes for baremetal |
@stbenjam: This pull request references Bugzilla bug 1794755, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@sdodson: This pull request references Bugzilla bug 1794755, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
How do we expect Hive to figure out if a cluster is up all the way? Should it be running the command multiple times, too, and then maybe trying to access the cluster's API? Does everything that wraps the installer have to do that for itself? |
Since this is limited in scope to baremetal platform we'll accept this. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sdodson The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
Thanks. At some point, I think it'd be worth having a discussion about how to bubble up more information from the CVO to the installer: having someone sit around for an hour with a cluster that we might've been able to determine at minute 15 was never going to succeed is obviously less than ideal. Although I know that's easier said than done |
I think that is highly dependent on the expectations of whatever is calling the installer. Really all the installer is capable of reporting is that it waited $x minutes and $y operators did not achieve their goals. The installer doesn't know if they will eventually achieve their goals or even why they weren't able to so it shouldn't be seen as a terminal failure in its current form.
Yeah, I think this is becoming necessary. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/lgtm cancel |
acde421
to
eb11ffe
Compare
Baremetal servers differ significantly from other platforms, especially due to the length of time it can take to boot real hardware: servers go through POST (power-on self checks) and hardware initializations that can take many minutes to complete. When deploying the control plane as well as workers in the installer, it reliably takes more than 30 minutes. Recently, we added code to the installer and machine-api-operator that now allows the installer to deploy workers on day 1. This change even on virtualized baremetal is running up against the 30-minute time limit. On real baremetal servers, it's guaranteed the install process is closer to 45 minutes.
eb11ffe
to
3d87766
Compare
Import ordering is fixed. |
@abhinavdahiya PTAL, I've addressed your concern regarding imports |
/lgtm |
/retest Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/skip |
/retest Please review the full test history for this PR and help us cut down flakes. |
@stbenjam: All pull requests linked via external trackers have merged. Bugzilla bug 1794755 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@stbenjam: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Can we have this backported to 4.3? Thanks! |
/cherry-pick release-4.3 |
@stbenjam: new pull request created: #3116 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Just FYI, this only applies to baremetal platform which is only used for baremetal IPI, this is not used by baremetal UPI which uses a platform of none. |
Thanks! Baremetal IPI is what we use here https://github.com/openshift-kni/baremetal-deploy and even if the focus is 4.4 we are still deploying 4.3 those days. |
Baremetal servers differ significantly from other platforms, especially
due to the length of time it can take to boot real hardware: servers go
through POST (power-on self checks) and hardware initializations that
can take many minutes to complete. When deploying the control plane as
well as workers in the installer, it reliably takes more than 30
minutes.
Recently, we added code to the installer and machine-api-operator that
now allows the installer to deploy workers on day 1. This change even on
virtualized baremetal is running up against the 30-minute time limit. On
real baremetal servers, it's guaranteed the install process is closer to
45 minutes.