Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1816904: Ensure the Installer checks for the correct number of compute replicas before exiting #4071

Closed
wants to merge 1 commit into from

Conversation

kirankt
Copy link
Contributor

@kirankt kirankt commented Aug 19, 2020

This PR ensures that the installer prints out an non-fatal error message before exiting if the number of worker replicas doesn't match the replicas requested in the install config.

These errors happen mostly in the baremetal platform where nodes might take a longer to finish introspection or enter an error state before which the installer might have exited leaving with incorrect compute/worker replicas. Since we ask the user for number of replicas in the install config, it makes sense to at least see to it that the replicas have been successfully deployed before exiting.

@openshift-ci-robot
Copy link
Contributor

@kirankt: This pull request references Bugzilla bug 1816904, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.6.0) matches configured target release for branch (4.6.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1816904: Ensure the Installer checks for the correct number of compute replicas before exiting

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 19, 2020
@kirankt
Copy link
Contributor Author

kirankt commented Aug 19, 2020

/label platform/baremetal

@openshift-ci-robot openshift-ci-robot added the platform/baremetal IPI bare metal hosts platform label Aug 19, 2020
@kirankt
Copy link
Contributor Author

kirankt commented Aug 19, 2020

/assign @abhinavdahiya

Copy link
Contributor

@abhinavdahiya abhinavdahiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally i don't think it is the installer's responsibilty to check if the compute nodes are running or how many,

We should probably work with machine-api-operator to track a condition on it's cluster operator that reports the machines are that haven't yet become nodes ... the installer can then provide that condition as part of the error message when the cluster complete failed.

If the cluster complete is done, worker nodes that are still provisioning or haven't provisioned means that they were not required for install-complete and hence it's weird to prompt that in installer, the cluster already has an alert for such machines in those cases.

cmd/openshift-install/create.go Outdated Show resolved Hide resolved
cmd/openshift-install/create.go Outdated Show resolved Hide resolved
cmd/openshift-install/create.go Outdated Show resolved Hide resolved
@kirankt
Copy link
Contributor Author

kirankt commented Aug 20, 2020

Personally i don't think it is the installer's responsibilty to check if the compute nodes are running or how many,

We should probably work with machine-api-operator to track a condition on it's cluster operator that reports the machines are that haven't yet become nodes ... the installer can then provide that condition as part of the error message when the cluster complete failed.

Hi. Thanks for the feedback. I was wondering if there is an example of how we currently use the cluster operator in the installer.

If the cluster complete is done, worker nodes that are still provisioning or haven't provisioned means that they were not required for install-complete and hence it's weird to prompt that in installer, the cluster already has an alert for such machines in those cases.

Instead of an error, can we just print a warning before we call logcomplete? Instead of an error, I'm currently just printing (in the newest commit) a warning, so its not too disruptive. WDYT?

@stbenjam
Copy link
Member

Personally i don't think it is the installer's responsibilty to check if the compute nodes are running or how many,

Everyone points to everyone else as responsibility for handling this 🤷‍♂️

We've previously discussed this on openshift/enhancements#328 (comment)

@kirankt
Copy link
Contributor Author

kirankt commented Sep 1, 2020

@abhinavdahiya What do you think of the new changes. We're just printing a warning message before exiting. IMO its pretty harmless and will not really affect other platforms.

@kirankt
Copy link
Contributor Author

kirankt commented Sep 1, 2020

/retest

1 similar comment
@kirankt
Copy link
Contributor Author

kirankt commented Sep 14, 2020

/retest

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented Sep 14, 2020

@kirankt: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-ovirt 1258aae link /test e2e-ovirt
ci/prow/e2e-openstack 1258aae link /test e2e-openstack
ci/prow/e2e-aws-workers-rhel7 1258aae link /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-fips 1258aae link /test e2e-aws-fips
ci/prow/e2e-libvirt 1258aae link /test e2e-libvirt
ci/prow/e2e-crc 1258aae link /test e2e-crc

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kirankt
Copy link
Contributor Author

kirankt commented Sep 15, 2020

@abhinavdahiya . Ideally we would like to exit with a non-zero return code, rather than a warning, which is what this PR is currently doing. Also, I've ensured that this affects only the baremetal platform. PTAL and let me know what you think.
/cc @hardys

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign abhinavdahiya after the PR has been reviewed.
You can assign the PR to them by writing /assign @abhinavdahiya in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…h an error message if the number of worker replicas doesn't match the replicas requested in the install config.

The error is non-fatal and the cluster will be operational except for a message that prints the inconsistency between requested worker replicas and what was available after the installation.
@kirankt
Copy link
Contributor Author

kirankt commented Oct 19, 2020

/retest

@openshift-merge-robot
Copy link
Contributor

openshift-merge-robot commented Oct 19, 2020

@kirankt: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-crc b3195aa link /test e2e-crc
ci/prow/e2e-ovirt b3195aa link /test e2e-ovirt
ci/prow/e2e-aws-workers-rhel7 b3195aa link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor

@abhinavdahiya . Ideally we would like to exit with a non-zero return code, rather than a warning, which is what this PR is currently doing. Also, I've ensured that this affects only the baremetal platform. PTAL and let me know what you think.
/cc @hardys

even for baremetal my concerns are still the same from #4071 (review) and none of it are fundamentally answered.

How do you ensure our figure out if the compute node that is missing or still not running actually required for install-complete signal? Marking installs as failed in these cases imo is inaccurate and should be avoided.

If you can update this PR to that constraint, more than happy to have the installer run this check for install complete.

secondly the people also use this code-path for UPI clusters and expecting/requiring those customers to match the install config replica count seems like an unnecessary requirement.

@kirankt
Copy link
Contributor Author

kirankt commented Oct 22, 2020

@abhinavdahiya . Ideally we would like to exit with a non-zero return code, rather than a warning, which is what this PR is currently doing. Also, I've ensured that this affects only the baremetal platform. PTAL and let me know what you think.
/cc @hardys

even for baremetal my concerns are still the same from #4071 (review) and none of it are fundamentally answered.

How do you ensure our figure out if the compute node that is missing or still not running actually required for install-complete signal? Marking installs as failed in these cases imo is inaccurate and should be avoided.

Thanks, @abhinavdahiya for taking time to re-review.

I understand your concern. It makes no sense to error out an otherwise perfectly installed cluster. What about how this PR stands now? Currently we're only printing a warning message about compute replicas. Can we at least do this (for BM platform only) and then continue to the normal exit? I think I'll be OK with this compromise. These bugs are being reported a lot and doesn't seem to be a corner case. At least if the warning message exists, we can inform the end user about it and they can take action on day-2.

If you can update this PR to that constraint, more than happy to have the installer run this check for install complete.

secondly the people also use this code-path for UPI clusters and expecting/requiring those customers to match the install config replica count seems like an unnecessary requirement.

FWIW, UPI uses platform=none for their installs and AFAIK it should not affect them. :-)

@stbenjam
Copy link
Member

I moved this bug out of the baremetal installer queue and back to the core installer team, this bug was reported in March and we've been playing a game of musical chairs with it. I don't think we're any closer to a resolution on it, and IMHO it really should be a fix for every platform in OCP.

Let's close this PR. Feel free to reopen if there's a path forward that fixes it for everyone.

/close

@openshift-ci-robot
Copy link
Contributor

@stbenjam: Closed this PR.

In response to this:

I moved this bug out of the baremetal installer queue and back to the core installer team, this bug was reported in March and we've been playing a game of musical chairs with it. I don't think we're any closer to a resolution on it, and IMHO it really should be a fix for every platform in OCP.

Let's close this PR. Feel free to reopen if there's a path forward that fixes it for everyone.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@kirankt: This pull request references Bugzilla bug 1816904. The bug has been updated to no longer refer to the pull request using the external bug tracker.

In response to this:

Bug 1816904: Ensure the Installer checks for the correct number of compute replicas before exiting

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. platform/baremetal IPI bare metal hosts platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants