Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-434: Absorb PodsScheduled condition into MinAvailable #854

Conversation

Miciah
Copy link
Contributor

@Miciah Miciah commented Nov 4, 2022

Absorb PodsScheduled condition into MinAvailable

Remove the PodsScheduled status condition on IngressController objects, and report the information that PodsScheduled reported instead in the DeploymentReplicasMinAvailable status condition's message when its status is "False".

Before this change, the PodsScheduled status condition could erroneously signal that there was a problem when some misscheduled pods existed even if there were enough correctly scheduled and available pods.

The PodsScheduled status condition was added in order to provide diagnostic information when there was a problem. That is, PodsScheduled was intended to indicate why there was a problem, not whether there was a problem. The DeploymentReplicasMinAvailable status condition definitively indicates whether there is a problem, and its status message should indicate why there is a problem. Having a separate status condition for the "why" is superfluous and unnecessarily complicated in this instance.

Follow-up to #465.

  • pkg/operator/controller/ingress/controller.go (IngressControllerPodsScheduledConditionType): Delete const.
  • pkg/operator/controller/ingress/metrics_test.go (TestDeleteIngressControllerConditionsMetric): Delete a reference to the "PodsScheduled" status condition in a test case.
  • pkg/operator/controller/ingress/status.go (syncIngressControllerStatus): Delete the call to computeDeploymentPodsScheduledCondition. Pass the list of pods to computeDeploymentReplicasMinAvailableCondition.
    (PruneConditions): Delete pruning of the "DeploymentDegraded" status condition (which was removed in 4.6.0) and instead prune the "PodsScheduled" status condition.
    (computeDeploymentPodsScheduledCondition): Rename...
    (checkPodsScheduledForDeployment): ...to this. Change the return value from an operator status condition to an error value.
    (computeDeploymentReplicasMinAvailableCondition): Call checkPodsScheduledForDeployment and include the error string in the status condition's message if the function returns an error.
    (computeIngressDegradedCondition): Remove IngressControllerPodsScheduledConditionType from the list of conditions to check.
  • pkg/operator/controller/ingress/status_test.go (TestComputePodsScheduledCondition): Rename...
    (Test_checkPodsScheduledForDeployment): ...to this. Update for the function's new name and result value.
    (TestComputeIngressDegradedCondition): Delete test cases for PodsScheduled.
    (TestComputeDeploymentReplicasMinAvailableCondition): Update to set the deployment's label selector and pass in a list of pods (which is empty).

checkPodsScheduledForDeployment: Add nil check

checkPodsScheduledForDeployment takes a pointer to a deployment, which should never be nil, so return an error indicating that there is a bug if the deployment is somehow nil.

  • pkg/operator/controller/ingress/status.go (checkPodsScheduledForDeployment): Check whether deployment is nil and return an error if it is.
  • pkg/operator/controller/ingress/status_test.go (Test_checkPodsScheduledForDeployment): Add a "nil deployment" test case.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Nov 4, 2022
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-434, which is invalid:

  • expected the bug to target the "4.13.0" version, but it targets "4.11.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Remove the PodsScheduled status condition on IngressController objects, and report the information that PodsScheduled reported instead in the DeploymentReplicasAllAvailable status condition's message when its status is "False".

Before this change, the PodsScheduled status condition could erroneously signal that there was a problem when some misscheduled pods existed even if there were enough correctly scheduled and available pods.

The PodsScheduled status condition was added in order to provide diagnostic information when there was a problem. That is, PodsScheduled was intended to indicate why there was a problem, not whether there was a problem. The DeploymentReplicasAllAvailable status condition definitively indicates whether there is a problem, and its status message should indicate why there is a problem. Having a separate status condition for the "why" is superfluous and unnecessarily complicated in this instance.

Follow-up to #465.

  • pkg/operator/controller/ingress/controller.go (IngressControllerPodsScheduledConditionType): Delete const.
  • pkg/operator/controller/ingress/status.go (syncIngressControllerStatus): Delete the call to computeDeploymentPodsScheduledCondition. Pass the list of pods to computeDeploymentReplicasAllAvailableCondition.
    (computeDeploymentPodsScheduledCondition): Rename...
    (checkPodsScheduledForDeployment): ...to this. Change the return value from an operator status condition to an error value.
    (computeDeploymentReplicasAllAvailableCondition): Call checkPodsScheduledForDeployment and include the error string in the status condition's message if the function returns an error.
    (computeIngressDegradedCondition): Remove IngressControllerPodsScheduledConditionType from the list of conditions to check.
  • pkg/operator/controller/ingress/status_test.go (TestComputePodsScheduledCondition): Rename...
    (Test_checkPodsScheduledForDeployment): ...to this. Update for the function's new name and result value.
    (TestComputeIngressDegradedCondition): Delete test cases for PodsScheduled.
    (TestComputeDeploymentReplicasAllAvailableCondition): Update to set the deployment's label selector and pass in a list of pods (which is empty).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor Author

Miciah commented Nov 4, 2022

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 4, 2022
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-434, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from lihongan November 4, 2022 15:35
@Miciah Miciah force-pushed the OCPBUGS-434-absorb-PodsScheduled-condition-into-AllAvailable branch from 6947f72 to 509baf5 Compare November 4, 2022 16:45
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-434, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

Remove the PodsScheduled status condition on IngressController objects, and report the information that PodsScheduled reported instead in the DeploymentReplicasMinAvailable status condition's message when its status is "False".

Before this change, the PodsScheduled status condition could erroneously signal that there was a problem when some misscheduled pods existed even if there were enough correctly scheduled and available pods.

The PodsScheduled status condition was added in order to provide diagnostic information when there was a problem. That is, PodsScheduled was intended to indicate why there was a problem, not whether there was a problem. The DeploymentReplicasMinAvailable status condition definitively indicates whether there is a problem, and its status message should indicate why there is a problem. Having a separate status condition for the "why" is superfluous and unnecessarily complicated in this instance.

Follow-up to #465.

  • pkg/operator/controller/ingress/controller.go (IngressControllerPodsScheduledConditionType): Delete const.
  • pkg/operator/controller/ingress/status.go (syncIngressControllerStatus): Delete the call to computeDeploymentPodsScheduledCondition. Pass the list of pods to computeDeploymentReplicasMinAvailableCondition.
    (computeDeploymentPodsScheduledCondition): Rename...
    (checkPodsScheduledForDeployment): ...to this. Change the return value from an operator status condition to an error value.
    (computeDeploymentReplicasMinAvailableCondition): Call checkPodsScheduledForDeployment and include the error string in the status condition's message if the function returns an error.
    (computeIngressDegradedCondition): Remove IngressControllerPodsScheduledConditionType from the list of conditions to check.
  • pkg/operator/controller/ingress/status_test.go (TestComputePodsScheduledCondition): Rename...
    (Test_checkPodsScheduledForDeployment): ...to this. Update for the function's new name and result value.
    (TestComputeIngressDegradedCondition): Delete test cases for PodsScheduled.
    (TestComputeDeploymentReplicasMinAvailableCondition): Update to set the deployment's label selector and pass in a list of pods (which is empty).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah Miciah changed the title OCPBUGS-434: Absorb PodsScheduled condition into AllAvailable OCPBUGS-434: Absorb PodsScheduled condition into MinAvailable Nov 4, 2022
@Miciah
Copy link
Contributor Author

Miciah commented Nov 4, 2022

Latest push changes from reporting the information about unscheduled/unschedulable pods from DeploymentReplicasAllAvailable to DeploymentReplicasMinAvailable because the former gets reported in Degraded after a 60-minute grace period whereas the latter gets reported after a 60-second grace period, so the cluster admin is more likely to see useful information sooner if it's in the DeploymentReplicasMinAvailable.

@Miciah
Copy link
Contributor Author

Miciah commented Nov 11, 2022

e2e-aws-ovn-upgrade failed because deprovisioning failed. e2e-gcp-ovn-serial because etcd had an unhealthy member. e2e-azure-ovn failed because terraform failed to configure the vnet.
/retest-required

@candita
Copy link
Contributor

candita commented Nov 30, 2022

/assign @gcs278
/assign

pods []corev1.Pod
expect operatorv1.ConditionStatus
name string
deployment *appsv1.Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always test to ensure functions can handle nil pointers. Do you agree with testing this edge case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ambivalent in this case and in the general case. Adding nil checks everywhere can make code safer (less likely to panic), but it can also obscure bugs (maybe it's a bug in the caller if the argument is nil). In this case, the caller of checkPodsScheduledForDeployment itself dereferences deployment, so it cannot be nil. For short helper functions like this that are used in only one or two places, I would be fine with saying something like, "The caller must ensure that the deployment argument is not nil." Would a comment in the godoc for checkPodsScheduledForDeployment suffice, or do you still prefer an explicit nil check in checkPodsScheduledForDeployment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, checkPodsScheduledForDeployment returns an error value, so I'll go ahead and add a check that returns an error if deployment is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I was a bit of a NIL Pointer Check absolutist until you mentioned it can also obscure bugs. So I agree, it's not that cut-and-dry whether every function needs a NIL pointer check. The GO community seems torn on it too based on my web searches.

So I think I don't mind either way.

@candita
Copy link
Contributor

candita commented Dec 7, 2022

@Miciah I have a small concern with this. What if SD or customers have set up their own monitoring for specific conditions, and we are changing things here. Will there be a good release note to explain the changes to the condition set?

@Miciah
Copy link
Contributor Author

Miciah commented Dec 7, 2022

@Miciah I have a small concern with this. What if SD or customers have set up their own monitoring for specific conditions, and we are changing things here. Will there be a good release note to explain the changes to the condition set?

We don't document the status condition, and we don't declare it in openshift/api, so we have made no promises to anyone that they can rely on it. We could make a release note if you think that's appropriate, though I would want to state clearly in any release note that no one should be relying on undocumented status conditions anyway.

Your question does spark a thought—we should prune the removed status condition on upgrade to make sure we don't leave a stale status condition around.

Remove the PodsScheduled status condition on IngressController objects, and
report the information that PodsScheduled reported instead in the
DeploymentReplicasMinAvailable status condition's message when its status
is "False".

Before this commit, the PodsScheduled status condition could erroneously
signal that there was a problem when some misscheduled pods existed even
if there were enough correctly scheduled and available pods.

The PodsScheduled status condition was added in order to provide diagnostic
information when there was a problem.  That is, PodsScheduled was intended
to indicate *why* there was a problem, not *whether* there was a problem.
The DeploymentReplicasMinAvailable status condition definitively
indicates *whether* there is a problem, and its status message should
indicate *why* there is a problem.  Having a separate status condition for
the "why" is superfluous and unnecessarily complicated in this instance.

Follow-up to commit c5e0f4a.

This bug fixes OCPBUGS-434.

https://issues.redhat.com/browse/OCPBUGS-434

* pkg/operator/controller/ingress/controller.go
(IngressControllerPodsScheduledConditionType): Delete const.
* pkg/operator/controller/ingress/metrics_test.go
(TestDeleteIngressControllerConditionsMetric): Delete a reference to the
"PodsScheduled" status condition in a test case.
* pkg/operator/controller/ingress/status.go (syncIngressControllerStatus):
Delete the call to computeDeploymentPodsScheduledCondition.  Pass the list
of pods to computeDeploymentReplicasMinAvailableCondition.
(PruneConditions): Delete pruning of the "DeploymentDegraded" status
condition (which was removed in 4.6.0) and instead prune the
"PodsScheduled" status condition.
(computeDeploymentPodsScheduledCondition): Rename...
(checkPodsScheduledForDeployment): ...to this.  Change the return value
from an operator status condition to an error value.
(computeDeploymentReplicasMinAvailableCondition): Call
checkPodsScheduledForDeployment and include the error string in the status
condition's message if the function returns an error.
(computeIngressDegradedCondition): Remove
IngressControllerPodsScheduledConditionType from the list of conditions to
check.
* pkg/operator/controller/ingress/status_test.go
(TestComputePodsScheduledCondition): Rename...
(Test_checkPodsScheduledForDeployment): ...to this.  Update for the
function's new name and result value.
(TestComputeIngressDegradedCondition): Delete test cases for PodsScheduled.
(TestComputeDeploymentReplicasMinAvailableCondition): Update to set the
deployment's label selector and pass in a list of pods (which is empty).
@Miciah Miciah force-pushed the OCPBUGS-434-absorb-PodsScheduled-condition-into-AllAvailable branch from f4da6f3 to 436fd31 Compare December 7, 2022 00:20
@openshift-ci-robot
Copy link
Contributor

@Miciah: This pull request references Jira Issue OCPBUGS-434, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Absorb PodsScheduled condition into MinAvailable

Remove the PodsScheduled status condition on IngressController objects, and report the information that PodsScheduled reported instead in the DeploymentReplicasMinAvailable status condition's message when its status is "False".

Before this change, the PodsScheduled status condition could erroneously signal that there was a problem when some misscheduled pods existed even if there were enough correctly scheduled and available pods.

The PodsScheduled status condition was added in order to provide diagnostic information when there was a problem. That is, PodsScheduled was intended to indicate why there was a problem, not whether there was a problem. The DeploymentReplicasMinAvailable status condition definitively indicates whether there is a problem, and its status message should indicate why there is a problem. Having a separate status condition for the "why" is superfluous and unnecessarily complicated in this instance.

Follow-up to #465.

  • pkg/operator/controller/ingress/controller.go (IngressControllerPodsScheduledConditionType): Delete const.
  • pkg/operator/controller/ingress/metrics_test.go (TestDeleteIngressControllerConditionsMetric): Delete a reference to the "PodsScheduled" status condition in a test case.
  • pkg/operator/controller/ingress/status.go (syncIngressControllerStatus): Delete the call to computeDeploymentPodsScheduledCondition. Pass the list of pods to computeDeploymentReplicasMinAvailableCondition.
    (PruneConditions): Delete pruning of the "DeploymentDegraded" status condition (which was removed in 4.6.0) and instead prune the "PodsScheduled" status condition.
    (computeDeploymentPodsScheduledCondition): Rename...
    (checkPodsScheduledForDeployment): ...to this. Change the return value from an operator status condition to an error value.
    (computeDeploymentReplicasMinAvailableCondition): Call checkPodsScheduledForDeployment and include the error string in the status condition's message if the function returns an error.
    (computeIngressDegradedCondition): Remove IngressControllerPodsScheduledConditionType from the list of conditions to check.
  • pkg/operator/controller/ingress/status_test.go (TestComputePodsScheduledCondition): Rename...
    (Test_checkPodsScheduledForDeployment): ...to this. Update for the function's new name and result value.
    (TestComputeIngressDegradedCondition): Delete test cases for PodsScheduled.
    (TestComputeDeploymentReplicasMinAvailableCondition): Update to set the deployment's label selector and pass in a list of pods (which is empty).

checkPodsScheduledForDeployment: Add nil check

checkPodsScheduledForDeployment takes a pointer to a deployment, which should never be nil, so return an error indicating that there is a bug if the deployment is somehow nil.

  • pkg/operator/controller/ingress/status.go (checkPodsScheduledForDeployment): Check whether deployment is nil and return an error if it is.
  • pkg/operator/controller/ingress/status_test.go (Test_checkPodsScheduledForDeployment): Add a "nil deployment" test case.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

// result of that check.
func checkPodsScheduledForDeployment(deployment *appsv1.Deployment, pods []corev1.Pod) error {
if deployment == nil {
return errors.New("Deployment was nil. This is a bug!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.New("Deployment was nil. This is a bug!")
return errors.New("System or configuration error detected - deployment was nil.")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that better than just saying, "This is a bug"? We don't want to mislead the cluster admin into thinking that maybe something is wrong with configuration that the cluster admin can fix when really the problem is that the operator has a bug that needs to be fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then maybe just "System error detected." and put a link to the bug reporting tool!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the error message as you suggested. The Jira link is a bit unwieldy though.

Comment on lines -282 to -301
{
name: "pods not scheduled for <10m",
conditions: []operatorv1.OperatorCondition{
cond(IngressControllerPodsScheduledConditionType, operatorv1.ConditionFalse, "", clock.Now().Add(time.Minute*-9)),
},
expectIngressDegradedStatus: operatorv1.ConditionFalse,
expectRequeue: true,
// Grace period is 10 minutes, subtract the 9 minute spoofed last transition time
expectAfter: time.Minute,
},
{
name: "pods not scheduled for >10m",
conditions: []operatorv1.OperatorCondition{
cond(IngressControllerPodsScheduledConditionType, operatorv1.ConditionFalse, "", clock.Now().Add(time.Minute*-31)),
},
expectIngressDegradedStatus: operatorv1.ConditionTrue,
expectRequeue: true,
// Just use the one minute retry duration for this degraded condition
expectAfter: time.Minute,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we no longer have a grace period for which pods can remain unscheduled and not yet considered unavailable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no longer a grace period for the PodsScheduled status condition, because the PodsScheduled status condition no longer even exists after this PR.

The point of this PR is to remove the PodsScheduled status condition because it was spuriously indicating a condition (some pods weren't scheduled) as a problem when really it wasn't necessarily a problem at all (the not-scheduled pods might actually be pods that are waiting to be garbage-collected). However, the operator still reports the DeploymentReplicasMinAvailable status condition (which has a 60-second grace period, before and after this PR), which conveys the information that really does matter (whether there are enough pods that are scheduled, running, and ready).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So will this automatically continue to reconcile while there are not enough pods scheduled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if there are not enough available pods, then DeploymentReplicasMinAvailable will have status False, and the grace logic will queue reconciliation for the end of the condition's grace period. This is another way in which PodsScheduled is superfluous: DeploymentReplicasMinAvailable has a shorter grace period than PodsScheduled had, so if DeploymentReplicasMinAvailable is False, its grace period already would have triggered reconcilation (and still does trigger it) before PodsScheduled's grace period would have triggered reconciliation.

@gcs278
Copy link
Contributor

gcs278 commented Dec 7, 2022

I forgot about status pruning, good catch.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 7, 2022
checkPodsScheduledForDeployment takes a pointer to a deployment, which
should never be nil, so return an error indicating that there is a bug
if the deployment is somehow nil.

* pkg/operator/controller/ingress/status.go
(checkPodsScheduledForDeployment): Check whether deployment is nil and
return an error if it is.
* pkg/operator/controller/ingress/status_test.go
(Test_checkPodsScheduledForDeployment): Add a "nil deployment" test case.
@Miciah Miciah force-pushed the OCPBUGS-434-absorb-PodsScheduled-condition-into-AllAvailable branch from 436fd31 to 67b0e92 Compare December 8, 2022 00:56
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Dec 8, 2022
// result of that check.
func checkPodsScheduledForDeployment(deployment *appsv1.Deployment, pods []corev1.Pod) error {
if deployment == nil {
return errors.New("System error detected: deployment was nil. Please report this issue to Red Hat: https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332330&issuetype=1&components=12367900&priority=10300&customfield_12316142=26752")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some concern this link may break in the future. Would it be better to just do https://issues.redhat.com/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that that would lead to more misrouted bug reports.

I could say "open a support case", but I don't even know the link for that, and neither would OKD users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷‍♂️ this is fine, hopefully we never hit this logic

@gcs278
Copy link
Contributor

gcs278 commented Dec 15, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Dec 15, 2022
@candita
Copy link
Contributor

candita commented Dec 16, 2022

/retest-required

@candita
Copy link
Contributor

candita commented Dec 16, 2022

This test output for e2e-gcp-ovn-serial is unusual in that it mentions two bugs:

ERRO[2022-12-08T03:23:26Z] Some steps failed:
...

  • could not run steps: step e2e-gcp-ovn-serial failed: "e2e-gcp-ovn-serial" test steps failed: "e2e-gcp-ovn-serial" pod "e2e-gcp-ovn-serial-openshift-e2e-test" failed: the pod ci-op-jk9ifky1/e2e-gcp-ovn-serial-openshift-e2e-test failed after 1h31m9s (failed containers: test): ContainerFailed one or more containers exited
    Container test exited with code 1, reason Error

57746Z","updated_at":"2022-12-08T02:54:41.100513Z","deleted_at":null,"status":"ON_QA","last_change_time":"2022-11-28T10:13:32Z","summary":"masters repeatedly losing connection to API and going NotReady","affects_versions":["4.12","4.13"],"fix_versions":[],"components":["Machine Config Operator / platform-baremetal"],"labels":["sippy-watchlist","sippy-watchlist","trt","trt"],"url":"https://issues.redhat.com/browse/OCPBUGS-3874"},{"id":14935611,"key":"OCPBUGS-3642","created_at":"2022-11-14T18:07:08.62315Z","updated_at":"2022-12-08T02:54:40.527553Z","deleted_at":null,"status":"New","last_change_time":"2022-11-14T16:25:50Z","summary":"Readiness and Liveliness Probe errors in openshift-oauth-apiserver and test failures","affects_versions":["4.12"],"fix_versions":[],"components":["apiserver-auth"],"labels":[],"url":"https://issues.redhat.com/browse/OCPBUGS-3642"}]},{"Name":"[sig-network-edge] ns/openshift-authentication route/oauth-openshift disruption/ingress-to-oauth-server connection/new should be available throughout the test","Risk":{"Level":{"Name":"Low","Level":1},"Reasons":["This test has passed 10.53% of 19 runs on release 4.13 [amd64 gcp ha ovn serial] in the last week."]},"OpenBugs":[]},{"Name":"[sig-network-edge] ns/openshift-console route/console disruption/ingress-to-console connection/reused should be available throughout the test","Risk":{"Level":{"Name":"Low","Level":1},"Reasons":["This test has passed 21.05% of 19 runs on release 4.13 [amd64 gcp ha ovn serial] in the last week."]},"OpenBugs":[]}],"OverallRisk":{"Level":{"Name":"High","Level":10},"Reasons":["Maximum failed test risk: High"]},"OpenBugs":[]}

@candita
Copy link
Contributor

candita commented Dec 17, 2022

/retest-required

@candita
Copy link
Contributor

candita commented Dec 17, 2022

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 17, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: candita

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 17, 2022
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d9d1a2b and 2 for PR HEAD 67b0e92 in total

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 560b2d3 and 1 for PR HEAD 67b0e92 in total

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 19, 2022

@Miciah: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 67b0e92 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-gcp-ovn-serial 67b0e92 link true /test e2e-gcp-ovn-serial

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target OCP 4.12), recalculating validity.

@openshift-bot
Copy link
Contributor

/bugzilla refresh

The requirements for Bugzilla bugs have changed (BZs linked to PRs on main branch need to target OCP 4.12), recalculating validity.

@openshift-ci-robot
Copy link
Contributor

@openshift-bot: This pull request references Jira Issue OCPBUGS-434, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

The requirements for Jira bugs have changed (Jira issues linked to PRs on main branch need to target OCP 4.12), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 1, 2023

@openshift-bot: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

Retaining the bugzilla/valid-bug label as it was manually added.

In response to this:

/bugzilla refresh

The requirements for Bugzilla bugs have changed (BZs linked to PRs on main branch need to target OCP 4.12), recalculating validity.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@candita
Copy link
Contributor

candita commented Jan 13, 2023

@Miciah There's a new frequent (but not permanent) failure in TestUnmanagedDNSToManagedDNSInternalIngressController. Could this change have caused that?

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD fc69408 and 0 for PR HEAD 67b0e92 in total

@openshift-merge-robot openshift-merge-robot merged commit b81ee72 into openshift:master Jan 13, 2023
@openshift-ci-robot
Copy link
Contributor

@Miciah: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-434 has been moved to the MODIFIED state.

In response to this:

Absorb PodsScheduled condition into MinAvailable

Remove the PodsScheduled status condition on IngressController objects, and report the information that PodsScheduled reported instead in the DeploymentReplicasMinAvailable status condition's message when its status is "False".

Before this change, the PodsScheduled status condition could erroneously signal that there was a problem when some misscheduled pods existed even if there were enough correctly scheduled and available pods.

The PodsScheduled status condition was added in order to provide diagnostic information when there was a problem. That is, PodsScheduled was intended to indicate why there was a problem, not whether there was a problem. The DeploymentReplicasMinAvailable status condition definitively indicates whether there is a problem, and its status message should indicate why there is a problem. Having a separate status condition for the "why" is superfluous and unnecessarily complicated in this instance.

Follow-up to #465.

  • pkg/operator/controller/ingress/controller.go (IngressControllerPodsScheduledConditionType): Delete const.
  • pkg/operator/controller/ingress/metrics_test.go (TestDeleteIngressControllerConditionsMetric): Delete a reference to the "PodsScheduled" status condition in a test case.
  • pkg/operator/controller/ingress/status.go (syncIngressControllerStatus): Delete the call to computeDeploymentPodsScheduledCondition. Pass the list of pods to computeDeploymentReplicasMinAvailableCondition.
    (PruneConditions): Delete pruning of the "DeploymentDegraded" status condition (which was removed in 4.6.0) and instead prune the "PodsScheduled" status condition.
    (computeDeploymentPodsScheduledCondition): Rename...
    (checkPodsScheduledForDeployment): ...to this. Change the return value from an operator status condition to an error value.
    (computeDeploymentReplicasMinAvailableCondition): Call checkPodsScheduledForDeployment and include the error string in the status condition's message if the function returns an error.
    (computeIngressDegradedCondition): Remove IngressControllerPodsScheduledConditionType from the list of conditions to check.
  • pkg/operator/controller/ingress/status_test.go (TestComputePodsScheduledCondition): Rename...
    (Test_checkPodsScheduledForDeployment): ...to this. Update for the function's new name and result value.
    (TestComputeIngressDegradedCondition): Delete test cases for PodsScheduled.
    (TestComputeDeploymentReplicasMinAvailableCondition): Update to set the deployment's label selector and pass in a list of pods (which is empty).

checkPodsScheduledForDeployment: Add nil check

checkPodsScheduledForDeployment takes a pointer to a deployment, which should never be nil, so return an error indicating that there is a bug if the deployment is somehow nil.

  • pkg/operator/controller/ingress/status.go (checkPodsScheduledForDeployment): Check whether deployment is nil and return an error if it is.
  • pkg/operator/controller/ingress/status_test.go (Test_checkPodsScheduledForDeployment): Add a "nil deployment" test case.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants