Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-10846: Fix TestClientTLS flakes #904

Conversation

rfredette
Copy link
Contributor

Wait for old router pods to be cleaned up before testing new mTLS config.

After updating the ingresscontroller configuration, the router deployment reports when all new pods are ready, but sometimes the pods from the older generation still haven't terminated. If those older generation pods are still marked ready when TestClientTLS curls an endpoint, sometimes the connections are handled by the older generation router, and that test case will fail.

This PR makes the test wait until the older generation pod(s) are completely terminated before running curl, ensuring that only the correct router pods are used.

@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 4, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-10846, which is invalid:

  • expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Wait for old router pods to be cleaned up before testing new mTLS config.

After updating the ingresscontroller configuration, the router deployment reports when all new pods are ready, but sometimes the pods from the older generation still haven't terminated. If those older generation pods are still marked ready when TestClientTLS curls an endpoint, sometimes the connections are handled by the older generation router, and that test case will fail.

This PR makes the test wait until the older generation pod(s) are completely terminated before running curl, ensuring that only the correct router pods are used.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested review from gcs278 and miheer April 4, 2023 21:09
@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 4, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-10846, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from ShudiLi April 4, 2023 23:07
@rfredette
Copy link
Contributor Author

/retest

@Miciah
Copy link
Contributor

Miciah commented Apr 5, 2023

Why isn't making sure that status.replicas isn't greater than spec.replicas sufficient?

if deployment.Spec.Replicas != nil {
replicas = *deployment.Spec.Replicas
}
if replicas != deployment.Status.Replicas {
return false, nil

status.replicas is the "number of non-terminated pods targeted by this deployment" according to https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#deploymentstatus-v1-apps. If there are pods that are marked for deletion but haven't been terminated yet, are they not counted?

@candita
Copy link
Contributor

candita commented Apr 5, 2023

/assign @frobware

@rfredette
Copy link
Contributor Author

It appears that terminating pods are not counted in status.replicas, although I didn't observe that field directly. Before making this change, I ran the test and had a second terminal open running oc get pods -w -n openshift-ingress, and noticed there was still an old openshift-router pod that was terminating but ready when I saw the output from curl resume at line 377 (https://github.com/openshift/cluster-ingress-operator/pull/904/files#diff-29d1d3601438a24c0a4a49463c65a350beed9bc9f0a3fad47734b193fb7d0c1bR377)

t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err)
return false, nil
}
return len(pods.Items) == int(*deployment.Spec.Replicas), nil
Copy link
Contributor

@frobware frobware Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is deployment.Spec.Replicas always non-nil?

Edit: ah, yes it is created in this function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure. deployment may always be non-nil, but deployment.Spec.Replicas could very well be nil. Am I missing something? I think it needs to be initialized like in waitForDeploymentComplete

replicas = *deployment.Spec.Replicas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change the Spec.Replicas deref to:

return len(pods.Items) == int(pointer.Int32Deref(deployment.Spec.Replicas, -1)), nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

39c6aef changes this to use Int32Deref. I kept the explicit expectedReplicas variable so that we aren't calling Int32Deref repeatedly; I hope that's fine.

@@ -3579,6 +3579,36 @@ func waitForDeploymentEnvVar(t *testing.T, cl client.Client, deployment *appsv1.
return err
}

// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old
// generation's pods to finish terminating.
func waitForDeploymentCompleteWithCleanup(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Cleanup" part was not immediately obvious for me. How about waitForDeploymentRolloutAndOldPodsTermination?

}

return wait.PollImmediate(2*time.Second, timeout, func() (bool, error) {
pods := &corev1.PodList{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pods/pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pods is a list of possibly multiple pods, so I think plural makes more sense. That said, if it's unclear as is, I do think podList as a name would work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head I read this as a podList, but Items brings the plurality.

@gcs278
Copy link
Contributor

gcs278 commented Apr 13, 2023

I'll take a look at this too, since it seems we are blocking on this on PRs such as #901
/assign

Copy link
Contributor

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I experimented with rolling out a new router pod, and I can very much confirm what @rfredette is seeing: status.replicas counts terminating replicas and the terminating replicas can still be ready for a solid moment.

I provided an alternative solution if you'd like to consider it, but I could be missing something.

return fmt.Errorf("failed to get deployment %s: %w", deploymentName.Name, err)
}

if deployment.Generation != deployment.Status.ObservedGeneration {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner/safer to skip this check and always do waitForDeploymentComplete? It does this check inside of it, along with the other replica count checks?

t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err)
return false, nil
}
return len(pods.Items) == int(*deployment.Spec.Replicas), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as doingspec.replicas == status.replicas, which is what waitForDeploymentComplete already does? You are counting the pods manually, but I don't see how this is filtering out terminating pods.

Shouldn't you check for no metadata.deletionTimestamp on each of the pods before comparing the pod count to spec.replicas?

Copy link
Contributor

@gcs278 gcs278 Apr 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, a better solution might be instead to ensure there are no terminating pods (of the deployment) that are also ready. This saves about 15 seconds of blocking when there is 1 terminating router that is not ready (it's not going to get requests), and we have our two other router pods up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we're seeing this issue is that status.replicas seems to either be pruning pods that are terminating or it's only including pods in the current generation. Either way, I've found that spec.replicas == status.replicas doesn't say if there are still terminating pods, while manually listing the pods does.

I do like the idea of checking for pods that are terminating but still ready. I'll try that and if it works, I'll add it to the PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that "terminating" isn't actually a pod state; oc/kubectl will show that, but it's not a part of the pod status.

However, I did update the function to only count the number of ready pods, so it should return once the old pods become not ready.

t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err)
return false, nil
}
return len(pods.Items) == int(*deployment.Spec.Replicas), nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure. deployment may always be non-nil, but deployment.Spec.Replicas could very well be nil. Am I missing something? I think it needs to be initialized like in waitForDeploymentComplete

replicas = *deployment.Spec.Replicas

…eteWithOldPodTermination

Also:
- Rename pods to podList
- When checking for old pod termination, only count the currently ready
  pods, instead of all pods
@rfredette
Copy link
Contributor Author

aws and gcp operator suites failed, but TestClientTLS passed. Other suites also failed, but since this PR only touches the operator suite, those don't seem related.

/retest

@rfredette
Copy link
Contributor Author

More test flakes in unrelated tests
/retest

@gcs278
Copy link
Contributor

gcs278 commented Apr 25, 2023

Not related.
/retest-required

}

expectedReplicas := 1
if deployment.Spec.Replicas != nil && int(*deployment.Spec.Replicas) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The int(*deployment.Spec.Replicas) != 0 seems a little funny to me. replicas: 0 is indeed an allowed value in an deployment and if you did have a deployment with it, this function would fail since it overrides it to 1.

I know it's a highly unlikely use-case. Did you have another reason for ignoring replicas: 0 and overriding it to 1?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 39c6aef.

@ShudiLi
Copy link
Member

ShudiLi commented Apr 26, 2023

/label qe-approved
thanks

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 26, 2023
Comment on lines 3582 to 3584
// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old
// generation's pods to finish terminating.
func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old
// generation's pods to finish terminating.
func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error {
// waitForDeploymentCompleteWithOldPodTermination waits for a deployment to complete its rollout, then waits for the old
// generation's pods to finish terminating.
func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 39c6aef.

@gcs278
Copy link
Contributor

gcs278 commented Apr 26, 2023

Ship it.
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2023
Follow-up to commit 20e4e38.

* test/e2e/operator_test.go
(waitForDeploymentCompleteWithOldPodTermination): Correct the function name
in the godoc.  Use "k8s.io/utils/pointer".Int32Deref, and respect the value
in spec.replicas even if it is set explicitly to 0.
@Miciah Miciah force-pushed the ocpbugs-10846-TestClientTLS-fix branch from 39c6aef to 5cd2a01 Compare April 26, 2023 16:03
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2023
@frobware
Copy link
Contributor

/approve

@gcs278
Copy link
Contributor

gcs278 commented Apr 26, 2023

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 26, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Apr 26, 2023
@gcs278
Copy link
Contributor

gcs278 commented Apr 26, 2023

/test e2e-aws-operator
TestUnmanagedDNSToManagedDNSInternalIngressController failure

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD a29464e and 2 for PR HEAD 5cd2a01 in total

@gcs278
Copy link
Contributor

gcs278 commented Apr 26, 2023

TestInternalLoadBalancer failure. I haven't seen that before, but this change only affects TestClientTLS

/test e2e-gcp-operator

@gcs278
Copy link
Contributor

gcs278 commented Apr 27, 2023

TestUnmanagedDNSToManagedDNSInternalIngressController failure...
/test e2e-gcp-operator

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 27, 2023

@rfredette: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 5cd2a01 link false /test e2e-aws-ovn-single-node

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit e93984f into openshift:master Apr 27, 2023
@openshift-ci-robot
Copy link
Contributor

@rfredette: Jira Issue OCPBUGS-10846: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-10846 has been moved to the MODIFIED state.

In response to this:

Wait for old router pods to be cleaned up before testing new mTLS config.

After updating the ingresscontroller configuration, the router deployment reports when all new pods are ready, but sometimes the pods from the older generation still haven't terminated. If those older generation pods are still marked ready when TestClientTLS curls an endpoint, sometimes the connections are handled by the older generation router, and that test case will fail.

This PR makes the test wait until the older generation pod(s) are completely terminated before running curl, ensuring that only the correct router pods are used.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor

Miciah commented May 3, 2023

The test is also failing for 4.13 and 4.12 CI.
/cherry-pick release-4.13

@openshift-cherrypick-robot

@Miciah: new pull request created: #923

In response to this:

The test is also failing for 4.13 and 4.12 CI.
/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Miciah
Copy link
Contributor

Miciah commented Jul 20, 2023

/cherry-pick release-4.12

@openshift-cherrypick-robot

@Miciah: new pull request created: #964

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants