OCPBUGS-10846: Fix TestClientTLS flakes #904

rfredette · 2023-04-04T21:08:18Z

Wait for old router pods to be cleaned up before testing new mTLS config.

After updating the ingresscontroller configuration, the router deployment reports when all new pods are ready, but sometimes the pods from the older generation still haven't terminated. If those older generation pods are still marked ready when TestClientTLS curls an endpoint, sometimes the connections are handled by the older generation router, and that test case will fail.

This PR makes the test wait until the older generation pod(s) are completely terminated before running curl, ensuring that only the correct router pods are used.

…esting new mTLS config

openshift-ci-robot · 2023-04-04T21:08:24Z

@rfredette: This pull request references Jira Issue OCPBUGS-10846, which is invalid:

expected the bug to target the "4.14.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Wait for old router pods to be cleaned up before testing new mTLS config.

After updating the ingresscontroller configuration, the router deployment reports when all new pods are ready, but sometimes the pods from the older generation still haven't terminated. If those older generation pods are still marked ready when TestClientTLS curls an endpoint, sometimes the connections are handled by the older generation router, and that test case will fail.

This PR makes the test wait until the older generation pod(s) are completely terminated before running curl, ensuring that only the correct router pods are used.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rfredette · 2023-04-04T23:07:26Z

/jira refresh

openshift-ci-robot · 2023-04-04T23:07:33Z

@rfredette: This pull request references Jira Issue OCPBUGS-10846, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.14.0) matches configured target version for branch (4.14.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @ShudiLi

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

rfredette · 2023-04-05T05:35:04Z

/retest

Miciah · 2023-04-05T15:16:52Z

Why isn't making sure that status.replicas isn't greater than spec.replicas sufficient?

cluster-ingress-operator/test/e2e/operator_test.go

Lines 3535 to 3539 in a29464e

    
           if deployment.Spec.Replicas != nil { 
        
           	replicas = *deployment.Spec.Replicas 
        
           } 
        
           if replicas != deployment.Status.Replicas { 
        
           	return false, nil

status.replicas is the "number of non-terminated pods targeted by this deployment" according to https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#deploymentstatus-v1-apps. If there are pods that are marked for deletion but haven't been terminated yet, are they not counted?

candita · 2023-04-05T15:46:29Z

/assign @frobware

rfredette · 2023-04-05T15:58:21Z

It appears that terminating pods are not counted in status.replicas, although I didn't observe that field directly. Before making this change, I ran the test and had a second terminal open running oc get pods -w -n openshift-ingress, and noticed there was still an old openshift-router pod that was terminating but ready when I saw the output from curl resume at line 377 (https://github.com/openshift/cluster-ingress-operator/pull/904/files#diff-29d1d3601438a24c0a4a49463c65a350beed9bc9f0a3fad47734b193fb7d0c1bR377)

frobware · 2023-04-06T10:12:07Z

test/e2e/operator_test.go

+			t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err)
+			return false, nil
+		}
+		return len(pods.Items) == int(*deployment.Spec.Replicas), nil


Is deployment.Spec.Replicas always non-nil?

Edit: ah, yes it is created in this function.

I'm not so sure. deployment may always be non-nil, but deployment.Spec.Replicas could very well be nil. Am I missing something? I think it needs to be initialized like in waitForDeploymentComplete

cluster-ingress-operator/test/e2e/operator_test.go

Line 3536 in 6660fb0

replicas = *deployment.Spec.Replicas

We could change the Spec.Replicas deref to:

return len(pods.Items) == int(pointer.Int32Deref(deployment.Spec.Replicas, -1)), nil

39c6aef changes this to use Int32Deref. I kept the explicit expectedReplicas variable so that we aren't calling Int32Deref repeatedly; I hope that's fine.

frobware · 2023-04-06T10:16:02Z

test/e2e/operator_test.go

@@ -3579,6 +3579,36 @@ func waitForDeploymentEnvVar(t *testing.T, cl client.Client, deployment *appsv1.
 	return err
 }

+// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old
+// generation's pods to finish terminating.
+func waitForDeploymentCompleteWithCleanup(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error {


The "Cleanup" part was not immediately obvious for me. How about waitForDeploymentRolloutAndOldPodsTermination?

frobware · 2023-04-06T10:23:20Z

test/e2e/operator_test.go

+	}
+
+	return wait.PollImmediate(2*time.Second, timeout, func() (bool, error) {
+		pods := &corev1.PodList{}


s/pods/pod.

pods is a list of possibly multiple pods, so I think plural makes more sense. That said, if it's unclear as is, I do think podList as a name would work.

In my head I read this as a podList, but Items brings the plurality.

gcs278 · 2023-04-13T20:35:44Z

I'll take a look at this too, since it seems we are blocking on this on PRs such as #901
/assign

gcs278

I experimented with rolling out a new router pod, and I can very much confirm what @rfredette is seeing: status.replicas counts terminating replicas and the terminating replicas can still be ready for a solid moment.

I provided an alternative solution if you'd like to consider it, but I could be missing something.

gcs278 · 2023-04-13T21:06:15Z

test/e2e/operator_test.go

+		return fmt.Errorf("failed to get deployment %s: %w", deploymentName.Name, err)
+	}
+
+	if deployment.Generation != deployment.Status.ObservedGeneration {


Would it be cleaner/safer to skip this check and always do waitForDeploymentComplete? It does this check inside of it, along with the other replica count checks?

gcs278 · 2023-04-13T21:22:21Z

test/e2e/operator_test.go

+			t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err)
+			return false, nil
+		}
+		return len(pods.Items) == int(*deployment.Spec.Replicas), nil


Isn't this the same as doingspec.replicas == status.replicas, which is what waitForDeploymentComplete already does? You are counting the pods manually, but I don't see how this is filtering out terminating pods.

Shouldn't you check for no metadata.deletionTimestamp on each of the pods before comparing the pod count to spec.replicas?

Alternatively, a better solution might be instead to ensure there are no terminating pods (of the deployment) that are also ready. This saves about 15 seconds of blocking when there is 1 terminating router that is not ready (it's not going to get requests), and we have our two other router pods up.

I think the reason we're seeing this issue is that status.replicas seems to either be pruning pods that are terminating or it's only including pods in the current generation. Either way, I've found that spec.replicas == status.replicas doesn't say if there are still terminating pods, while manually listing the pods does.

I do like the idea of checking for pods that are terminating but still ready. I'll try that and if it works, I'll add it to the PR.

It turns out that "terminating" isn't actually a pod state; oc/kubectl will show that, but it's not a part of the pod status.

However, I did update the function to only count the number of ready pods, so it should return once the old pods become not ready.

gcs278 · 2023-04-13T21:24:15Z

test/e2e/operator_test.go

+			t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err)
+			return false, nil
+		}
+		return len(pods.Items) == int(*deployment.Spec.Replicas), nil


I'm not so sure. deployment may always be non-nil, but deployment.Spec.Replicas could very well be nil. Am I missing something? I think it needs to be initialized like in waitForDeploymentComplete

cluster-ingress-operator/test/e2e/operator_test.go

Line 3536 in 6660fb0

replicas = *deployment.Spec.Replicas

…eteWithOldPodTermination Also: - Rename pods to podList - When checking for old pod termination, only count the currently ready pods, instead of all pods

rfredette · 2023-04-19T20:59:13Z

aws and gcp operator suites failed, but TestClientTLS passed. Other suites also failed, but since this PR only touches the operator suite, those don't seem related.

/retest

rfredette · 2023-04-20T04:43:40Z

More test flakes in unrelated tests
/retest

gcs278 · 2023-04-25T12:11:41Z

Not related.
/retest-required

gcs278 · 2023-04-25T12:16:48Z

test/e2e/operator_test.go

+	}
+
+	expectedReplicas := 1
+	if deployment.Spec.Replicas != nil && int(*deployment.Spec.Replicas) != 0 {


The int(*deployment.Spec.Replicas) != 0 seems a little funny to me. replicas: 0 is indeed an allowed value in an deployment and if you did have a deployment with it, this function would fail since it overrides it to 1.

I know it's a highly unlikely use-case. Did you have another reason for ignoring replicas: 0 and overriding it to 1?

Fixed in 39c6aef.

ShudiLi · 2023-04-26T01:34:07Z

/label qe-approved
thanks

Miciah · 2023-04-26T15:19:53Z

test/e2e/operator_test.go

+// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old
+// generation's pods to finish terminating.
+func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error {


Suggested change

// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old

// generation's pods to finish terminating.

func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error {

// waitForDeploymentCompleteWithOldPodTermination waits for a deployment to complete its rollout, then waits for the old

// generation's pods to finish terminating.

func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error {

Fixed in 39c6aef.

gcs278 · 2023-04-26T15:39:39Z

Ship it.
/lgtm

Follow-up to commit 20e4e38. * test/e2e/operator_test.go (waitForDeploymentCompleteWithOldPodTermination): Correct the function name in the godoc. Use "k8s.io/utils/pointer".Int32Deref, and respect the value in spec.replicas even if it is set explicitly to 0.

frobware · 2023-04-26T16:04:29Z

/approve

gcs278 · 2023-04-26T16:04:48Z

/lgtm

openshift-ci · 2023-04-26T16:06:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: frobware

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [frobware]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gcs278 · 2023-04-26T18:05:19Z

/test e2e-aws-operator
TestUnmanagedDNSToManagedDNSInternalIngressController failure

openshift-ci-robot · 2023-04-26T18:21:20Z

/retest-required

Remaining retests: 0 against base HEAD a29464e and 2 for PR HEAD 5cd2a01 in total

gcs278 · 2023-04-26T19:53:53Z

TestInternalLoadBalancer failure. I haven't seen that before, but this change only affects TestClientTLS

/test e2e-gcp-operator

gcs278 · 2023-04-27T00:08:49Z

TestUnmanagedDNSToManagedDNSInternalIngressController failure...
/test e2e-gcp-operator

openshift-ci · 2023-04-27T01:35:31Z

@rfredette: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`5cd2a01`	link	false	`/test e2e-aws-ovn-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot · 2023-04-27T01:38:13Z

@rfredette: Jira Issue OCPBUGS-10846: All pull requests linked via external trackers have merged:

openshift/cluster-ingress-operator#904

Jira Issue OCPBUGS-10846 has been moved to the MODIFIED state.

In response to this:

Wait for old router pods to be cleaned up before testing new mTLS config.

After updating the ingresscontroller configuration, the router deployment reports when all new pods are ready, but sometimes the pods from the older generation still haven't terminated. If those older generation pods are still marked ready when TestClientTLS curls an endpoint, sometimes the connections are handled by the older generation router, and that test case will fail.

This PR makes the test wait until the older generation pod(s) are completely terminated before running curl, ensuring that only the correct router pods are used.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2023-05-03T19:55:12Z

The test is also failing for 4.13 and 4.12 CI.
/cherry-pick release-4.13

openshift-cherrypick-robot · 2023-05-03T19:55:55Z

@Miciah: new pull request created: #923

In response to this:

The test is also failing for 4.13 and 4.12 CI.
/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Miciah · 2023-07-20T19:56:01Z

/cherry-pick release-4.12

openshift-cherrypick-robot · 2023-07-20T19:56:41Z

@Miciah: new pull request created: #964

In response to this:

/cherry-pick release-4.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Fix TestClientTLS: Wait for old router pods to be cleaned up before t…

1a9b492

…esting new mTLS config

openshift-ci bot requested review from gcs278 and miheer April 4, 2023 21:09

openshift-ci bot requested a review from ShudiLi April 4, 2023 23:07

openshift-ci bot assigned frobware Apr 5, 2023

frobware requested changes Apr 6, 2023

View reviewed changes

frobware mentioned this pull request Apr 6, 2023

OCPBUGS-10189: Updating ose-cluster-ingress-operator images to be consistent with ART #898

Merged

openshift-ci bot assigned gcs278 Apr 13, 2023

gcs278 reviewed Apr 13, 2023

View reviewed changes

gcs278 mentioned this pull request Apr 17, 2023

OCPBUGS-10714: gatewayclass: Update for OSSM 2.4 API change #901

Merged

Rename waitForDeploymentCompleteWithCleanup to waitForDeploymentCompl…

20e4e38

…eteWithOldPodTermination Also: - Rename pods to podList - When checking for old pod termination, only count the currently ready pods, instead of all pods

gcs278 reviewed Apr 25, 2023

View reviewed changes

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Apr 26, 2023

Miciah reviewed Apr 26, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2023

Miciah force-pushed the ocpbugs-10846-TestClientTLS-fix branch from 39c6aef to 5cd2a01 Compare April 26, 2023 16:03

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 26, 2023

openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Apr 26, 2023

openshift-merge-robot merged commit e93984f into openshift:master Apr 27, 2023

gcs278 mentioned this pull request May 1, 2023

OCPBUGS-12913: Deflake TestRouterCompressionOperation #920

Merged

openshift-cherrypick-robot mentioned this pull request May 3, 2023

[release-4.13] OCPBUGS-13071: Fix TestClientTLS flakes #923

Merged

This was referenced Jun 7, 2023

[release-4.12] WIP: Handle mTLS CRLs, and fix accidental CRL duplication with TestClientTLS Fix #945

Closed

[release-4.12] OCPBUGS-14454, OCPBUGS-14455: Handle mTLS CRLs, and fix accidental CRL duplication #941

Merged

openshift-cherrypick-robot mentioned this pull request Jul 20, 2023

[release-4.12] OCPBUGS-16621: Fix TestClientTLS flakes #964

Merged

OCPBUGS-10846: Fix TestClientTLS flakes #904

OCPBUGS-10846: Fix TestClientTLS flakes #904

Conversation

rfredette commented Apr 4, 2023

openshift-ci-robot commented Apr 4, 2023

rfredette commented Apr 4, 2023

openshift-ci-robot commented Apr 4, 2023

rfredette commented Apr 5, 2023

Miciah commented Apr 5, 2023

candita commented Apr 5, 2023

rfredette commented Apr 5, 2023

frobware Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gcs278 commented Apr 13, 2023

gcs278 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gcs278 Apr 13, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rfredette commented Apr 19, 2023

rfredette commented Apr 20, 2023

gcs278 commented Apr 25, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShudiLi commented Apr 26, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gcs278 commented Apr 26, 2023

frobware commented Apr 26, 2023

gcs278 commented Apr 26, 2023

openshift-ci bot commented Apr 26, 2023

gcs278 commented Apr 26, 2023

openshift-ci-robot commented Apr 26, 2023

gcs278 commented Apr 26, 2023

gcs278 commented Apr 27, 2023

openshift-ci bot commented Apr 27, 2023

openshift-ci-robot commented Apr 27, 2023

Miciah commented May 3, 2023

openshift-cherrypick-robot commented May 3, 2023

Miciah commented Jul 20, 2023

openshift-cherrypick-robot commented Jul 20, 2023

frobware Apr 6, 2023 •

edited

Loading

gcs278 Apr 13, 2023 •

edited

Loading