-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-10846: Fix TestClientTLS flakes #904
OCPBUGS-10846: Fix TestClientTLS flakes #904
Conversation
…esting new mTLS config
@rfredette: This pull request references Jira Issue OCPBUGS-10846, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@rfredette: This pull request references Jira Issue OCPBUGS-10846, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
Why isn't making sure that cluster-ingress-operator/test/e2e/operator_test.go Lines 3535 to 3539 in a29464e
status.replicas is the "number of non-terminated pods targeted by this deployment" according to https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.26/#deploymentstatus-v1-apps. If there are pods that are marked for deletion but haven't been terminated yet, are they not counted?
|
/assign @frobware |
It appears that terminating pods are not counted in status.replicas, although I didn't observe that field directly. Before making this change, I ran the test and had a second terminal open running |
test/e2e/operator_test.go
Outdated
t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err) | ||
return false, nil | ||
} | ||
return len(pods.Items) == int(*deployment.Spec.Replicas), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is deployment.Spec.Replicas always non-nil?
Edit: ah, yes it is created in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure. deployment
may always be non-nil, but deployment.Spec.Replicas
could very well be nil. Am I missing something? I think it needs to be initialized like in waitForDeploymentComplete
replicas = *deployment.Spec.Replicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change the Spec.Replicas deref to:
return len(pods.Items) == int(pointer.Int32Deref(deployment.Spec.Replicas, -1)), nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
39c6aef changes this to use Int32Deref
. I kept the explicit expectedReplicas
variable so that we aren't calling Int32Deref
repeatedly; I hope that's fine.
test/e2e/operator_test.go
Outdated
@@ -3579,6 +3579,36 @@ func waitForDeploymentEnvVar(t *testing.T, cl client.Client, deployment *appsv1. | |||
return err | |||
} | |||
|
|||
// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old | |||
// generation's pods to finish terminating. | |||
func waitForDeploymentCompleteWithCleanup(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Cleanup" part was not immediately obvious for me. How about waitForDeploymentRolloutAndOldPodsTermination
?
test/e2e/operator_test.go
Outdated
} | ||
|
||
return wait.PollImmediate(2*time.Second, timeout, func() (bool, error) { | ||
pods := &corev1.PodList{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/pods/pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pods
is a list of possibly multiple pods, so I think plural makes more sense. That said, if it's unclear as is, I do think podList
as a name would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my head I read this as a podList
, but Items
brings the plurality.
I'll take a look at this too, since it seems we are blocking on this on PRs such as #901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I experimented with rolling out a new router pod, and I can very much confirm what @rfredette is seeing: status.replicas
counts terminating replicas and the terminating replicas can still be ready for a solid moment.
I provided an alternative solution if you'd like to consider it, but I could be missing something.
test/e2e/operator_test.go
Outdated
return fmt.Errorf("failed to get deployment %s: %w", deploymentName.Name, err) | ||
} | ||
|
||
if deployment.Generation != deployment.Status.ObservedGeneration { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be cleaner/safer to skip this check and always do waitForDeploymentComplete
? It does this check inside of it, along with the other replica count checks?
test/e2e/operator_test.go
Outdated
t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err) | ||
return false, nil | ||
} | ||
return len(pods.Items) == int(*deployment.Spec.Replicas), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this the same as doingspec.replicas == status.replicas
, which is what waitForDeploymentComplete
already does? You are counting the pods manually, but I don't see how this is filtering out terminating pods.
Shouldn't you check for no metadata.deletionTimestamp
on each of the pods before comparing the pod count to spec.replicas
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, a better solution might be instead to ensure there are no terminating pods (of the deployment) that are also ready. This saves about 15 seconds of blocking when there is 1 terminating router that is not ready (it's not going to get requests), and we have our two other router pods up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason we're seeing this issue is that status.replicas
seems to either be pruning pods that are terminating or it's only including pods in the current generation. Either way, I've found that spec.replicas == status.replicas
doesn't say if there are still terminating pods, while manually listing the pods does.
I do like the idea of checking for pods that are terminating but still ready. I'll try that and if it works, I'll add it to the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that "terminating" isn't actually a pod state; oc/kubectl will show that, but it's not a part of the pod status.
However, I did update the function to only count the number of ready pods, so it should return once the old pods become not ready.
test/e2e/operator_test.go
Outdated
t.Logf("failed to list pods for deployment %q: %v", deploymentName.Name, err) | ||
return false, nil | ||
} | ||
return len(pods.Items) == int(*deployment.Spec.Replicas), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure. deployment
may always be non-nil, but deployment.Spec.Replicas
could very well be nil. Am I missing something? I think it needs to be initialized like in waitForDeploymentComplete
replicas = *deployment.Spec.Replicas |
…eteWithOldPodTermination Also: - Rename pods to podList - When checking for old pod termination, only count the currently ready pods, instead of all pods
aws and gcp operator suites failed, but TestClientTLS passed. Other suites also failed, but since this PR only touches the operator suite, those don't seem related. /retest |
More test flakes in unrelated tests |
Not related. |
test/e2e/operator_test.go
Outdated
} | ||
|
||
expectedReplicas := 1 | ||
if deployment.Spec.Replicas != nil && int(*deployment.Spec.Replicas) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The int(*deployment.Spec.Replicas) != 0
seems a little funny to me. replicas: 0
is indeed an allowed value in an deployment and if you did have a deployment with it, this function would fail since it overrides it to 1.
I know it's a highly unlikely use-case. Did you have another reason for ignoring replicas: 0
and overriding it to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 39c6aef.
/label qe-approved |
test/e2e/operator_test.go
Outdated
// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old | ||
// generation's pods to finish terminating. | ||
func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// waitForDeploymentCompleteWithCleanup waits for a deployment to complete its rollout, then waits for the old | |
// generation's pods to finish terminating. | |
func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error { | |
// waitForDeploymentCompleteWithOldPodTermination waits for a deployment to complete its rollout, then waits for the old | |
// generation's pods to finish terminating. | |
func waitForDeploymentCompleteWithOldPodTermination(t *testing.T, cl client.Client, deploymentName types.NamespacedName, timeout time.Duration) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 39c6aef.
Ship it. |
Follow-up to commit 20e4e38. * test/e2e/operator_test.go (waitForDeploymentCompleteWithOldPodTermination): Correct the function name in the godoc. Use "k8s.io/utils/pointer".Int32Deref, and respect the value in spec.replicas even if it is set explicitly to 0.
39c6aef
to
5cd2a01
Compare
/approve |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: frobware The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test e2e-aws-operator |
/test e2e-gcp-operator |
|
@rfredette: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@rfredette: Jira Issue OCPBUGS-10846: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-10846 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The test is also failing for 4.13 and 4.12 CI. |
@Miciah: new pull request created: #923 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.12 |
@Miciah: new pull request created: #964 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Wait for old router pods to be cleaned up before testing new mTLS config.
After updating the ingresscontroller configuration, the router deployment reports when all new pods are ready, but sometimes the pods from the older generation still haven't terminated. If those older generation pods are still marked ready when
TestClientTLS
curls an endpoint, sometimes the connections are handled by the older generation router, and that test case will fail.This PR makes the test wait until the older generation pod(s) are completely terminated before running curl, ensuring that only the correct router pods are used.