OCPBUGS-23435: workloadctl: account for terminating pods #1732

stlaz · 2024-05-06T12:33:25Z

Don't set Progressing=False if some pods from the previous generation are still running.

/assign @openshift/openshift-team-auth
/cc @deads2k

openshift-ci-robot · 2024-05-06T12:33:31Z

@stlaz: This pull request references Jira Issue OCPBUGS-23435, which is invalid:

expected the bug to target either version "4.16." or "openshift-4.16.", but it targets "4.15.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Don't set Progressing=False if some pods from the previous generation are still running.

/assign @openshift/openshift-team-auth
/cc @deads2k

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2024-05-06T12:33:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stlaz
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/operator/apiserver/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stlaz · 2024-05-06T13:41:56Z

/hold
I accidentally used a live client instead of a lister

stlaz · 2024-05-06T13:53:43Z

/hold cancel

Don't set Progressing=False if some pods from the previous generation are still running.

openshift-ci · 2024-05-06T15:17:13Z

@stlaz: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

atiratree · 2024-05-06T20:30:01Z

pkg/operator/apiserver/controller/workload/workload.go

+	} else if tooManyMatchingPods {
+		deploymentProgressingCondition.Status = operatorv1.ConditionTrue
+		deploymentProgressingCondition.Reason = "PreviousGenPodsPresent"
+		deploymentProgressingCondition.Message = fmt.Sprintf("deployment/%s.%s: %d pod(s) from the previous generation are still present", workload.Name, c.targetNamespace, len(matchingPods)-int(desiredReplicas))


This may not always be correct. There may be extra pods for different reasons. E.g. if they are disrupted/evicted for some reason, the deployment controller will create extra pods to account for that.

How would you word it, then? Just too many pods?

In k8s, this is considered a complete deployment, so it depends on what you want to convey. There can also be extra pods during a rollout with maxSurge, but in that case the pods will have different revisions/hashes.

I suppose the message should be about pods of a different revision still existing. Would you be able to think of any other case where extra pods might cause unexpected behavior?

Apart from the terminating pods which are a subject of this bug, no.

I guess, there might be some exotic cases where the pods can be owned by another controller, but that can be safely ignored here.

atiratree · 2024-05-06T20:32:10Z

pkg/operator/apiserver/controller/workload/workload.go

+	// contribute to unexpected behavior if we report Progressing=False.
+	// The case of too many pods might occur for example if `TerminationGracePeriodSeconds`
+	// is set.
+	tooManyMatchingPods := int32(len(matchingPods)) > desiredReplicas


Safer would be to test if all the pods have the same hash (pod-template-hash label). This would work well in combination with workloadIsBeingUpdated.

is pod-template-hash set on 100% of our deployments?

yes it is

https://github.com/kubernetes/kubernetes/blob/0590bb1ac495ae8af2a573f879408e48800da2c5/pkg/controller/deployment/sync.go#L191

I think the pod spec can be the same, only the underlying config changed. Would that still work?

The deployment rollout has to be triggered somehow and it seems that you are triggering it with the resource revisions of the dependencies. So yes it should work.

https://github.com/openshift/cluster-authentication-operator/blob/b415439ebab2829c8da1ea17c05f2ac75fe5dbe8/pkg/controllers/deployment/default_deployment.go#L54

openshift-ci bot requested a review from deads2k May 6, 2024 12:33

stlaz force-pushed the wl_terminating_pods branch from 84214a9 to 141390e Compare May 6, 2024 12:38

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024

stlaz force-pushed the wl_terminating_pods branch from 141390e to 04d6124 Compare May 6, 2024 13:53

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024

workloadctl: account for terminating pods

c238cda

Don't set Progressing=False if some pods from the previous generation are still running.

stlaz force-pushed the wl_terminating_pods branch from 04d6124 to c238cda Compare May 6, 2024 14:54

atiratree reviewed May 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-23435: workloadctl: account for terminating pods #1732

OCPBUGS-23435: workloadctl: account for terminating pods #1732

stlaz commented May 6, 2024

openshift-ci-robot commented May 6, 2024

openshift-ci bot commented May 6, 2024

stlaz commented May 6, 2024

stlaz commented May 6, 2024

openshift-ci bot commented May 6, 2024

atiratree May 6, 2024

stlaz May 7, 2024

atiratree May 7, 2024 •

edited

Loading

stlaz May 7, 2024

atiratree May 7, 2024

atiratree May 6, 2024

stlaz May 7, 2024

atiratree May 7, 2024

stlaz May 7, 2024

atiratree May 7, 2024

OCPBUGS-23435: workloadctl: account for terminating pods #1732

Are you sure you want to change the base?

OCPBUGS-23435: workloadctl: account for terminating pods #1732

Conversation

stlaz commented May 6, 2024

openshift-ci-robot commented May 6, 2024

openshift-ci bot commented May 6, 2024

stlaz commented May 6, 2024

stlaz commented May 6, 2024

openshift-ci bot commented May 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree May 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree May 7, 2024 •

edited

Loading