Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-23435: workloadctl: account for terminating pods #1732

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

stlaz
Copy link
Contributor

@stlaz stlaz commented May 6, 2024

Don't set Progressing=False if some pods from the previous generation are still running.

/assign @openshift/openshift-team-auth
/cc @deads2k

@openshift-ci openshift-ci bot requested a review from deads2k May 6, 2024 12:33
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels May 6, 2024
@openshift-ci-robot
Copy link

@stlaz: This pull request references Jira Issue OCPBUGS-23435, which is invalid:

  • expected the bug to target either version "4.16." or "openshift-4.16.", but it targets "4.15.z" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Don't set Progressing=False if some pods from the previous generation are still running.

/assign @openshift/openshift-team-auth
/cc @deads2k

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented May 6, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stlaz
Once this PR has been reviewed and has the lgtm label, please assign p0lyn0mial for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stlaz
Copy link
Contributor Author

stlaz commented May 6, 2024

/hold
I accidentally used a live client instead of a lister

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
@stlaz
Copy link
Contributor Author

stlaz commented May 6, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 6, 2024
Don't set Progressing=False if some pods from the previous generation
are still running.
Copy link
Contributor

openshift-ci bot commented May 6, 2024

@stlaz: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

} else if tooManyMatchingPods {
deploymentProgressingCondition.Status = operatorv1.ConditionTrue
deploymentProgressingCondition.Reason = "PreviousGenPodsPresent"
deploymentProgressingCondition.Message = fmt.Sprintf("deployment/%s.%s: %d pod(s) from the previous generation are still present", workload.Name, c.targetNamespace, len(matchingPods)-int(desiredReplicas))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may not always be correct. There may be extra pods for different reasons. E.g. if they are disrupted/evicted for some reason, the deployment controller will create extra pods to account for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you word it, then? Just too many pods?

Copy link
Member

@atiratree atiratree May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In k8s, this is considered a complete deployment, so it depends on what you want to convey. There can also be extra pods during a rollout with maxSurge, but in that case the pods will have different revisions/hashes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the message should be about pods of a different revision still existing. Would you be able to think of any other case where extra pods might cause unexpected behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the terminating pods which are a subject of this bug, no.

I guess, there might be some exotic cases where the pods can be owned by another controller, but that can be safely ignored here.

// contribute to unexpected behavior if we report Progressing=False.
// The case of too many pods might occur for example if `TerminationGracePeriodSeconds`
// is set.
tooManyMatchingPods := int32(len(matchingPods)) > desiredReplicas
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Safer would be to test if all the pods have the same hash (pod-template-hash label). This would work well in combination with workloadIsBeingUpdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is pod-template-hash set on 100% of our deployments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the pod spec can be the same, only the underlying config changed. Would that still work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deployment rollout has to be triggered somehow and it seems that you are triggering it with the resource revisions of the dependencies. So yes it should work.

https://github.com/openshift/cluster-authentication-operator/blob/b415439ebab2829c8da1ea17c05f2ac75fe5dbe8/pkg/controllers/deployment/default_deployment.go#L54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants