Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pod controller to reconcile not found pods #1512

Merged
merged 1 commit into from
Dec 27, 2023

Conversation

achernevskii
Copy link
Contributor

@achernevskii achernevskii commented Dec 22, 2023

What type of PR is this?

/kind bug

What this PR does / why we need it:

  • If pod is not found, it shouldn't be skipped, since there could be a dangling workload that should be finalized.

  • Update Skip method for the pod controller to return false if the pod is not found.

  • Add an integration test for the case when pods are finalized and deleted before the workload finalizer is removed.

Which issue(s) this PR fixes:

Fixes #1450

Special notes for your reviewer:

The integration test could be replaced with a unit test for the sake of testing speed increase.

Does this PR introduce a user-facing change?

Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found

* If pod is not found, it shouldn't be skipped, since
  there could be a dangling workload that should be
  finalized.

* Update Skip method for the pod controller to return
  false if the pod is not found.

* Add an integration test for the case when pods are
  finalized and deleted before the workload finalizer
  is removed.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Dec 22, 2023
Copy link

netlify bot commented Dec 22, 2023

Deploy Preview for kubernetes-sigs-kueue ready!

Name Link
🔨 Latest commit 1d8a45c
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/6585fc7af580f40008baa5fd
😎 Deploy Preview https://deploy-preview-1512--kubernetes-sigs-kueue.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Dec 22, 2023
@alculquicondor
Copy link
Contributor

  1. Can you explain what happens in this line when the Pod doesn't exist?

if !object.GetDeletionTimestamp().IsZero() {

  1. I think this fix is still incomplete. This is relying on the fact that a key for the Pod made it to the work queue. This will not be true if the kueue controller restarts after the Pods have been deleted. Hence, I think a better solution is to remove the workload finalizer in the workload controller when it has a Finished condition or the ownerReferences are empty fix: remove finalizer if workload finished #1454 (comment)

@achernevskii
Copy link
Contributor Author

Can you explain what happens in this line when the Pod doesn't exist?

GetDeletionTimestamp will return a nil pointer to metav1.Time, IsZero will return true:
https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/time.go#L60-L62
And the job reconciler won't try to finalize a non-existing job.

I could add an additional check, to make sure that the job is found and not rely on the behaviour of IsZeroand GetDeletionTimestamp.

@alculquicondor
Copy link
Contributor

I could add an additional check, to make sure that the job is found and not rely on the behaviour of IsZeroand GetDeletionTimestamp.

No, I think your answer is enough.

Can you open a separate PR to solve the problem via the workload finalizer instead? I want to see which of the PRs is simpler so that it can be cherry-picked to release-0.5.

@alculquicondor
Copy link
Contributor

Actually, the bug shouldn't be exclusive to Pods. It could also happen if a Job disappears before the Workload is finalized.

So I really think we should take the other approach.

@achernevskii
Copy link
Contributor Author

Can you open a separate PR to solve the problem via the workload finalizer instead? I want to see which of the PRs is simpler so that it can be cherry-picked to release-0.5.

Here's a PR for the finalization in the workload reconciler: #1523

I think we should still merge this one. Pod reconciliation should be done if the pod is not found. As any other job.

@alculquicondor
Copy link
Contributor

Ah right, other jobs don't have Skip.

Can you also cherry-pick into release-0.5 if the automatic one doesn't work?

/cherry-pick
/lgtm
/approve

@k8s-infra-cherrypick-robot

@alculquicondor: once the present PR merges, I will cherry-pick it on top of /lgtm in a new PR and assign it to you.

In response to this:

Ah right, other jobs don't have Skip.

Can you also cherry-pick into release-0.5 if the automatic one doesn't work?

/cherry-pick
/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 27, 2023
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: b4a264aaa0ea9966d7f85b2255688a4d5e2f3d2d

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: achernevskii, alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 27, 2023
@tenzen-y
Copy link
Member

/cherry-pick release-0.5

@k8s-infra-cherrypick-robot

@tenzen-y: once the present PR merges, I will cherry-pick it on top of release-0.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-0.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot merged commit a0757fa into kubernetes-sigs:main Dec 27, 2023
14 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.6 milestone Dec 27, 2023
@k8s-infra-cherrypick-robot

@alculquicondor: cannot checkout /lgtm: error checking out "/lgtm": exit status 128 fatal: /lgtm: '/lgtm' is outside repository at '/var/tmp/gitrepo1351745998'

In response to this:

Ah right, other jobs don't have Skip.

Can you also cherry-pick into release-0.5 if the automatic one doesn't work?

/cherry-pick
/lgtm
/approve

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-infra-cherrypick-robot

@tenzen-y: #1512 failed to apply on top of branch "release-0.5":

Applying: Update pod controller to reconcile not found pods
Using index info to reconstruct a base tree...
M	pkg/controller/jobs/pod/pod_controller.go
M	test/integration/controller/jobs/pod/pod_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/integration/controller/jobs/pod/pod_controller_test.go
Auto-merging pkg/controller/jobs/pod/pod_controller.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/pod/pod_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Update pod controller to reconcile not found pods
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tenzen-y
Copy link
Member

@tenzen-y: #1512 failed to apply on top of branch "release-0.5":

Applying: Update pod controller to reconcile not found pods
Using index info to reconstruct a base tree...
M	pkg/controller/jobs/pod/pod_controller.go
M	test/integration/controller/jobs/pod/pod_controller_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/integration/controller/jobs/pod/pod_controller_test.go
Auto-merging pkg/controller/jobs/pod/pod_controller.go
CONFLICT (content): Merge conflict in pkg/controller/jobs/pod/pod_controller.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Update pod controller to reconcile not found pods
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-0.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@achernevskii Can you create a cherry-pick PR?

@achernevskii
Copy link
Contributor Author

Created a separate fix pull request #1524

@tenzen-y
Copy link
Member

/release-note-edit

Fix a bug in the pod integration that unexpected error will occur when the pod isn't find

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Jan 15, 2024
@tenzen-y
Copy link
Member

/release-note-edit

Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

High churn cluster with pod only causing stuck queue and overcommitment
5 participants