Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[multikueue] Remove remote objects synchronously when reachable. #2347

Conversation

trasc
Copy link
Contributor

@trasc trasc commented Jun 3, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Remove the multikueue remote objects synchronously when reachable based on the filters cache workload content.

Which issue(s) this PR fixes:

Fixes #2320

Special notes for your reviewer:

Does this PR introduce a user-facing change?

MultiKueue: Remove remote objects synchronously when the worker cluster is reachable.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 3, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 3, 2024
Copy link

netlify bot commented Jun 3, 2024

Deploy Preview for kubernetes-sigs-kueue canceled.

Name Link
🔨 Latest commit d0ee7df
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-sigs-kueue/deploys/665ef68733997a0008b2600d

@trasc trasc force-pushed the multikueue-delete-delegates-on-local-wl-delete branch from 85ee548 to 8c87ac6 Compare June 3, 2024 14:32
@alculquicondor
Copy link
Contributor

/assign @mimowo

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation lgtm. As for testing, I would like to see a test case for the baseline scenario.

@@ -59,6 +61,7 @@ type wlReconciler struct {
clusters *clustersReconciler
origin string
workerLostTimeout time.Duration
deletedWlCache *utilmaps.SyncMap[string, *kueue.Workload]
Copy link
Contributor

@mimowo mimowo Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this could leak a key-value pair if MultiKueueGC removes the workload in the meanwhile, when the synchronous delete could fail, leaking the entry. Should we also delete the entry when performing MultiKueueGC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, in the reconciler the group will be empty the object deletion skipped and the workload deleted from the cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, but I think there could be a race condition when the group contains a worker, but in the meanwhile the workload is removed by GC. In that case the delete request in the new code would probably fail as the object does not exist, and thus it would not delete the key. Can you double check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NotFound errors are ignored during the object deletion, other errors will trigger a new reconcile that will eventually get to remove the cache entry.
(I have extended a bit the unit tests)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking and the update to the tests, the extra check is useful

Copy link
Contributor

@mimowo mimowo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
@alculquicondor @tenzen-y I think it would be good to include this in 0.7.x line. WDYT?

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 4, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 83604b2ac7412b89720fa3ca5896a150db3a7714

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, trasc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2024
@k8s-ci-robot k8s-ci-robot merged commit a63984c into kubernetes-sigs:main Jun 4, 2024
16 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v0.8 milestone Jun 4, 2024
@trasc trasc deleted the multikueue-delete-delegates-on-local-wl-delete branch June 4, 2024 12:15
@alculquicondor
Copy link
Contributor

ok with me to cherry-pick.

@tenzen-y
Copy link
Member

tenzen-y commented Jun 5, 2024

/lgtm /approve @alculquicondor @tenzen-y I think it would be good to include this in 0.7.x line. WDYT?

SGTM

@trasc
Copy link
Contributor Author

trasc commented Jun 5, 2024

/cherry-pick release-0.7

@k8s-infra-cherrypick-robot

@trasc: new pull request created: #2360

In response to this:

/cherry-pick release-0.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@alculquicondor
Copy link
Contributor

/release-note-edit

MultiKueue: Remove remote objects synchronously when the worker cluster is reachable.

Fiona-Waters pushed a commit to Fiona-Waters/kueue that referenced this pull request Jun 25, 2024
…ernetes-sigs#2347)

* [multikueue] Remove remote objects synchronously when reachable.

* Extend unit test coverage.

* Review remarks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MultiKueue] When a JobSet is deleted on management cluster it takes up to 1min to delete from workers
6 participants