Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MultiKueue] When a JobSet is deleted on management cluster it takes up to 1min to delete from workers #2320

Closed
mimowo opened this issue May 29, 2024 · 12 comments · Fixed by #2347
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@mimowo
Copy link
Contributor

mimowo commented May 29, 2024

What happened:

When running a MultiKueue environment you delete a JobSet on the manager it is not deleted instantly on the worker.
It takes up to 1min to delete the JobSet on worker by the garbage collector.

What you expected to happen:

When a user deleted the JobSet on manager it should be deleted instantaneously on a worker.
Only if the delete request fails we fallback to the Garbage-collector mechanism

How to reproduce it (as minimally and precisely as possible):

  1. Setup a MultiKueue environment. Single worker is enough for reproduction.
  2. Create a JobSet
  3. Delete the jobset on the management cluster

Issue: the mirror JobSet remains running on the worker for a prolonged amount of time.

@mimowo mimowo added the kind/bug Categorizes issue or PR as related to a bug. label May 29, 2024
@mimowo
Copy link
Contributor Author

mimowo commented May 29, 2024

/cc @alculquicondor @trasc

@trasc
Copy link
Contributor

trasc commented May 29, 2024

This is not a bug it's a known limitation, the multikueue workload reconciler it's unable to know if a workload that is currently not found was the subject of delegation.

Sure we can try to work around this (by using a finalizer, try to detect this in the Delete event predicate...).

/assign

@mimowo
Copy link
Contributor Author

mimowo commented May 29, 2024

I didn't know this is known.

I think this is certainly a bug-like behavior from the user perspective. If the improvement does not require API change I think we can categorize as a bugfix rather than new feature.

EDIT: I think one advantage of categorizing as a bug is that we could cherry-pick for 0.7.1 if the fix does not make for 0.7.0. We don't cherry-pick new features.

@mimowo
Copy link
Contributor Author

mimowo commented May 29, 2024

Maybe we can avoid the use of finalizers.

I think based on the delete event we could enqueue cleanup of the worker clusters? So we would have a in-memory queue of workloads to clean. Sure, if the kueue controller is restarted it means fallback to the regular garbage-collector, but this is fine.

@mimowo
Copy link
Contributor Author

mimowo commented May 29, 2024

I'm thinking about a queue-based mechanism, something similar as we use in the Job controller to cleanup the orphan pods, ref.

EDIT : I'm open to something more involving or simpler if we have other ideas.

@trasc
Copy link
Contributor

trasc commented Jun 3, 2024

EDIT: I think one advantage of categorizing as a bug is that we could cherry-pick for 0.7.1 if the fix does not make for 0.7.0. We don't cherry-pick new features.

I don't see any issues with backporting small features, we did it in the past (v0.6.3, v0.6.1).

@mimowo
Copy link
Contributor Author

mimowo commented Jun 4, 2024

I was surprised by the delay, and I expect other users who are not intimately familiar with MultiKueue code to consider the delay as a bug,. AFAIK this isn't documented as known as a limitation, so not sure how users would know that.

Having said that I'm fine either way, I will leave the final tagging to @alculquicondor, who anyway updates tagging when preparing release notes.

@tenzen-y
Copy link
Member

I was surprised by the delay, and I expect other users who are not intimately familiar with MultiKueue code to consider the delay as a bug,. AFAIK this isn't documented as known as a limitation, so not sure how users would know that.

Having said that I'm fine either way, I will leave the final tagging to @alculquicondor, who anyway updates tagging when preparing release notes.

I wonder that mentioning this limitation in the troubleshooting guide would be worth it.

@mimowo
Copy link
Contributor Author

mimowo commented Jun 12, 2024

I wonder that mentioning this limitation in the troubleshooting guide would be worth it.

I'm not sure, this is still an alpha feature, so probably some small issues are acceptable. We have also cherry-picked the fix on 0.7 branch

@tenzen-y
Copy link
Member

I wonder that mentioning this limitation in the troubleshooting guide would be worth it.

I'm not sure, this is still an alpha feature, so probably some small issues are acceptable. We have also cherry-picked the fix on 0.7 branch

I agree with you. It would be better to mention this limitation when the MultiKueue graduates to beta.

@mimowo
Copy link
Contributor Author

mimowo commented Jun 12, 2024

Yeah, but it is fixed already, so imo nothing to be mentioned.

@tenzen-y
Copy link
Member

Yeah, but it is fixed already, so imo nothing to be mentioned.

Oh, I didn't find that :) NVM
Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants