-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MultiKueue] When a JobSet is deleted on management cluster it takes up to 1min to delete from workers #2320
Comments
This is not a bug it's a known limitation, the multikueue workload reconciler it's unable to know if a workload that is currently not found was the subject of delegation. Sure we can try to work around this (by using a finalizer, try to detect this in the Delete event predicate...). /assign |
I didn't know this is known. I think this is certainly a bug-like behavior from the user perspective. If the improvement does not require API change I think we can categorize as a bugfix rather than new feature. EDIT: I think one advantage of categorizing as a bug is that we could cherry-pick for 0.7.1 if the fix does not make for 0.7.0. We don't cherry-pick new features. |
Maybe we can avoid the use of finalizers. I think based on the delete event we could enqueue cleanup of the worker clusters? So we would have a in-memory queue of workloads to clean. Sure, if the kueue controller is restarted it means fallback to the regular garbage-collector, but this is fine. |
I'm thinking about a queue-based mechanism, something similar as we use in the Job controller to cleanup the orphan pods, ref. EDIT : I'm open to something more involving or simpler if we have other ideas. |
I don't see any issues with backporting small features, we did it in the past (v0.6.3, v0.6.1). |
I was surprised by the delay, and I expect other users who are not intimately familiar with MultiKueue code to consider the delay as a bug,. AFAIK this isn't documented as known as a limitation, so not sure how users would know that. Having said that I'm fine either way, I will leave the final tagging to @alculquicondor, who anyway updates tagging when preparing release notes. |
I wonder that mentioning this limitation in the troubleshooting guide would be worth it. |
I'm not sure, this is still an alpha feature, so probably some small issues are acceptable. We have also cherry-picked the fix on 0.7 branch |
I agree with you. It would be better to mention this limitation when the MultiKueue graduates to beta. |
Yeah, but it is fixed already, so imo nothing to be mentioned. |
Oh, I didn't find that :) NVM |
What happened:
When running a MultiKueue environment you delete a JobSet on the manager it is not deleted instantly on the worker.
It takes up to 1min to delete the JobSet on worker by the garbage collector.
What you expected to happen:
When a user deleted the JobSet on manager it should be deleted instantaneously on a worker.
Only if the delete request fails we fallback to the Garbage-collector mechanism
How to reproduce it (as minimally and precisely as possible):
Issue: the mirror JobSet remains running on the worker for a prolonged amount of time.
The text was updated successfully, but these errors were encountered: