[MultiKueue] When a JobSet is deleted on management cluster it takes up to 1min to delete from workers #2320

mimowo · 2024-05-29T15:32:33Z

What happened:

When running a MultiKueue environment you delete a JobSet on the manager it is not deleted instantly on the worker.
It takes up to 1min to delete the JobSet on worker by the garbage collector.

What you expected to happen:

When a user deleted the JobSet on manager it should be deleted instantaneously on a worker.
Only if the delete request fails we fallback to the Garbage-collector mechanism

How to reproduce it (as minimally and precisely as possible):

Setup a MultiKueue environment. Single worker is enough for reproduction.
Create a JobSet
Delete the jobset on the management cluster

Issue: the mirror JobSet remains running on the worker for a prolonged amount of time.

mimowo · 2024-05-29T15:32:48Z

/cc @alculquicondor @trasc

trasc · 2024-05-29T15:50:18Z

This is not a bug it's a known limitation, the multikueue workload reconciler it's unable to know if a workload that is currently not found was the subject of delegation.

Sure we can try to work around this (by using a finalizer, try to detect this in the Delete event predicate...).

/assign

mimowo · 2024-05-29T15:53:53Z

I didn't know this is known.

I think this is certainly a bug-like behavior from the user perspective. If the improvement does not require API change I think we can categorize as a bugfix rather than new feature.

EDIT: I think one advantage of categorizing as a bug is that we could cherry-pick for 0.7.1 if the fix does not make for 0.7.0. We don't cherry-pick new features.

mimowo · 2024-05-29T16:29:39Z

Maybe we can avoid the use of finalizers.

I think based on the delete event we could enqueue cleanup of the worker clusters? So we would have a in-memory queue of workloads to clean. Sure, if the kueue controller is restarted it means fallback to the regular garbage-collector, but this is fine.

mimowo · 2024-05-29T16:32:22Z

I'm thinking about a queue-based mechanism, something similar as we use in the Job controller to cleanup the orphan pods, ref.

EDIT : I'm open to something more involving or simpler if we have other ideas.

trasc · 2024-06-03T14:29:31Z

EDIT: I think one advantage of categorizing as a bug is that we could cherry-pick for 0.7.1 if the fix does not make for 0.7.0. We don't cherry-pick new features.

I don't see any issues with backporting small features, we did it in the past (v0.6.3, v0.6.1).

mimowo · 2024-06-04T11:40:15Z

I was surprised by the delay, and I expect other users who are not intimately familiar with MultiKueue code to consider the delay as a bug,. AFAIK this isn't documented as known as a limitation, so not sure how users would know that.

Having said that I'm fine either way, I will leave the final tagging to @alculquicondor, who anyway updates tagging when preparing release notes.

tenzen-y · 2024-06-12T12:19:37Z

I was surprised by the delay, and I expect other users who are not intimately familiar with MultiKueue code to consider the delay as a bug,. AFAIK this isn't documented as known as a limitation, so not sure how users would know that.

Having said that I'm fine either way, I will leave the final tagging to @alculquicondor, who anyway updates tagging when preparing release notes.

I wonder that mentioning this limitation in the troubleshooting guide would be worth it.

mimowo · 2024-06-12T12:49:58Z

I wonder that mentioning this limitation in the troubleshooting guide would be worth it.

I'm not sure, this is still an alpha feature, so probably some small issues are acceptable. We have also cherry-picked the fix on 0.7 branch

tenzen-y · 2024-06-12T12:52:37Z

I wonder that mentioning this limitation in the troubleshooting guide would be worth it.

I'm not sure, this is still an alpha feature, so probably some small issues are acceptable. We have also cherry-picked the fix on 0.7 branch

I agree with you. It would be better to mention this limitation when the MultiKueue graduates to beta.

mimowo · 2024-06-12T12:54:48Z

Yeah, but it is fixed already, so imo nothing to be mentioned.

tenzen-y · 2024-06-12T12:56:59Z

Yeah, but it is fixed already, so imo nothing to be mentioned.

Oh, I didn't find that :) NVM
Thank you!

mimowo added the kind/bug Categorizes issue or PR as related to a bug. label May 29, 2024

k8s-ci-robot assigned trasc May 29, 2024

trasc mentioned this issue Jun 3, 2024

[multikueue] Remove remote objects synchronously when reachable. #2347

Merged

k8s-ci-robot closed this as completed in #2347 Jun 4, 2024

trasc mentioned this issue Jun 5, 2024

[multikueue] Ignore the IsManaged result during sync remote objects deletion. #2355

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MultiKueue] When a JobSet is deleted on management cluster it takes up to 1min to delete from workers #2320

[MultiKueue] When a JobSet is deleted on management cluster it takes up to 1min to delete from workers #2320

mimowo commented May 29, 2024

mimowo commented May 29, 2024

trasc commented May 29, 2024

mimowo commented May 29, 2024 •

edited

Loading

mimowo commented May 29, 2024

mimowo commented May 29, 2024 •

edited

Loading

trasc commented Jun 3, 2024

mimowo commented Jun 4, 2024

tenzen-y commented Jun 12, 2024

mimowo commented Jun 12, 2024

tenzen-y commented Jun 12, 2024

mimowo commented Jun 12, 2024

tenzen-y commented Jun 12, 2024

[MultiKueue] When a JobSet is deleted on management cluster it takes up to 1min to delete from workers #2320

[MultiKueue] When a JobSet is deleted on management cluster it takes up to 1min to delete from workers #2320

Comments

mimowo commented May 29, 2024

mimowo commented May 29, 2024

trasc commented May 29, 2024

mimowo commented May 29, 2024 • edited Loading

mimowo commented May 29, 2024

mimowo commented May 29, 2024 • edited Loading

trasc commented Jun 3, 2024

mimowo commented Jun 4, 2024

tenzen-y commented Jun 12, 2024

mimowo commented Jun 12, 2024

tenzen-y commented Jun 12, 2024

mimowo commented Jun 12, 2024

tenzen-y commented Jun 12, 2024

mimowo commented May 29, 2024 •

edited

Loading

mimowo commented May 29, 2024 •

edited

Loading