Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not create Jobs and the service when JobSet is suspended #535

Open
mimowo opened this issue Apr 19, 2024 · 13 comments
Open

Do not create Jobs and the service when JobSet is suspended #535

mimowo opened this issue Apr 19, 2024 · 13 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@mimowo
Copy link
Contributor

mimowo commented Apr 19, 2024

To reproduce just create a simple JobSet with spec.suspend=true. In that case Jobs and the service are created.

This is wasteful in case of Jobs which are queued by Kueue, and may stay in the queue for a long time, potentially.

@mimowo
Copy link
Contributor Author

mimowo commented Apr 19, 2024

@alculquicondor
Copy link

alculquicondor commented Apr 19, 2024

I think this could be particularly wasteful when the jobs are rather small, but have a big replication number.

@kannon92
Copy link
Contributor

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 19, 2024
@kannon92
Copy link
Contributor

I know that when I did this, I figured I would just use JobSet.Spec.Suspend and set that on all jobs that are created. Resuming means to resume the individual jobs.

I can see why maybe we would want to go a different route. I tagged this as a feature.

@kannon92
Copy link
Contributor

Is it a big deal to have the service created?

@mimowo
Copy link
Contributor Author

mimowo commented Apr 19, 2024

Maybe not a "big" deal, but Kueue is typically used to hold long queues of suspended Jobs(or JobSets), say 50k, so would be nice to do not create them.

I imagine it would be fine to keep a once created service for a JobSet that got suspended (it was running, but got preempted). There should not be too many preemptions, and we could save on recreation in case the JobSet is quickly re-admitted.

@kannon92
Copy link
Contributor

I think the main tricky point would be support for startup policy and suspend.

Our implementation with suspend and startup policy was to resume the replicated jobs in order of their listing.

I guess this could clean this up as we would only create the jobs if they were resumed. But it may be a bit tricky to implement...

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 18, 2024
@mimowo
Copy link
Contributor Author

mimowo commented Jul 26, 2024

/remove-lifecycle stale
I'm investigating this so that we can fix #625

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 26, 2024
@kannon92
Copy link
Contributor

I don't quite follow why this is required for #625. Can you explain?

@mimowo
Copy link
Contributor Author

mimowo commented Jul 26, 2024

I don't quite follow why this is required for #625. Can you explain?

Consider the following chain of states: suspend - resume 1 - suspend - resume.2.

Here resume 1 and 2 may use different Pod templates (updated during suspend), so we need to recreate the Jobs at some point.

Deleting the in the suspend phase seems simplest. The Job controller also deletes pods on suspend

@kannon92
Copy link
Contributor

I see. So Jobs delete pods on a resume/suspend but JobSet was keeping the Jobs around?

@mimowo
Copy link
Contributor Author

mimowo commented Jul 26, 2024

Correct, The alternative could be to try to update the Jobs by JobSet but this is rather complex.

First due to mutability constraints in Jobs. It would require multiple requests similarly as we do in Kueue.

Second the update of new Pod template would revert changes to Jobs done by some potential create webhooks which users may have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants