Skip to content
This repository has been archived by the owner on Sep 12, 2023. It is now read-only.

Add job suspend semantics #196

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xiaoxubeii
Copy link

To support job suspend semantics like Kubernetes batch job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign gaocegege after the PR has been reviewed.
You can assign the PR to them by writing /assign @gaocegege in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gaocegege
Copy link
Member

/ok-to-test

@gaocegege
Copy link
Member

Thanks for the PR, is it ready to review?

@xiaoxubeii
Copy link
Author

Thanks for the PR, is it ready to review?

@gaocegege Ready for review. Thanks :)

if err != nil {
return err
}
if commonutil.IsSucceeded(jobStatus) || commonutil.IsFailed(jobStatus) || (jobSuspended != nil && *jobSuspended) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern here is Suspend is just a transition state, should we delete all the pods or just the active ones, leaving the completed pods(succeeded/failed) as they are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If anything, it should have the same semantics as kubernetes Job, where we delete the running pods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current implementation is consistent with batch/job.

@@ -357,3 +361,8 @@ func (jc *JobController) CleanupJob(runPolicy *apiv1.RunPolicy, jobStatus apiv1.
func (jc *JobController) calcPGMinResources(minMember int32, replicas map[apiv1.ReplicaType]*apiv1.ReplicaSpec) *v1.ResourceList {
return CalcPGMinResources(minMember, replicas, jc.PriorityClassLister.Get)
}

func (jc *JobController) JobSuspended(job interface{}) (*bool, error) {
log.Infof("Not implemented.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we should merge this since the feature is not completed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a base class default function in case Job subclasses(TFJob, MPIJob, etc.) do not implement this method.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it will be override in Job subclass which supports job suspend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DeletePodsAndServices has an implementation, I don't see why this function wouldn't have one.

@ggaaooppeenngg
Copy link

How is this PR going now?

@alculquicondor
Copy link
Contributor

Is this actively being worked on? Or will we get rid of the common repo first?

@tenzen-y
Copy link
Member

tenzen-y commented Jan 6, 2023

Is this actively being worked on? Or will we get rid of the common repo first?

@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.

kubeflow/training-operator#1683

Wed Jan 25th 2023 Week 18 Release Team Feature Freeze

https://github.com/kubeflow/community/blob/6ba2e0e754166989d2f0d06aae827ceafdb65b29/releases/release-1.7/README.md

@johnugeorge
Copy link
Member

Agree. we will take this up in next release after our merging kubeflow/common as planned in kubeflow/training-operator#1714 (comment)

@alculquicondor
Copy link
Contributor

alculquicondor commented Jan 20, 2023

@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later?
It might give us a better chance to iterate faster and learn.

@tenzen-y
Copy link
Member

tenzen-y commented Jan 20, 2023

@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? It might give us a better chance to iterate faster and learn.

@alculquicondor Yes. that is a good idea. I was thinking of the same.
Although, we need to move forward kubernetes-sigs/kueue#369 before we adapt mpi-operator to Kueue.

@alculquicondor
Copy link
Contributor

Excellent!
We can work on the kueue side in parallel, while we add support for suspend in the mpi-operator.

@tenzen-y
Copy link
Member

Excellent! We can work on the kueue side in parallel, while we add support for suspend in the mpi-operator.

You are right. I will work on the following steps after kubeflow feature freeze date (1/25) since I have no enough bandwidth for mpi-operator v2, now:

  1. Upgrade Kubernetes dependencies mpi-operator#502
  2. Support coscheduling plugin mpi-operator#500
  3. Support suspend in mpi-operator

Although, other anyone can take step 3 after step 1 is completed.

@alculquicondor
Copy link
Contributor

@mimowo will help with suspend in mpi-operator kubeflow/mpi-operator#504

@tenzen-y
Copy link
Member

Great! Thanks to @mimowo!

@xiaoxubeii
Copy link
Author

Is this actively being worked on? Or will we get rid of the common repo first?

@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.

kubeflow/training-operator#1683

Wed Jan 25th 2023 Week 18 Release Team Feature Freeze

https://github.com/kubeflow/community/blob/6ba2e0e754166989d2f0d06aae827ceafdb65b29/releases/release-1.7/README.md

Agreed. We could try to work on Job suspend feature for kubeflow v1.8.

@alculquicondor
Copy link
Contributor

@johnugeorge how are we doing with the branch creation?
Can we proceed with this PR or move it to training-operator?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants