-
Notifications
You must be signed in to change notification settings - Fork 71
Add job suspend semantics #196
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/ok-to-test |
Thanks for the PR, is it ready to review? |
@gaocegege Ready for review. Thanks :) |
if err != nil { | ||
return err | ||
} | ||
if commonutil.IsSucceeded(jobStatus) || commonutil.IsFailed(jobStatus) || (jobSuspended != nil && *jobSuspended) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here is Suspend
is just a transition state, should we delete all the pods or just the active ones, leaving the completed pods(succeeded/failed) as they are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If anything, it should have the same semantics as kubernetes Job, where we delete the running pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current implementation is consistent with batch/job.
@@ -357,3 +361,8 @@ func (jc *JobController) CleanupJob(runPolicy *apiv1.RunPolicy, jobStatus apiv1. | |||
func (jc *JobController) calcPGMinResources(minMember int32, replicas map[apiv1.ReplicaType]*apiv1.ReplicaSpec) *v1.ResourceList { | |||
return CalcPGMinResources(minMember, replicas, jc.PriorityClassLister.Get) | |||
} | |||
|
|||
func (jc *JobController) JobSuspended(job interface{}) (*bool, error) { | |||
log.Infof("Not implemented.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we should merge this since the feature is not completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a base class default function in case Job subclasses(TFJob, MPIJob, etc.) do not implement this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it will be override in Job subclass which supports job suspend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If DeletePodsAndServices
has an implementation, I don't see why this function wouldn't have one.
How is this PR going now? |
Is this actively being worked on? Or will we get rid of the common repo first? |
@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up. kubeflow/training-operator#1683
|
Agree. we will take this up in next release after our merging kubeflow/common as planned in kubeflow/training-operator#1714 (comment) |
@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? |
@alculquicondor Yes. that is a good idea. I was thinking of the same. |
Excellent! |
You are right. I will work on the following steps after kubeflow feature freeze date (1/25) since I have no enough bandwidth for mpi-operator v2, now:
Although, other anyone can take step 3 after step 1 is completed. |
@mimowo will help with suspend in mpi-operator kubeflow/mpi-operator#504 |
Great! Thanks to @mimowo! |
Agreed. We could try to work on Job suspend feature for kubeflow v1.8. |
@johnugeorge how are we doing with the branch creation? |
To support job suspend semantics like Kubernetes batch job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job