Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix the condition #617

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

fix the condition #617

wants to merge 1 commit into from

Conversation

wang-mask
Copy link

fixes #615

The original logic is:

if launcher == nil {
	if mpiJob.Spec.LauncherCreationPolicy == kubeflow.LauncherCreationPolicyAtStartup ||  c.countReadyWorkerPods(worker) == len(worker) {
		launcher, err = c.kubeClient.BatchV1().Jobs(namespace).Create(context.TODO(), c.newLauncherJob(mpiJob), metav1.CreateOptions{})
		....
}

If the MPIJob is suspended, the len(worker) and c.countReadyWorkerPods(worker) would be both 0. Then this judgment condition will be meet, and the launcher pod will be created.

Signed-off-by: wang-mask <2018091609006@std.uestc.edu.cn>
Copy link

google-cla bot commented Jan 26, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alculquicondor for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alculquicondor
Copy link
Collaborator

Then this judgment condition will be meet, and the launcher pod will be created.

But we only create the launcher Job suspended.

Are you saying that the problem arises when you unsuspend the job and suddenly all the pods start?

Not sure if it's worth always creating the Job and only unsuspending if the workers are ready.

@wang-mask
Copy link
Author

I think the launcher pod should be created after all the workers are ready when the spec.launcherCreationPolicy is set to "WaitForWorkersReady", even if the MPIJob is suspend.
The current logic is not very consistent with "WaitForWorkersReady".

@tenzen-y
Copy link
Member

I think the launcher pod should be created after all the workers are ready when the spec.launcherCreationPolicy is set to "WaitForWorkersReady", even if the MPIJob is suspend.

I think that when the MPIJob is suspended, workers aren't created by the mpi-operator.
Could you clarify such a situation?

@wang-mask
Copy link
Author

wang-mask commented Jan 29, 2024

I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?

In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady".
Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?

@wang-mask
Copy link
Author

My understanding of spec.launcherCreationPolicy is set to "WaitForWorkersReady" is to wait for all workers to be ready before creating a launcher. So in a suspended state, the workers have not been created and also are not ready, the launcher should not be created.

The current code is designed to create a launcher, but it suspends it when the MPI job is suspended, regardless of the .launcherCreationPolicy. Does this meet expectations?

@tenzen-y
Copy link
Member

I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?

In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady".

Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?

In this situation, who does create the workers?

@wang-mask
Copy link
Author

I think that when the MPIJob is suspended, workers aren't created by the mpi-operator. Could you clarify such a situation?

In this case the workers are not created by the mpi-operator, but the mpi-operator creates the launcher when the spec.launcherCreationPolicy is set to "WaitForWorkersReady".
Shouldn't it create the launcher after the mpi job is unsuspended and workers are ready?

In this situation, who does create the workers?

The operation of unsuspending an MPI job triggers the creation of workers.

@alculquicondor
Copy link
Collaborator

With this implementation: what happens if the job is running, then it is suspended and unsuspended?

Is a launcher pod created as soon as it is unsuspended the second time? If so, this solution is not sufficient.

@wang-mask
Copy link
Author

With this implementation: what happens if the job is running, then it is suspended and unsuspended?

Is a launcher pod created as soon as it is unsuspended the second time? If so, this solution is not sufficient.

Now the controller deletes the worker pod and suspends the launcher when the mpi job is suspended.

Maybe it can also deletes the launcher when the mpi job is suspended and spec.launcherCreationPolicy = "WaitForWorkersReady". After second or subsequent suspension was lifted, the controller waits for all workers to be ready before creating a launcher.

If this implementation is reasonable, I would be happy to modify this PR.

@alculquicondor
Copy link
Collaborator

Or you could still create the Job object but not flip the suspend flag until the workers are ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants