Support v2 MPIJob #1972

chenxi-seu · 2023-12-26T11:48:05Z

We noticed that the MPIJob used in the Kubeflow community documentation is v2beta1 version (https://www.kubeflow.org/docs/components/training/mpi/), but it seems that the training-operator only supports the v1 version of MPIJob. Does the training-operator community have plans to support v2?

Currently, if users need to use both MPIJob and PytorchJob together, they need to install mpi-operator to support v2beta1 MPIJob first, and then install training-operator to use the v1 PytorchJob.

tenzen-y · 2023-12-26T11:56:40Z

@chenxi-seu Yes, we have a plan to support MPIJob v2. Please see #1906.

chenxi-seu · 2023-12-26T12:08:26Z

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct?
I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

tenzen-y · 2023-12-26T12:10:46Z

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Yes, you're right.

mupeifeiyi · 2024-03-25T10:22:22Z

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

tenzen-y · 2024-03-25T18:19:55Z

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

@mupeifeiyi This would be a good example: #1777 (comment)

mupeifeiyi · 2024-03-28T05:52:50Z

/> > > @tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

@mupeifeiyi This would be a good example: #1777 (comment)

Thanks, the following is useful to me:

    spec:
      containers:
      - args:
        - --enable-scheme=tfjob
        - --enable-scheme=pytorchjob
        - --enable-scheme=mxjob
        - --enable-scheme=xgboostjob
        - --enable-scheme=paddlejob
        command:
        - /manager

mxjob，not mxnetjob

github-actions · 2024-06-26T10:01:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

andreyvelich · 2024-06-26T10:31:48Z

For the context here, we are planning to implement support for MPIJob V2 as part of Kubeflow Training V2 proposal: https://bit.ly/3WzjTlw

tenzen-y mentioned this issue Apr 29, 2024

MPIJob requires service names for the pods. #2090

Closed

Souheil-Yazji mentioned this issue May 1, 2024

Feat: deploy MPI-Operator Standalone StatCan/openmpp#72

Open

github-actions bot added the lifecycle/stale label Jun 26, 2024

github-actions bot removed the lifecycle/stale label Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support v2 MPIJob #1972

Support v2 MPIJob #1972

chenxi-seu commented Dec 26, 2023

tenzen-y commented Dec 26, 2023

chenxi-seu commented Dec 26, 2023

tenzen-y commented Dec 26, 2023

mupeifeiyi commented Mar 25, 2024

tenzen-y commented Mar 25, 2024 •

edited

Loading

mupeifeiyi commented Mar 28, 2024 •

edited

Loading

github-actions bot commented Jun 26, 2024

andreyvelich commented Jun 26, 2024

Support v2 MPIJob #1972

Support v2 MPIJob #1972

Comments

chenxi-seu commented Dec 26, 2023

tenzen-y commented Dec 26, 2023

chenxi-seu commented Dec 26, 2023

tenzen-y commented Dec 26, 2023

mupeifeiyi commented Mar 25, 2024

tenzen-y commented Mar 25, 2024 • edited Loading

mupeifeiyi commented Mar 28, 2024 • edited Loading

github-actions bot commented Jun 26, 2024

andreyvelich commented Jun 26, 2024

tenzen-y commented Mar 25, 2024 •

edited

Loading

mupeifeiyi commented Mar 28, 2024 •

edited

Loading