Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support v2 MPIJob #1972

Open
chenxi-seu opened this issue Dec 26, 2023 · 8 comments
Open

Support v2 MPIJob #1972

chenxi-seu opened this issue Dec 26, 2023 · 8 comments

Comments

@chenxi-seu
Copy link

We noticed that the MPIJob used in the Kubeflow community documentation is v2beta1 version (https://www.kubeflow.org/docs/components/training/mpi/), but it seems that the training-operator only supports the v1 version of MPIJob. Does the training-operator community have plans to support v2?
image

Currently, if users need to use both MPIJob and PytorchJob together, they need to install mpi-operator to support v2beta1 MPIJob first, and then install training-operator to use the v1 PytorchJob.

@tenzen-y
Copy link
Member

@chenxi-seu Yes, we have a plan to support MPIJob v2. Please see #1906.

@chenxi-seu
Copy link
Author

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct?
I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

@tenzen-y
Copy link
Member

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Yes, you're right.

@mupeifeiyi
Copy link

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

@tenzen-y
Copy link
Member

tenzen-y commented Mar 25, 2024

@tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

@mupeifeiyi This would be a good example: #1777 (comment)

@mupeifeiyi
Copy link

mupeifeiyi commented Mar 28, 2024

/> > > @tenzen-y Thank you for your response. Could you please confirm if my current approach is correct? I plan to first install mpi-operator and then install training-operator by configuring --enable-scheme. I only enable PytorchJob to avoid any conflict between the two operators regarding different versions of MPIJob.

Sorry, I have the same problem now, but I don't understand how you did this step.

@mupeifeiyi This would be a good example: #1777 (comment)

Thanks, the following is useful to me:

    spec:
      containers:
      - args:
        - --enable-scheme=tfjob
        - --enable-scheme=pytorchjob
        - --enable-scheme=mxjob
        - --enable-scheme=xgboostjob
        - --enable-scheme=paddlejob
        command:
        - /manager

mxjob,not mxnetjob

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@andreyvelich
Copy link
Member

For the context here, we are planning to implement support for MPIJob V2 as part of Kubeflow Training V2 proposal: https://bit.ly/3WzjTlw

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants