Deprecate MPIJob v1 #1906

alculquicondor · 2023-09-07T12:12:34Z

Ideally, we should migrate the v2 implementations to the training operator, then remove the v1 implementation from the training-operator to reduce the maintenance costs. However, we can not take the way immediately because there are many issues in the training operator (e.g. inconsistent job conditions, not using headless svc, and so on). So, I think it would be better to mark the v1 implementation as deprecated, then stop adding the new features to the v1 implementation and only provide bug fixes. So we suggest using the mpi-operator to users if they would like to the new features.

Originally posted by @tenzen-y in #1768 (comment)

alculquicondor · 2023-09-07T12:13:21Z

@kubeflow/wg-training-leads

terrytangyuan · 2023-09-08T00:47:00Z

migrate the v2 implementations to the training operator

Are you suggesting moving the entire codebase to training-operator? Or use mpi-operator as a library?

tenzen-y · 2023-09-08T06:15:36Z

migrate the v2 implementations to the training operator

Are you suggesting moving the entire codebase to training-operator? Or use mpi-operator as a library?

Use mpi-operator as a library. I think a separate binary for mpi-operator would be worth it since mpi-operator doesn't focus on ML Training.

terrytangyuan · 2023-09-08T11:01:48Z

Sounds good

terrytangyuan · 2023-09-08T17:01:14Z

there are many issues in the training operator (e.g. inconsistent job conditions, not using headless svc, and so on)

Can you expand on this? This would be helpful for estimating work and allocating sufficient resources.

tenzen-y · 2023-09-08T19:40:52Z

there are many issues in the training operator (e.g. inconsistent job conditions, not using headless svc, and so on)

Can you expand on this? This would be helpful for estimating work and allocating sufficient resources.

Sure. Actually, there are already issues:

Headless SVC issue: #1030
Inconsistent job conditions: #1703

tenzen-y · 2023-09-14T17:04:54Z

Friendly ping @johnugeorge :)

johnugeorge · 2023-10-09T11:51:06Z

Sorry for late reply.

Agree. I am good with deprecating v1 in favor for v2. We need to take it up sometime. Can you explain more on your idea of creating a library? You mean, reconcile logic to be used from MPI operator repo within training-operator? Is it easy in managing manifests etc?

We will target all pre-reqs(#1030 #1703) for next training operator 1.8 release and then followed by mpi v2 support in training operator if we have time. What do you think?

tenzen-y · 2023-11-15T18:11:33Z

Sorry for late reply.

Agree. I am good with deprecating v1 in favor for v2. We need to take it up sometime. Can you explain more on your idea of creating a library? You mean, reconcile logic to be used from MPI operator repo within training-operator? Is it easy in managing manifests etc?

We will target all pre-reqs(#1030 #1703) for next training operator 1.8 release and then followed by mpi v2 support in training operator if we have time. What do you think?

I have discussed this with @johnugeorge offline. We leave the individual mpi-operator, and the training-operator uses mpi-operatror as a library. It means that users can deploy MPIJob v2 as either part of the training operator or the mpi-operator.

We have tasks to realize this migration and deprecation:

Training Operator Side:

Resolve Headless SVC issue ([feature] Can we use one headless service for one job? #1030)
Resolve Inconsistent job conditions (Some suggestions about engineering optimization #1703)
Import the mpi-operator as a library to the training-operator and provide MPIJob v2
Notice to users that MPIJob v1 is no longer maintenance

MPI Operator Side:

Refactor MPI Operator (kubeflow/mpi-operator) so that the training-operator can use the mpi-operator as a library.

terrytangyuan · 2023-11-15T18:15:20Z

We leave the individual mpi-operator, and the training-operator uses mpi-operatror as a library. It means that users can deploy MPIJob v2 as either part of the training operator or the mpi-operator.

@tenzen-y Thanks! This approach looks good.

eero-t · 2023-11-15T18:32:50Z

Sounds great!

I assume that would fix also #1807, maybe also some other MPIJob tickets: https://github.com/kubeflow/training-operator/issues?q=is%3Aissue+is%3Aopen+mpijob.

But more important could be whether there will be regressions compared to current v1 features though.

Would training-operator MPIJob tests be updated to v2:

$ find training-operator/ -name '*test.go' | grep -i mpi
training-operator/pkg/apis/kubeflow.org/v1/mpi_validation_test.go
training-operator/pkg/apis/kubeflow.org/v1/mpi_defaults_test.go
training-operator/pkg/controller.v1/mpi/suite_test.go
training-operator/pkg/controller.v1/mpi/mpijob_controller_test.go

And/or mpi-operator tests brought to training-operator?

$ find mpi-operator/ -name '*test.go'
mpi-operator/test/integration/main_test.go
mpi-operator/test/integration/mpi_job_controller_test.go
mpi-operator/test/e2e/e2e_suite_test.go
mpi-operator/test/e2e/mpi_job_test.go
mpi-operator/pkg/controller/mpi_job_controller_test.go
mpi-operator/pkg/controller/podgroup_test.go
mpi-operator/pkg/apis/kubeflow/v2beta1/default_test.go
mpi-operator/pkg/apis/kubeflow/validation/validation_test.go

tenzen-y · 2023-11-15T18:42:29Z

I assume that would fix also #1807, maybe also some other MPIJob tickets: https://github.com/kubeflow/training-operator/issues?q=is%3Aissue+is%3Aopen+mpijob.

Yes, that's right.

Would training-operator MPIJob tests be updated to v2

Yes, we should have proper tests.

And/or mpi-operator tests brought to training-operator?

No, I think that we wouldn't have tests for MPI-Operator library in this repo. However, I think we should implement unit and e2e tests alongside the training-operator.

johnugeorge · 2023-11-16T08:03:26Z

+1
One point that is yet to finalize, is the mpi-operator v2 manifests location. How do users install mpi operator with training operator? How does Kubeflow manifests sync mpi operator manifests during any release?

tenzen-y · 2023-11-16T08:44:17Z

is the mpi-operator v2 manifests location.

I think that we can use kustomize remote ref in the following:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - kubeflow.org_tfjobs.yaml
  - kubeflow.org_mxjobs.yaml
  - kubeflow.org_pytorchjobs.yaml
  - kubeflow.org_xgboostjobs.yaml
  - github.com/kubeflow/mpi-operator/manifesist/crd.yaml?ref=v0.4.0
  - kubeflow.org_paddlejobs.yaml

And then, I think that we can have pre-built all-in-one manifests in this repository for the users without internet access.
It means save manifests (deploy.yaml) built with kustomize build github.com/kubeflow/training-operator/manifests/overlays/standalone > deploy.yaml.

How do users install mpi operator with training operator?

If users want to install both operators, users need to disable the MPIJob on the training-operator side as in the past.

andreyvelich · 2023-11-16T12:07:35Z

@tenzen-y Does it mean that we are going to maintain separate releases for MPI Operator and Training Operator ?

tenzen-y · 2023-11-16T21:11:33Z

@tenzen-y Does it mean that we are going to maintain separate releases for MPI Operator and Training Operator ?

Yes, that's right.

itayvallach · 2023-11-17T11:36:41Z

migrate the v2 implementations to the training operator

Are you suggesting moving the entire codebase to training-operator? Or use mpi-operator as a library?

Use mpi-operator as a library. I think a separate binary for mpi-operator would be worth it since mpi-operator doesn't focus on ML Training.

@tenzen-y Can you explain why mpi-operator doesn't focus on ML Training?

tenzen-y · 2023-11-17T13:29:58Z

migrate the v2 implementations to the training operator

Are you suggesting moving the entire codebase to training-operator? Or use mpi-operator as a library?

Use mpi-operator as a library. I think a separate binary for mpi-operator would be worth it since mpi-operator doesn't focus on ML Training.

@tenzen-y Can you explain why mpi-operator doesn't focus on ML Training?

MPIJob isn't used only for machine learning. MPIJob is used in generic HPC use cases like simulations.
So, I think that we shouldn't focus only on ML use cases.

Any thoughts? > @terrytangyuan @alculquicondor

github-actions · 2024-02-15T15:01:52Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tenzen-y · 2024-02-15T15:03:33Z

/remove-lifecycle frozen

tenzen-y · 2024-02-15T15:03:44Z

/remove-lifecycle stale

tenzen-y · 2024-04-26T17:00:19Z

/retitle Deprecate MPIJob v1

vsoch · 2024-07-17T03:07:59Z

MPIJob isn't used only for machine learning. MPIJob is used in generic HPC use cases like simulations. So, I think that we shouldn't focus only on ML use cases.

+1

alculquicondor mentioned this issue Sep 7, 2023

MPIJob doesn't support exitcode restartPolicy #1768

Closed

tenzen-y mentioned this issue Sep 29, 2023

MPIJob example failure #1926

Closed

tenzen-y mentioned this issue Dec 26, 2023

Support v2 MPIJob #1972

Open

tenzen-y mentioned this issue Jan 25, 2024

[Release] Training Operator 1.8 Roadmap #1994

Closed

11 tasks

github-actions bot added the lifecycle/stale label Feb 15, 2024

google-oss-prow bot removed the lifecycle/stale label Feb 15, 2024

tenzen-y changed the title ~~Deprecate MPI Operator v1~~ Deprecate MPIJob v1 Apr 26, 2024

tenzen-y mentioned this issue Apr 29, 2024

MPIJob requires service names for the pods. #2090

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deprecate MPIJob v1 #1906

Deprecate MPIJob v1 #1906

alculquicondor commented Sep 7, 2023 •

edited

Loading

alculquicondor commented Sep 7, 2023

terrytangyuan commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

terrytangyuan commented Sep 8, 2023

terrytangyuan commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

tenzen-y commented Sep 14, 2023

johnugeorge commented Oct 9, 2023 •

edited

Loading

tenzen-y commented Nov 15, 2023

terrytangyuan commented Nov 15, 2023

eero-t commented Nov 15, 2023

tenzen-y commented Nov 15, 2023 •

edited

Loading

johnugeorge commented Nov 16, 2023

tenzen-y commented Nov 16, 2023

andreyvelich commented Nov 16, 2023

tenzen-y commented Nov 16, 2023

itayvallach commented Nov 17, 2023

tenzen-y commented Nov 17, 2023 •

edited

Loading

github-actions bot commented Feb 15, 2024

tenzen-y commented Feb 15, 2024

tenzen-y commented Feb 15, 2024

tenzen-y commented Apr 26, 2024

vsoch commented Jul 17, 2024

Deprecate MPIJob v1 #1906

Deprecate MPIJob v1 #1906

Comments

alculquicondor commented Sep 7, 2023 • edited Loading

alculquicondor commented Sep 7, 2023

terrytangyuan commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

terrytangyuan commented Sep 8, 2023

terrytangyuan commented Sep 8, 2023

tenzen-y commented Sep 8, 2023

tenzen-y commented Sep 14, 2023

johnugeorge commented Oct 9, 2023 • edited Loading

tenzen-y commented Nov 15, 2023

terrytangyuan commented Nov 15, 2023

eero-t commented Nov 15, 2023

tenzen-y commented Nov 15, 2023 • edited Loading

johnugeorge commented Nov 16, 2023

tenzen-y commented Nov 16, 2023

andreyvelich commented Nov 16, 2023

tenzen-y commented Nov 16, 2023

itayvallach commented Nov 17, 2023

tenzen-y commented Nov 17, 2023 • edited Loading

github-actions bot commented Feb 15, 2024

tenzen-y commented Feb 15, 2024

tenzen-y commented Feb 15, 2024

tenzen-y commented Apr 26, 2024

vsoch commented Jul 17, 2024

alculquicondor commented Sep 7, 2023 •

edited

Loading

johnugeorge commented Oct 9, 2023 •

edited

Loading

tenzen-y commented Nov 15, 2023 •

edited

Loading

tenzen-y commented Nov 17, 2023 •

edited

Loading