Add DeepSpeed Example with Pytorch Operator #2235

Syulin7 · 2024-08-27T13:27:30Z

What this PR does / why we need it:
Add DeepSpeed Example with Pytorch Operator. The script used is HelloDeepSpeed from DeepSpeedExamples.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part-of #2091

Checklist:

Docs included if any changes are user facing

coveralls · 2024-08-27T13:31:56Z

Pull Request Test Coverage Report for Build 11216096781

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 100.0%

Totals
Change from base Build 11164961069:	0.0%
Covered Lines:	66
Relevant Lines:	66

💛 - Coveralls

Syulin7 · 2024-08-28T04:52:10Z

@tenzen-y @andreyvelich @kuizhiqing @terrytangyuan This PR is ready for review. PTAL, Thanks!

andreyvelich

Thank you for adding this great example @Syulin7!
/assign @kubeflow/wg-training-leads @kuizhiqing

examples/pytorch/deepspeed-demo/README.md

andreyvelich · 2024-09-05T17:16:35Z

examples/pytorch/deepspeed-demo/README.md

+DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
+See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).


Do we set the appropriate env variables for deepspeed or accelerate launchers in PyTorchJob or only torchrun can be used ?

When using deepspeed launcher, it defaults to using pdsh(machines accessible via passwordless SSH)) to send commands to the workers for execution, which is the launcher-worker mode.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate. Deepspeed does not require setting env variables and reads information from the hostfile.

# deepspeed --hostfile path default at /job/hostfile deepspeed --hostfile=/etc/mpi/hostfile /train_bert_ds.py --checkpoint_dir /root/deepspeed_data

About hostfile, see: https://github.com/microsoft/DeepSpeed/blob/3b09d945ead6acb15a172e9a379fc3de1f64d2b2/docs/_tutorials/getting-started.md?plain=1#L173-L187

# hostfile worker-1 slots=4 worker-2 slots=4

I can add an example in mpi-operator (mpi v2) later.

In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.

The mpi-operator in the training operator is executed through kubectl exec, and it is uncertain whether Deepspeed can support it. Currently, using mpi v2 (via passwordless SSH) would be more appropriate.

Thanks for this info! I think, we can support it once we migrate to MPI V2 in TrainJob API. cc @tenzen-y @alculquicondor
So we can build the specific deepspeed runtime that will leverage MPI orchestration to create hostfiles.

In PyTorchJob, torchrun and accelerate can be used. If I remember correctly, the environment variables for torchrun and accelerate are similar.

As far as I know, the accelerate is compatible with torchrun. However, it might have some additional parameters that torchrun doesn't allow to be set. E.g. mixed precision: https://huggingface.co/docs/accelerate/en/basic_tutorials/launch#:~:text=MIXED_PRECISION%3D%22fp16%22

deepspeed is already compatible with mpi-operator (the one outside of training-operator)

Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610

deepspeed is already compatible with mpi-operator (the one outside of training-operator)

Someone started a PR to add an example, but they abandoned it kubeflow/mpi-operator#610

Yes, The image used in this example is one I build earlier. I can provide the Dockerfile for reference. cc @alculquicondor @kuizhiqing

I'm happy to accept a PR for this in the mpi-operator repo.

I think, once we merge this PR we can refer this training script in MPI-Operator repo as well and add the simple YAML with MPIJob

@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.

examples/pytorch/deepspeed-demo/README.md

andreyvelich · 2024-09-05T17:22:15Z

examples/pytorch/deepspeed-demo/pytorch_deepspeed_demo.yaml

+  name: pytorch-deepspeed-demo
+spec:
+  pytorchReplicaSpecs:
+    Master:


Why do you need Master replica for this example ?

Actually, the complete command is as follows, torchrun will read the environment variables MASTER_ADDR, MASTER_PORT, and RANK (which are set by the training operator in pod env)

# node1 torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \ --master_port=9901 your_program.py <normal cl args> # node2 torchrun --nproc_per_node=8 --nnode=2 --node_rank=1 --master_addr=hostname1 \ --master_port=9901 your_program.py <normal cl args>

so the command can be simplified as follows:

torchrun --nproc_per_node=8 --nnode=2 your_program.py <normal cl args>

See: https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#multi-node-deployment

Yeah, I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set. Eventually, you don't need to have dedicated Master replica if the PodTemplateSpec is the same between all nodes.

I think we have problem with V1 Training Operator that we only set the MASTER_PORT when Master replica is set.

Yes, so we need Master replica for this example.

examples/pytorch/deepspeed-demo/train_bert_ds.py

andreyvelich · 2024-09-05T17:33:13Z

examples/pytorch/deepspeed-demo/train_bert_ds.py

+# Checkpoint Related Functions
+
+
+def load_model_checkpoint(


How do we use it ?

The function is not used; this script was directly copied from DeepSpeedExamples.
https://github.com/microsoft/DeepSpeedExamples/blob/master/training/HelloDeepSpeed/README.md

In this tutorial show the changes necessary to integrate DeepSpeed, and show some of the advantages of doing so.

load_model_checkpoint is used in train_bert.py, which is the original script that does not use DeepSpeed.

I’m not sure whether we should delete it or stay consistent with DeepSpeedExamples.

I am fine with both, eventually we can add it when we have more dedicated example/notebook when we can show how to resume training from checkpoint.
Any thoughts @kubeflow/wg-training-leads ?

andreyvelich · 2024-09-05T17:40:28Z

examples/pytorch/deepspeed-demo/train_bert_ds.py

+    return uuid
+
+
+def create_experiment_dir(


Do we need this experiment dir in this example ?

Similar to the issue above, it will create a directory in the checkpoint_dir of rank 0, I think we can stay consistent with DeepSpeedExamples.

andreyvelich · 2024-09-05T17:41:31Z

examples/pytorch/deepspeed-demo/train_bert_ds.py

+            )
+    # Save the last checkpoint if not saved yet
+    if step % checkpoint_every != 0:
+        model.save_checkpoint(save_dir=exp_dir, client_state={"checkpoint_step": step})


Will the model checkpointing be only on the rank 0 node ?

Yes, generally, the model will be saved on shared storage (using PVC).

@Syulin7 Do we have this check as part of save_checkpoint() API or we need to verify it?
Like in this FSDP Example from PyTorch

@andreyvelich all processes must call save_checkpoint(), so we don't need to verify it.

https://www.deepspeed.ai/getting-started/#model-checkpointing

Important: all processes must call this method and not just the process with rank 0. It is because each process needs to save its master weights and scheduler+optimizer states. This method will hang waiting to synchronize with other processes if it’s called just for the process with rank 0.

andreyvelich · 2024-09-30T22:34:00Z

@Syulin7 @kubeflow/wg-training-leads Are we ready to merge this PR ?

Syulin7 · 2024-10-01T09:30:45Z

Are we ready to merge this PR ?

@andreyvelich Yes, I think this PR can be merged.

andreyvelich

Thanks @Syulin7!
/lgtm
/assign @kubeflow/wg-training-leads

kuizhiqing

Good work.

kuizhiqing · 2024-10-03T13:51:29Z

examples/pytorch/deepspeed-demo/pytorch_deepspeed_demo.yaml

+            - name: pytorch
+              image: kubeflow/pytorch-deepspeed-demo:latest
+              command:
+                - torchrun


@Syulin7 No, actually, you no need to set the parameters for torchrun, setting the correct environment related parameters(or using env in the operator case) is the responsibility of the operator.

If you set them, the parameters will overwrite the env which will work with no doubt, but we don't encourage our use to use it this way. Since we use operator, we leave those staff to the operator.

kuizhiqing · 2024-10-03T13:53:45Z

examples/pytorch/deepspeed-demo/README.md

+DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
+See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).


@Syulin7 Yes, thinks to your original work on the base image. The plan in kubeflow/mpi-operator#610 is somewhat staled for some reason. You are really welcomed to continue that.

Signed-off-by: Syulin7 <735122171@qq.com>

Syulin7 · 2024-10-07T13:11:31Z

@andreyvelich @kuizhiqing Thanks for the review! I addressed all comments. PTAL.

kuizhiqing

LGTM

andreyvelich

Thank you for doing this @Syulin7!
/lgtm
/assign @kubeflow/wg-training-leads

andreyvelich · 2024-10-17T17:06:43Z

I think, we can merge it.
thanks for the work @Syulin7!
/approve

google-oss-prow · 2024-10-17T17:07:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, kuizhiqing

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

* Added test for create-pytorchjob.ipynb Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * fix yaml syntax Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Fix uses path Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add actions/checkout Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add bash to action.yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Install pip dependencies step Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add quotes for args Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add jupyter Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add nbformat_minor: 5 to fix invalid format error Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Fix job name Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * test papermill-args-yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args1 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args2 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * testing multi line args3 Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Parameterize sdk install Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Remove unnecessary output Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * nbformat normailze Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] Training Client Conditions related unit tests (#2253) * test: add unit test for get_job_conditions function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_created function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_running function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_restarting function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_failed function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: add unit test for is_job_succeded function of training client Signed-off-by: Bobbins228 <mcampbel@redhat.com> * test: improve job condition unit tests efficiency Signed-off-by: Bobbins228 <mcampbel@redhat.com> --------- Signed-off-by: Bobbins228 <mcampbel@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] test: add unit test for list_jobs method of the training_client (#2267) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Generate clientset, openapi spec for the V2 APIs (#2273) Generate clientset, informers, listers and open api spec for v2alpha1 APIs. Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] Use torchrun to create PyTorchJob from function (#2276) * [SDK] Use torchrun to create PyTorchJob from function Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update PyTorchJob SDK example Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add consts for entrypoint Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add check for num procs per worker Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [SDK] test: add unit test for get_job_logs method of the training_client (#2275) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [v2alpha] Move GV related codebase (#2281) Move GV related codebase in v2alpha Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement runtime framework (#2248) * KEP-2170: Implement runtime framework interfaces Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove grep dependency Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Implement ValidateObjects interface to the runtime framework Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Expose the TrainingRuntime and ClusterTrainingRuntime Kind Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * KEP-2170: Remove unneeded scheme field from the internal TrainingRuntime Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rephrase the error message Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Distinguish TrainingRuntime and ClusterTrainingRuntime when creating indexes for the TrainJobs Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Propagate the TrainJob labels and annotations to the JobSet Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove PodAnnotations from the runtime info Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Implement TrainingRuntime ReplicatedJob validation Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Add TODO comments Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Replace queueSuspendedTrainJob with queueSuspendedTrainJobs Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add DeepSpeed Example with Pytorch Operator (#2235) Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API (#2283) * KEP-2170: Rename TrainingRuntimeRef to RuntimeRef API Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Rename RuntimeRef in runtime framework Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Adding CEL validations on v2 TrainJob CRD (#2260) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade Deepspeed demo dependencies (#2294) Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add manifests for Kubeflow Training V2 (#2289) * KEP-2170: Add manifests for Kubeflow Training V2 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix invalid name for webhook config in cert Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix integration tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move kubebuilder markers to runtime framework Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use Kubernetes recommended labels Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * FSDP Example for T5 Fine-Tuning and PyTorchJob (#2286) * FSDP Example with PyTorchJob and T5 Fine-Tuning Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Modify text Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement TrainJob Reconciler to manage objects (#2295) * KEP-2170: Implement TrainJob Reconciler to manage objects Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Mode dep-crds to manifests/external-crds Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rename run with runtime Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Remove Prometheus Monitoring doc (#2301) Signed-off-by: Sophie <sophy010017@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Decouple JobSet from TrainJob (#2296) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Strictly verify the CRD marker validation and defaulting in the integration testings (#2304) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Initialize runtimes before the manager starts (#2306) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Generate Python SDK for Kubeflow Training V2 (#2310) * Generate SDK models for the Training V2 APIs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Create pyproject.toml config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove comments Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix pre-commit Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Create model and dataset initializers (#2303) * KEP-2170: Create model and dataset initializers Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add abstract classes Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add storage URI to config Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update .gitignore Co-authored-by: Kevin Hannon <kehannon@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix the misspelling for initializer Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add .pt and .pth to ignore_patterns Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Kevin Hannon <kehannon@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement JobSet, PlainML, and Torch Plugins (#2308) * KEP-2170: Implement JobSet and PlainML Plugins Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix nil pointer exception for Trainer Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix unit tests in runtime package Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix integration tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix lint Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Implement Torch Plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use list for the Info envs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix golang ci Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix Torch plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use K8s sets Update error return Use ptr.Deref() for nil values Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use client.Object for Build() call Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove DeepCopy Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove MLPolicy and PodGroupPolicy from the Info object Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Inline error Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove SDK jar file Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add integration test for Torch plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add TODO to calculate PodGroup values in unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Revert the change to add original Runtime Policies to Info Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Create const for the DefaultJobReplicas Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Check if PodLabels is empty Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Implement Initializer builders in the JobSet plugin (#2316) * KEP-2170: Implement Initializer builder in the JobSet plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update the SDK models Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove Info from Initializer builder Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update manifests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update pkg/constants/constants.go Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use var for envs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Remove check manifests from GitHub actions Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move consts to JobSet plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add the TrainJob state transition design (#2298) * KEP-2170: Add the TrainJob state transition design Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Replace actual jobs with TrainJob Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Remove the JobSet conditions propagation and Add expanding runtime framework interfaces for each plugin Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Expand the Creation Failed reasons Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Rename Completed to Complete Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update tf job examples to tf v2 (#2270) * mnist with summaries updaetd to TF v2 Signed-off-by: yelias <yossi.elias@nokia.com> * tf_sample updaetd to TF v2 Signed-off-by: yelias <yossi.elias@nokia.com> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <yossi.elias@nokia.com> * Add mnist_utils and update dist-mnist Signed-off-by: yelias <yossi.elias@nokia.com> * Remove old example - estimator-API, this example has been replaced by distribution_strategy Signed-off-by: yelias <yossi.elias@nokia.com> * Small fix Signed-off-by: yelias <yossi.elias@nokia.com> * Remove unsupported powerPC dockerfiles Signed-off-by: yelias <yossi.elias@nokia.com> * Fix typo in copyright Signed-off-by: yelias <yossi.elias@nokia.com> --------- Signed-off-by: yelias <yossi.elias@nokia.com> Co-authored-by: yelias <yossi.elias@nokia.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add TrainJob conditions (#2322) * KEP-2170: Implement TrainJob conditions Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Fix API comments Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Make condition message constants Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> * Stop connecting condition type and reason in JobSet plugin Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> --------- Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pin Gloo repository in JAX Dockerfile to a specific commit (#2329) This commit pins the Gloo repository to a specific commit (43b7acbf) in the JAX Dockerfile to prevent build failures caused by a recent bug introduced in the Gloo codebase. By locking the version of Gloo to a known working commit, we ensure that the JAX build remains stable and functional until the issue is resolved upstream. The build failure occurs when compiling the gloo/transport/tcp/buffer.cc file due to an undefined __NR_gettid constant, which was introduced after the pinned commit. By using this commit, we bypass the issue and allow the build to complete successfully. Signed-off-by: Sandipan Panda <samparksandipan@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * [fix] Resolve v2alpha API exceptions (#2317) Resolve v2alpha API exceptions by adding necessary listType validations. Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade Kubernetes to v1.30.7 (#2332) * Upgrade Kubernetes to v1.30.7 Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Use typed event handlers and predicates in job controllers Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Re-organize pkg/common/util/reconciler.go Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Update installation instructions in README Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Ignore cache exporting errors in the image building workflows (#2336) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * KEP-2170: Add Torch Distributed Runtime (#2328) * KEP-2170: Add Torch Distributed Runtime Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add pip list Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Refine the server-side apply installation args (#2337) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add openapi-generator CLI option to skip SDK v2 test generation (#2338) Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Upgrade kustomization files to Kustomize v5 (#2326) Signed-off-by: oksanabaza <obazylie@redhat.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pin accelerate package version in trainer (#2340) * Pin accelerate package version in trainer Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> * include new line to pass pre-commit hook Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> --------- Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Replace papermill command with bash script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Typo fix Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Move Checkout step outside action.yaml file Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Add newline EOF in script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Pass python dependencies as args and pin versions Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update Usage Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Install dependencies in yaml Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * fix ipynb Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * set bash flags Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * Update script args and add more kubernetes versions for tests Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * add gang-scheduler-name to template Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * move go setup to template Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> * remove -p parameter from script Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> --------- Signed-off-by: sailesh duddupudi <saileshradar@gmail.com> Signed-off-by: Bobbins228 <mcampbel@redhat.com> Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> Signed-off-by: Varsha Prasad Narsing <varshaprasad96@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: Akshay Chitneni <achitneni@apple.com> Signed-off-by: Sophie <sophy010017@gmail.com> Signed-off-by: yelias <yossi.elias@nokia.com> Signed-off-by: Sandipan Panda <samparksandipan@gmail.com> Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Signed-off-by: oksanabaza <obazylie@redhat.com> Signed-off-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com> Co-authored-by: Mark Campbell <mcampbel@redhat.com> Co-authored-by: Wei-Cheng Lai <qazwsx0939059006@gmail.com> Co-authored-by: Varsha <varshaprasad96@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Yuki Iwai <yuki.iwai.tz@gmail.com> Co-authored-by: yu lin <735122171@qq.com> Co-authored-by: Akshay Chitneni <akshayadatta@gmail.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Sophie Hsu <112261858+sophie0730@users.noreply.github.com> Co-authored-by: Kevin Hannon <kehannon@redhat.com> Co-authored-by: YosiElias <73485442+YosiElias@users.noreply.github.com> Co-authored-by: yelias <yossi.elias@nokia.com> Co-authored-by: Sandipan Panda <87253083+sandipanpanda@users.noreply.github.com> Co-authored-by: Antonin Stefanutti <astefanutti@users.noreply.github.com> Co-authored-by: Oksana Bazylieva <61097730+oksanabaza@users.noreply.github.com> Co-authored-by: Gavrish Prabhu <gavrish.prabhu@nutanix.com>

google-oss-prow bot requested review from jinchihe and kuizhiqing August 27, 2024 13:27

google-oss-prow bot added the size/XL label Aug 27, 2024

Syulin7 force-pushed the deepspeed branch 4 times, most recently from 23c73db to 6bd91b8 Compare August 28, 2024 03:51

Syulin7 force-pushed the deepspeed branch from 6bd91b8 to 2a5530f Compare September 4, 2024 07:02

andreyvelich reviewed Sep 5, 2024

View reviewed changes

Syulin7 force-pushed the deepspeed branch from 2a5530f to 869fe6c Compare September 6, 2024 08:23

Syulin7 force-pushed the deepspeed branch from 869fe6c to 7fa309c Compare October 1, 2024 08:23

andreyvelich reviewed Oct 1, 2024

View reviewed changes

google-oss-prow bot assigned andreyvelich Oct 1, 2024

google-oss-prow bot added the lgtm label Oct 1, 2024

kuizhiqing reviewed Oct 3, 2024

View reviewed changes

Add DeepSpeed Example with Pytorch Operator

470f62b

Signed-off-by: Syulin7 <735122171@qq.com>

Syulin7 force-pushed the deepspeed branch from 7fa309c to 470f62b Compare October 7, 2024 13:02

google-oss-prow bot removed the lgtm label Oct 7, 2024

kuizhiqing approved these changes Oct 7, 2024

View reviewed changes

andreyvelich reviewed Oct 7, 2024

View reviewed changes

google-oss-prow bot added the lgtm label Oct 7, 2024

google-oss-prow bot added the approved label Oct 17, 2024

google-oss-prow bot merged commit 2d58b49 into kubeflow:master Oct 17, 2024
41 checks passed

saileshd1402 pushed a commit to saileshd1402/training-operator that referenced this pull request Dec 2, 2024

Add DeepSpeed Example with Pytorch Operator (kubeflow#2235)

936620d

Signed-off-by: Syulin7 <735122171@qq.com> Signed-off-by: sailesh duddupudi <saileshradar@gmail.com>

		DeepSpeed can be deployed by different launchers such as torchrun, the deepspeed launcher, or Accelerate.
		See [deepspeed](https://huggingface.co/docs/transformers/main/en/deepspeed?deploy=multi-GPU&pass-config=path+to+file&multinode=torchrun#deployment).

Add DeepSpeed Example with Pytorch Operator #2235

Add DeepSpeed Example with Pytorch Operator #2235

Conversation

Syulin7 commented Aug 27, 2024

coveralls commented Aug 27, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11216096781

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Syulin7 commented Aug 28, 2024 • edited Loading

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Syulin7 Sep 6, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich commented Sep 30, 2024

Syulin7 commented Oct 1, 2024

andreyvelich left a comment

Choose a reason for hiding this comment

kuizhiqing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Syulin7 commented Oct 7, 2024

kuizhiqing left a comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

andreyvelich commented Oct 17, 2024

google-oss-prow bot commented Oct 17, 2024

coveralls commented Aug 27, 2024 •

edited

Loading

Syulin7 commented Aug 28, 2024 •

edited

Loading

Syulin7 Sep 6, 2024 •

edited

Loading